<h1 style=color:red;text-align:center;fontweight:bold;>Web Scraping Project: Motorcycle Data from Hindustan Times Auto</h1>
<h3 style=color:orange;fontweight:bold>Description</h3>
<p>This project involves scraping motorcycle data from the Hindustan Times Auto website (<a href='https://auto.hindustantimes.com/new-bikes/search'>https://auto.hindustantimes.com/new-bikes/search</a>) to collect information about various motorcycle models available in the Indian market. The scraped data includes details such as brand, model, price range, engine capacity, top speed, mileage, user ratings, and review counts. The data is processed and stored in a structured format using a Pandas DataFrame for further analysis.</p>
<h3 style=color:orange;fontweight:bold>Objectives</h3>
<ul>
<li>Extract motorcycle specifications and pricing data from multiple pages of the Hindustan Times Auto website.</li>
<li>Clean and preprocess the scraped data to ensure consistency and usability (e.g., handling missing values, converting units, and parsing price ranges).</li>
<li>Create a structured dataset suitable for analysis or visualization of motorcycle attributes.</li>
<li>Converting the dataset to CSV file.</li>
</ul>
<h3 style=color:orange;fontweight:bold>Tools and Libraries</h3>
<ul>
<li>'requests': For sending HTTP requests to fetch web pages.</li>
<li>'BeautifulSoup (from bs4)': For parsing HTML content and extracting relevant data.</li>
<li>'pandas': For organizing and cleaning the scraped data into a DataFrame.</li>
<li>'numpy': For handling missing values and numerical operations.</li>
<li>'re': For regular expression-based text processing.</li>
</ul>

In [76]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

<h4 style=fontweigth:bold;color:darkviolet>The dataset includes the following columns:</h4>
<ol>
<li style=color:maroon;>Brand.</li>
<li style=color:maroon;>Model.</li>
<li style=color:maroon;>Engine(cc).</li>
<li style=color:maroon;>Top Speed(kmph).</li>
<li style=color:maroon;>Mileage(kmpl).</li>
<li style=color:maroon;>Rating(out of 5).</li>
<li style=color:maroon;>Review Count.</li>
<li style=color:maroon;>Min Price.</li>
<li style=color:maroon;>Max Price.</li>
</ol>

In [77]:
#lists to collect data
Brand=[]
Model=[]
Rating=[]
Review_Count=[]
Engine=[]
Top_Speed=[]
Mileage=[]
Price=[]

<h4 style=color:darkviolet;text-align:center>Extraction of motorcycle specifications and pricing data from multiple pages of the Hindustan Times Auto website.</h4>

In [78]:
for page in range(1,26):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
    url='https://auto.hindustantimes.com/new-bikes/search?pageNo={}'.format(page)
    webpage=requests.get(url=url,headers=headers).text
    soup=BeautifulSoup(webpage,'lxml')
    for i in soup.find_all('div',class_='FinderProductCard_cardContent__k_jWg'):
        bike_name=i.find('div',class_='FinderProductCard_titleContainer__K2duM')
        brand_model=bike_name.find_all('span')
        Brand.append(brand_model[0].text.strip() if bike_name else np.nan)
        Model.append(brand_model[1].text.strip() if bike_name else np.nan)
        
        bike_price=i.find('div',class_='FinderProductCard_price__UPXye')
        Price.append(bike_price.text.strip() if bike_price else np.nan)
        
        rating_div=i.find('div',class_='rteWgt__rating-with-star rating-with-star rteWgt__star-after star-after')
        Rating.append(rating_div.find('span').text.strip() if rating_div else np.nan)
        
        review_count=i.find('span',class_='rteWgt__total-reviews total-reviews')
        Review_Count.append(review_count.text.strip() if review_count else np.nan)
        
        features=i.find('div',class_='FinderProductCard_specsContainer__vfNoH')
    
        engine_cc=features.find('div',{'class':'FinderProductCardKeySpecs_spec__NJgXU','title':'Engine'})
        Engine.append(engine_cc.find('div',class_='FinderProductCardKeySpecs_spec-label__Kg7UI').text.strip() if engine_cc else np.nan)
    
        speed=features.find('div',{'class':'FinderProductCardKeySpecs_spec__NJgXU','title':'Speed'})
        Top_Speed.append(speed.find('div',class_='FinderProductCardKeySpecs_spec-label__Kg7UI').text.strip() if speed else np.nan)
    
        bike_mileage=features.find('div',{'class':'FinderProductCardKeySpecs_spec__NJgXU','title':'Mileage'})
        Mileage.append(bike_mileage.find('div',class_='FinderProductCardKeySpecs_spec-label__Kg7UI').text.strip() if bike_mileage else np.nan)
  

In [79]:
df=pd.DataFrame({'Brand':Brand,'Model':Model,'Price':Price,'Engine(cc)':Engine,'Top Speed(kmph)':Top_Speed,'Mileage(kmpl)':Mileage,'Rating(out of 5)':Rating,'Review Count':Review_Count})
df

Unnamed: 0,Brand,Model,Price,Engine(cc),Top Speed(kmph),Mileage(kmpl),Rating(out of 5),Review Count
0,TVS,iQube,"₹94,434 - 1.59 Lakhs",,82 kmph,,4.7,58
1,Yamaha,MT-15 V2,₹1.7 - 1.74 Lakhs,155.0 cc,122 kmph,56.87 kmpl,4.4,34
2,Royal Enfield,Hunter 350,₹1.5 - 1.82 Lakhs,349 cc,114 kmph,36.2 kmpl,4.0,102
3,Hero,Splendor Plus XTEC,"₹81,001 - 86,051",97.2 cc,87 kmph,70 kmpl,4.4,31
4,KTM,390 Duke,₹2.97 Lakhs,398.63 cc,167 kmph,28.9 kmpl,4.0,76
...,...,...,...,...,...,...,...,...
619,Honda,Rebel 500,₹5.12 Lakhs,471 cc,153 kmph,27 kmpl,,
620,Honda,XL750 Transalp [2025],₹11 Lakhs,755 cc,180 kmph,23 kmpl,,
621,Triumph,Speed T4,₹1.99 - 2.03 Lakhs,398.15 cc,135 kmph,30 kmpl,,
622,BMW,CE-02,₹4.49 Lakhs,,95 kmph,,4.0,1


In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 624 entries, 0 to 623
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Brand             624 non-null    object
 1   Model             624 non-null    object
 2   Price             624 non-null    object
 3   Engine(cc)        314 non-null    object
 4   Top Speed(kmph)   613 non-null    object
 5   Mileage(kmpl)     304 non-null    object
 6   Rating(out of 5)  207 non-null    object
 7   Review Count      207 non-null    object
dtypes: object(8)
memory usage: 39.1+ KB


<h4 style=color:darkviolet;text-align:center>Cleaning scraped data and type casting columns</h4>

In [81]:
df['Engine(cc)'] = df['Engine(cc)'].str.replace('cc', '', regex=False).str.strip().astype(float)
#removing the word 'cc' in 'Engine(cc)' column and converting it into float

In [82]:
df['Top Speed(kmph)'] = df['Top Speed(kmph)'].str.lower().str.replace(r'kmph|km', '', regex=True).str.strip().astype(float)
#removing the word 'kmph' in 'Top Speed(kmph)' column and converting it's datatype into float

In [83]:
df['Mileage(kmpl)'] = df['Mileage(kmpl)'].str.lower().str.replace('kmpl', '', regex=False).str.strip().astype(float)
#removing the word 'kmpl' in 'Mileage(kmpl)' column and converting it's datatype into float

In [84]:
df['Rating(out of 5)'] = df['Rating(out of 5)'].astype(float)
#converting the datatype column 'Rating(out of 5)' from object to float

In [85]:
df['Review Count'] = df['Review Count'].str.replace(',', '', regex=False)
df['Review Count'] = pd.to_numeric(df['Review Count'], errors='coerce').astype('Int64')
#converting the datatype column 'Review Count' from object to int

In [86]:
def parse_price(text):
    if pd.isna(text):
        return (None, None)
    text = str(text).replace('₹', '').replace(',', '').strip().lower()
    if '-' in text:
        parts = [p.strip() for p in text.split('-')]
    else:
        parts = [text, text]
    if ('lakhs' in parts[1] and 'lakhs' not in parts[0]):
        part_1=re.findall(r'[\d.]+', parts[0])
        part_2=re.findall(r'[\d.]+', parts[1])
        if(float(part_1[0])<float(part_2[0])):
            parts[0] += 'lakhs'
    if 'lakhs' in parts[0] and 'lakhs' not in parts[1]:
        parts[1] += 'lakhs'
    def convert(part):
        match = re.findall(r'[\d.]+', part)
        if not match:
            return None
        num = float(match[0])
        if 'lakhs' in part:
            return num * 100000
        else:
            return num
    
    min_price = convert(parts[0])
    max_price = convert(parts[1])
    return min_price, max_price

df[['Min Price', 'Max Price']] = df['Price'].apply(parse_price).apply(pd.Series)
#extracting new columns 'Min Price' and 'Max Price' from 'Price' column

In [87]:
df.drop('Price',axis=True,inplace=True)
#deleting the old column 'Price'

<h4 style=color:darkviolet;text-align:center>Structured dataset after type casting columns</h4>

In [88]:
df

Unnamed: 0,Brand,Model,Engine(cc),Top Speed(kmph),Mileage(kmpl),Rating(out of 5),Review Count,Min Price,Max Price
0,TVS,iQube,,82.0,,4.7,58,94434.0,159000.0
1,Yamaha,MT-15 V2,155.00,122.0,56.87,4.4,34,170000.0,174000.0
2,Royal Enfield,Hunter 350,349.00,114.0,36.20,4.0,102,150000.0,182000.0
3,Hero,Splendor Plus XTEC,97.20,87.0,70.00,4.4,31,81001.0,86051.0
4,KTM,390 Duke,398.63,167.0,28.90,4.0,76,297000.0,297000.0
...,...,...,...,...,...,...,...,...,...
619,Honda,Rebel 500,471.00,153.0,27.00,,,512000.0,512000.0
620,Honda,XL750 Transalp [2025],755.00,180.0,23.00,,,1100000.0,1100000.0
621,Triumph,Speed T4,398.15,135.0,30.00,,,199000.0,203000.0
622,BMW,CE-02,,95.0,,4.0,1,449000.0,449000.0


In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 624 entries, 0 to 623
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Brand             624 non-null    object 
 1   Model             624 non-null    object 
 2   Engine(cc)        314 non-null    float64
 3   Top Speed(kmph)   613 non-null    float64
 4   Mileage(kmpl)     304 non-null    float64
 5   Rating(out of 5)  207 non-null    float64
 6   Review Count      207 non-null    Int64  
 7   Min Price         624 non-null    float64
 8   Max Price         624 non-null    float64
dtypes: Int64(1), float64(6), object(2)
memory usage: 44.6+ KB


<h4 style=color:darkviolet;text-align:center>Converting the dataset into CSV file</h4>

In [90]:
df.to_csv('Motorcycle_Data_India.csv',index=False)

<h4 style=color:darkviolet;text-align:center>Project Summary</h4>
<p style=color:maroon>This project scraped motorcycle data from the Hindustan Times Auto website (<a href="https://auto.hindustantimes.com/new-bikes/search">https://auto.hindustantimes.com/new-bikes/</a>) across 25 pages, collecting details for 624 motorcycles. The dataset includes brand, model, engine capacity, top speed, mileage, user ratings, review counts, and price ranges (min and max). Using Python libraries ('requests', 'BeautifulSoup', 'pandas', 'numpy', 're'), the data was extracted, cleaned, and structured into a pandas DataFrame. Preprocessing steps involved handling missing values with `np.nan`, parsing price ranges, removing units ('cc', 'kmph', 'kmpl'), and converting columns to appropriate types (float for numerical values, Int64 for review counts). The final dataset was saved as `Motorcycle_Data_India.csv`. This project is useful for market research, enabling analysis of motorcycle trends, price comparisons, and performance metrics in the Indian market. It supports consumers in making informed purchasing decisions, helps manufacturers benchmark competitors, and provides a foundation for data-driven insights in the automotive industry.</p>