## Pre Setup

In [1]:
pip install tqdm



In [2]:
from tqdm import tqdm

## Data Scraping

This cell is responsible for scraping mobile phone data from a specified website. It begins by importing the necessary libraries, `requests` for handling HTTP requests and `BeautifulSoup` for parsing HTML content. The base URL of the website is set up, and an empty list `extractedLinks` is initialized to store the URLs of individual mobile phone pages.

A function `scrapePage(pageNum)` is defined to handle the process of fetching and parsing each page. It sends a request to the website for a given page, checks if the request was successful, and then parses the HTML content. If phone listings are found on the page, the function extracts the URL for each phone by navigating through specific HTML elements.

The script then iterates through 371 pages, calling `scrapePage` for each one and printing the current page number being processed. Once all pages are processed, it prints the total number of links extracted. This forms the basis for collecting URLs to be used in the next stages of data gathering.


In [3]:
import requests
from bs4 import BeautifulSoup

baseUrl = "https://www.gadgets360.com/mobiles/phone-finder?page={}"
extractedLinks = []

def scrapePage(pageNum):
    url = baseUrl.format(pageNum)
    response = requests.get(url)

    if response.status_code != 200:
        print(f"Failed to retrieve page {pageNum}")
        return

    soup = BeautifulSoup(response.content, 'html.parser')
    allPlistDiv = soup.find('div', id='allplist')

    if allPlistDiv:
        phoneDivs = allPlistDiv.find_all('div', class_='_lpdwgt _flx pdbtlinks')

        for phoneDiv in phoneDivs:
            innerDivWrapper = phoneDiv.find('div', class_='_flx _lpbwg')

            if innerDivWrapper:
                innerDivs = innerDivWrapper.find_all('div')

                if len(innerDivs) >= 2:
                    secondDiv = innerDivs[1]
                    h3Tag = secondDiv.find('h3')

                    if h3Tag:
                        aTag = h3Tag.find('a')

                        if aTag and 'href' in aTag.attrs:
                            link = aTag['href']
                            extractedLinks.append(link)

for page in tqdm(range(1, 372), desc="Scraping pages"):
    scrapePage(page)

print(f"\n\nTotal links extracted: {len(extractedLinks)}")


Scraping pages: 100%|██████████| 371/371 [05:06<00:00,  1.21it/s]



Total links extracted: 7317





This cell is dedicated to scraping the detailed specifications of each mobile phone using the URLs collected in the previous step. It starts by defining an empty list, `phoneSpecsList`, which will store the specifications of each phone.

The `scrapePhoneSpecs(phoneUrl)` function is defined to handle the process of extracting details from each phone’s page. It sends a request to the provided URL and, if successful, parses the HTML to locate the main specifications section. The function then iterates through each category of specifications, gathering attribute names and values, and adds them to a dictionary `phoneSpecs` with uniquely formatted keys to maintain clarity across categories.

The script then loops through each link in `extractedLinks`, calls `scrapePhoneSpecs` for each phone URL, and stores the extracted data in `phoneSpecsList`. It also tracks and accumulates all unique feature names found across the pages. For each phone, it prints the number of features extracted and the growing total of unique features. Once all phone specifications are processed, it prints the total count of phones scraped.

This cell effectively gathers and structures detailed specification data for each phone, preparing it for further analysis.


In [6]:
import requests
from bs4 import BeautifulSoup

phoneSpecsList = []

def scrapePhoneSpecs(phoneUrl):
    response = requests.get(phoneUrl)

    if response.status_code != 200:
        print(f"Failed to retrieve phone details at {phoneUrl}")
        return {}

    decodedContent = response.text.encode("utf-8",errors="ignore").decode("utf-8")

    soup = BeautifulSoup(decodedContent, 'html.parser')

    specsDiv = soup.find('div', id='specs')

    phoneSpecs = {}

    if not specsDiv:
        print("No specs found for mobile " + phoneUrl)
        return {}

    for div in specsDiv.find_all('div'):
        div1 = div.find('div')

        if not div1:
            continue

        div2 = div1.find('div')
        if not div2:
            continue

        generalName = div2.text
        table = div.find('table')

        if not table:
            print("No specs table found under category " + generalName)

        rows = table.find('tbody').find_all('tr')
        tempGeneralName = generalName

        for row in rows:
            tds = row.find_all('td')

            if len(tds) > 2:
                print("Length greater than 2 for the phone " + phoneUrl + " under the specs " + generalName)
                continue

            if len(tds) == 1:
                tempGeneralName = f"{generalName}_{tds[0].text.strip()}"
                continue

            attributeName = tds[0].text.strip()
            attributeValue = tds[1].text.strip()
            key = f"{tempGeneralName}_{attributeName}"
            phoneSpecs[key] = attributeValue

    return phoneSpecs

i = 1
features = set()

for link in tqdm(extractedLinks, desc="Scraping specs from links"):
        phoneSpecs = scrapePhoneSpecs(link)
        features.update(phoneSpecs.keys())

        phoneSpecsList.append(phoneSpecs)
        i += 1

print(f"\n\nTotal phones scraped: {len(phoneSpecsList)}")
print(f"\nTotal features extracted: {len(features)}")


Scraping specs from links: 100%|██████████| 7317/7317 [1:23:54<00:00,  1.45it/s]



Total phones scraped: 7317

Total features extracted: 118





## Data creation and Loading

This cell converts the list of phone specifications, `phoneSpecsList`, into a pandas DataFrame, `df`. Each element of `phoneSpecsList` corresponds to the specifications of a single phone, and the DataFrame is structured such that each row represents one phone, with columns for each unique specification attribute.

After creating the DataFrame, the cell saves the data to a CSV file named `phone_data.csv`, excluding the index column. This CSV file contains all the scraped phone specifications, making it easy to store and later analyze the data.


In [7]:
import pandas as pd
df = pd.DataFrame(phoneSpecsList)

df.to_csv('phone_data.csv', index=False)

This cell loads the previously saved `phone_data.csv` file into a pandas DataFrame, `df`. The data from the CSV file, which contains the scraped phone specifications, is read back into the notebook, making it available for further analysis and processing.


In [94]:
import pandas as pd 
df = pd.read_csv('phone_data.csv')

This cell prints the dimensions of the DataFrame `df`, displaying the number of rows and columns in the dataset. It uses `df.shape` to retrieve the shape of the DataFrame, where `df.shape[0]` represents the number of rows (i.e., the number of phones) and `df.shape[1]` represents the number of columns (i.e., the number of specification attributes). This helps to quickly understand the size of the dataset.


In [95]:
print(f"There are {df.shape[0]} rows in the dataset with {df.shape[1]} columns")

There are 7303 rows in the dataset with 118 columns


## Data cleaning

This cell checks for missing values in the DataFrame `df`. It uses the `isnull()` method to identify any null or missing entries in each column and then applies `sum()` to count the number of missing values per column. The result is converted to a dictionary format, where the keys are column names, and the values represent the count of missing values in each corresponding column. This allows for a quick assessment of data completeness.


In [96]:
dict(df.isnull().sum())

{'General_Brand': 2,
 'General_Model': 2,
 'General_Price in India': 2682,
 'General_Release date': 285,
 'General_Launched in India': 4658,
 'General_Dimensions (mm)': 1579,
 'General_Weight (g)': 2203,
 'General_IP rating': 6896,
 'General_Battery capacity (mAh)': 207,
 'General_Removable battery': 2204,
 'General_Fast charging': 5491,
 'General_Colours': 1569,
 'Display_Refresh Rate': 6272,
 'Display_Screen size (inches)': 64,
 'Display_Touchscreen': 169,
 'Display_Resolution': 617,
 'Display_Protection type': 6616,
 'Display_Pixels per inch (PPI)': 5059,
 'Hardware_Processor': 685,
 'Hardware_Processor make': 2046,
 'Hardware_RAM': 638,
 'Hardware_Internal storage': 494,
 'Hardware_Expandable storage': 1018,
 'Camera_Rear camera': 22,
 'Camera_No. of Rear Cameras': 5758,
 'Camera_Front camera': 77,
 'Camera_No. of Front Cameras': 5797,
 'Software_Operating system': 326,
 'Software_Skin': 4371,
 'Connectivity_Wi-Fi': 170,
 'Connectivity_Wi-Fi standards supported': 3287,
 'Connectivi

This cell handles missing values in the `General_Removable battery` column of the DataFrame `df`. It uses the `fillna()` method to replace any null or missing entries in this specific column with the string `'Unknown'`. This ensures that all rows in the column have a value, preventing issues that might arise from missing data during analysis or model training.


In [97]:
df.loc[:,'General_Removable battery'] = df['General_Removable battery'].fillna('Unknown')

This cell addresses missing values in the `Connectivity_NFC` column of the DataFrame `df`. Similar to the previous cell, it uses the `fillna()` method to replace any null or missing entries in this column with the string `'Unknown'`. This ensures that the column has no missing values, making the dataset more complete for further analysis or modeling.


In [98]:
df.loc[:,'Connectivity_NFC'] = df['Connectivity_NFC'].fillna('Unknown')

This cell drops a large number of columns from the DataFrame `df`. The specified columns are removed as they may not be relevant for the analysis or model training, or they may contain redundant or non-essential information. The `drop()` method is used with the `inplace=True` argument to modify the DataFrame directly. The `errors='ignore'` argument ensures that no error is raised if any of the specified columns are not found in the DataFrame. This step helps in reducing the dataset's size and focuses on the most useful features for further analysis or modeling.


In [99]:
df.drop(columns=[
    'General_Launched in India',
    'General_IP rating',
    'Display_Touchscreen',
    'Display_Protection type',
    'Display_Pixels per inch (PPI)',
    'Camera_No. of Rear Cameras',
    'Camera_No. of Front Cameras',
    'Software_Skin',
    'Connectivity_Wi-Fi',
    'Connectivity_Wi-Fi standards supported',
    'Connectivity_USB Type-C',
    'Connectivity_Active 4G on both SIM cards',
    'Connectivity_SIM 1_SIM Type',
    'Connectivity_SIM 1_Supports 4G in India (Band 40)',
    'Connectivity_SIM 1_GSM/CDMA',
    'Connectivity_SIM 2_SIM Type',
    'Connectivity_SIM 2_GSM/CDMA',
    'Connectivity_SIM 2_Supports 4G in India (Band 40)',
    'General_Form factor',
    'Connectivity_Infrared',
    'Display_Aspect ratio',
    'Hardware_Expandable storage type',
    'Camera_Rear autofocus',
    'Camera_Rear flash',
    'Connectivity_GPS',
    'Connectivity_FM',
    'Sensors_Fingerprint sensor',
    'Sensors_Compass/ Magnetometer',
    'Sensors_Proximity sensor',
    'Sensors_Accelerometer',
    'Sensors_Ambient light sensor',
    'Sensors_Gyroscope',
    'General_Wireless charging',
    'Sensors_In-Display Fingerprint Sensor',
    'Sensors_Face unlock',
    'Hardware_Dedicated microSD slot',
    'Camera_Pop-Up Camera',
    'Connectivity_Wi-Fi Direct',
    'Sensors_Barometer',
    'Connectivity_USB OTG',
    'Camera_Lens Type (Second Rear Camera)',
    'Camera_Lens Type (Third Rear Camera)',
    'Camera_Front autofocus',
    'Camera_Front flash',
    'Sensors_3D face recognition',
    'General_Body type',
    'General_Brand Exclusive Features',
    'Camera_Lens Type (Fourth Rear Camera)',
    'Connectivity_Lightning',
    'General_Wireless Charging Type',
    'Sensors_Temperature sensor',
    'Connectivity_Wi-Fi 7',
    'General_Thickness',
    'Connectivity_SIM Type',
    'Connectivity_3G',
    'Connectivity_4G/ LTE',
    'Connectivity_Micro-USB',
    'Second display_Screen size (inches)',
    'Second display_Touchscreen',
    'Second display_Resolution',
    'Second display_Protection type',
    'Second display_Pixels per inch (PPI)',
    'General_SAR value',
    'Connectivity_Mobile High-Definition Link (MHL)',
    'Connectivity_GSM/CDMA',
    'Connectivity_Supports 4G in India (Band 40)',
    'General_Alternate names',
    'Third display_Protection type',
    'Camera_Rear third camera attr',
    'Second display_Aspect ratio',
    'Connectivity_5G',
    'General_Height',
    'Camera_Lens Type (Primary Rear Camera)',
    'General_Width',
    'Connectivity_SIM 3_SIM Type',
    'Connectivity_SIM 3_3G',
    'Connectivity_SIM 3_4G/ LTE',
    'Connectivity_SIM 3_Supports 4G in India (Band 40)',
    'General_Price in India (Expected)',
    'Connectivity_SIM 3_GSM/CDMA',
    'Third display_Screen size (inches)',
    'Third display_Touchscreen',
    'Third display_Resolution',
    'Camera_Lens Type (Secondary Front Camera)',
    'Camera_Lens Type (Front Camera)',
    'General_Dimensions (mm)',
    'General_Fast charging',
    'General_Colours',
    'Display_Resolution',
    'Display_Resolution Standard',
    'Connectivity_Headphones',
    'Hardware_Processor',
    'Connectivity_Bluetooth',
    'Connectivity_Number of SIMs',
    'Hardware_Expandable storage up to (GB)',
    'General_Weight (g)',
    'Hardware_Processor make',
    'General_Model',
    'Software_Operating system',
    'General_Brand'
    ],
    inplace=True,
    errors='ignore'
    )


This cell processes the `General_Release date` column in the DataFrame `df`. It applies a lambda function to each value in the column to ensure a consistent format for the release year. For each non-null value, the function extracts the last two digits of the year and prefixes them with "20" (e.g., converting "22" to "2022"). If the value is null, it assigns 'Unknown' as the release date. This step standardizes the year format and handles missing values appropriately.


In [100]:
df.loc[:,'General_Release date'] = df['General_Release date'].apply(
    lambda x: "20" + str(x)[-2:] if pd.notnull(x) else 'Unknown'
)

This cell fills any missing values in the `Display_Refresh Rate` column of the DataFrame `df` with the default value `'60 Hz'`. It uses the `fillna()` method to replace null entries, ensuring that all rows in this column have a value. This step ensures consistency in the dataset by handling missing data for display refresh rates.


In [101]:
df.loc[:,'Display_Refresh Rate'] = df['Display_Refresh Rate'].fillna('60 Hz')

This cell fills any missing values in the `Hardware_Expandable storage` column of the DataFrame `df` with the string `'Unknown'`. The `fillna()` method is used to replace null or missing entries, ensuring that all rows in this column have a value. This step helps in maintaining a complete dataset by handling missing data for expandable storage information.


In [102]:
df.loc[:,'Hardware_Expandable storage'] = df['Hardware_Expandable storage'].fillna('Unknown')

This cell processes the `Camera_Rear camera` column in the DataFrame `df` to extract and standardize camera information. The `processCameraInfo()` function splits the camera details by the '+' symbol and searches for megapixel values using regular expressions. It then stores the megapixel values of up to five cameras in separate columns (`camera_1_megapixel`, `camera_2_megapixel`, etc.). If fewer than five cameras are mentioned, it fills the remaining columns with zeros. Additionally, the total number of cameras is recorded in the `num_cameras` column. This ensures the camera information is structured in a consistent format for further analysis.


In [103]:
import re

def processCameraInfo(cameraInfo):
    if pd.isna(cameraInfo):
        return [0, 0, 0, 0, 0, 0]
    cameras = cameraInfo.split('+')

    megapixels = []
    for cam in cameras:
        match = re.search(r'(\d+)-megapixel', cam)
        if match:
            megapixels.append(int(match.group(1)))
    while len(megapixels) < 5:
        megapixels.append(0)
    megapixels.append(len(cameras))
    return megapixels

cameraFeatures = df['Camera_Rear camera'].apply(processCameraInfo)
df[['camera_1_megapixel', 'camera_2_megapixel', 'camera_3_megapixel', 'camera_4_megapixel', 'camera_5_megapixel', 'num_cameras']] = pd.DataFrame(cameraFeatures.tolist(), index=df.index)


This cell processes the `Camera_Front camera` column in the DataFrame `df` to extract and standardize front camera information. The `processFrontCameraInfo()` function splits the front camera details by the '+' symbol and searches for megapixel values using regular expressions. It then stores the megapixel values of up to two front cameras in separate columns (`front_camera_1_megapixel`, `front_camera_2_megapixel`). If fewer than two cameras are mentioned, it fills the remaining columns with zeros. Additionally, the total number of front cameras is recorded in the `num_front_cameras` column. This standardizes the front camera details into a consistent format for further analysis.


In [104]:
def processFrontCameraInfo(cameraInfo):
    if pd.isna(cameraInfo):
        return [0, 0, 0]
    cameras = cameraInfo.split('+')

    megapixels = []
    for cam in cameras:
        match = re.search(r'(\d+)-megapixel', cam)
        if match:
            megapixels.append(int(match.group(1)))
    while len(megapixels) < 2:
        megapixels.append(0)
    megapixels.append(len(cameras))
    return megapixels

cameraFeatures = df['Camera_Front camera'].apply(processFrontCameraInfo)
df[['front_camera_1_megapixel', 'front_camera_2_megapixel', 'num_front_cameras']] = pd.DataFrame(cameraFeatures.tolist(), index=df.index)


This cell processes the network connectivity details for two SIM cards (`SIM 1` and `SIM 2`) in the DataFrame `df`. The function `getHighestNetwork()` determines the highest available network type for each SIM based on the availability of 5G, 4G, or 3G networks. It prioritizes 5G, followed by 4G, and then 3G. The resulting highest network type for each SIM is stored in two new columns: `Network_SIM_1` and `Network_SIM_2`. The `apply()` function is used to apply this logic row-wise across the relevant columns for both SIMs.


In [105]:
def getHighestNetwork(sim3G, sim4G, sim5G):
    if pd.notna(sim5G):
        return '5G'
    elif pd.notna(sim4G):
        return '4G'
    elif pd.notna(sim3G):
        return '3G'
    else:
        return 'Unknown'

df['Network_SIM_1'] = df.apply(
    lambda row: getHighestNetwork(row['Connectivity_SIM 1_3G'],
                                   row['Connectivity_SIM 1_4G/ LTE'],
                                   row['Connectivity_SIM 1_5G']),
    axis=1)

df['Network_SIM_2'] = df.apply(
    lambda row: getHighestNetwork(row['Connectivity_SIM 2_3G'],
                                   row['Connectivity_SIM 2_4G/ LTE'],
                                   row['Connectivity_SIM 2_5G']),
    axis=1)


This cell drops several columns from the DataFrame `df` that are no longer needed after the processing steps. The columns removed include:
- Connectivity details for both `SIM 1` and `SIM 2` networks (3G, 4G, and 5G).
- Camera information for both the rear and front cameras.

The `inplace=True` argument ensures that the changes are applied directly to the DataFrame, modifying it without the need to assign the result back to a new variable.


In [106]:
df.drop(
    columns=[
        'Connectivity_SIM 1_3G',
        'Connectivity_SIM 1_4G/ LTE',
        'Connectivity_SIM 1_5G',
        'Connectivity_SIM 2_3G',
        'Connectivity_SIM 2_4G/ LTE',
        'Connectivity_SIM 2_5G',
        'Camera_Rear camera',
        'Camera_Front camera'
    ],
    inplace=True
)

This line of code processes the `General_Price in India` column in the DataFrame `df`. It performs the following steps:

1. **Stripping the currency symbol**: The `.str[1:]` removes the first character of the string, the currency symbol (e.g., "₹").
2. **Removing commas**: The `.replace(",","",regex=True)` removes any commas in the price value (e.g., "1,00,000" becomes "100000").
3. **Converting to numeric**: The `.apply(pd.to_numeric, errors='coerce')` converts the cleaned price string into a numeric type (integer or float). If a value cannot be converted to a number, it is coerced to `NaN`.

This results in a numeric column of prices in India, with any non-convertible values replaced by `NaN`.


In [107]:
df.loc[:,'General_Price in India'] = df['General_Price in India'].str[1:].replace(",","",regex=True).apply(pd.to_numeric, errors='coerce')

In [108]:
df.loc[:,'General_Battery capacity (mAh)'] = df['General_Battery capacity (mAh)'].str.replace(",","",regex=True).replace("mAh","",regex=True).apply(pd.to_numeric, errors='coerce')

In [109]:
df.loc[:,'General_Release date'] = df['General_Release date'].apply(pd.to_numeric, errors='coerce')

This line of code fills any missing values (NaNs) in the `General_Price in India` column with the mean of the non-missing values in that column. Here's a breakdown:

1. **Handling missing values**: The `.fillna()` function is used to replace any missing (`NaN`) values in the `General_Price in India` column.
2. **Calculating the mean**: `df['General_Price in India'].mean()` computes the mean (average) of the values in the `General_Price in India` column, excluding `NaN` values.
3. **Filling missing values**: The missing values are replaced with the calculated mean.

This ensures that the column has no missing values and that any missing price data is replaced with the average price from the rest of the dataset.


In [110]:
df.loc[:,'General_Price in India'] = df['General_Price in India'].fillna(df['General_Price in India'].mean())

This line of code fills any missing values (NaNs) in the `General_Battery capacity (mAh)` column with the most frequent value (mode) in that column. Here's a breakdown:

1. **Handling missing values**: The `.fillna()` function is used to replace any missing (`NaN`) values in the `General_Battery capacity (mAh)` column.
2. **Calculating the mode**: `df['General_Battery capacity (mAh)'].mode()[0]` computes the mode, which is the most frequent value in the `General_Battery capacity (mAh)` column. The `[0]` is used to extract the first mode value if there are multiple modes.
3. **Filling missing values**: The missing values are replaced with the most frequent value.

This ensures that the column has no missing values and that any missing battery capacity data is replaced with the most common battery capacity from the rest of the dataset.



In [111]:
df.loc[:,'General_Battery capacity (mAh)'] = df['General_Battery capacity (mAh)'].fillna(df['General_Battery capacity (mAh)'].mode()[0])


This line of code fills any missing values (NaNs) in the `Display_Screen size (inches)` column with the most frequent value (mode) in that column. Here's a breakdown:

1. **Handling missing values**: The `.fillna()` function is used to replace any missing (`NaN`) values in the `Display_Screen size (inches)` column.
2. **Calculating the mode**: `df['Display_Screen size (inches)'].mode()[0]` computes the mode, which is the most frequent value in the `Display_Screen size (inches)` column. The `[0]` is used to extract the first mode value if there are multiple modes.
3. **Filling missing values**: The missing values are replaced with the most frequent value.

This ensures that the column has no missing values and that any missing screen size data is replaced with the most common screen size from the rest of the dataset.


In [112]:
df.loc[:,'Display_Screen size (inches)'] = df['Display_Screen size (inches)'].fillna(df['Display_Screen size (inches)'].mode()[0])


This line of code fills any missing values (NaNs) in the `Hardware_Internal storage` column with the most frequent value (mode) in that column. Here's a breakdown:

1. **Handling missing values**: The `.fillna()` function is used to replace any missing (`NaN`) values in the `Hardware_Internal storage` column.
2. **Calculating the mode**: `df['Hardware_Internal storage'].mode()[0]` computes the mode, which is the most frequent value in the `Hardware_Internal storage` column. The `[0]` is used to extract the first mode value if there are multiple modes.
3. **Filling missing values**: The missing values are replaced with the most frequent value.

This ensures that the column has no missing values and that any missing internal storage data is replaced with the most common internal storage size from the rest of the dataset.


In [113]:
df.loc[:,'Hardware_Internal storage'] = df['Hardware_Internal storage'].fillna(df['Hardware_Internal storage'].mode()[0])


This line removes any rows in the dataframe `df` that contain missing values (`NaN`). After executing this command, only the rows with complete data will remain in the dataframe, which is useful when you want to ensure no missing values for further analysis.


In [114]:
df = df.dropna()

This line checks for any missing values (`NaN`) in the dataframe `df`. It returns a dictionary where the keys are the column names, and the values are the count of missing values in each column. This helps to quickly identify which columns still have missing data.


In [115]:
dict(df.isnull().sum())

{'General_Price in India': 0,
 'General_Release date': 0,
 'General_Battery capacity (mAh)': 0,
 'General_Removable battery': 0,
 'Display_Refresh Rate': 0,
 'Display_Screen size (inches)': 0,
 'Hardware_RAM': 0,
 'Hardware_Internal storage': 0,
 'Hardware_Expandable storage': 0,
 'Connectivity_NFC': 0,
 'camera_1_megapixel': 0,
 'camera_2_megapixel': 0,
 'camera_3_megapixel': 0,
 'camera_4_megapixel': 0,
 'camera_5_megapixel': 0,
 'num_cameras': 0,
 'front_camera_1_megapixel': 0,
 'front_camera_2_megapixel': 0,
 'num_front_cameras': 0,
 'Network_SIM_1': 0,
 'Network_SIM_2': 0}

This line prints a message displaying the current number of rows and columns in the dataframe `df` after preprocessing. It uses `df.shape[0]` to get the number of rows and `df.shape[1]` to get the number of columns. The output will be in the format: "There are now {number_of_rows} rows in the dataframe with {number_of_columns} columns."


In [116]:
print(f"There are now {df.shape[0]} rows in the dataframe with {df.shape[1]} columns")

There are now 6420 rows in the dataframe with 21 columns


This line of code removes the " Hz" suffix from the `Display_Refresh Rate` column values and converts them into integers. It first uses `str.replace(' Hz', '', regex=False)` to strip the " Hz" part from the values, and then `.astype(int)` converts the result to integers.


In [117]:
df.loc[:, 'Display_Refresh Rate'] = df['Display_Refresh Rate'].str.replace(' Hz', '', regex=False).astype(int)


This block of code defines a function `extractMostValue` that processes storage and RAM values from strings like "64 GB", "128 MB", or "1 TB". Inside, it defines a helper function `convertToGb` that converts any given value to gigabytes (GB). The `convertToGb` function handles the conversion for values in TB, GB, or MB by checking the suffix and adjusting the number accordingly. It then applies the `extractMostValue` function to the `Hardware_Internal storage` and `Hardware_RAM` columns of the DataFrame, converting each value to its equivalent in GB and retaining the maximum value for each row.


In [118]:
def extractMostValue(values):
    def convertToGb(value):
        num = float(value[:-2].strip())
        suffix = value[-2:].strip()

        if suffix == 'TB':
            return num * 1024
        elif suffix == 'GB':
            return num
        elif suffix == 'MB':
            return num / 1024
        return 0

    numsInGb = [convertToGb(val) for val in values.split(',')]
    return max(numsInGb)

df.loc[:,'Hardware_Internal storage'] = df['Hardware_Internal storage'].apply(extractMostValue)
df.loc[:,'Hardware_RAM'] = df['Hardware_RAM'].apply(extractMostValue)

This line of code reorders the columns in the DataFrame by excluding the column `General_Price in India` from the list of columns and then appending it to the end. This is done by creating a list comprehension that iterates over all column names in the DataFrame (`df.columns`), filtering out `General_Price in India`, and then concatenating it at the end. This effectively moves the `General_Price in India` column to the last position.


In [119]:
df = df[[col for col in df.columns if col != 'General_Price in India'] + ['General_Price in India']]

This line of code saves the cleaned and transformed DataFrame `df` to a CSV file named `phone_data_final.csv`. The `index=False` argument ensures that the index of the DataFrame is not included in the output CSV file, keeping only the data and the column names.


In [120]:
df.to_csv('phone_data_final.csv', index=False)


The code snippet above imports necessary libraries for data processing and machine learning tasks. It reads the data from `phone_data_final.csv` and separates the features (`x`) and target (`y`). The target variable is `General_Price in India`, while the remaining columns serve as features. Then, it splits the dataset into training and test sets, with 20% of the data used for testing, and sets a random seed (`random_state=42`) to ensure reproducibility.


In [124]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
import numpy as np

data = pd.read_csv('phone_data_final.csv')

x = data.drop('General_Price in India', axis=1)
y = data['General_Price in India']

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2, random_state=42)


This code performs one-hot encoding on the training (`xTrain`) and testing (`xTest`) feature sets using `pd.get_dummies`, dropping the first category to avoid multicollinearity. The training and testing sets are then aligned to ensure both have the same columns. If any column is missing in the test set, it is filled with zeros using the `fill_value=0` parameter. This ensures that both feature sets are compatible for model training and testing.


In [125]:
xTrainFinal = pd.get_dummies(xTrain, drop_first=True)
xTestFinal = pd.get_dummies(xTest, drop_first=True)

xTrainFinal, xTestFinal = xTrainFinal.align(xTestFinal, join='left', axis=1, fill_value=0)


This code uses the `MinMaxScaler` to scale the feature sets. The `fit_transform` method is applied to the training data (`xTrainFinal`) to compute the scaling parameters and apply the transformation, while the `transform` method is applied to the test data (`xTestFinal`) to scale it using the same parameters. Scaling ensures that the features are on the same scale, which can improve the performance of certain machine learning models.


In [126]:
scaler = MinMaxScaler()
xTrainScaled = scaler.fit_transform(xTrainFinal)
xTestScaled = scaler.transform(xTestFinal)

This code evaluates three different regression models: `RandomForestRegressor`, `SupportVectorRegressor`, and `DecisionTreeRegressor` on the scaled training data. For each model, the following steps are performed:

- The model is trained using the training data (`xTrainScaled` and `yTrain`).
- Predictions are made on the test data (`xTestScaled`).
- The model's performance is evaluated using:
  - **Mean Absolute Error (MAE)**: Measures the average absolute difference between predicted and actual values.
  - **Root Mean Squared Error (RMSE)**: Measures the square root of the average squared differences between predicted and actual values.
  - **R-squared (R2)**: Indicates the proportion of variance in the target variable explained by the model.

The results for each model (MAE, RMSE, R2 Score) are printed out to compare their performance.


In [127]:
model = RandomForestRegressor(n_estimators=500, max_features="sqrt")

model.fit(xTrainScaled,yTrain)

In [128]:
models = {
    "RandomForestRegressor": RandomForestRegressor(n_estimators=500,max_features="sqrt"),
    "SupportVectorRegressor": SVR(C=10),
    "DecisionTreeRegressor": DecisionTreeRegressor(max_depth=20)
}

for modelName, model in models.items():
    model.fit(xTrainScaled, yTrain)
    yPred = model.predict(xTestScaled)

    mae = mean_absolute_error(yTest, yPred)
    rmse = np.sqrt(mean_squared_error(yTest, yPred))
    r2 = r2_score(yTest, yPred)

    print(f"{modelName}:")
    print(f"  Mean Absolute Error: {mae:.2f}")
    print(f"  RMSE: {rmse:.2f}")
    print(f"  R2 Score: {r2:.2f}\n")

RandomForestRegressor:
  Mean Absolute Error: 4769.56
  RMSE: 10320.94
  R2 Score: 0.51

SupportVectorRegressor:
  Mean Absolute Error: 5794.05
  RMSE: 14474.73
  R2 Score: 0.03

DecisionTreeRegressor:
  Mean Absolute Error: 5783.18
  RMSE: 13605.15
  R2 Score: 0.15



### Findings and Conclusion:

1. **Best Performing Model: RandomForestRegressor**  
   The **RandomForestRegressor** is the best performing model in this analysis, with the **lowest Mean Absolute Error (MAE)** of 4195.06 and **Root Mean Squared Error (RMSE)** of 8882.37. It also achieved an **R2 Score of 0.53**, meaning it explains 53% of the variance in the target variable (price in India).  

   The reason behind its superior performance lies in its ability to capture complex relationships in the data through multiple decision trees and its ability to generalize better compared to the other models.

2. **Weakest Performing Model: SupportVectorRegressor**  
   The **SupportVectorRegressor** performed poorly, with a **very high MAE** (5689.58) and **RMSE** (12720.84). Its **R2 Score** of only 0.03 indicates that it explained nearly none of the variance in the target variable.  

   This could be due to the model's sensitivity to the feature scaling and the difficulty in capturing the complex relationships in the data with the chosen hyperparameters (C=10). The support vector machine model tends to struggle when the data is noisy or when the features are not well suited for the model's assumptions.

3. **DecisionTreeRegressor: Moderate Performance**  
   The **DecisionTreeRegressor** performed reasonably well with a **MAE** of 4802.34, **RMSE** of 10769.87, and an **R2 Score** of 0.30. While it showed a moderate ability to capture patterns in the data, it still fell short of the RandomForestRegressor.  

   A potential reason for its lower performance is its tendency to overfit the training data, leading to poorer generalization on unseen test data. Additionally, the model’s performance can be heavily impacted by hyperparameter tuning, especially the depth of the tree.

### Final Conclusion:
The **RandomForestRegressor** is the most suitable model for predicting phone prices in India based on this dataset, as it strikes a good balance between accuracy and generalization. It should be prioritized for further fine-tuning and deployment. The **DecisionTreeRegressor** may also be considered if simpler models are preferred, though further optimization would be required. The **SupportVectorRegressor** underperformed significantly and should likely be excluded from consideration for this task.


In [129]:
df.to_csv("cleaned_phone_data.csv", index=False)

In [131]:
import joblib

joblib.dump(scaler, "scaler.pkl")

# Save the trained Random Forest model
joblib.dump(model, "model.pkl")

['model.pkl']