
# **Car Price prediction**

- First step was to collect the data
- We scrapped pakwheels.com and gathered around 58,000 entries.
- Then, we started by loaded the data in pandas dataFrame, after that performed data analysis.
- Removed the attributes that were not useful like the Link, Ad No, Last Updated, Features and Location
- Removed the rows that have a missing value
- Used Label Encode to Encode the categorical data


# **Business problem**

- The primary objective is to develop a machine learning model that can predict the price of used cars listed on PakWheels with a high degree of accuracy. This model will assist both buyers and sellers in making informed decisions about pricing and negotiations.
- For sellers, it will ensure that their cars are competitively priced to attract potential buyers
- For buyers, it will help in understanding the fair value of a car based on its characteristics.


In [3]:
# Loading and cleaning the csv file
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



# Read data from file 'pakWheels.csv'
data = pd.read_csv('pakWheels.csv')
data.head()



Unnamed: 0,Ad No,Name,Price,Model Year,Location,Mileage,Registered City,Engine Type,Engine Capacity,Transmission,Color,Assembly,Body Type,Features,Last Updated,URL
0,7472110,Chevrolet Joy 1.0 2008,975000.0,2008,"I-10, Islamabad Islamabad",144800,Islamabad,Petrol,1000 cc,Manual,Gold,Imported,Hatchback,"AM/FM Radio, CD Player, Cassette Player, Keyl...","Jun 25, 2023",https://www.pakwheels.com/used-cars/chevrolet-...
1,7597614,Toyota Corolla GLi 1.3 VVTi 2016,3375000.0,2016,"Shorkot, Shorkot Punjab",122000,Lahore,Petrol,1300 cc,Manual,White,Local,Sedan,"AM/FM Radio, Air Conditioning, CD Player, DVD...","Jun 25, 2023",https://www.pakwheels.com/used-cars/toyota-cor...
2,7280186,Honda BR-V i-VTEC S 2018,4800000.0,2018,"NFC 1, Lahore Punjab",18731,Lahore,Petrol,1500 cc,Automatic,Crystal Black Pearl,Local,MPV,"ABS, AM/FM Radio, Air Bags, Air Conditioning,...","Jun 25, 2023",https://www.pakwheels.com/used-cars/honda-br-v...
3,7319452,Suzuki Mehran VX 2010,680000.0,2010,"Pak Arab Housing Society, Lahore Punjab",76316,Bahawalpur,Petrol,800 cc,Manual,Blue,Local,Hatchback,Front Speakers,"Jun 25, 2023",https://www.pakwheels.com/used-cars/suzuki-meh...
4,7352581,MG HS 1.5 Turbo 2022,8000000.0,2022,"DHA Defence, Lahore Punjab",16950,Punjab,Petrol,1490 cc,Automatic,Black,Local,Crossover,"ABS, AM/FM Radio, Air Bags, Air Conditioning,...","Jun 25, 2023",https://www.pakwheels.com/used-cars/mg-hs-2022...


In [4]:
# Size of dataset

data.shape

(57783, 16)

# **Data Processing**

## Data Cleaning

In [5]:
# Check for any missing values

data.isna().sum()   # This show the number of missing values for each attribute

Ad No                 0
Name                  0
Price                 0
Model Year            0
Location              0
Mileage               0
Registered City       0
Engine Type         502
Engine Capacity      86
Transmission          0
Color                 0
Assembly              0
Body Type          5583
Features           3413
Last Updated          0
URL                   0
dtype: int64

In [6]:
# Remove missing values
data = data.dropna(axis=0)

In [7]:
# All the missing values are now removed

data.isna().sum()

Ad No              0
Name               0
Price              0
Model Year         0
Location           0
Mileage            0
Registered City    0
Engine Type        0
Engine Capacity    0
Transmission       0
Color              0
Assembly           0
Body Type          0
Features           0
Last Updated       0
URL                0
dtype: int64

In [8]:
# Size after removing missing values

data.shape

(50259, 16)

In [9]:
# Cleaning the data

# Remove Ad No, URL and Last Updated columns from the dataset
data.drop('Ad No', axis=1, inplace=True)
data.drop('URL', axis=1, inplace=True)
data.drop('Last Updated', axis=1, inplace=True)
data.drop('Location', axis=1, inplace=True)
data.drop('Features', axis=1, inplace=True)

# Convert 'Price' column to numeric values
data['Price'] = pd.to_numeric(data['Price'], errors='coerce')   # errors='coerce' will replace any non-numeric values with NaN
# In Price at some places there was Call for price so that will be converted into NaN after that i will drop those rows

data = data.dropna(subset=['Price'], axis=0)    # Drop any rows with NaN values in the 'Price' column
data = data.dropna(subset=['Engine Type'], axis=0)
data = data.dropna(subset=['Engine Capacity'], axis=0)
data = data.dropna(subset=['Body Type'], axis=0)


# Convert names from Panda Series into a new list
names = data['Name']
names_list = names.tolist()
#print(type(names_list))

# Assigning two new lists to store company and model names
company_list = []
model_list = []
for name in names_list:
    # Create a separate list for company and model
     company_list.append(name.split()[0])
     model_list.append(name.split()[1]+' '+name.split()[2])  # Suppose if the model name consist of two words

# Append model and company
data.insert(0, 'Company', company_list)
data.insert(1, 'Model', model_list)

# Remove CC from the end of engine capacity and convert it into numeric
data['Engine Capacity'] = data['Engine Capacity'].str.replace('cc', '')
data['Engine Capacity'] = pd.to_numeric(data['Engine Capacity'], errors='coerce')   # errors='coerce' will replace any non-numeric values with NaN



data.drop('Name', axis=1, inplace=True)
data['Price'] = data['Price']


In [10]:
# View the cleaned data for understanding purpose

data.head()

Unnamed: 0,Company,Model,Price,Model Year,Mileage,Registered City,Engine Type,Engine Capacity,Transmission,Color,Assembly,Body Type
0,Chevrolet,Joy 1.0,975000.0,2008,144800,Islamabad,Petrol,1000,Manual,Gold,Imported,Hatchback
1,Toyota,Corolla GLi,3375000.0,2016,122000,Lahore,Petrol,1300,Manual,White,Local,Sedan
2,Honda,BR-V i-VTEC,4800000.0,2018,18731,Lahore,Petrol,1500,Automatic,Crystal Black Pearl,Local,MPV
3,Suzuki,Mehran VX,680000.0,2010,76316,Bahawalpur,Petrol,800,Manual,Blue,Local,Hatchback
4,MG,HS 1.5,8000000.0,2022,16950,Punjab,Petrol,1490,Automatic,Black,Local,Crossover


In [11]:
data.shape

(49778, 12)