# 🚗 Car Sales Data Preprocessing

This notebook is dediacted to preprocessing the data and maiking sure that it can be fitted to a model a and visualized well.

In [1]:
# Making the required imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Importing the dataset
df = pd.read_csv("../data/car-sales.csv")
df

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [3]:
df.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [4]:
df.dtypes

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

# Filling the missing values

It makes sense to fill the missing data with median value of that particular column as it is a more stable measure of central tendency.

In [5]:
df["Odometer (KM)"].median()

131821.0

In [6]:
# Filling Odometer KM
df["Odometer (KM)"].fillna(df["Odometer (KM)"].median(), inplace=True)

In [8]:
df.isna().sum()

Make             49
Colour           50
Odometer (KM)     0
Doors            50
Price            50
dtype: int64

In [9]:
# Filling Doors 
df["Doors"].fillna(4, inplace=True)

In [10]:
df.isna().sum()

Make             49
Colour           50
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [11]:
df["Price"].fillna(df["Price"].median(), inplace=True)

In [12]:
df.isna().sum()

Make             49
Colour           50
Odometer (KM)     0
Doors             0
Price             0
dtype: int64

Since all the numerical data has been filled, now we need to drop the rows with missing Make and Colour values.

In [14]:
df = df.dropna()

In [15]:
df.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

Great, we've removed all the missing values from the dataset.

Let's check the shape of our dataset now.

In [16]:
df.shape

(902, 5)

Looks like we've lost about 100 rows in the process of removing the missing values. 

In [17]:
df

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
994,BMW,Blue,163322.0,3.0,31666.0
995,Toyota,Black,35820.0,4.0,32042.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [18]:
# Save the dataset as a csv file
df.to_csv("../data/car-sales-remastered.csv", index=False)