# Data Analysis

Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.


in this notebook we will inspect and clean the data that we extracted and use it to train different machine learning models. we will use seprate notebooks for differrent models to avoid messy codes.

importing needed libraries

In [62]:
import pandas as pd
import numpy as np

reading data from **cars.scv** file that we created before.

In [91]:
cars = pd.read_csv("cars.csv")
cars.head(3)

Unnamed: 0,Name,style,Exterior color,interior color,Engine,drive type,Fuel Type,Transmission,Mileage,mpg city,mpg highway,price
0,2018 Nissan Titan,Pickup Truck,Deep Blue Pearl,Black,5.6L V-8 Gas,4WD,Gas,Automatic,82230,15,21,35620
1,2020 Honda Civic,Hatchback,Sonic Gray Pearl,Unknown,1.5L Inline-4 Gas Turbocharged,FWD,Gas,Automatic,24282,31,40,24999
2,2018 Dodge Charger,Sedan,Indigo Blue,Brazen Gold/Black,5.7L V-8 Gas,RWD,Gas,Automatic,19468,16,25,41999


as you can see **Brand**, **Name** and **Year** of the cars are all included in the **Name** column. we are going to extract them and create their own columns in the given order. <br>
we also gonna extract the **Engine volume** and put it in **Engine V** column because its a very important feature in car prices.

In [92]:
cars["Year"] = cars.Name.str.extract("(\d\d\d\d)" , expand=False).astype(int)
cars["Name"] = cars.Name.str.replace("(\d\d\d\d )", "", regex=True)

cars["Brand"] = cars.Name.str.extract("([\w]+)" , expand=False)
cars["Brand"] = cars["Brand"].apply(lambda x : x.strip())

cars["Name"] = cars.Name.str.replace("(^[\w]+ )", "",regex=True)
cars["Name"] = cars.Name.apply(lambda x : x.strip())

cars["Engine V"] = cars.Engine.str.extract("(\d\.*\d*)").astype(float)
cars["Engine"] = cars.Engine.str.replace("(\d\.*\d*L )", "", regex=True)

cars.head()

Unnamed: 0,Name,style,Exterior color,interior color,Engine,drive type,Fuel Type,Transmission,Mileage,mpg city,mpg highway,price,Year,Brand,Engine V
0,Titan,Pickup Truck,Deep Blue Pearl,Black,V-8 Gas,4WD,Gas,Automatic,82230,15,21,35620,2018,Nissan,5.6
1,Civic,Hatchback,Sonic Gray Pearl,Unknown,Inline-4 Gas Turbocharged,FWD,Gas,Automatic,24282,31,40,24999,2020,Honda,1.5
2,Charger,Sedan,Indigo Blue,Brazen Gold/Black,V-8 Gas,RWD,Gas,Automatic,19468,16,25,41999,2018,Dodge,5.7
3,F-150,Pickup Truck,Shadow Black,Medium Earth Gray,V-6 Gas Turbocharged,4WD,Gas,Automatic,195205,18,23,20995,2018,Ford,2.7
4,Altima,Sedan,White,Black,Inline-4 Gas,FWD,Gas,Automatic,92366,27,38,10995,2015,Nissan,2.5


now, we changed our data as we wanted. its time to check for missing valuse. so we can decide what we have to do for them. in this part we are going to use **isnull()** funtion and return an array which give the value 1 to null or nan cells and 0 for the rest. and by using **sum()** function that returns the sum of all values in each column, we can easily find how many missing values we have in each column.

In [93]:
cars.isnull().sum()

Name              0
style             0
Exterior color    0
interior color    0
Engine            0
drive type        0
Fuel Type         0
Transmission      0
Mileage           0
mpg city          0
mpg highway       0
price             0
Year              0
Brand             0
Engine V          1
dtype: int64

as you can see, we only have one missing value in **Engine V** column. it means that there's only one row in our dataset that doesnt have any value for its engine volume. so it would be a waste of time if we try to impute some value for only one row. then we remove it!<br><br>

> This show that our web scraper worked very well and extracted all the data we wanted.

In [94]:
cars = cars.dropna()
cars.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6532 entries, 0 to 6532
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Name            6532 non-null   object 
 1   style           6532 non-null   object 
 2   Exterior color  6532 non-null   object 
 3   interior color  6532 non-null   object 
 4   Engine          6532 non-null   object 
 5   drive type      6532 non-null   object 
 6   Fuel Type       6532 non-null   object 
 7   Transmission    6532 non-null   object 
 8   Mileage         6532 non-null   int64  
 9   mpg city        6532 non-null   int64  
 10  mpg highway     6532 non-null   int64  
 11  price           6532 non-null   int64  
 12  Year            6532 non-null   int32  
 13  Brand           6532 non-null   object 
 14  Engine V        6532 non-null   float64
dtypes: float64(1), int32(1), int64(4), object(9)
memory usage: 791.0+ KB


there you go. we cleaned our dataset and we are ready to start training models. but before that, lets have a statistical review on our dataset : 

In [95]:
cars.describe(include="all")

Unnamed: 0,Name,style,Exterior color,interior color,Engine,drive type,Fuel Type,Transmission,Mileage,mpg city,mpg highway,price,Year,Brand,Engine V
count,6532,6532,6532,6532,6532,6532,6532,6532,6532.0,6532.0,6532.0,6532.0,6532.0,6532,6532.0
unique,269,9,535,312,30,4,3,2,,,,,,43,
top,F-150,Pickup Truck,Black,Black,Inline-4 Gas,FWD,Gas,Automatic,,,,,,Ford,
freq,1238,2098,447,2273,1682,2628,6468,6413,,,,,,2017,
mean,,,,,,,,,63627.757042,21.254133,28.302358,26390.029394,2017.260104,,3.025398
std,,,,,,,,,42866.056166,5.019364,6.209774,11107.289425,2.972726,,1.224997
min,,,,,,,,,143.0,10.0,16.0,2000.0,1997.0,,1.0
25%,,,,,,,,,31943.5,18.0,23.0,17999.0,2017.0,,2.0
50%,,,,,,,,,52081.0,20.0,27.0,25595.5,2018.0,,2.7
75%,,,,,,,,,86404.5,25.0,33.0,34488.5,2019.0,,3.6


this table gives very important informations such as categorical features or encoders that should use for each deature and so many useful information that we will talk about them in more detail, while training models.

lets take an out put of our cleaned dataset for later use:

In [84]:
cars.to_csv(
    "cleaned_data.csv", 
    index=False
)