In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Introduction

Various Data of used car around 100,000 listings, including various manufacturers such as Audi, BMW, Benz, Ford, Skoda and so on are available. With the help of this data set and the Machine Learning knowledge, we have to generate a Model which is like a tool to Predict the Market Price of the Car.


### Scope of the Project

From this data, We are focussing on one Particular Brand Skoda. Later from this knowledge it can applied for any other available Brands. Since it is about predicting value of a Car, this is a Regression problem.

Our approach as follows:
1. Cleaning the Data set.
2. Sorting out the Missing Values.
3. Replacing zero values with reasonable Mean values.
4. Visualizing the Data.
5. Finding the influencing factors for Price.
6. Encoding the Object type datas.
7. Model Selection for Training.
8. Tuning the Hyperparameter of the Selected Model.
9. Finding the Best Model.
10. Predicting and Validating the results.

In [None]:
import numpy as np                 # linear algebra
import pandas as pd                # data processing 
import matplotlib.pyplot as plt    # data visualization
import seaborn as sns              # data visualization

## 1) Data Exploration

### Importing and Cleaning the Data

In [None]:
# out of many brand data file we are focussing only on Skoda
df = pd.read_csv('../input/used-car-dataset-ford-and-mercedes/skoda.csv')
df.head(15)

From the above file it includes the features such as Model of the car, Year - purchased year, Price value, Transmission, Mileage - total number of miles reached so far, Fueltype, Tax - road tax, mpg - fuel consumption and Enginesize.


In [None]:
# to view various info of available data
df.info()

From the above information it does not include any missing values. Out of nine features three are Object type and remaining are Numeric data type. 

In [None]:
# the shape of our data 
df.shape

In [None]:
# to look for missing values
df.notnull()

For my convinence the Year feature has been changed in to Age. 

In [None]:
# to calculate vehicle age 
df['age'] = 2020 - df['year']
df = df.drop(columns = 'year')    # removed that column
df.head()

In [None]:
# to count number of zero value in each column
df.isin([0]).sum()

From the above information we have few zero values in the features Tax and Enginesize. To get more accuracy on the model we have decided to include the reasonable values for these features.


In [None]:
# totally we have these much zero values for 'engineSize' and 'tax'
# the age zero values represents the car purchased year on 2020
print(sum(df['engineSize'] == 0))
print(sum(df['tax'] == 0))

Since we have certain number of Zero Values in Enginesize and Tax. We are initially replacing that value with a Nan type.

In [None]:
# the zero values 
df[["engineSize","tax"]] = df[["engineSize","tax"]].replace(0,np.NaN)   # replacing Zero by Nan values
df.isnull().sum()

In [None]:
median_to_fill = df.groupby("model").median()       # Groupby.Median: Compute median of groups, excluding missing values. 

for model, row in median_to_fill.iterrows():        # Iterrows: Iterate over DataFrame rows
    rows_to_fill = (df["model"] == model)
    df[rows_to_fill] = df[rows_to_fill].fillna(row) # Fillna: Fills the NaN values with a given substitute number

As said earlier, the zero values has now been calculated through median function based on Model feature. 

In [None]:
# to count number of zero value in each column
df.isin([0]).sum()

In [None]:
df.head(15)

In [None]:
df.describe() 

From the above data it seems like they have meaningful values. We will conclude that in upcoming sections through Correlation matrix and Data visualization methods. 

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True) # annot=True shows the values inside the box

The Correlation matrix shows the clear correlation inbetween variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

The Mileage and Age clearly indicates the Strong influence against Price. It has a valid reason and logic to accept this influence. The second set of pair Tax and mpg has bit lower influence over Price. Finally the Enginesize increases and the Price increases too, pretty much it's reasonable to agree that the size grows along with the cost of the Vehicle apparently.


In [None]:
df.corr().abs()   # abs: generates the absolute value

In the above absolute value distribution simply it shows the correlation between each and every variables available in the data clearly.

The number of attributes in our Data are nine which has six Numerical and three Categorial i.e. Quantitative data types. Since we are doing a regression analysis it is important to visualize the importance of Quantitative datas as well. In the upcoming Encoding section we will convert these category datas in our desired numerical format.

Because of these two different data types such as numerical and categorial in our Model, we will visualize the data in two sets as follows:


In [None]:
numeric_type = df[['price','mileage','tax','engineSize','mpg','age']] # Includes only Int and Float type Datas

fig = plt.figure(figsize=(15,10))                             # Figsize: Dimension of the plot 
                                   
for index,col in enumerate(numeric_type):                     # Enumerate iterates over the numeric_type
    sns.set_style('whitegrid')                                # Background theme of the plot
    plt.subplot(3,3,index+1)                                  # Controls the rows and columns, Index for placing
    sns.set(font_scale = 1.0)                                 # Scaling the Font size
    sns.distplot(df[col],kde = False, color='blue')           # Dist plot: univariate distribution of observations
fig.tight_layout(pad=1.0)                                     # To adjust subplots Label 

Observation from the Data: 
1. The distribution of Price are right skew and shows the mean of 14275 GBP, most of the vehicles are in this value.
2. The distribution of Mileage also right skew almost many of the vehicles are within the range of 20000 miles.
3. The Tax distribution shows the mean value of around 120 GBP.
4. The EngineSize distribution shows mean value of 1.4 for the vehicle.
5. The distribution of mpg shows the value of 60 for most of the vehicle. 
6. The Age distribution shows that 16 years old vehicle and most of them are in 2.5 years.

In [None]:
category_type = df[['model','transmission','fuelType']]         # Includes only Object type Datas

fig = plt.figure(figsize=(20,5))

for index,col in enumerate(category_type):
    sns.set_style('whitegrid')
    plt.subplot(1,3,index+1)
    if(index == 0):
        plt.xticks(rotation=90)                                   # To make X-axis Label in vertical represantation
    sns.set(font_scale = 1.0)
    sns.countplot(df[col], order = df[col].value_counts().index)  # categorical bin using bars representation
fig.tight_layout(pad=1.0)  

Observation from the Data:
1. From the Model distribution half of the Car Models sold are Fabia and Octavia.
2. More than 50 percent of the Car sold are coming under the Manual Transmission category. 
3. It is evident that People bought Vehicle with a Fuel type of Petrol the most. 



### Plotting over Price

In [None]:
# various numeric_type data influence over price
numeric_type = df[['mileage','tax','engineSize','mpg','age']]               # Updated the set without Price

fig = plt.figure(figsize=(20,20))

for index,col in enumerate(numeric_type):                                   
        sns.set_style('whitegrid')                                          
        plt.subplot(4,3,index+1)                   
        sns.set(font_scale = 1.0)
        sns.scatterplot(data = df, x = col, y = 'price',color='blue', alpha = 0.5) 
fig.tight_layout(pad=1.0)   

Observation  from the Data:
1. In the Mileage plot couple of Vehicles has reached above 250000 miles which has low Prices. 
2. The Tax distribution over price shows high Tax value for couple of Vehicles in low Price range might be a Outliers. 
3. The Engine size of 2.5 shows low price might be chance of Outliers. 
4. The mpg distribution shows bunch of Vehicles has around 170 to 200 mpg which should be a Outliers. 
5. The Age plot shows reasonable distribution over Price.

In [None]:
# various category_type data influence over price

fig = plt.figure(figsize=(20,5))

for index,col in enumerate(category_type):
    sns.set_style('whitegrid')
    plt.subplot(1,3,index+1)
    if(index == 0):
        plt.xticks(rotation=90)
    sns.set(font_scale = 1.0)
    sns.barplot(x=df[col], y='price', data = df, ci = None) # ci: to avoid error bars
fig.tight_layout(pad=1.0)

Observation from the Data:
1. Various Model in this Skoda brand shows various Price values.
2. In this Brand Automatic transmission type has higher Price values. 
3. The Hybrid type Vehicles are evidently high in Price than any other fuel type available options.

## 2. Data Representation

In our Model Data we have seperated our features as 2 types. In Numerical data type we have no any missing values. In Categorial data type, we need to perform one hot coding before letting them to join our model. 

There are two ways to encode the data : Label Encoding and One Hot Encoder.

Here we have chosen to go with One Hot Encoder. What one hot encoding does is, it takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. In our model we will be getting around twenty columns. 

Decided to choose One Hot Encoder: We might run into situations where, after label encoding, we might confuse our model into thinking that a column has data with some kind of order or hierarchy, when we clearly don’t have it. To avoid this, we ‘OneHotEncode’ that column.

### Encoding 

In [None]:
# Converting all the categorial data into some useful numerical data for better evaluation using One-hot Encoding
from sklearn.preprocessing import OneHotEncoder                                     # To perform encoding of data
df_onehot = pd.get_dummies(df,columns=['model', 'transmission','fuelType'])         # Encoding shown columns 
print(df_onehot.shape)
df_onehot.head()

In [None]:
#Splitting the Train and Test data
from sklearn.model_selection import train_test_split         # Splitting up the data as Train and Test set respectively
X = df_onehot.drop(columns=['price'])                        # X includes all data except target variable
y = df_onehot['price'].copy()                                # y has only target variable-Price
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0, test_size = 0.30) # Test size 30%

In [None]:
y_train.shape

## 3.Training

### Model Selection

To select our correct Model it is always better to consider couple of algorithms and evaluate their performance through cross-validation method. We are interested in the following algorithms:

1. Linear Regression
2. Gradient Boosting
3. Decision Tree
4. Random Forest

In our case We have not Standardized our values because the algorithms that we chosen here does not requires the value to be standardized scalar.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score                         # Evaluate a score by cross-validation
from sklearn.metrics import r2_score                                        # Coefficient of determination 

# Finding the best fit algorithm for our model
model_list = [(LinearRegression(), 'LinearRegression'),                     # List included all desired algorithms
              (GradientBoostingRegressor(),'GradientBoostingRegressor'),
              (DecisionTreeRegressor(),'DecisionTreeRegressor'),
              (RandomForestRegressor(),'RandomForestRegressor'),
              ]

model_score = []

for i in model_list:
    model = i[0]                                                           # Scoring: Coefficient of determination r2
    score = cross_val_score(model,X_train,y_train,cv=4, scoring='r2')      # model: estimator, cv: splitting strategy
    print(f'{i[1]} score = {score.mean().round(2)*100}')                   # Score.mean: Shows mean of all scores                                     
    model_score.append([i[1],score.mean()])

So from the Model selection block we found that out of all algorithms Gradient Boosting Regressor and Random Forest Regressor perfoms well. So we have decided to take Gradient Boosting regressor and tune its hyperparameter to attain maximum accuracy for our analysis.

Gradient boosting is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

### Tuning model's Hyperparameter

Now it is important to tune the Hyperparameter of the model to attain better accuracy. There are no any optimum value to tune any model it can be acheived only by continuous tuning. In this model we have decided to tune the following three Hyperparameters:

1. n_estimators - number of boosting stages to perform, large number usually results in better performance.
2. max_depth - maximum depth limits the number of nodes in the tree, best tuning parameter for best performance  
3. learning_rate - Shrinks the contributiuon of each tree, choosing three different value for good performance

Totally Grid search will run 3x3x3 = 27 models to find the best combination of the Hyperparameters. Actually based on our selection,in each model Gridsearch Cv will run cross validation with 4 folds. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
# selecting the hyperparameter
param_grid = [
    {'n_estimators' : [200,300,500],                 # number of boosting stages to perform
          'max_depth' : [2,4,6],                     # maximum depth limits the number of nodes in the tree 
          'learning_rate' : [0.1,0.3,0.5]}           # learning rate: Shrinks the contributiuon of each tree                          
]

# Through grid search finding the best Model
grid_search = GridSearchCV(GradientBoostingRegressor(), # estimator object
                           param_grid,                  # includes hyperparameter values
                           cv=4,                        # cross-validation splitting strategy     
                           scoring = 'r2')              # coefficient of determination

In [None]:
grid_search.fit(X_train,y_train)                   # fitting the values in to grid search 
y_pred = grid_search.predict(X_test)               # predicting the Price value 
my_model = grid_search.best_estimator_             # Best estimator has the parameters of better perfomance
my_model                                           # Best model 

The my_model has the best performing parameters among all other possible combinations. This model will be used to train and evaluate results for our analysis.


## 4.Evaluation

In [None]:
my_model.fit(X_train,y_train)                     # Training the best model with datas
prediction = my_model.predict(X_test)             # Predicting the Price values

In [None]:
# To generate a comparison table between predicted and actual Price of Car
result = X_test.copy()
result["predicted"] = my_model.predict(X_test)
result["actual"]= y_test.copy()
result =result[['predicted', 'actual']]
result['predicted'] = result['predicted'].round(2)
result.sample(10)

In the above comparison table it is pretty clear that our model have perfomed in a better way. Let see the predicted and actual price values in plot.

In [None]:
# Data visulaization of actual price and predicted price of Car
XX = np.linspace(0, 40000, 1881)                                 # return numbers in selected range 
plt.scatter(XX, y_pred, color="green", alpha = 0.2)              # green dots represents y_pred against XX         
plt.scatter(XX, y_test, color="blue", alpha = 0.5)               # blue dots represents y_test against XX

Prediction Error Plot:

A prediction error plot shows the actual targets from the dataset against the predicted values generated by our model. This allows us to see how much variance is in the model. Data scientists can diagnose regression models using this plot by comparing against the 45 degree line, where the prediction exactly matches the model.

In [None]:
import warnings
warnings.simplefilter("ignore")

In [None]:
# Result visualization
from yellowbrick.regressor import PredictionError   # To plot prediction error
# Instantiate the linear model and visualizer
visualizer = PredictionError(my_model)
visualizer.fit(X_train, y_train)                    # Fit the training data to the visualizer
visualizer.score(X_test, y_test)                    # Evaluate the model on the test data
visualizer.show()                                   # Finalize and render the figure

The Y-axis as predicted price and X-axis as actual price, we have grey dashed line which has one hundred percent accuracy it means actual = predicted. Our model has a Rsquared value of 0.95. Our best model really perfomed well.


## Conclusion

So we conclude that through Hyperparameter tuning we have increased our model performance and finally achieved the Rsquared value of 0.95

## Future Works

From this knowledge we can also predict the Price value of any other available Car data.


Here come to the end of this notebook, this is the first regression problem solving for me, I'll try with more challenging dataset and algorithm in next problem. I would greatly appreciate it if you kindly give me some feedback for this notebook. If you like it, please hit upvote! Thanks for visiting 