# 1) Introduction

This project uses database of used Skoda cars in UK @ https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes

Our objective is to use machine learning alogrithms to predict the price of a used Skoda car in UK based on features like, Model, Year of introduction, Engine size etc. To achieve the objective we will require the following python libraries:
1. scikit-learn for devloping prediction models
2. pandas for data management
3. seaborn for data visualization

Further, we will be using the following machine learning logarithms form scikit-learn library:
1. LinearRegression
2. RandomForestRegression
3. GradientBoostingRegressor

Moreover, to evaluate the models we will use two metrics:
1. R square
2. Mean Average Error

# 2) Problem statement

Our objective is to predict the price of a used Skoda car in UK based on the following features:
1. Model of the Skoda car
2. Year	of introduction
3. Transmission	type
4. Mileage of the car
5. Fuel Type used by the car
6. Tax on the car
7. Miles Per Gallon	
8. Engine Size of the car

# 3) Detailed methodology

Lets import the required libraries which include:
1. numpy for numerical analysis and as a requirement for working with pandas
2. pandas for data management
3. mathplotlib and seaborn for data visualization

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Importing the database of skoda used cars using pandas.
The "data" variable holds all of our data in pandas data frame.

In [None]:
data = pd.read_csv('../input/used-car-dataset-ford-and-mercedes/skoda.csv')

# 3a) Data Description and Visualization

Let's see the data types of values for each columns of the data frame using dtypes method in pandas.

In [None]:
data.dtypes

There are a total of 9 columns in our data frame, 3 of them are catagorical.
We now use pandas describe() function for stastical summary of the data.

In [None]:
data.describe()

There are 6,000+ entries in the database as well as the means and quantiles for each colum of data can be seen form the stastical summary.
Let's visualize the scatterplots for each of these columns plotted against price.

In [None]:
g = sns.PairGrid(data, y_vars=["price"], x_vars=["year", "mileage", "tax", "mpg", "engineSize"])
g.map(plt.scatter)

The scatterplots show a linear relationship of variables with price. Hence, this data can be used to feed to a supervised-regression machine learning logarithm for developing predictions.

Additionally, we also see some outliers in the data which needs to be removed. For this purpose we must proceed with the data cleaning to get rid of them first.

# 3b) Data Cleaning

In the following cell, I have devloped a loop which takes non-string columns (price, year, mileage, tax, mpg, engineSize). For every column, I have sorted the data and have removed the last 2% of entries. Meaning, I have removed the top 2% of data. Top 2%, because, the outliers are mostly concentrated in higher percentiles. lower percentiles have not been removed.

Once, the loop deletes top 2% of data for every column, the modified data is stored in new variable called clean_data.

In [None]:
sorted_data= data

for col in ["price", "year", "mileage", "tax", "mpg", "engineSize"]:
    sorted_data= sorted_data.sort_values(by=[col])
    num_of_outliers= len(sorted_data.index)*0.02
    selected_values= len(sorted_data.index)-round(num_of_outliers,0)
    sorted_data= sorted_data[:int(selected_values)]
    
clean_data= sorted_data
clean_data

With outliers removed, we have 5552 rows now. To see the difference, let us see the scatter plots for each variables against price using seaborn.

In [None]:
g = sns.PairGrid(clean_data, y_vars=["price"], x_vars=["year", "mileage", "tax", "mpg", "engineSize"])
g.map(plt.scatter)

We see a clear difference in this plot from the earlier scatter plot. the outliers are removed and the data is now ready for feature engineering.

# 3c) Feature Engineering

I have used get_dummies() function in pandas to code the catagorical data columns in our clean_data variable.

In [None]:
clean_coded_data= pd.get_dummies(clean_data, columns=['transmission','model', 'fuelType'])
clean_coded_data

The Model, Transmission and Fuel type columns are coded properly.

We observe form the data that some columns have higher values on average than others. For instance, the engizeSize column has values form 0 to 2.5 while the price column has values form 995 to 91874.

Machine learning prediction models work best if the data is scaled or standardized. I used the MinMaxScaler() from sci-kit learn to scale the values of each numerical column. It sets the values between 0-1 hence, devlpoping homoginity of scale in the whole data.

However, interestingly, it did not improve my prediction in any way. I then, moved to use StandardScaler(), another function from sci-kit which transforms the values of columns into a normal distribution with mean of 0. Nevertheless, this again, did not bring any imporvement to the prediction.

Therefore, I decided to move on without scaling and standardizing the data. However, I have left the following cell commented to show the logic I used for it.

In [None]:
# code for scaling values with MinMaxScaler

#from sklearn.preprocessing import MinMaxScaler
#scaler= MinMaxScaler()

#def scaleColumns(df, cols_to_scale):
#    for col in cols_to_scale:
#        df[col] = pd.DataFrame(scaler.fit_transform(pd.DataFrame(clean_coded_data[col])),columns=[col])
#    return df

#clean_coded_data = scaleColumns(clean_coded_data, ['year','price', 'mileage', 'tax', 'mpg', 'engineSize'])
#clean_coded_data

# code for standardizing values with StandardScaler

#from sklearn.preprocessing import StandardScaler
#scaler= StandardScaler()

#def scaleColumns(df, cols_to_scale):
#    for col in cols_to_scale:
#        df[col] = pd.DataFrame(scaler.fit_transform(pd.DataFrame(clean_coded_data[col])),columns=[col])
#    return df

#clean_coded_data = scaleColumns(clean_coded_data, ['year','price', 'mileage', 'tax', 'mpg', 'engineSize'])
#clean_coded_data

# 3d) Model Development and Testing


Before we being using models, we must split the data into dependent and independent variables. x variable holds the independent colummns and y holds the price column.

In [None]:
x= clean_coded_data.drop(columns='price')
y= clean_coded_data['price'].copy()

let's split the data into taining and testing subsets. We have imported the train_test_split module form sci-kit. Further, the size of testing data is set to 30% which is not randomly seleted from data as random_state parimeter is set to 0.

Another module we have imported is metrics, which will be used for capturing R squre and Mean Average Error for every logarithm we will run.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
xTrain, xTest, yTrain, yTest= train_test_split(x, y, random_state=0, test_size=.30)

We will be using the following logarithms form sci-kit to tain the data:
1. LinearRegression
2. RandomForestRegression
3. GradientBoostingRegressor

For each logarithm, I have printed the R square and Average Mean Error as benchmarks to campare their results.

**Using Linear Regression**

We have imported the LinearRegression module form sci-kit learn. LinearRegression alogarithm works on the priciple of minizing the distance of each data point to the line of best fit.

1. regressor variable sets the function LinearRegression() offered by LinearRegression module. 
2. fit() function is used to train xTrain and yTrain variables which hold our training data.
3. predict() function is used to predict the price based on xTest which holds our testing data.
4. predicted values of prices by the logarithm is stored in yPredict column.

m11 and m21 holds the banchmarking valules (r square and mean average error).

In [None]:
from sklearn.linear_model import LinearRegression

regressor= LinearRegression()
regressor.fit(xTrain, yTrain)
yPredict= regressor.predict(xTest)

m11= metrics.r2_score(yTest, yPredict)*100
m21= metrics.mean_absolute_error(yTest, yPredict)

print('R Square %: ', m11)
print('Mean Absolute Error: ', m21)

**Using Random Forest Regressor**

Now we will use RandomForestRegressor module form sci-kit learn. RandomForestRegressor works on developing decision trees for prediction. It has single parimeter, n_estimators, which sets the number of trees/nodes which will be formed for prediction. The higher the trees, the more accurate the prediction will be.

1. regressor variable sets the function RandomForestRegressor() at n_estimators=150. 
2. fit() function is used to train xTrain and yTrain variables which hold our training data.
3. predict() function is used to predict the price based on xTest which holds our testing data.
4. predicted values of prices by the logarithm is stored in yPredict column.

m12 and m22 holds the banchmarking valules (r square and mean average error).

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor= RandomForestRegressor(n_estimators=150)
regressor.fit(xTrain, yTrain)
yPredict= regressor.predict(xTest)

m12= metrics.r2_score(yTest, yPredict)*100
m22= metrics.mean_absolute_error(yTest, yPredict)

print('R Square %: ', m12)
print('Mean Absolute Error: ', m22)

**Using Gradient Boosting Regressor**

Finally, we have used GradientBoostingRegressor module form sci-kit learn. GradientBoostingRegressor is a powerful logarithm based on several weak regression logarithms which are combined to bring accuracy in prediction. It has several parameters but, we will use n_estimators and max-depth in the next steps to optamize the prediction model.

1. regressor variable sets the function GradientBoostingRegressor at default n_estimators=100 and max_depth=3.
2. fit() function is used to train xTrain and yTrain variables which hold our training data.
3. predict() function is used to predict the price based on xTest which holds our testing data.
4. predicted values of prices by the logarithm is stored in yPredict column.

m13 and m23 holds the banchmarking valules (r square and mean average error).

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
regressor= GradientBoostingRegressor()
regressor.fit(xTrain, yTrain)
yPredict= regressor.predict(xTest)

m13= metrics.r2_score(yTest, yPredict)*100
m23= metrics.mean_absolute_error(yTest, yPredict)

print('R Square %: ', m13)
print('Mean Absolute Error: ', m23)

**Parimeter Optimization for Gradient Boosting Regressor**

We need to configure a pair of n_estimators and max_depth parimeters for GradientBoostingRegressor logarithm to bring maximum prediction accuracy.

Our approach is to run a loop for 100 times which will pick random values of n_estimators and max_depth. These random values will be used to run the logarithm, and the corrosponding r square will be stored in a list named results. The results list will contain another list of 3 values, r square, n_estimators value and max_depth value.

In [None]:
import random

results= []
for i in range(100):
    n_est= random.randrange(100,200)
    max_dep= round(random.uniform(3,10),1)
    
    regressor= GradientBoostingRegressor(n_estimators=n_est, max_depth=max_dep)
    regressor.fit(xTrain, yTrain)
    yPredict= regressor.predict(xTest)
    
    r2= metrics.r2_score(yTest, yPredict)
    
    output=[r2, n_est, max_dep]
    results.append(output)
    
results[:5]

We need to slect the list in results which has highest r square which has been done using max() function. This brings our optimal configuration of n_estimators and max_depth. Finally, for one last time, we will run the GradientBoostingRegression with these optimal parameter values and update the benchmak values m13 and m23.

In [None]:
r2, n_estimators,max_depth= max(results)

regressor= GradientBoostingRegressor(n_estimators=n_estimators, max_depth=max_depth)
regressor.fit(xTrain, yTrain)
yPredict= regressor.predict(xTest)

m13= metrics.r2_score(yTest, yPredict)*100
m23= metrics.mean_absolute_error(yTest, yPredict)

print('R Square %: ', m13)
print('Mean Absolute Error: ', m23)

# 3e) Model Results

Since, we have stored the benchmarks for all models, we now create a pandas data frame to store them in a table.

In [None]:
metric_results= {'Model': ['linear Regression', 'Random Forest', 'Gradient Boosting'], 
                 'R Square': [m11, m12, m13], 
                 'MAE': [m21, m22, m23]}
metrics= pd.DataFrame(metric_results)
metrics

Based on this table, we now conclude GradientBoostingRegressor to be efficient in bringing accurate predictions in used car prices. It has explained 95% of the variation in price of used Skoda cars in UK with an average error of below 910 dollars.

In [None]:
prediction= pd.DataFrame({'actual price': yTest, 'predicted price': yPredict})
sns.relplot(data=prediction, x='actual price', y='predicted price')

We can observe form the above graph that we can do better to predict the prices of more expensive cars. Our model brings good results in the less expenisve cars.


# 4) Business Value of Results

We are now able to explain the variation in Prices of used Skoda cars in UK. Our model have removed 95% of the uncertanity in Prices of these cars. Further, with an average error of 910$, we will be able to predict the prices using the following features:
* Model of the Skoda car
* Year of introduction
* Transmission type
* Mileage of the car
* Fuel Type used by the car
* Tax on the car
* Miles Per Gallon
* Engine Size of the car

Price prediction is an important metric for car dealers. It helps them to manage their profits. A car dealer dealing in Skoda cars can use this data to set buying price and manage his profit margin. Further, it also helps in inventory management. As expensive cars are bought less, a dealer can use this prediction model to optamize his inventory of expensive and affordable cars thus, minimizing his inventory costs.

Another use can be in car modification and repair service. Car modification businesses can set their prices based on features added and removed. For instance, an owner of workshop might be able to see the market value of converting a manual transmission car to automatic. The cofficients of transmission feature will help him see how much transmission type influence price of a car. Thus, he will be able to manage his comissions and profits.

On the other hand, a customer can make use of this model to know the market value of a used Skoda car. This will help him to pay the true market price for the car. Further, he will be aware of dealers comission and can bargain better while buying.

# 5) Further Scope

The model can also be used to predict prices for other brands like BMW, Nissan, Toyota etc. 

Although we have removed 95% of the uncertaninty related to price variations, our mean average error can be improved. An inflated MAE with a high R square can make a model unhelpful.

Finally, as already discussed, our model does a good job for prediction less expensive cars. However, it can be improved in price prediction for expensive used cars. This can be done either adding more data rows on expensive cars or dividing the data into expensive and less expensive cars, and building two different prediction models.