ASSIGNMENT NUMBER 1
In this assignment, you will solve a problem, i.e., Chaky company makes some car but he has
difficulty setting the price for the car. Please make a simple web-based car price prediction system.
Note: You are ENCOURAGED to work with your friends, but DISCOURAGED to blindly copy
other’s work. Both parties will be given 0.
Note: Comments should be provided sufficiently so we know you understand. Failure to do so can
raise suspicion of possible copying/plagiarism.
Note: You will be graded upon (1) documentation, (2) experiment, (3) implementation.
Note: This is a two-weeks assignment, but start early.
Deliverables: The GitHub link containing the jupyter notebook, a README.md of the github, and
the folder of your web application called ‘app’.

***************************************************************************************************

Task 1. Preparing the datasets
Download the Car Price dataset from Google classroom. 
Perform loading, 
EDA, 
preprocessing,
model selection, · · · , inference.

There are some important coding considerations:
• For the feature owner, map First owner to 1, ..., Test Drive Car to 5
• For the feature fuel, remove all rows with CNG and LPG because CNG and LPG use a different
mileage system i.e., km/kg which is different from kmfeaturepl for Diesel and Petrol 
• For the feature mileage, remove “kmpl” and convert the column to numerical type (e.g., float). Hint: use
df.mileage.str.split
• For the feature engine, remove “CC” and convert the column to numerical type (e.g., float)
• Do the same for max power
• For the feature brand, take only the first word and remove the rest
• Drop the feature torque, simply because Chaky’s company does not understand well about it 
• You will found out that Test Drive Cars are ridiculously expensive. Since we do not want to involve
  this, we will simply delete all samples related to it.
• Since selling price is a big number, it can cause your prediction to be very unstable. One trick is
  to first transform the label using log transform, i.e., y = np.log(df['selling_price'])
• During inference/testing, you have to transform your predicted y backed before comparing with y
test, i.e., pred_y = np.exp(pred_y)

**********************************************************************************************

Task 2. Report - In the end of the notebook, please write a 2-3 paragraphs summary deeply
discussing and analysing the results. Possible points of discussion:
• Which features are important? Which are not? Why?
• Which algorithm performs well? Which does not? Why? (here, you haven’t learned about any
algorithms yet, but you can search online a bit and start building an intuition)

**********************************************************************************************

Task 3. Deployment - Develop a web-based application that contains the model. Here you will be
tasked to self-study how to deploy the model into production. Here are some guidelines: Here you
have multiple options. Those who are veteran web developer may prefer their own web app stack
which is welcomed. For those who are new to this realm, you may consider a simpler/one-stop
solution rather than learning the traditional/flexible approach.
The goal of this task is to expose/deploy our model for public use via the web interface. The main
scenario is the following:
1) Users enter the domain on their browser. They land on your page.
2) (optional) Users may need to navigate to a prediction page.
3) Users read the instruction given on the page that instructs them on how the prediction
works. 4) Users find the input form, put in the appropriate data, and click submit.
5) Note that if users do not have information on certain field, you have to allow users to skip that
field. In that case, we recommend you to fill the missing field with imputation technique you
have learned in the class.
6) A moment later (depending on your model and hardware performance), the result is returned
and printed below the form.
Deploying aside, the app should work on the local environment (your machine) first. I would suggest
you use ‘Dash’ by ‘Plotly’ https://dash.plotly.com/ as a one-stop solution. Spend time studying the
‘Quick Start’ tutorial on the site and also ‘Dash Fundamental’. They are essential for you to know how
‘Dash’ works.
The deliverable for the app would be, in GitHub, you have a folder ‘app’ with ‘.Dockerfile’, ‘docker
compose.yaml’ files, and ‘code’ folder.
Bootstrap: I know Dockerizing the app could be difficult for newcomers, you will get confused when
searching for stuff online, especially, when you just trust ChatGPT to give you the right answer. So, for
those who want to postpone the process of learning “Docker”, here is the Dockerized Dash project
link. Don’t worry, you will eventually need to do this yourself in this shortcoming weeks. You can not
escape this.

Loading the data set:

In [None]:
import pandas as pd
df_cars=pd.read_csv('Cars.csv')

Checking the created data set: (number of samples (rows), number of variables (columns), column names):

In [None]:
df_cars.shape

In [None]:
df_cars.columns

In [None]:
df_cars.head()

coding feature "owner": First Owner --> 1, Second Owner --> 2, Third owner --> 3, Fourth & Above Owner --> 4, Test Drive Car --> 5: 

In [None]:
owner_coding = {
    'First Owner': 1,
    'Second Owner': 2,
    'Third Owner': 3,
    'Fourth & Above Owner': 4,
    'Test Drive Car': 5
}

df_cars['owner'] = df_cars['owner'].map(owner_coding)

Remove rows with fuel values 'CNG' or 'LPG':

In [None]:
df_cars = df_cars[df_cars['fuel'].isin(['Petrol', 'Diesel'])]

Removing “kmpl” for the feature mileage, and convert the column to numerical type (e.g., float). 
Removing “CC” for the feature engine,  and convert the column to numerical type (e.g., float)
Removing “bph” for the feature engine,  and convert the column to numerical type (e.g., float)
for max_power, there is a single value that is equal to 'bph', to get a correct result the value is changed to ' bph'

In [None]:
df_cars.mileage = df_cars.mileage.str.split(expand=True)[0].astype(float)

In [None]:
df_cars.engine = df_cars.engine.str.split(expand=True)[0].astype(float)

In [None]:
df_cars.loc[df_cars['max_power'] == 'bph', 'max_power'] = ' bph'

In [None]:
df_cars.max_power = df_cars.max_power.str.split(expand=True)[0].astype(float)

Taking only the first word and removing the rest For the feature brand:

In [None]:
df_cars.name=df_cars.name.str.split(expand=True)[0]

Droping the feature torque:

In [None]:
df_cars = df_cars.drop(columns=['torque'])

Deleting all samples related to "Test Drive Cars == 5":

In [None]:
df_cars = df_cars[df_cars['owner'] != 5]

Transforming selling price by log transform function since selling price is a big number. y = np.log(df['selling_price']):

In [None]:
import numpy as np
df_cars['selling_price'] = np.log(df_cars['selling_price'])

As it is likely that the producing year of cars to be important for predicting the car's price, a new variable age is created to use as a feuture:

In [None]:
from datetime import datetime
now = datetime.now()
df_cars['car_age'] = int(now.strftime("%Y")) - df_cars['year']

Final check for loaded car data set structure:

In [None]:
df_cars.head()

In [None]:
df_cars.info()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt 

In [None]:
import matplotlib
np.__version__, pd.__version__, sns.__version__, matplotlib.__version__

# 2. Exploratory Data Analysis

In [None]:
desc_num = df_cars.describe(include=['float', 'int'])
desc_str = df_cars.describe(include=['object'])

In [None]:
desc_str

In [None]:
desc_num

Horizental Count Plot (Bar plot) is created for each categorical features to explore their distribution and any unusual values. 

In [None]:
sns.countplot(data = df_cars, y = 'name', color = 'Grey')

In [None]:
sns.countplot(data = df_cars, y = 'fuel', color = 'Grey')

In [None]:
sns.countplot(data = df_cars, y = 'seller_type', color = 'Grey')

In [None]:
sns.countplot(data = df_cars, y = 'transmission', color = 'Grey')

In [None]:
sns.countplot(data = df_cars, y = 'owner', color = 'Grey')

In [None]:
sns.countplot(data = df_cars, y = 'seats', color = 'grey')

Distribution Plot (Histogram) is useful to see how continues label and features are distributed. also we can find any deviation from a normal distribution e.g. outliers, unusual observation, skewness or kurtosis can be found from these plots, furthermore we can use these plots to find the appropriate measure for imputing missing data. however outlier detection and imputation will be done in the preprecessing phase.  

In [None]:
sns.displot(data = df_cars, x = 'selling_price', color = 'Green')

In [None]:
sns.displot(data = df_cars, x = 'km_driven', color = 'Green')

In [None]:
sns.displot(data = df_cars, x = 'mileage', color = 'Green')

In [None]:
sns.displot(data = df_cars, x = 'engine', color = 'Green')

In [None]:
sns.displot(data = df_cars, x = 'max_power', color = 'Green')

In [None]:
sns.displot(data = df_cars, x = 'car_age', color = 'Green')

Box plot is very usefull for exploring attributes distribution as well as outliers. creating this plot for the continues attributes by the values of categorical feuthers enables us to compare distribution in subcategory samples. In this regression case, boxplots of the label by categorical feathurs are created.

In [None]:
sns.boxplot(y = df_cars["selling_price"]);
plt.ylabel("Selling Price")


In [None]:
sns.boxplot( y = df_cars["selling_price"], x = df_cars["fuel"]);
plt.ylabel("Selling Price")
plt.xlabel("Fuel Type")

In [None]:
sns.boxplot( y = df_cars["selling_price"], x = df_cars["seller_type"]);
plt.ylabel("Selling Price")
plt.xlabel("Seller Type")

In [None]:
sns.boxplot( y = df_cars["selling_price"], x = df_cars["transmission"]);
plt.ylabel("Selling Price")
plt.xlabel("Transmission Type")

In [None]:
sns.boxplot( y = df_cars["selling_price"], x = df_cars["owner"]);
plt.ylabel("Selling Price")
plt.xlabel("First Owner:1, Second Owner:2, Third Owner:3, Fourth & Above Owner:4")

Scatter plot is useful to see how the label and continues features are related. we can find none, linear, or nonlinear correlation (not causation) 
between label and features and this is useful in feature selection.

In [None]:
sns.pairplot(df_cars[['selling_price', 'km_driven', 'mileage', 'engine', 'max_power', 'car_age']], diag_kind='kde', corner=True)

In [None]:
sns.scatterplot(x = df_cars['engine'], y = df_cars['selling_price'], hue=df_cars['seller_type'])

In [None]:
sns.scatterplot(x = df_cars['engine'], y = df_cars['selling_price'], hue=df_cars['transmission'])

In [None]:
sns.scatterplot(x = df_cars['engine'], y = df_cars['selling_price'], hue=df_cars['fuel'])

In [None]:
sns.scatterplot(x = df_cars['engine'], y = df_cars['selling_price'], hue=df_cars['owner'])

In [None]:
sns.scatterplot(x = df_cars['max_power'], y = df_cars['selling_price'], hue=df_cars['seller_type'])

In [None]:
sns.scatterplot(x = df_cars['max_power'], y = df_cars['selling_price'], hue=df_cars['transmission'])

In [None]:
sns.scatterplot(x = df_cars['max_power'], y = df_cars['selling_price'], hue=df_cars['fuel'])

In [None]:
sns.scatterplot(x = df_cars['max_power'], y = df_cars['selling_price'], hue=df_cars['owner'])

In [None]:
sns.scatterplot(x = df_cars['car_age'], y = df_cars['selling_price'], hue=df_cars['seller_type'])

In [None]:
sns.scatterplot(x = df_cars['car_age'], y = df_cars['selling_price'], hue=df_cars['transmission'])

In [None]:
sns.scatterplot(x = df_cars['car_age'], y = df_cars['selling_price'], hue=df_cars['fuel'])

In [None]:
sns.scatterplot(x = df_cars['car_age'], y = df_cars['selling_price'], hue=df_cars['owner'])

Correlation Matrix
correlation matrix is useful to find strong factors predicting the selling price. however we found some facts about this in scatter plots. It's also for checking whether certain features are too correlated.

In [None]:
df_cars.loc[(df_cars['transmission'] == 'Manual','transmission_code')] = 0
df_cars.loc[(df_cars['transmission'] == 'Automatic', 'transmission_code')] = 1

In [None]:
df_cars.loc[(df_cars['fuel'] == 'Petrol','fuel_code')] = 0
df_cars.loc[(df_cars['fuel'] == 'Diesel', 'fuel_code')] = 1

to consider correlation between features fuel, transmission, and seller_type, we encode them into 0 & 1 features (for seller type: one hot encoding)

In [None]:
df_cars.loc[(df_cars['seller_type'] == 'Individual', 'individual_seller')] = 1
df_cars.loc[(df_cars['seller_type'] == 'Dealer', 'individual_seller')] = 0
df_cars.loc[(df_cars['seller_type'] == 'Trustmark Dealer', 'individual_seller')] = 0
df_cars.loc[(df_cars['seller_type'] == 'Dealer', 'Dealer_seller')] = 1
df_cars.loc[(df_cars['seller_type'] == 'Individual', 'Dealer_seller')] = 0
df_cars.loc[(df_cars['seller_type'] == 'Trustmark Dealer', 'Dealer_seller')] = 0

In [None]:
plt.figure(figsize = (10,5))
sns.heatmap(df_cars[['selling_price', 'km_driven', 'mileage', 'engine', 'max_power', 'seats', 'owner',\
    'car_age', 'fuel_code', 'transmission_code', 'individual_seller', 'Dealer_seller' ]].corr(), annot=True, cmap="coolwarm")

In [None]:
plt.figure(figsize = (10,5))
sns.heatmap(df_cars[['selling_price', 'km_driven', 'mileage', 'engine', 'max_power', 'seats', 'owner',\
    'car_age', 'fuel_code', 'transmission_code', 'individual_seller', 'Dealer_seller' ]].corr(method='spearman'), annot=True, cmap="coolwarm")

In [None]:
import ppscore as pps


In [None]:
# before using pps, let's drop country and year
df_cars_copy = df_cars.copy()
df_cars_copy.drop(['name', 'year', 'fuel', 'seller_type', 'transmission'], axis='columns', inplace=True)


In [None]:
#this needs some minor preprocessing because seaborn.heatmap unfortunately does not accept tidy data
matrix_df_cars = pps.matrix(df_cars_copy)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')

In [None]:
#plot
plt.figure(figsize = (10,5))
sns.heatmap(matrix_df_cars, vmin=0, vmax=1, cmap="Reds", linewidths=0.5, annot=True)

# 4. Feature selection
after exploring data, calculating pearson and spearman corerelation, and predictive power scores, features max_power,, mileage, engine, car_age, and transmission_code seem more strong for prediction of label selling_price. because feature engine are strongly correlated with max_power (r = 0.7) so we can leave it out from our model. 

In [None]:

X = df_cars[['max_power', 'mileage', 'car_age']]
y = df_cars['selling_price']

## Train and Test set creating

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [None]:
y_train.shape, X_train.shape, y_test.shape, X_test.shape

# 5. Preprocessing

## Null Values

In [None]:
X_train.isna().sum()

In [None]:
X_test.isna().sum()

In [None]:
y_train.isna().sum()

In [None]:
y_test.isna().sum()

In [None]:
sns.boxplot(x = X_train['max_power'])

In [None]:
sns.boxplot(x = X_train['mileage'])

In [None]:
X_train[['max_power', 'mileage']].describe()

to impute null values for features max_power and mileage their distribution and outliers considered. max_power distribution is skewed to the right and there is slightly significance difference between mean and median so for this feature, median is used as a null filling measure. Feature mileage distribution has no outlier and mean is used for imputing null values.

In [None]:
X_train['max_power'].fillna(X_train['max_power'].median(), inplace=True)
X_train['mileage'].fillna(X_train['mileage'].mean(), inplace=True)

In [None]:
X_train.isna().sum()

In [None]:
X_test['max_power'].fillna(X_train['max_power'].median(), inplace=True)
X_test['mileage'].fillna(X_train['mileage'].mean(), inplace=True)

In [None]:
X_test.isna().sum()

there is no need to scaling because we don't have any feature with big values.

# Modeling

In [None]:
from sklearn.linear_model import LinearRegression  
from sklearn.metrics import mean_squared_error, r2_score

lr = LinearRegression()
lr.fit(X_train, y_train)
yhat = lr.predict(X_test)

print('MSE: ', mean_squared_error(y_test, yhat))
print('R2: ', r2_score(y_test, yhat))

In [None]:
from sklearn.linear_model import LinearRegression  #we are using regression models
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Libraries for model evaluation

# models that we will be using, put them in a list
algorithms = [LinearRegression(), SVR(), KNeighborsRegressor(), DecisionTreeRegressor(random_state = 0), 
              RandomForestRegressor(n_estimators = 100, random_state = 0)]

# The names of the models
algorithm_names = ["Linear Regression", "SVR", "KNeighbors Regressor", "Decision-Tree Regressor", "Random-Forest Regressor"]

In [None]:
from sklearn.model_selection import KFold, cross_val_score

#lists for keeping mse
train_mse = []
test_mse = []

#defining splits
kfold = KFold(n_splits=5, shuffle=True)

for i, model in enumerate(algorithms):
    scores = cross_val_score(model, X_train, y_train, cv=kfold, scoring='neg_mean_squared_error')
    print(f"{algorithm_names[i]} - Score: {scores}; Mean: {scores.mean()}")

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'bootstrap': [True], 'max_depth': [5, 10, None],
              'n_estimators': [5, 6, 7, 8, 9, 10, 11, 12, 13, 15]}

rf = RandomForestRegressor(random_state = 1)

grid = GridSearchCV(estimator = rf, 
                    param_grid = param_grid, 
                    cv = kfold, 
                    n_jobs = -1, 
                    return_train_score=True, 
                    refit=True,
                    scoring='neg_mean_squared_error')

# Fit your grid_search
grid.fit(X_train, y_train);  #fit means start looping all the possible parameters

In [None]:
grid.best_params_

In [None]:
# Find your grid_search's best score
best_mse = grid.best_score_

In [None]:
best_mse  # ignore the minus because it's neg_mean_squared_error

# Testing

In [None]:
yhat = grid.predict(X_test)
mean_squared_error(y_test, yhat)

In [None]:
#stored in this variable
#note that grid here is random forest
rf = grid.best_estimator_

rf.feature_importances_

In [None]:
#let's plot
plt.barh(X.columns, rf.feature_importances_)

In [None]:
#hmm...let's sort first
sorted_idx = rf.feature_importances_.argsort()
plt.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")

In [None]:
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(rf, X_test, y_test)

#let's plot
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(X.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Random Forest Feature Importance")

In [None]:
import shap
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

In [None]:
#shap provides plot
shap.summary_plot(shap_values, X_test, plot_type="bar", feature_names = X.columns)

In [None]:
import pickle

In [None]:
filename = 'model/Car_selling_price.model'
pickle.dump(grid, open(filename, 'wb'))

In [None]:
df_cars[['max_power', 'mileage', 'car_age']].loc[1]

In [None]:
loaded_model = pickle.load(open(filename, 'rb'))

In [None]:
sample = np.array([[70, 15, 10]])

In [None]:
predicted_car_price = np.exp(loaded_model.predict(sample))
predicted_car_price