# <font color='blue'> Car Price Prediction

This dataset was taken from the Kaggle community, at this link: https://www.kaggle.com/hellbuoy/car-price-prediction

#### Problem Statement:

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

Which variables are significant in predicting the price of a car
How well those variables describe the price of a car
Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.

#### Business Goal:

We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.


* Please Note : The dataset provided is for learning purpose. Please don’t draw any inference with real world scenario.

In addition to the Business Goal, let's try to answer 5 business questions:

* 1 - Car prices are around how many dollars?
* 2 - What kind of car appears most in the dataset?
* 3 - What type of fuel is most used?
* 4 - What are the most common engine power?
* 5 - What is the most common insurance risk classification?

## Importing required libraries

In [None]:
import pandas as pd 
import matplotlib as mat
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.graph_objects as go

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
pip install openpyxl

## Opening the dataset

In [None]:
df = pd.read_csv('/kaggle/input/car-price-prediction/CarPrice_Assignment.csv')
df.head(5)

## Dataset description

In [None]:
# There have been some changes to the table to stay presentable

dicionario = pd.read_excel('/kaggle/input/car-price-prediction/Data Dictionary - carprices.xlsx')
dicionario = dicionario.iloc[3:29,[7,11]].reset_index().drop(columns=['index']) # Remove blank rows and columns
dicionario = dicionario.rename(columns={'Unnamed: 7' : 'Column', 'Unnamed: 11' : 'Description'}) # Rename the columns
dicionario

## Analisando o dataset

In [None]:
# Information about attribute types
df.info()

In [None]:
# Number of lines and columns
df.shape

In [None]:
# Statistical description of numeric dataset attributes
df.describe()

In [None]:
# No null value
df.isnull().sum()

In [None]:
# Outliers are mainly found in our target column: price

plt.figure(figsize=(20,5))
df.boxplot()
print()

### Removing outliers in 'price' using IQR method.

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x=df['price'])
plt.show()

In [None]:
Q1 = df['price'].quantile(.25)
Q3 = df['price'].quantile(.75)

Q1,Q3

In [None]:
IQR = Q3 - Q1
IQR

In [None]:
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

lower,upper

#### Defining lower / upper

In [None]:
df = df[df['price'] >= lower] 
df = df[df['price'] <= upper]

In [None]:
# There were 205 lines. Now, we have 190 lines.
df.shape

In [None]:
df.describe()

In [None]:
plt.figure(figsize=(20,5))
df.boxplot()
print()

## Business Questions

### 1 - Car prices are around how many dollars?
> Most prices are concentrated between 5000.00 and 10000.00

In [None]:
plt.title('Car price distribution', fontsize = 15)
sns.violinplot(x = 'price', data = df)
plt.show()

### 2 - What kind of car appears most in the dataset?

> The sedan-type cars are the most popular.

In [None]:
plt.title('Cars', fontsize = 15)
sns.countplot(df['carbody'])
plt.show()

### 3 - What type of fuel is most used?
> Gasoline is the most used fuel, with 90.0%, while diesel is used in 10.0% of cases.

In [None]:
fueltype = df['fueltype'].value_counts()
total = df['fueltype'].value_counts().sum()

porcentagem = fueltype/total
# plotar o gráfico de pizza
plt.title('Most used fuels', fontsize = 15)
plt.pie(porcentagem, labels=['gas', 'diesel'], autopct='%1.1f%%');

### 4 - What are the most common engine power?
> The distribution shows that engines with power between 60 and 120 hp are the most common.

In [None]:
sns.distplot(df['horsepower'])
plt.title('', fontsize = 15)
plt.xlabel('Length of cars', fontsize = 15)
plt.ylabel('Total')
plt.show()

### 5 - What is the most common insurance risk classification?

> A zero (0) rating is the most common among all insurance risk ratings. The assigned insurance risk rating is a value ranging from +3 (indicates that the car is risky) to -3 (which is probably very safe).

In [None]:
df['symboling'].value_counts().sort_values().plot.bar()
plt.title('Insurance Risk Rating', fontsize = 15)
plt.xlabel('Risk Rating', fontsize = 15)
plt.ylabel('Total')
plt.show()

## Preparing the data

In [None]:
# Correlation between columns

plt.figure(figsize=(12,7))
correlacao = df.corr()
sns.heatmap(correlacao, annot = True);

In [None]:
# Analyzing only the correlation between the target variable (price) with the other columns

correlations = df.corr()['price'].drop('price')
correlations.sort_values()

> The enginesize, curbweight, horsepower, carwidth and carlength columns have a strong positive correlation with the price column, while the highwaympg and citympg columns have a strong negative correlation.

#### Transforming categorical variables into numerical variables, so that these variables can also enter the model that will predict which ones are best for the algorithms.

In [None]:
df['fueltype'] = df['fueltype'].map({'gas':'0','diesel':'1'})
df['aspiration'] = df['aspiration'].map({'std':'0','turbo':'1'})
df['doornumber'] = df['doornumber'].map({'two':'2','four':'4'})
df['carbody'] = df['carbody'].map({'convertible':'0','hatchback':'1','sedan':'2','wagon':'3','hardtop':'4'})
df['drivewheel'] = df['drivewheel'].map({'rwd':'0','fwd':'1','4wd':'2'})
df['enginelocation'] = df['enginelocation'].map({'front':'0','rear':'1'})
df['cylindernumber'] = df['cylindernumber'].map({'four':'4','six':'6','five':'5','three':'3','twelve':'12','two':'2','eight':'8'})

In [None]:
# Transforming object into int32

df['fueltype'] = df['fueltype'].astype(int)
df['aspiration'] = df['aspiration'].astype(int)
df['doornumber'] = df['doornumber'].astype(int)
df['carbody'] = df['carbody'].astype(int)
df['drivewheel'] = df['drivewheel'].astype(int)
df['enginelocation'] = df['enginelocation'].astype(int)
df['cylindernumber'] = df['cylindernumber'].astype(int)

In [None]:
# Only numeric variables

numerical_columns = df.select_dtypes(include = ['int32','int64','float'])
numerical_columns.head()

## Separating training data and testing data

In [None]:
from sklearn.model_selection import train_test_split

X = df[['symboling', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
       'carlength', 'carwidth', 'carheight', 'curbweight',
       'cylindernumber', 'enginesize', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']] # Only numeric values (21 features)

y= df['price'] # Target column

X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.3,random_state=55)

In [None]:
# Printing the results

print("{0:0.2f}% are training data".format((len(X_train)/len(df.index)) * 100))
print("{0:0.2f}% are testing data".format((len(X_test)/len(df.index)) * 100))

In [None]:
X_train.shape # Rows and columns for training

In [None]:
X_test.shape # Rows and columns for testing

## Training and testing the model

In [None]:
# Model evaluation metrics

from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score # Using RMSE, MAE and R2 as metrics

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train,y_train)
prediction_lr = lr.predict(X_test)

print("Model\t\t\t\t R2 \t\t RMSE \t\t MAE")
print("""Linear Regresson \t\t {:.2f} \t\t {:.4} \t {:.2f}""".format(r2_score(y_test,prediction_lr), 
                                                                     np.sqrt(mean_squared_error(y_test, prediction_lr)), 
                                                                     mean_absolute_error(lr.predict(X_test), y_test)))

### Decision Trees Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor(random_state=42) 
dtr.fit(X_train, y_train) 
prediction_dtr = dtr.predict(X_test)

print("Model\t\t\t\t R2 \t\t RMSE \t\t MAE")
print("""Decision Tree Regressor \t {:.2f} \t\t {:.4} \t {:.2f}""".format(r2_score(y_test,prediction_dtr), 
                                                                     np.sqrt(mean_squared_error(y_test, prediction_dtr)), 
                                                                     mean_absolute_error(dtr.predict(X_test), y_test)))

### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=10, random_state=42)
rfr.fit(X_train, y_train.ravel()) 
prediction_rfr = rfr.predict(X_test)

print("Model\t\t\t\t R2 \t\t RMSE \t\t MAE")
print("""Random Forest Regressor \t {:.2f} \t\t {:.4} \t {:.2f}""".format(r2_score(y_test,prediction_rfr), 
                                                                     np.sqrt(mean_squared_error(y_test, prediction_rfr)), 
                                                                     mean_absolute_error(rfr.predict(X_test), y_test)))

### KNeighborsRegressor

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knr = KNeighborsRegressor(n_neighbors=5)
knr.fit(X_train,y_train)
prediction_knr = knr.predict(X_test)


print("Model\t\t\t\t R2 \t\t RMSE \t\t MAE")
print("""KNeighborsRegressor \t\t {:.2f} \t\t {:.4} \t {:.2f}""".format(r2_score(y_test,prediction_knr), 
                                                                     np.sqrt(mean_squared_error(y_test, prediction_knr)), 
                                                                     mean_absolute_error(knr.predict(X_test), y_test)))

### Support Vector Regressor

In [None]:
from sklearn.svm import SVR

svr = SVR(kernel='linear', C=1.0, epsilon=0.2)
svr.fit(X_train, y_train)
prediction_svr = svr.predict(X_test)

print("Model\t\t\t\t R2 \t\t RMSE \t\t MAE")
print("""Support Vector Regressor \t {:.2f} \t\t {:.4} \t {:.2f}""".format(r2_score(y_test,prediction_svr), 
                                                                     np.sqrt(mean_squared_error(y_test, prediction_svr)), 
                                                                     mean_absolute_error(svr.predict(X_test), y_test)))

## Evaluating with Cross Validation

> Cross Validation is a widely used technique for evaluating the performance of machine learning models. Cross Validation consists of partitioning data into sets (parts), where one set is used for training and another set is used for testing and evaluating the model's performance. The use of CV has high chances of detecting if your model is overfitting your training data, that is, suffering overfitting. The cross_val_score function receives as parameter the model, all training data, class data, the amount of folds and the evaluation method.

In [None]:
from sklearn.model_selection import cross_val_score # Importing the Cross Validation

### Linear Regression

In [None]:
lr_scores = cross_val_score(lr, X_train,y_train, cv=5, scoring='r2')
print(lr_scores)
print("Mean:", lr_scores.mean())

### Decision Trees Regressor

In [None]:
dtr_scores = cross_val_score(dtr, X_train,y_train, cv=5, scoring='r2')
print(dtr_scores)
print("Mean:", dtr_scores.mean())

### Random Forest Regressor

In [None]:
rfr_scores = cross_val_score(rfr, X_train,y_train, cv=5, scoring='r2')
rfr_scores
print("Mean:", rfr_scores.mean())

### KNeighborsRegressor

In [None]:
knr_scores = cross_val_score(knr, X_train,y_train, cv=5, scoring='r2')
knr_scores
print("Mean:", knr_scores.mean())

### Support Vector Regressor

In [None]:
svr_scores = cross_val_score(svr, X_train,y_train, cv=5, scoring='r2')
svr_scores
print("Mean:", svr_scores.mean())

## Comparing and evaluating models

In [None]:
# Table summary for better viewing

resultados = pd.DataFrame([
    {'Algorithm' : 'Linear Regression', 'Original' : r2_score(y_test,prediction_lr), 'Cross-validation': lr_scores.mean()},
    {'Algorithm' : 'Decision Trees Regressor', 'Original' : r2_score(y_test,prediction_dtr), 'Cross-validation': dtr_scores.mean()},
    {'Algorithm' : 'Random Forest Regressor', 'Original' : r2_score(y_test,prediction_rfr), 'Cross-validation': rfr_scores.mean()},
    {'Algorithm' : 'KNeighborsRegressor', 'Original' : r2_score(y_test,prediction_knr), 'Cross-validation': knr_scores.mean()},
    {'Algorithm' : 'Support Vector Regressor', 'Original' : r2_score(y_test,prediction_svr), 'Cross-validation': svr_scores.mean()}
])

resultados.sort_values(by=['Cross-validation'], ascending=False)

> Some models improved and others worsened when using the cross-validation technique, with Coefficient of Determination (R²) as the scoring system.

> KNeighborsRegressor, for example, scored better without Cross-validation, while Linear Regression scored higher with Cross-validation compared to the original model.

> Linear Regression was the best model, with 0.739355 points, using Cross-validation. Without Cross-Validation, Random Forest Regressor had the best result, with 0.897571 points.

## Saving the best model

In [None]:
import pickle
filename = 'rfr_model.sav'
pickle.dump(rfr, open(filename,'wb'))

### Loading the model and forecasting with new datasets

> (X_test, Y_test must be new datasets prepared with the proper cleanup and transformation procedure)

In [None]:
load_model = pickle.load(open(filename, 'rb'))

In [None]:
resultado = load_model.predict(X_test[:100])

In [None]:
plt.figure(figsize=(12,8))

plt.title('Real values vs Predicted values')
plt.ylabel('Sales Value')
plt.plot(resultado) #x_test
plt.plot(y_test.values[:100]) #y_test. 100 first values

plt.legend(['Predictions', 'Real Values'])
plt.show()

### Regression Test

In [None]:
df.columns

In [None]:
test = np.array([[3,0,0,4,2,0,1,88.6,168.8,157.3,65.6,2585,5,130,2.20,3.40,9.0,120,5500,22,30]]) #21 features

In [None]:
rfr.predict(test)

## Conclusion

* After answering some hypotheses and finding some positive correlations between the target column (Price) and other dataset variables, five machine learning regression algorithms were trained and evaluated to predict the price of cars. 


* Linear Regression was the best model, with 0.739355 points, using Cross-validation. Without Cross-Validation, Random Forest Regressor had the best result, with 0.897571 points. Cross-valuation was used to evaluate the performance of machine learning models. 


* Random Forest Regressor was used to predict values for having obtained the best accuracy among the models without cross-validation, but other models can also be used. 


* Many other techniques can and should be tested, such as the normalization of features for example.