<h1> Car Price Prediction </h1>

<h3> Problem Statement </h3>

<p> A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. </p>

<p> They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know: </p>

- Which variables are significant in predicting the price of a car
- How well those variables describe the price of a car

<p> Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market. </p>

<h3> Business Goal </h3>
<p> You are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market. </p>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings('ignore')
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Reading Data

In [None]:
raw_data = pd.read_csv("/kaggle/input/car-data/CarPrice_Assignment.csv")
df = raw_data.iloc[:, 2:].copy()
df.head()

# Examining The Datasets

In [None]:
# how many rows and columns are in the dataset?
print(f"Rows: {df.shape[0]} Columns: {df.shape[1]}")

In [None]:
# information about data
df.info()

In [None]:
df.isnull().sum()

There is no nan value in dataset. That makes it easy for us for preparing the data.

In [None]:
# statistical data
df.describe()

# Preprocessing

In [None]:
# Splitting brand from CarName column

def splitBrand(x):
    return x.split(' ')[0]

brand = df.CarName.apply(splitBrand)
df.insert(1,"brand",brand)
df.drop('CarName',axis=1,inplace=True)
df.head()

In [None]:
# Some brand names are mispelled. We can fix them like this:

df.brand = df.brand.str.lower()

df.brand.replace('maxda','mazda', inplace=True)
df.brand.replace('porcshce','porsche', inplace=True)
df.brand.replace('toyouta','toyota', inplace=True)
df.brand.replace('vokswagen','volkswagen', inplace=True)
df.brand.replace('vw','volkswagen', inplace=True)

df.brand.unique()

### Visualing Data

Before scaling, dummies and train-test-split processes which are preprocessing parts, it is useful to get to know the data by visualizing some of the data. Therefore, we can make some inferences about the data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# distribution of car price
sns.histplot(df.price, kde = True)
plt.title("Distribution of Car Price")
plt.show();

In [None]:
plt.figure(figsize = (15,5))

plt.subplot(1,2,1)
a = sns.countplot(df.brand)
a.set_xticklabels(a.get_xticklabels(), rotation=90)
plt.title('Histogram of Brand')
a.set(xlabel = 'Car Brand', ylabel='Frequency of Brand')


plt.subplot(1,2,2)
a = sns.countplot(df.fueltype)
plt.title('Histogram of Fueltype')
a.set(xlabel = 'Car Fueltype', ylabel='Frequency of Fueltype')

plt.show()

<p>Looking at the charts:
    <ul>
        <li> It can be said that the toyota brand seems to be a favorite compared to other brands.</li>
        <li> Number of gas fueled cars are more than diesel.
    </ul>

In [None]:
plt.figure(figsize = (15,5))

plt.subplot(1,2,1)
a = sns.countplot(df.carbody)
plt.title('Histogram of Car Body')
a.set(xlabel = 'Car Body', ylabel='Frequency of Body')


plt.subplot(1,2,2)
temp = df.groupby(['brand'])['price'].mean()
a = sns.barplot(temp.index, temp.values)
plt.title('Brand and Mean of Price')
a.set_xticklabels(a.get_xticklabels(), rotation=90)
a.set(xlabel = 'Car Brand', ylabel='Car Price')

plt.show()

<p>Looking at the charts:
    <ul>
        <li> It can be said that the sedan cars seems to be a favorite compared to others.</li>
        <li> Jaguar brand cars are the most expensive on average.</li>
    </ul>

In [None]:
plt.figure(figsize = (15,5))

plt.subplot(1,2,1)
temp = df.groupby(['carbody'])['price'].mean()
a = sns.barplot(temp.index, temp.values)
plt.title('Body and Mean of Price')
a.set(xlabel = 'Car Body', ylabel='Car Price')


plt.subplot(1,2,2)
temp = df.groupby(['enginetype'])['price'].mean()
a = sns.barplot(temp.index, temp.values)
plt.title('Engine and Mean of Price')
a.set(xlabel = 'Car Engine', ylabel='Car Price')

plt.show()

<p>Looking at the charts:
    <ul>
        <li> Hardtop body's cars are the most expensive on average.</li>
        <li> dohcv engine's cars are the most expensive on average.</li>
    </ul>

In [None]:
plt.figure(figsize = (15,5))

# correlation between horsepower and price
plt.subplot(1,2,1)
a = sns.scatterplot(df.horsepower, df.price)
plt.title('Horsepower and Price I')
a.set(xlabel = 'Horsepower', ylabel='Car Price')

# correlation between horsepower and price with enginetype
plt.subplot(1,2,2)
a = sns.scatterplot(df.horsepower, df.price, hue = df.enginetype)
plt.title('Horsepower and Price II')
a.set(xlabel = 'Horsepower', ylabel='Car Price')

plt.show()

In [None]:
plt.figure(figsize = (15,5))

# correlation between horsepower and price with fueltype
plt.subplot(1,2,1)
a = sns.scatterplot(df.horsepower, df.price, hue = df.fueltype)
plt.title('Horsepower and Price III')
a.set(xlabel = 'Horsepower', ylabel='Car Price')

# correlation between carlength and price with enginetype
plt.subplot(1,2,2)
a = sns.scatterplot(df.carlength, df.price, hue = df.enginetype)
plt.title('Car Length and Price')
a.set(xlabel = 'Car Length', ylabel='Car Price')


plt.show()

In [None]:
plt.figure(figsize = (15,5))

# correlation between enginesize and price with enginelocation
plt.subplot(1,2,1)
a = sns.scatterplot(df.enginesize, df.price, hue = df.enginelocation)
plt.title('Engine Size and Price I')
a.set(xlabel = 'Enginesize', ylabel='Car Price')

# correlation between enginesize and price with enginetype
plt.subplot(1,2,2)
a = sns.scatterplot(df.enginesize, df.price, hue = df.enginetype)
plt.title('Engine Size and Price II')
a.set(xlabel = 'Engine Size', ylabel='Car Price')

plt.show()

In [None]:
# correlation
plt.figure(figsize = (15,7))
sns.heatmap(df.corr(), annot = True)
plt.show();

### Scaling, Dummies and Train-Test-Split

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Feature Scaling
scaler = MinMaxScaler()
num_columns = np.array(df.select_dtypes(include = ["int64","float64"]).columns)
df[num_columns] = scaler.fit_transform(df[num_columns])

In [None]:
# Dummies
cat_columns = np.array(df.select_dtypes(include = ["object"]).columns)

for i in cat_columns:
    if i != "brand": 
        df = pd.concat([df, pd.get_dummies(df[i], drop_first = True)], axis = 1)
        
df = df.drop(list(cat_columns), axis = 1)  

In [None]:
df.head()

In [None]:
# train test split
x = df.drop("price", axis = 1)
y = df.price
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state = 42)

# Creating and Training Models, and Evaluation (Before Tunning)

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

### Model 1: Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor()
dt_model.fit(x_train, y_train)

y_pred = dt_model.predict(x_test)
print("Before Tunning\n------------------------")
print('MAE:', mean_absolute_error(y_test,y_pred))
print('MSE:', mean_squared_error(y_test, y_pred))
print('RMSE:', r2_score(y_test, y_pred))

In [None]:
# visualing predict and true values
sns.scatterplot(y_test, y_pred)
plt.title("Predicted and True Values Before Tunning")
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show();

### Model 2: Random Forest Regression

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor()
rf_model.fit(x_train, y_train)

y_pred = rf_model.predict(x_test)
print("Before Tunning\n------------------------")
print('MAE:', mean_absolute_error(y_test,y_pred))
print('MSE:', mean_squared_error(y_test, y_pred))
print('RMSE:', r2_score(y_test, y_pred))

In [None]:
# visualing predict and true values
sns.scatterplot(y_test, y_pred)
plt.title("Predicted and True Values Before Tunning")
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show();

# Parameter Tunning

In [None]:
from sklearn.model_selection import GridSearchCV

#### For Decision Tree Regression Model

In [None]:
dt_params = {"max_depth" : [5,7,9,11,12,13,14],
             "min_samples_leaf":[8,9,10,20,30],
             "min_samples_split": [30,50,60,80,100],
             "max_leaf_nodes":[40,50,60,70,80,90] }

dt_cv_model = GridSearchCV(dt_model, dt_params, cv = 5, verbose = 2, n_jobs = -1).fit(x_train, y_train)

In [None]:
dt_cv_model.best_params_ 

#### For Random Forest Regression Model

In [None]:
rf_params = {"n_estimators": [100,200, 500],
            "min_samples_split": [10,20,30,40]}

rf_cv_model = GridSearchCV(rf_model, rf_params, cv = 5, verbose = 2, n_jobs = -1).fit(x_train, y_train)

In [None]:
rf_cv_model.best_params_

# Predict and Evaluation (After Tunning)

#### For Decision Tree Regression Model

In [None]:
dt_tunned_model = DecisionTreeRegressor(max_depth = dt_cv_model.best_params_["max_depth"],
                                        min_samples_leaf = dt_cv_model.best_params_["min_samples_leaf"],
                                        min_samples_split = dt_cv_model.best_params_["min_samples_split"],
                                        max_leaf_nodes = dt_cv_model.best_params_["max_leaf_nodes"])

dt_tunned_model.fit(x_train, y_train)
y_pred = dt_tunned_model.predict(x_test)
print("After Tunning\n------------------------")
print('MAE:', mean_absolute_error(y_test,y_pred))
print('MSE:', mean_squared_error(y_test, y_pred))
print('RMSE:', r2_score(y_test, y_pred))

In [None]:
# visualing predict and true values
sns.scatterplot(y_test, y_pred)
plt.title("Predicted and True Values After Tunning")
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show();

#### For Random Forest Regression Model

In [None]:
rf_tunned_model = RandomForestRegressor(n_estimators = rf_cv_model.best_params_["n_estimators"],
                                        min_samples_split = rf_cv_model.best_params_["min_samples_split"])

rf_tunned_model.fit(x_train, y_train)
y_pred = rf_tunned_model.predict(x_test)
print("After Tunning\n------------------------")
print('MAE:', mean_absolute_error(y_test,y_pred))
print('MSE:', mean_squared_error(y_test, y_pred))
print('RMSE:', r2_score(y_test, y_pred))

In [None]:
# visualing predict and true values
sns.scatterplot(y_test, y_pred)
plt.title("Predicted and True Values After Tunning")
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show();

# Feature Importance

This section covers the extent to which the variables affect car prices. With the feature importance level, we can see how much the variables affect the prices.

#### For Decision Tree Regression Model

In [None]:
plt.figure(figsize = (15,7))
Importance = pd.DataFrame({'Importance':dt_tunned_model.feature_importances_*300}, 
                          index = x_train.columns)

a = sns.barplot(Importance.index, Importance.Importance)
a.set_xticklabels(a.get_xticklabels(), rotation=90)
plt.show();

#### For Random Forest Regression Model

In [None]:
plt.figure(figsize = (15,7))
Importance = pd.DataFrame({'Importance':rf_tunned_model.feature_importances_*300}, 
                          index = x_train.columns)

a = sns.barplot(Importance.index, Importance.Importance)
a.set_xticklabels(a.get_xticklabels(), rotation=90)
plt.show();