# Car Price Prediction

## Problem Statement

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

* Which variables are significant in predicting the price of a car
* How well those variables describe the price of a car

Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.

## Business goal

We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

## Variable information

**symboling:** 	Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. (Categorical) 

**carname:**	Name of car make and model. (Categorical)

**fueltype:**	Car fuel type i.e. gas or diesel. (Categorical)

**aspiration:**	Aspiration used in a car. Mode of air intake for the internal combustion engine i.e. natural (standard) or turbocharger. (Categorical)

**doornumber:**	Number of doors in a car i.e. two or four. (Categorical)

**carbody:**	Body of car i.e. convertible or hardtop or hatchback or sedan or wagon. (Categorical)

**drivewheel:**	Type of drive wheel. The wheel connected to the motor/engine transmission, which causes the vehicle to move i.e. Front-wheel drive or Rear-wheel drive or Four-wheel drive. (Categorical)

**enginelocation:**	Location of car engine i.e. front or rear. (Categorical)

**wheelbase:**	Length of wheelbase of car. Wheelbase is the distance between centers of front and rear wheels. (Numeric)

**carlength:**	Length of car. (Numeric)

**carwidth:**	Width of car. (Numeric)

**carheight:**	Height of car. (Numeric)

**curbweight:**	The weight of a car without occupants or baggage. (Numeric)

**enginetype:**	Type of engine i.e. I, ohc, ohcf, ohcv, dohc, dohcv, rotor. (Categorical)

**cylindernumber:**	Number of cylidners used inside the engine i.e. two - twelve. (Categorical)

**enginesize:**	Engine size, or the engine displacement in the car. Engine displacement is the swept volume of all the pistons inside the cylinders of a reciprocating engine in a single movement from top dead centre to bottom dead centre. (Numeric)

**fuelsystem:**	Fuel system used in the car i.e. 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. Fuel-system provided fuel-air mixture to the engine. (Categorical)

**boreratio:**	Bore ratio of car. It is the ratio between cylinder bore diameter and piston stroke. (Numeric)

**stroke:**     Stroke or volume inside the engine. It is the distance travelled by the piston in each cycle. (Numeric)

**compressionratio:**	Compression ratio of car. It is the ratio of the maximum to minimum volume in the cylinder of an internal combustion engine. (Numeric)

**horsepower:**	Horsepower of the engine. The power an engine produces is called horsepower. In mathematical terms, one horsepower is the power needed to move 550 pounds one foot in one second. (Numeric)

**peakrpm:**	RPM at which engine delivers peak horsepower. (Numeric)

**citympg:**	Mileage in city. (Numeric)

**highwaympg:**	Mileage on highway. (Numeric)

**price:**      Price of car. (Numeric) (Dependent variable)

In [None]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

# display settings
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

# filterning warnings
import warnings
warnings.filterwarnings("ignore")

## Reading and understanding data

In [None]:
# reading data from csv and creating dataframe
df = pd.read_csv("../input/car-price-prediction/CarPrice_Assignment.csv")

In [None]:
# displaying first 5 rows
df.head()

In [None]:
# dropping the ID column as it will not be useful in predicting our dependent variable
df.drop(columns="car_ID", inplace=True)

In [None]:
# dimensions of dataframe
print("No. of rows: {}\tNo. of columns: {}".format(*df.shape))

In [None]:
# columns info
df.info()

In [None]:
# descriptive statistics
df.describe().T

## Feature engineering

In [None]:
# % of missing values
(df.isna().sum() / df.shape[0]) * 100

**Missing values:**
* There are no missing values observed.

### 1. Symboling

In [None]:
# converting from numeric to categorical variable type
df["symboling"] = df["symboling"].astype(str)

### 2. CarName

In [None]:
# extracting make from the values
df["make"] = df['CarName'].str.split(' ', expand=True)[0]

In [None]:
# unique values in make
df["make"].unique()

**Correcting typo errors in make values:**

maxda = mazda

Nissan = nissan

porcshce = porsche

toyouta = toyota

vokswagen = vw = volkswagen

In [None]:
# correcting the typo errors in make values
df["make"] = df["make"].replace({"maxda":"mazda",
                               "Nissan":"nissan",
                               "porcshce":"porsche",
                               "toyouta":"toyota",
                               "vokswagen":"volkswagen",
                               "vw":"volkswagen"})

In [None]:
# dropping the car name variable
df.drop(columns="CarName", inplace=True)

### 3. Creating price category

In [None]:
# categorizing price into standard and high-end
df["price_category"] = df["price"].apply(lambda x: "standard" if x <= 18500 else "high-end")

In [None]:
# creating list of numeric and categorical columns
col_numeric = list(df.select_dtypes(exclude="object"))

col_categorical = list(df.select_dtypes(include="object"))

## Exploratory Data Analysis

In [None]:
# visualizing the car make
plt.figure(figsize=(15,6))
df["make"].value_counts().sort_values(ascending=False).plot.bar()
plt.xticks(rotation=90)
plt.xlabel("Make", fontweight="bold")
plt.ylabel("Count", fontweight="bold")
plt.title("Countplot of Car Make", fontweight="bold")
plt.show()

**Insights:**

* Toyota seems to be the most favourite make.
* Mercury seems to be the least favourite make.

In [None]:
# visualizing the other categorical variables
plt.figure(figsize=(15,20))
for i,col in enumerate(col_categorical[:-2], start=1):
    plt.subplot(5,2,i)
    sns.countplot(df[col])
    plt.xlabel(col, fontweight="bold")
plt.show()

**Insights:**

* `symboling`: A majority of auto makers are neither safe nor risky. Looks like there are more number of risky autos as compared to safe ones.
* `fueltype`: Majority of the automobiles are gas fuel type.
* `aspiration`: Majority of the automobiles use standard aspiration.
* `doornumber`: Majority of the automobiles are 4 door models.
* `carbody`: Sedan is the most common model, convertible is the least common model.
* `drivewheel`: Forward wheel drive is the most common model, 4 wheel drive is the least common model.
* `enginelocation`: Almost all the models are having engine location as front.
* `enginetype`: Majority (almost all) of the models are having 'ohc' engine type.
* `cylindernumber`: Majority (almost all) of the models are 4 cylinder models.
* `fuelsystem`: Majority of the models are having 'mpfi' and '2bbl' fuel systems.

In [None]:
# pair plot to understand the correlation between the numeric variables (except price)
sns.pairplot(df[col_numeric[:-1]])
plt.show()

In [None]:
# heatmap to visualize the pearson's correlation matrix between the numeric variables (except price)
plt.figure(figsize=(12,8))
sns.heatmap(df.drop(columns="price").corr(), annot=True, cmap="RdYlGn", square=True, mask=np.triu(df.drop(columns="price").corr(), k=1))
plt.show()

**Insights:**

* Model specifications (`wheelbase`, `carlenght`, `carweight`, `carheight`, `crubweight`, `enginesize`, `boreratio`, `stroke`, `compressionratio`, `horsepower`) and performance metrics (`peakrpm`, `citympg`, `highwaympg`) are mostly negatively correlated.

In [None]:
# visualizing our dependent variable for outliers and skewnwss
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.boxplot(df["price"])
plt.title("Boxplot for outliers detection", fontweight="bold")

plt.subplot(1,2,2)
sns.distplot(df["price"])
plt.title("Distribution plot for skewness", fontweight="bold")

plt.show()

**Insights:**

* There are few outliers towards the higher price range, suggesting that there are few high price models.
* The distribution of price is right skewed, maybe we should think about applying transformation methods.
* Most of the models are within 5000 and 18000 price range.

In [None]:
# average price of each make
df.groupby("make")["price"].mean().sort_values(ascending=False).plot.bar(figsize=(12,6))
plt.title("Average price of each make", fontweight="bold")
plt.ylabel("Price", fontweight="bold")
plt.xlabel("Make", fontweight="bold")
plt.show()

**Insights:**

* `jaguar` make is having highest average price.
* `chevrolet` make is having least average price.

In [None]:
# proportion of high-end models in each make
pd.crosstab(df["make"], df["price_category"], normalize="index").plot.bar(stacked=True, figsize=(10,5))
plt.xlabel("Make", fontweight="bold")
plt.ylabel("Proportion", fontweight="bold")
plt.title("Proportion of high-end models in each make", fontweight="bold")
plt.show()

**Insights:**

* `buick`, `jaguar` and `porsche` are having only high-end models.
* `bmw` is having 80% of their models as high-end.
* `volvo` is having equal proportion of high-end and standard price models.
* `audi`, `nissan` and `saab` are having less than 33% of models as high-end.
* The rest (majority) of the car makers are having only standard price models.

In [None]:
# price analysis for each carbody type
fig, ax = plt.subplots(1,2, figsize=(15,5))

pd.crosstab(df["carbody"], df["price_category"], normalize="index").plot.bar(stacked=True, ax=ax[0])
ax[0].set(xlabel="Carbody type", ylabel="Proportion", title="Proportion of high-end models in each carbody type")

df.groupby("carbody")["price"].mean().sort_values(ascending=False).plot.bar(ax=ax[1])
ax[1].set(xlabel="Carbody type", ylabel="Average price", title="Average price of models in each carbody type")

plt.show()

**Insights:**

* `hardtop` and `convertible` are having highest average price, and also high proportion of high-end price models.

In [None]:
# visualizing distribution of price with the other categorical variables
plt.figure(figsize=(15,20))
for i,col in enumerate(col_categorical[:-2], start=1):
    plt.subplot(5,2,i)
    sns.violinplot(data=df, x=col, y="price", split=True, hue="price_category")
    plt.xlabel(col, fontweight="bold")
plt.show()

**Insights:**

* `price` and `symboling`, `fueltype`, `doornumber`, `carbody` doesn't seem to have much correlation.
* Safest (symboling -2) seems to have only standard priced models.
* `price` and `drivewheel` seems to have little correlation. All 4 wheel drive models are standard priced models.
* `price` and `enginelocation` seems to have correlation. All the rear engine models are high-end models.
* `price` and `enginetype` seems to have little correlation. While standard priced models are having all types of engines, high-end models are having 'dohc', 'ohc', 'ohcv' and 'ohcf' engine types.
* `price` and `cylindernumber` seems to have correlation. As the number of cylinders increases price of the model increases.
* `price` and `fuelsystem` seems to have little correlation. High-end models are having only 'idi' and 'mpfi' fuel systems.

In [None]:
# visualizing distribution of price with continuous variables
col_numeric_pc = col_numeric.copy()
col_numeric_pc.append("price_category")
sns.pairplot(df[col_numeric_pc], hue="price_category")
plt.show()

In [None]:
# heatmap to visualize the pearson's correlation between price and other the numeric variables
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap="RdYlGn", square=True, mask=np.triu(df.corr(), k=1))
plt.show()

**Insights:**

* `price` is having high positive correlation with `curbweight`, `enginesize`, `horsepower`.
* `price` is having high negative correlation with `mpg`.

## Data preperation

### Converting categorical variables into numeric

Applying label encoding since I will be using a tree based model.

In [None]:
# converting categorical variables into numeric variables using label encoding
le = LabelEncoder()

df_encoded = df.drop(columns=["price_category"])
df_encoded[col_categorical[:-1]] = df_encoded[col_categorical[:-1]].apply(lambda col: le.fit_transform(col))

df_encoded.head()

### Creating dependent and independent variables

In [None]:
# independent variables
X = df_encoded.drop(columns="price")

# dependent variable
y = df_encoded["price"]

### Splitting data into train test data

In [None]:
# splitting into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model building

### Decision tree regressor

**Building base model**

In [None]:
# building a base model
base_model = DecisionTreeRegressor()
base_model.fit(X_train, y_train)

In [None]:
# scoring using test data
y_pred = base_model.predict(X_test)
print("R-squared:", r2_score(y_pred, y_test))

**Hyperparameter tuning**

In [None]:
# hyperparameter tuning for best model
parameters = {"max_depth":list(range(1,15))}

base_model = DecisionTreeRegressor()
cv_model = GridSearchCV(estimator=base_model, param_grid=parameters, scoring='r2', return_train_score=True, cv=5).fit(X_train,y_train)

pd.DataFrame(cv_model.cv_results_)#[["mean_test_score","mean_train_score"]]

# train and test scores
plt.plot(pd.DataFrame(cv_model.cv_results_)["param_max_depth"], pd.DataFrame(cv_model.cv_results_)["mean_test_score"], label="test score")
plt.plot(pd.DataFrame(cv_model.cv_results_)["param_max_depth"], pd.DataFrame(cv_model.cv_results_)["mean_train_score"], label="train score")
plt.title("Training vs. Test score")
plt.ylabel("R-squared")
plt.xlabel("Max depth")
plt.legend()
plt.grid()
plt.show()

**Observations:**
- There is no improvement in training score after max depth 8, so we build our model with max depth 8.

In [None]:
# building final model
model = DecisionTreeRegressor(max_depth=8)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("R-squared:", r2_score(y_pred, y_test))