# Linear Regression

### Problem Statement

A Chinese automobile company **Geely Auto** aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. 

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

- Which variables are significant in predicting the price of a car
- How well those variables describe the price of a car

Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the American market. 

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Display all columns and rows

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
# Load data from CarPrice_Assignment.csv dataset

car_df = pd.read_csv("/kaggle/input/car-price-prediction/CarPrice_Assignment.csv", engine='python')

In [None]:
# Check the head of the dataset

car_df.head()

In [None]:
# Check for null values

car_df.isnull().sum()

In [None]:
# Dataset dimensions

car_df.shape

In [None]:
# Dataset information

car_df.info()

In [None]:
# More understanding about the dataset

car_df.describe()

#### Custom Functions

In [None]:
import statsmodels.api as sm

In [None]:
# Function to get VIF (Variation Inflation Factor)
from statsmodels.stats.outliers_influence import variance_inflation_factor

def get_VIF(X_train):
    # A dataframe that will contain the names of all the feature variables and their respective VIFs
    vif = pd.DataFrame()
    vif['Features'] = X_train.columns
    vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    print(vif)

### Data cleaning

In [None]:
# Creating a derived column for company name of cars from the column CarName

car_df.loc[:,'company'] = car_df.CarName.str.split(' ').str[0]

In [None]:
car_df.company = car_df.company.apply(lambda x: str(x).lower())

In [None]:
car_df.company.unique()

There are a few company names which are evidently mis-spelled in the dataset like toyota has been written as toyouta. 
We will go ahead and repair these. 

In [None]:
car_df['company'].replace('maxda','mazda',inplace=True)
car_df['company'].replace('porcshce','porsche',inplace=True)
car_df['company'].replace('toyouta','toyota',inplace=True)
car_df['company'].replace(['vokswagen','vw'],'volkswagen',inplace=True)

In [None]:
# Dropping the CarName column

car_df.drop(columns = 'CarName', inplace=True)

In [None]:
car_df.fuelsystem.unique()

From business understanding of the automobile domain we can understand the following:

- mpfi stands for Multi Point Fuel Injection. There is no such thing as mfi in automobile

In [None]:
car_df['fuelsystem'].replace('mfi','mpfi',inplace=True)

In [None]:
car_df.enginetype.unique()

Here also we can see that the following data are incorrect.

- ohc hasbeen mis-spelled at places with ohcv
- dohc has been mis-spelled as dohcv

In [None]:
car_df['enginetype'].replace('dohcv','dohc',inplace = True)
car_df['enginetype'].replace('ohcv','ohc',inplace = True)

In [None]:
car_df.drivewheel.unique()

Here fwd has been mis-spelled as 4wd

In [None]:
car_df['drivewheel'].replace('4wd', 'fwd', inplace = True)

### Data Understanding and Preparation

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#### Data Visualization - Continuous Variable

In [None]:
# Plotting a paiplot for the continuous variables

sns.pairplot(car_df, diag_kind="kde")
plt.show()

In [None]:
plt.figure(figsize=(20,12))
sns.heatmap(car_df.corr(), linewidths=.5, annot=True, cmap="YlGnBu")

From the above plots we can understand the following:

1. The dependent variable **price** has a high positive co-relation with:
    * horsepower
    * enginesize
    * curbweight
    * carwidth
    * carlength

2. The dependent variable **price** has a high negative co-relation with:
    * highwaympg
    * citympg
    
Among the variables which have a high relation with the dependent variable price, there are a few variables which have a very high co-relation with some other variables such are:

- enginesize with horsepower and curbwidth
- curbweigth with enginesize, carwidth and carlength
- highwaympg with citympg

These multi-collinearity need to be considered while building the model as non-multicollinearity is one of the assumptions of linear regression

#### Derived variable creation

In [None]:
# curbweight/enginesize

car_df.loc[:,'curbweight/enginesize'] = car_df.curbweight/car_df.enginesize

In [None]:
# enginesize/horsepower

car_df.loc[:,'enginesize/horsepower'] = car_df.enginesize/car_df.horsepower

In [None]:
# carwidth/carlength

car_df.loc[:,'carwidth/carlength'] = car_df.carwidth/car_df.carlength

In [None]:
# highwaympg/citympg

car_df.loc[:,'highway/city'] = car_df.highwaympg/car_df.citympg

In [None]:
# We can now drop the corresponding columns as we have taken a ratio.

car_df.drop(columns = ['enginesize','carwidth', 'carlength', 'highwaympg', 'citympg'], inplace = True)

In [None]:
# Checking the dataset once more

car_df.head()

In [None]:
# Dropping car_ID column as it is not useful

car_df.drop(columns = 'car_ID', inplace=True)

#### Data Visualization - Categorical Variable

The description of symboling given in the data dictionary states, it's assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

We divide as follows: 

- -3,-2,-1 --> **Safe**
- 0,1      --> **Moderate**
- 2,3      --> **Risky**

In [None]:
car_df.symboling = car_df.symboling.map({-3: 'safe', -2: 'safe',-1: 'safe',0: 'moderate',1: 'moderate',2: 'risky',3:'risky'})

In [None]:
# Visualizing categorical data via boxplots

plt.figure(figsize=(20, 16))
plt.subplot(3,3,1)
sns.boxplot(x = 'symboling', y = 'price', data = car_df)
plt.subplot(3,3,2)
sns.boxplot(x = 'fueltype', y = 'price', data = car_df)
plt.subplot(3,3,3)
sns.boxplot(x = 'aspiration', y = 'price', data = car_df)
plt.subplot(3,3,4)
sns.boxplot(x = 'doornumber', y = 'price', data = car_df)
plt.subplot(3,3,5)
sns.boxplot(x = 'carbody', y = 'price', data = car_df)
plt.subplot(3,3,6)
sns.boxplot(x = 'drivewheel', y = 'price', data = car_df)
plt.subplot(3,3,7)
sns.boxplot(x = 'enginelocation', y = 'price', data = car_df)
plt.subplot(3,3,8)
sns.boxplot(x = 'cylindernumber', y = 'price', data = car_df)
plt.subplot(3,3,9)
sns.boxplot(x = 'fuelsystem', y = 'price', data = car_df)
plt.show()

1. Cars with rear engines are clearly more priced than others.
2. Similiarly, there is a significant relationship among price and cylinder number and whether it has a risky or safe symbol. 
3. However, fuel-type and number of doors does not seem to have that much effect on the price of a car.

In [None]:
# Plotting company vs price

plt.figure(figsize=(20, 16))
sns.boxplot(x = 'company', y = 'price', data = car_df, palette="Reds")

Company name definitely seems to have a significant effect on the price as companies such as BMW, Jaguar, Buick and Porsche seem to manufacture some serious high end expensive cars. 

We can divide the companies into buckets of low, med and high mased on the **median** price of that company. We choose the mediam price instead of mean as there were some outliers in the data for the feature "company"

In [None]:
median_dict = car_df.groupby(['company'])[['price']].median().to_dict()
median_dict = median_dict['price']
median_dict

In [None]:
dict_keys = list(median_dict.keys())

# Median price of category below 10000 is low, between 10000 and 20000 is med and above 20000 is high
for i in dict_keys:
    if median_dict[i] < 10000:
        median_dict[i] = 'low'
    elif median_dict[i] >= 10000 and median_dict[i] <= 20000:
        median_dict[i] = 'med'
    else:
        median_dict[i] = 'high'

median_dict

In [None]:
car_df.company = car_df.company.map(median_dict)
car_df.company.unique()

#### One Hot Encoding for the categorical variables

In [None]:
car_df = pd.get_dummies(car_df, drop_first=True)

In [None]:
# Checking dataframe after dummy variable creation

car_df.head()

#### Splitting the entire dataset into test and train data

Here we are splitting the data in a 75 and 25 ratio for train and test respectively.

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(car_df, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
print("Train data shape: ", df_train.shape)
print("Test data shape: ", df_test.shape)

#### Feature scaling

Feature scaling is necessary for all continuous variables to help the gradient decent algorithm converge quickly.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
conti_vars = ['wheelbase', 'carheight', 'boreratio', 'stroke', 'compressionratio', 'peakrpm', 'horsepower', 'curbweight', 'price', 'curbweight/enginesize', 'carwidth/carlength', 'highway/city', 'enginesize/horsepower']
df_train[conti_vars] = scaler.fit_transform(df_train[conti_vars])

df_train.describe()

In [None]:
# X and y division

y_train = df_train.pop('price')
X_train = df_train

### Modeling

We first use sklearn's RFE(Recursive Feature Elimination) technique to reduce down the model to 10 values.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 10)             # running RFE to select 10 best features
rfe = rfe.fit(X_train, y_train)

##### Model 1 - 10 features

In [None]:
# Checking the statistics of the model using statsmodel library

col_rfe = X_train.columns[rfe.support_]
X_train = X_train[col_rfe]

X_train_sm = sm.add_constant(X_train)
lm_1 = sm.OLS(y_train, X_train_sm).fit()
print(lm_1.summary()) #stats
get_VIF(X_train_sm) #VIF

This seems a decent point to start removing features one by one. All the features except *carbody_hardtop* have acceptable p-values. Hence we will start removing feature *carbody_hardtop* and rebuild the model.

In [None]:
X_train.drop(columns='carbody_hardtop', inplace=True)

##### Model 2 - 9 features

In [None]:
X_train_sm = sm.add_constant(X_train)
lm_2 = sm.OLS(y_train, X_train_sm).fit()
print(lm_2.summary()) #stats
get_VIF(X_train_sm) #VIF

*wheelbase* came out to have a high p-value. Removing and rebuilding model.

In [None]:
X_train.drop(columns='wheelbase', inplace=True)

##### Model 3 - 8 features

In [None]:
X_train_sm = sm.add_constant(X_train)
lm_3 = sm.OLS(y_train, X_train_sm).fit()
print(lm_3.summary()) #stats
get_VIF(X_train_sm) #VIF

*carbody_sedan* has a high p-value and a VIF above 5. So it becomes a very good candidate to be dropped. 

In [None]:
X_train.drop(columns='carbody_sedan', inplace=True)

##### Model 4 - 7 features

In [None]:
X_train_sm = sm.add_constant(X_train)
lm_4 = sm.OLS(y_train, X_train_sm).fit()
print(lm_4.summary()) #stats
get_VIF(X_train_sm) #VIF

At this point we see that all the p values are below 0.05 and the VIFs are also below 5. 
So we can be pretty confident that this is a good model.

We also see that lm_4 model has an **R-squared value of 0.926** and an **adjusted R-squared value of 0.922**

### Residual Analysis

##### On training data

In [None]:
y_train_price = lm_4.predict(X_train_sm)

In [None]:
fig = plt.figure()
sns.distplot((y_train - y_train_price), bins = 20)
fig.suptitle('Residual Error Distribution', fontsize = 20)

The residual errors are distributed in a bell shaped curve with the mean centered at 0.0. It is showing a good Normal Distribution curve 

##### On testing data

In [None]:
# We are scaling the testing set with the already existing scaler object which has been fitted on the train dataset

df_test[conti_vars] = scaler.transform(df_test[conti_vars])

df_test.describe()

In [None]:
# X and y division

y_test = df_test.pop('price')
X_test = df_test

In [None]:
X_test = X_test[col_rfe]

In [None]:
X_test.drop(columns=['carbody_sedan', 'wheelbase', 'carbody_hardtop'], inplace=True) # Dropping columns which we dropped while building the model after RFE

In [None]:
X_test_sm = sm.add_constant(X_test)

In [None]:
y_pred = lm_4.predict(X_test_sm)

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
plt.xlabel('y_test_price', fontsize=18)
plt.ylabel('y_pred', fontsize=16)

### Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score

In [None]:
rmse = sqrt(mean_squared_error(y_test, y_pred))
print('Model RMSE:',rmse)

r2=r2_score(y_test, y_pred)
print('Model r2_score:',r2)

**R2_score on training data**: 0.926

**R2_score on testing data**: 0.914

This proves that our model is able to explain the variance of the test set, almost as much as it is explaining the variance of the training set.

In [None]:
c = [i for i in range(1,63)]

fig = plt.figure()
plt.plot(c,y_test,color="blue",linewidth=3,linestyle='-')
plt.plot(c,y_pred,color="red",linewidth=3,linestyle='-')
plt.ylabel('Car Price')
plt.xlabel('Index')

The test vs predicted results almost overlap each other which shows good prediction.

The final model has the following features and coefficients:

| Feature Name | Co-efficient |
| -: | -: |
| curbweight | 0.4328 |
| horsepower | 0.2874 |
| carbody_hatchback | -0.0232 |
| carbody_wagon | -0.0454 |
| enginelocation_rear | 0.1697 |
| company_low | -0.2831 |
| company_med | -0.2307 |

### Final Analysis and Recommendations

Geely Auto can use the above features to determine the price of a car which they are about to distribute in the US market. 

Along with that, our initial analysis brought out some valient features which has a huge impact on the price of a car. These are summarized again below:- 

1. **Company Name** - Brand value is a big factor. Companies such as Porsche, BMW, Jaguar produce some expensive cars. So price depends a lot on the company of the car.
2. **Symboling** - Cars symboled safe have a higher price range than others.
3. **Fueltype** - Diesel powered cars tend to be very slightly expensive than their petrol counterparts. This could be because diesel is less expensive than petrol and thus a diesel car willcost less over time.
4. **Engine Location** - Cars with engines on the rear are significantly more expensive than the cars with engine on the front. This is mainly because the expensive sports cars have engine towards the back for better balance at high speeds and aerodynamic enhancement.
5. **Cylinder Number** - With the increase in the number of cylinders, the prices increase as well