## <font color=purple> PROBLEM STATEMENT</font>

A Chinese **automobile company** Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the **factors affecting the pricing of cars in the American market**, since those may be very different from the Chinese market. 

The company wants to know:

- `Which variables are significant in predicting the price of a car.`


- `How well those variables describe the price of a car.`

Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market

### <font color=navy>Business Goal</font>

You are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#importing usual libraries
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### <font color=navy>STEP 1: Reading and Understanding the Data</font>

1. Import data using the pandas library
2. Understanding the structure of the data

In [None]:
#importing dataset csv to pandas dataframe

automobile = pd.read_csv("/kaggle/input/CarPrice_Assignment.csv")
automobile.head()

In [None]:
#checking number of rows and columns

automobile.shape

In [None]:
#checking dtypes and null values of columns

automobile.info()

In [None]:
#checking summary of numeric variables

automobile.describe()

In [None]:
#checking number of columns of each data type for general EDA

automobile.dtypes.value_counts()

### <font color=navy>Step 2 : Data Cleaning and Exploratory Data Analysis</font>
1. Cleanup carname to consider only company name as the independent variable for model building.
2. Identify null values.
3. Replace neccessary values.
4. Convert dtypes if required.
5. Explore spread of variables and their influence on price

In [None]:
#cleaning Car Name to keep only brand(company) name and remove model names 

automobile['CarName']=automobile['CarName'].apply(lambda x:x.split(' ', 1)[0])
automobile.rename(columns = {'CarName':'companyname'}, inplace = True)
automobile.head()

In [None]:
#checking unique values in company name column

automobile.companyname.unique()

Invalid Values

There is some inconsistency in the spellings of company names, which needs to be fixed. We need to do the following replacements:

- maxda -> mazda
- Nissan -> nissan
- porcshce -> porsche 
- toyouta -> toyota
- vokswagen -> volkswagen
- vw -> volkswagen

In [None]:
#counting number of unique company names

automobile.companyname.nunique()

**There are 28 unique companies right now**

In [None]:
# Fixing values in company name

automobile.companyname = automobile.companyname.str.lower()

def replace_name(a,b):
    automobile.companyname.replace(a,b,inplace=True)

replace_name('maxda','mazda')
replace_name('porcshce','porsche')
replace_name('toyouta','toyota')
replace_name('vokswagen','volkswagen')
replace_name('vw','volkswagen')

automobile.companyname.unique()

In [None]:
#counting number of unique company names

automobile.companyname.nunique()

**After fixing there are 28 unique companies right now**

Let's see their company wise popularity (count) and company wise average price

In [None]:
#plotting count of company names

plt.figure(figsize=(30, 8))
plt1=sns.countplot(x=automobile.companyname, data=automobile, order= automobile.companyname.value_counts().index)
plt.title('Company Wise Popularity', size=14)
plt1.set_xlabel('Car company', fontsize=14)
plt1.set_ylabel('Frequency of Car Body', fontsize=14)
plt1.set_xticklabels(plt1.get_xticklabels(),rotation=360, size=14)
plt.show()

**Inferences:**
- Toyota has most number of rows and seems to be the most popular brand/company.
- Mercury had the least number of rows and is the least popular company.

In [None]:
#plotting company wise average price of car

plt.figure(figsize=(30, 6))

df = pd.DataFrame(automobile.groupby(['companyname'])['price'].mean().sort_values())
df=df.reset_index(drop=False)
plt1=sns.barplot(x="companyname", y="price", data=df)
plt1.set_title('Car Range vs Average Price', size=14)
plt1.set_xlabel('Car company', fontsize=14)
plt1.set_ylabel('Price', fontsize=14)
plt1.set_xticklabels(plt1.get_xticklabels(),rotation=360, size=14)
plt.show()

**INFERENCES:**
- Chevrolet has the cheapest average price amongst all companies.
- Jaguar has the highest average price.
- The avg price seems to be dependant on the company name and this is an indicator that we can use this variable in our model because it shows correlation with car price


Now, since the number of companies are too many and it would create a lot of dummy variables, lets divide these companies into segments based on their avg price.

In [None]:
#Binning the Car Companies based on avg prices of each Company.

def replace_values(a,b):
    automobile.companyname.replace(a,b,inplace=True)

replace_values('chevrolet','Low_End')
replace_values('dodge','Low_End')
replace_values('plymouth','Low_End')
replace_values('honda','Low_End')
replace_values('subaru','Low_End')
replace_values('isuzu','Low_End')
replace_values('mitsubishi','Budget')
replace_values('renault','Budget')
replace_values('toyota','Budget')
replace_values('volkswagen','Budget')
replace_values('nissan','Budget')
replace_values('mazda','Budget')
replace_values('saab','Medium')
replace_values('peugeot','Medium')
replace_values('alfa-romero','Medium')
replace_values('mercury','Medium')
replace_values('audi','Medium')
replace_values('volvo','Medium')
replace_values('bmw','High_End')
replace_values('porsche','High_End')
replace_values('buick','High_End')
replace_values('jaguar','High_End')

automobile.rename(columns = {'companyname':'segment'}, inplace = True)
automobile.head()

### Let's visualize other categorical variables now, and see if the have any correlation with price.

In [None]:
## FUNCTION TO PLOT CHARTS

def plot_charts(var1, var2):
    plt.figure(figsize=(15, 10))   
    plt.subplot(2,2,1)
    plt.title('Histogram of '+ var1)
    sns.countplot(automobile[var1], palette=("husl"))
    plt1.set(xlabel = '%var1', ylabel='Frequency of'+ '%s'%var1)
    
    plt.subplot(2,2,2)
    plt.title(var1+' vs Price')
    sns.boxplot(x=automobile[var1], y=automobile.price, palette=("husl"))
    
    plt.subplot(2,2,3)
    plt.title('Histogram of '+ var2)
    sns.countplot(automobile[var2], palette=("husl"))
    plt1.set(xlabel = '%var2', ylabel='Frequency of'+ '%s'%var2)
    
    plt.subplot(2,2,4)
    plt.title(var1+' vs Price')
    sns.boxplot(x=automobile[var2], y=automobile.price, palette=("husl"))
    
    plt.show()   

In [None]:
plot_charts('symboling', 'fueltype')

**INFERENCES**
- Most common value for symboling is 0 and 1. The box plot shows us that symboing 1 has the least median price, followed by 0 and 2. symboling values of -1 and -2 have highest median car price. 
- Symboling could be a good predictor variables because we can see a relation in the value of symboling and the price of car.

- More number of cars have fuel type Gas than diesel.
- Disel cars have higher median price than gas, although we can see some outliers in the gas boxplot.

In [None]:
plot_charts('aspiration', 'doornumber')

**INFERENCES**
- Most cars have std aspiration. The box plot shows that cars with turbo aspiration have higher median price
- door number shows no relation to car price and hence seems like an insignificant variable right now.

In [None]:
plot_charts('drivewheel', 'carbody')

**INFERENCES**
- Most cars have fwd **(front wheel drive)**, followed by rwd **(reverse wheel drive)**. The 4wd **4 wheel drive** is very uncommon and has the least number of records.
- cars with rwd have higher median price but their are very few records to make any conclusion on this.

- sedan followed by hatchback seem to be the most popular carbody.
- box plot shows that car body convertible and hardtop have higher median values, but very few entries again.

In [None]:
plot_charts('enginelocation', 'enginetype')

**INFERENCES**
- Most cars have engine located at the front and very few cars have engine located at the rear.
- the boxplot shows that when it is at the rear, the median price is higher than the when it is at the front.

- ohc engine is preferred over others.
- ohcv engine has the higherst median value

In [None]:
plot_charts('cylindernumber', 'fuelsystem')

**INFERENCES**
- 4 cylinders is the most common number, followed by 4. Cars with four cylinder have the 2nd lowest median value after 3 (very few entries again).

- mpfi is the most occuring fuelsystem. It has the highest median value and also contains outliers.

### Visualizing Numeric Variables


In [None]:
#checking distribution and spread of car price

plt.figure(figsize=(20,6))

plt.subplot(1,2,1)
plt.title('Car Price Distribution Plot')
sns.distplot(automobile.price)

plt.subplot(1,2,2)
plt.title('Car Price Spread')
sns.boxplot(y=automobile.price)

plt.show()

- The plot is right-skewed, meaning that the most prices in the dataset are low (Below 15,000).
- There is a significant difference between the mean and the median of the price distribution.
- There is a high variance in the car prices, data points are far spread out from the mean.

In [None]:
# checking numeric columns

automobile.select_dtypes(include=['float64','int64']).columns

In [None]:
#function to plot scatter plot numeric variables with price

def pp(x,y):
    sns.pairplot(automobile, x_vars=[x,y], y_vars='price',height=4, aspect=1, kind='scatter')
    plt.show()

pp('carlength', 'carwidth')
pp('carwidth', 'curbweight')

- Except Car Height, all variables show a positive correlation with respect to price.

In [None]:
#function to plot scatter plot numeric variables with price

def pp(x,y,z):
    sns.pairplot(automobile, x_vars=[x,y,z], y_vars='price',height=4, aspect=1, kind='scatter')
    plt.show()

pp('wheelbase', 'compressionratio', 'enginesize')
pp('boreratio', 'horsepower', 'peakrpm')
pp('stroke', 'highwaympg', 'citympg')

- Compression Ratio, Stoke and Peakrpm show no obvious correlation b/w them and car price.
- Boreratio shows some positive correlation with a lot of variance.
- Citympg and highwaympg are negatively correlated to the price.

In [None]:
#converting cylinder number to numeric and replacing values

def replace_values(a,b):
    automobile.cylindernumber.replace(a,b,inplace=True)

replace_values('four','4')
replace_values('six','6')
replace_values('five','5')
replace_values('three','3')
replace_values('twelve','12')
replace_values('two','2')
replace_values('eight','8')

automobile.cylindernumber=automobile.cylindernumber.astype('int')

In [None]:
automobile.symboling.unique()

In [None]:
#converting symboling to categorical because the numeric values imply weight

def replace_values(a,b):
    automobile.symboling.replace(a,b,inplace=True)

replace_values(3,'Very_Risky')
replace_values(2,'Moderately_Risky')
replace_values(1,'Neutral')
replace_values(0,'Safe')
replace_values(-1,'Moderately_Safe')
replace_values(-2,'Very_Safe')

In [None]:
# Converting variables with 2 values to 1 and 0

automobile['fueltype'] = automobile['fueltype'].map({'gas': 1, 'diesel': 0})
automobile['aspiration'] = automobile['aspiration'].map({'std': 1, 'turbo': 0})
automobile['doornumber'] = automobile['doornumber'].map({'two': 1, 'four': 0})
automobile['enginelocation'] = automobile['enginelocation'].map({'front': 1, 'rear': 0})

In [None]:
#dropping card_Id because it has all unique values

automobile.drop(['car_ID'], axis =1, inplace = True)

In [None]:
#numeric variables

num_vars=automobile.select_dtypes(include=['float64','int64']).columns

In [None]:
# plotting heatmap to check correlation amongst variables

plt.figure(figsize = (20,10))  
sns.heatmap(automobile[num_vars].corr(),cmap="YlGnBu",annot = True)

In [None]:
#dropping variables which are highly correlated to other variables

automobile.drop(['compressionratio','carwidth','curbweight','wheelbase','citympg'], axis =1, inplace = True)

In [None]:
automobile.head()

In [None]:
#getting dummies for categorical variables

df = pd.get_dummies(automobile)
df.head()

In [None]:
#checking column names for dummy variables

df.columns

### DIVIDING INTO TRAIN AND TEST

In [None]:
# importing necessary libraries and functions

from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively

df_train, df_test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 100)

### SCALING NUMERIC VARIABLES

In [None]:
# for scaling

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables

num_vars = ['fueltype', 'aspiration', 'doornumber', 'enginelocation', 'enginesize','horsepower', 
            'peakrpm', 'highwaympg', 'carlength', 'carheight', 'boreratio', 'stroke', 'price']


df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

df_train.head()

### Dividing into X and Y sets for the Model Building

In [None]:
#dividing into x and y sets where y has the variable we have to predict

y_train = df_train.pop('price')
X_train = df_train

In [None]:
# Importing RFE and LinearRegression

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Running RFE with the output number of the variable equal to 10
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 10)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
#checking RFE columns
col = X_train.columns[rfe.support_]
col

### Building model using statsmodel, for the detailed statistics

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_rfe = sm.add_constant(X_train_rfe)

In [None]:
#function for checking VIF

def checkVIF(X):
    vif = pd.DataFrame()
    vif['variable'] = X.columns    
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

In [None]:
# building MODEL #1

lm = sm.OLS(y_train,X_train_rfe).fit() # fitting the model
print(lm.summary()) # model summary

In [None]:
#dropping constant to calculate VIF

X_train_rfe.drop('const', axis = 1, inplace=True)

In [None]:
#checking VIF

checkVIF(X_train_rfe)

In [None]:
#dopping boreratio because it has the highest p-value and also high VIF. It is also something which is difficult to explain to management

X_train_new = X_train_rfe.drop(["boreratio"], axis = 1)

In [None]:
#building MODEL #2 after dropping boreratio

X_train_new = sm.add_constant(X_train_new)
lm = sm.OLS(y_train,X_train_new).fit() # fitting the model
print(lm.summary()) # model summary

In [None]:
#dropping constant to calculate VIF

X_train_new.drop('const', axis=1, inplace=True)

In [None]:
#checking VIF

checkVIF(X_train_new)

In [None]:
#dopping enginelocation because it has the highest p-value and also high VIF. it has very few values for rear as we saw earlier

X_train_new.drop(["enginelocation"], axis=1, inplace=True)

In [None]:
#building MODEL #3 after dropping enginelocation

X_train_new = sm.add_constant(X_train_new)
lm = sm.OLS(y_train,X_train_new).fit() # fitting the model
print(lm.summary()) # model summary

In [None]:
#dropping constant to calculate VIF

X_train_new.drop('const', axis=1, inplace=True)

In [None]:
#checking VIF

checkVIF(X_train_new)

In [None]:
#dopping horsepower because it has the high VIF and exhibits multicollinearity. 
#it is highly correlated to engine size and can be dropped.

X_train_new.drop(["horsepower"], axis=1, inplace=True)

In [None]:
#building MODEL #4 after dropping horsepower

X_train_new = sm.add_constant(X_train_new)
lm = sm.OLS(y_train,X_train_new).fit() # fitting the model
print(lm.summary()) # model summary

In [None]:
#dropping constant to calculate VIF

X_train_new.drop('const', axis=1, inplace=True)

In [None]:
#checking VIF

checkVIF(X_train_new)

In [None]:
#dopping carlength because it has the high VIF and exhibits multicollinearity. 
#it is highly correlated to engine size and can be dropped.

X_train_new.drop(["carlength"], axis=1, inplace=True)

In [None]:
#building MODEL #5 after dropping carlength

X_train_new = sm.add_constant(X_train_new)
lm = sm.OLS(y_train,X_train_new).fit() # fitting the model
print(lm.summary()) # model summary

In [None]:
#dropping constant to calculate VIF

X_train_vif=X_train_new.drop('const', axis=1)

In [None]:
#checking VIF

checkVIF(X_train_vif)

## Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
#calculating price on train set using the model built

y_train_price = lm.predict(X_train_new)

In [None]:
# Plot the histogram of the error terms

fig = plt.figure()
sns.distplot((y_train - y_train_price), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

Error Terms are normally distributed with mean 0

In [None]:
# Plotting y_train and y_train_price to understand the residuals.

plt.figure(figsize = (8,6))
plt.scatter(y_train,y_train_price)
plt.title('y_train vs y_train_price', fontsize=20)              # Plot heading 
plt.xlabel('y_train', fontsize=18)                          # X-label
plt.ylabel('y_train_price', fontsize=16)                          # Y-label

In [None]:
# Actual vs Predicted for TRAIN SET

plt.figure(figsize = (8,5))
c = [i for i in range(1,144,1)]
d = [i for i in range(1,144,1)]
plt.plot(c, y_train_price, color="blue", linewidth=1, linestyle="-")     #Plotting Actual
plt.plot(d, y_train, color="red",  linewidth=1, linestyle="-")  #Plotting predicted
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Car Price', fontsize=16)  
plt.show()

In [None]:
# Error terms for TRAIN SET
plt.figure(figsize = (8,5))
c = [i for i in range(1,144,1)]
plt.scatter(c,y_train-y_train_price)

plt.title('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('ytest-ypred', fontsize=16)                # Y-label

## Making Predictions

In [None]:
# Applying the scaling on the test sets

num_vars = ['fueltype', 'aspiration', 'doornumber', 'enginelocation', 'enginesize','horsepower', 
            'peakrpm', 'highwaympg', 'carlength', 'carheight', 'boreratio', 'stroke', 'price']

df_test[num_vars] = scaler.transform(df_test[num_vars])

In [None]:
# Dividing into X_test and y_test

y_test = df_test.pop('price')
X_test = df_test

In [None]:
X_train_new.drop('const', axis=1, inplace=True)

In [None]:
# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test[X_train_new.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
# Making predictions
y_pred = lm.predict(X_test_new)

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

## Model Evaluation

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)                          # Y-label

In [None]:
# Actual vs Predicted
c = [i for i in range(1,63,1)]
d = [i for i in range(1,63,1)]
plt.plot(c, y_pred, color="blue", linewidth=1, linestyle="-")     #Plotting Actual
plt.plot(d, y_test, color="red",  linewidth=1, linestyle="-")  #Plotting predicted
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Car Price', fontsize=16)  
plt.show()

In [None]:
# Error terms

fig = plt.figure()
c = [i for i in range(1,63,1)]
plt.scatter(c,y_test-y_pred)

fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('ytest-ypred', fontsize=16)                # Y-label

In [None]:
#RMSE score for test set

import numpy as np
from sklearn import metrics
print('RMSE :', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
#RMSE score for train set

import numpy as np
from sklearn import metrics
print('RMSE :', np.sqrt(metrics.mean_squared_error(y_train, y_train_price)))

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

In [None]:
r2_score(y_train, y_train_price)