# Problem Statement 
A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. 
 
They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know: 

 Which variables are significant in predicting the price of a car

 How well those variables describe the price of a car 

Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the American market. 

# Importing libraries

In [None]:
#importing basic libraries
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt

#importing sklearn libraries 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

#importing statsmodel api module
import statsmodels.api as sm

#importing warnings
from warnings import filterwarnings
filterwarnings('ignore')
sb.set_style("darkgrid")


In [None]:
pd.options.display.max_columns = 40

In [None]:
#Reading dataset
df=pd.read_csv("../input/car-price-prediction/CarPrice_Assignment.csv")

In [None]:
df.info()

In [None]:
df.head()

In [None]:
#Checking the Null Values %age
round(((df.isna().sum() / df.shape[0]) * 100),2)

# Data Preparation 
There is a variable named CarName which is comprised of two parts - the first word is the name of 'car company' and the second is the 'car model'. For example, chevrolet impala has 'chevrolet' as the car company name and 'impala' as the car model name. You need to consider only company name as the independent variable for model building

### Checking the CarName column


In [None]:
#Splitting the CarName COlumn
df.CarName=df.CarName.apply(lambda x:x.split(" ")[0])

In [None]:
#Checking the unique values
df.CarName.unique()

We have seen that few of the CarName are misspelled, so correcting them

In [None]:
df.CarName.replace({"maxda":"mazda","vokswagen":"volkswagen","vw":"volkswagen","Nissan":"nissan","porcshce":"porsche","toyouta":"toyota"},inplace=True)

### Checking the doornumber column

In [None]:
df.doornumber.unique()

In [None]:
#mapping the door number from object to numerical
df.doornumber=df.doornumber.replace({"two":2,"four":4}).astype(np.int64)

### Checking the drivewheel column

In [None]:
df.drivewheel.unique()

### Checking the cylindernumber column

In [None]:
df.cylindernumber.unique()

In [None]:
#mapping the object columns to numericals
df.cylindernumber=df.cylindernumber.map({"four":4,"six":6,"five":5,"three":3,"twelve":12,"two":2,"eight":8}).astype(np.int64)

In [None]:
df.cylindernumber.unique()

# Data Visualising And Finding Insights  (EDA's)

First plotting the categorical columns

#### Now Checking the The price distribution of ```fueltype``` ,```aspiration``` and ```doornumber``` and their counts

In [None]:
plt.figure(figsize=(20,10))

#FUELTYPE

# fueltye vs price distribution
plt.subplot(2,3,1)
plt.title("fueltype vs price ")
sb.boxplot(df.fueltype,df.price)
#count 
plt.subplot(2,3,4)
plt.title("NO. of cars with specified fueltype ")
sb.countplot(df.fueltype)

#ASPIRATION

#aspiration vs price disribution
plt.subplot(2,3,2)
plt.title("Aspiration used in the car vs price")
sb.boxplot(df.aspiration,df.price)
#count
plt.subplot(2,3,5)
plt.title("Count of specific Aspiration type")
sb.countplot(df.aspiration)

#DOORNUMBER

#doornumber vs price distribution
plt.subplot(2,3,3)
plt.title("no. of doors vs price")
sb.boxplot(df.doornumber,df.price)
#count
plt.subplot(2,3,6)
plt.title("Count of the cars with specific door number")
sb.countplot(df.doornumber)

We have seen that diesel cars have a significantly higher price but they have very few numbers                                 
Turbo aspiration have a higher price bue usually implemented in less cars                                                       
Door numbers doesnot to be vary in price but 4 Doors cars are more bought

#### Now Analysing the carbody with price

In [None]:
plt.figure(figsize=(20,4))
plt.subplot(1,2,1)
plt.title("Count of type of carbody bought")
sb.countplot(df.carbody)
plt.subplot(1,2,2)
plt.title("type of carbody vs price")
sb.boxplot(df.carbody,df.price)

Again we have seen the same variation that high price carbody type is usually bought less 

#### Cheking which enginelocation and enginetype is more preferred

In [None]:
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
plt.title("No. of the cars bought vs engine-location")
sb.countplot(df.enginelocation)
plt.subplot(1,2,2)
plt.title("No of cars vs Engine type")
sb.countplot(df.enginetype)

#### There is a huge amount of cars bougth whose Engine Location is ```FRONT``` and Engine Type is ```OHC```

In [None]:
plt.figure(figsize=(7,4))
plt.title("No. of cyliner used in the cars")
sb.countplot(df.cylindernumber)


#### Cars having 4 cylinders are bought most

#### Now checking which fuelsystem is used the most and also checking its distribution

In [None]:
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.title("No of cars for particular fuelsystem")
sb.countplot(df.fuelsystem)
plt.subplot(1,2,2)
plt.title("fuelsystm vs car prices")
sb.barplot(df.fuelsystem,df.price)

#### From above graph we have seen that mpfi have a bit high price and its count is max so we can also opt for ```2bbi``` as fuelsystem in tems of price and its usage

### Now analysisng the numericals columns

In [None]:
int_vars=df.select_dtypes(exclude="object")

In [None]:
int_vars.head()

In [None]:
nvars=int_vars.drop(labels=["car_ID","cylindernumber","price"],axis=1)

In [None]:
cols=list(nvars.columns)

### plotting a pairplot for the numerical columns vs price column

In [None]:
#sb.pairplot(data=df,x_vars=cols,y_vars=df.price)
sb.pairplot(data=df,x_vars=cols[:5],y_vars="price")

In [None]:
sb.pairplot(data=df,x_vars=cols[6:11],y_vars="price")

In [None]:
sb.pairplot(data=df,x_vars=cols[12:],y_vars="price")

### Making the correlation map and seeing which variables are most correlated

In [None]:
plt.figure(figsize=(15,7))
sb.heatmap(data=df.corr(),annot=True)

### Before dropping the CarName column lets analysze its distributionin acccordance with price

In [None]:
sb.distplot(df.groupby(by="CarName")["price"].mean().sort_values(ascending=False),bins=3)

Here we have seen the clear distribution of price according to car so we can make a new derived matrix called car_category which can have the category of the car according the price fo the car.

In [None]:
def car(x):
    if x>5000 and x<=15000:
        return "Low"
    elif x>15000 and x<=25000:
        return "Medium"
    else :
        return "Hign"

In [None]:
df["car_cat"]=df.price.apply(car)

In [None]:
df.head()

# CREATING MODEL

### creating dummy variables

In [None]:
#GETTING DUMMIES
dummy_vars=["fueltype","aspiration","carbody","drivewheel","enginelocation","enginetype","fuelsystem","car_cat"]
dummies=pd.get_dummies(data=df[dummy_vars],drop_first=True)

In [None]:
#Dropiing the columns that are of no use to us
df.drop(labels=["car_ID","CarName","fueltype","aspiration","carbody","drivewheel","enginelocation","enginetype","fuelsystem","car_cat"],axis=1,inplace=True)

In [None]:
final_df=pd.concat([df,dummies],axis=1)

In [None]:
final_df.head()

### Train_test_split

In [None]:
#SPLITTING THE DATASET

df_train,df_test=train_test_split(final_df,test_size=0.3,random_state=42)

### Scaling the dataset

In [None]:
#SCALING THE FEATUREs
scale=MinMaxScaler()

In [None]:
df_train.columns

In [None]:
cols=['symboling', 'doornumber', 'wheelbase', 'carlength', 'carwidth',
       'carheight', 'curbweight', 'cylindernumber', 'enginesize', 'boreratio',
       'stroke', 'compressionratio', 'horsepower', 'peakrpm', 'citympg',
       'highwaympg', 'price', 'fueltype_gas', 'aspiration_turbo',
       'carbody_hardtop', 'carbody_hatchback', 'carbody_sedan',
       'carbody_wagon', 'drivewheel_fwd', 'drivewheel_rwd',
       'enginelocation_rear', 'enginetype_dohcv', 'enginetype_l',
       'enginetype_ohc', 'enginetype_ohcf', 'enginetype_ohcv',
       'enginetype_rotor', 'fuelsystem_2bbl', 'fuelsystem_4bbl',
       'fuelsystem_idi', 'fuelsystem_mfi', 'fuelsystem_mpfi',
       'fuelsystem_spdi', 'fuelsystem_spfi', 'car_cat_Low', 'car_cat_Medium']

In [None]:
df_train[cols]=scale.fit_transform(df_train[cols])

### Making the target variable and predictor variable

In [None]:
y_train=df_train.pop("price")
X_train=df_train

In [None]:
X_train.shape

In [None]:
y_train.shape

## Building the Model

In [None]:
lm=LinearRegression()
lm.fit(X_train,y_train)

#### Using the Recursive Feature elimination Technique fior removing the unnecessary columns

In [None]:
rfe=RFE(lm,10)
rfe=rfe.fit(X_train,y_train)

In [None]:
rfe.ranking_

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
final_cols=X_train.columns[rfe.support_].to_list()

In [None]:
#Checking the final columns 
final_cols

In [None]:
X_train.columns[~rfe.support_]

### Building the model using statsmodels to get a statiscally significant model

In [None]:
################################### ITERATION 1 ############################################

In [None]:
X_train_rfe=X_train[final_cols]

In [None]:
#Making the model using statsmodel (sm)
X_train_rfe=sm.add_constant(X_train_rfe)
lm = sm.OLS(y_train,X_train_rfe).fit()
lm.summary()

This is the summary of statsmodels using ols on the RFE selected columns

Now checking the Variance Inflation factor (VIF) of the columns

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
vif = pd.DataFrame()
X=X_train_rfe.drop("const",axis=1)
vif['featues'] = X.columns
vif['VIF']=[variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif=vif.sort_values(by='VIF',ascending=False)
vif

#### Removing the ```citympg``` columns as its p-value is significantly high  and reiterating the model using the remaining columns

In [None]:
X_train_2=X_train_rfe.drop(["citympg"],axis=1)

In [None]:
###################################     ITERATION 2   ###########################

In [None]:
X_train_2=sm.add_constant(X_train_2)
lm2 = sm.OLS(y_train,X_train_2).fit()
lm2.summary()

In [None]:
vif = pd.DataFrame()
X=X_train_2.drop("const",axis=1)
vif['featues'] = X.columns
vif['VIF']=[variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif=vif.sort_values(by='VIF',ascending=False)
vif

Removing the ```carbody_wagon``` column due to its p value(> 0.05)

In [None]:
X_train_3=X_train_2.drop(["carbody_wagon"],axis=1)

In [None]:
##################################      TRERATION 3    #####################

In [None]:
X_train_3=sm.add_constant(X_train_3)
lm3 = sm.OLS(y_train,X_train_3).fit()
lm3.summary()

In [None]:
vif = pd.DataFrame()
X=X_train_3.drop("const",axis=1)
vif['featues'] = X.columns
vif['VIF']=[variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif=vif.sort_values(by='VIF',ascending=False)
vif

Reiterating the model by removing the ```highwaympg``` column as it have a high vif value

In [None]:
X_train_4=X_train_3.drop(["highwaympg"],axis=1)

In [None]:
#######################################       ITERATION 4    ###################

In [None]:
X_train_4=sm.add_constant(X_train_4)
lm4 = sm.OLS(y_train,X_train_4).fit()
lm4.summary()

In [None]:
vif = pd.DataFrame()
X=X_train_4.drop("const",axis=1)
vif['featues'] = X.columns
vif['VIF']=[variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif=vif.sort_values(by='VIF',ascending=False)
vif

Removing the ```compressionratio``` column due to its p value(> 0.05) and reiterating using the remaining columns

In [None]:
############################## ITERATION 5 ##############################

In [None]:
X_train_5=X_train_4.drop(["compressionratio"],axis=1)

In [None]:
X_train_5=sm.add_constant(X_train_5)
lm5 = sm.OLS(y_train,X_train_5).fit()
lm5.summary()

In [None]:
vif = pd.DataFrame()
X=X_train_5.drop("const",axis=1)
vif['featues'] = X.columns
vif['VIF']=[variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif=vif.sort_values(by='VIF',ascending=False)
vif

Reiterating the model by removing the ```carwidth``` column as it have a high vif value

In [None]:
##################################  iTERATION 6 ##############################

In [None]:
X_train_6=X_train_5.drop(["carwidth"],axis=1)

In [None]:
X_train_6=sm.add_constant(X_train_6)
lm6 = sm.OLS(y_train,X_train_6).fit()
lm6.summary()

In [None]:
vif = pd.DataFrame()
X=X_train_6.drop("const",axis=1)
vif['featues'] = X.columns
vif['VIF']=[variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif=vif.sort_values(by='VIF',ascending=False)
vif

We have seen that our p values and vif are in our required limits , so checking further coorelation by using the heatmap

In [None]:
sb.heatmap(X_train_6.corr(),annot=True)

#### car_cat_low and car_cat_medium are having a high co relation , by removing the car_cat_low our R2 score drops significantly so try removing the car_cat_medium and checking the R2 score and p values

In [None]:
###########################################################################################################

In [None]:
X_train_7=X_train_6.drop(["car_cat_Medium"],axis=1)

In [None]:
X_train_7=sm.add_constant(X_train_7)
lm7 = sm.OLS(y_train,X_train_7).fit()
lm7.summary()

In [None]:
vif = pd.DataFrame()
X=X_train_7.drop("const",axis=1)
vif['featues'] = X.columns
vif['VIF']=[variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif=vif.sort_values(by='VIF',ascending=False)
vif

#### VIF , P-values and R2 score doesnot drop significatly so we can say that this is our final model built and we will use the above columns for further analysis

# Residual Analysis

### for training

In [None]:
#preidting the values
y_train_pred=lm7.predict(X_train_7)

In [None]:
#Finding residuals
res = y_train - y_train_pred
plt.axvline(x=0,color="red")

#distribution of error terms
plt.xlabel("errors",fontsize=15)
sb.distplot(res,rug=True,color='red')
r2_score (y_train, y_train_pred)

The training error (residuals) are not perfectly but almost reside near 0 , so we can say that the distribution is almost close to a normal distribution

### for testing data

In [None]:
#Taking the columns
print(X_train_7.columns)
var=['price', 'enginesize', 'carbody_hardtop', 'fuelsystem_4bbl',
       'car_cat_Low']

Scaling the test data

In [None]:
df_test[cols]=scale.transform(df_test[cols])

In [None]:
df_test_var=df_test[var]

making dataframe according to the model created

In [None]:
df_test_var=sm.add_constant(df_test_var)

In [None]:
y_test=df_test_var.pop("price")
X_test_var=df_test_var

In [None]:
X_test_var.head()

Predicting the values

In [None]:
y_test_pred=lm7.predict(X_test_var)

Checking the distribution of test error values

In [None]:
res2 = y_test-y_test_pred
plt.axvline(x=0,color="blue")
sb.distplot(res2,rug=True)

## Model Evaluation

In [None]:
#R2 score of the test data
r2_score (y_test, y_test_pred)

In [None]:
#ploting the y_test and Y-pred values
plt.xlabel("y_test",fontsize=15)
plt.ylabel("y_pred",fontsize=15)
plt.scatter(y_test_pred,y_test)

y_test vs y_pred a is observed to have quite a linear variation, so our overall model is quite good

# Final Inferences

After our analysis we have concluded these variables can describe the price of the cars upto a good extent

enginesize                                                                                                                     
car_cat_Low                                                                                                                     
carbody_hardtop                                                                                                                 
fuelsystem_4bbl
