# **Used Cars Price Prediction**
# **Milestone 2**

## **Model Building**

1. What we want to predict is the "Price". We will use the normalized version 'price_log' for modeling.
2. Before we proceed to the model, we'll have to encode categorical features. We will drop categorical features like Name. 
3. We'll split the data into train and test, to be able to evaluate the model that we build on the train data.
4. Build Regression models using train data.
5. Evaluate the model performance.

**Note:** Please load the data frame that was saved in Milestone 1 here before separating the data, and then proceed to the next step in Milestone 2.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

import scipy.stats as stats

from sklearn.model_selection import GridSearchCV


import warnings
warnings.filterwarnings('ignore')

# Import required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Remove the limit from the number of displayed columns and rows. It helps to see the entire dataframe while printing it
pd.set_option("display.max_columns", None)


### **Load the data**

In [2]:
import pandas as pd

cars_data = pd.read_csv("cars_data_updated.csv")

In [3]:
cars_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7252 entries, 0 to 7251
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Name                   7252 non-null   object 
 1   Location               7252 non-null   object 
 2   Year                   7252 non-null   int64  
 3   Kilometers_Driven      7252 non-null   int64  
 4   Fuel_Type              7252 non-null   object 
 5   Transmission           7252 non-null   object 
 6   Owner_Type             7252 non-null   object 
 7   Mileage                7252 non-null   float64
 8   Engine                 7252 non-null   float64
 9   Power                  7252 non-null   float64
 10  Seats                  7252 non-null   float64
 11  New_price              7252 non-null   float64
 12  Price                  7252 non-null   float64
 13  kilometers_driven_log  7252 non-null   float64
 14  price_log              7252 non-null   float64
 15  Name

In [4]:
cars_data.nunique()

Name                     2041
Location                   11
Year                       23
Kilometers_Driven        3659
Fuel_Type                   5
Transmission                2
Owner_Type                  4
Mileage                   438
Engine                    150
Power                     386
Seats                       8
New_price                 642
Price                    1421
kilometers_driven_log    3659
price_log                1420
Name_Brand&Model          223
dtype: int64

In [5]:
cars_data.isnull().sum()

Name                     0
Location                 0
Year                     0
Kilometers_Driven        0
Fuel_Type                0
Transmission             0
Owner_Type               0
Mileage                  0
Engine                   0
Power                    0
Seats                    0
New_price                0
Price                    0
kilometers_driven_log    0
price_log                0
Name_Brand&Model         0
dtype: int64

### **Split the Data**

<li>Step1: Seperating the indepdent variables (X) and the dependent variable (y). 
<li>Step2: Encode the categorical variables in X using pd.dummies.
<li>Step3: Split the data into train and test using train_test_split.

**Think about it:** Why we should drop 'Name','Price','price_log','Kilometers_Driven' from X before splitting?

In [6]:
# Step-1
X = cars_data.drop(['Name','Price','price_log','Kilometers_Driven'], axis = 1)

y = cars_data[["price_log", "Price"]]

X.head()

Unnamed: 0,Location,Year,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_price,kilometers_driven_log,Name_Brand&Model
0,Mumbai,2010,CNG,Manual,First,26.6,998.0,58.16,5.0,5.51,11.184421,Maruti Wagon
1,Pune,2015,Diesel,Manual,First,19.67,1582.0,126.2,5.0,16.06,10.621327,Hyundai Creta
2,Chennai,2011,Petrol,Manual,First,18.2,1199.0,88.7,5.0,8.61,10.736397,Honda Jazz
3,Chennai,2012,Diesel,Manual,First,20.77,1248.0,88.76,7.0,11.27,11.373663,Maruti Ertiga
4,Coimbatore,2013,Diesel,Automatic,Second,15.2,1968.0,140.8,5.0,53.14,10.613246,Audi A4


In [7]:
y.head()

Unnamed: 0,price_log,Price
0,0.559616,1.75
1,2.525729,12.5
2,1.504077,4.5
3,1.791759,6.0
4,2.875822,17.74


In [8]:
# Step-2 Use pd.get_dummies(drop_first = True)
X = pd.get_dummies(X, drop_first = True)
X.head()

Unnamed: 0,Year,Mileage,Engine,Power,Seats,New_price,kilometers_driven_log,Location_Bangalore,Location_Chennai,Location_Coimbatore,Location_Delhi,Location_Hyderabad,Location_Jaipur,Location_Kochi,Location_Kolkata,Location_Mumbai,Location_Pune,Fuel_Type_Diesel,Fuel_Type_Electric,Fuel_Type_LPG,Fuel_Type_Petrol,Transmission_Manual,Owner_Type_Fourth & Above,Owner_Type_Second,Owner_Type_Third,Name_Brand&Model_Audi A3,Name_Brand&Model_Audi A4,Name_Brand&Model_Audi A6,Name_Brand&Model_Audi A7,Name_Brand&Model_Audi A8,Name_Brand&Model_Audi Q3,Name_Brand&Model_Audi Q5,Name_Brand&Model_Audi Q7,Name_Brand&Model_Audi RS5,Name_Brand&Model_Audi TT,Name_Brand&Model_BMW 1,Name_Brand&Model_BMW 3,Name_Brand&Model_BMW 5,Name_Brand&Model_BMW 6,Name_Brand&Model_BMW 7,Name_Brand&Model_BMW X1,Name_Brand&Model_BMW X3,Name_Brand&Model_BMW X5,Name_Brand&Model_BMW X6,Name_Brand&Model_BMW Z4,Name_Brand&Model_Bentley Continental,Name_Brand&Model_Bentley Flying,Name_Brand&Model_Chevrolet Aveo,Name_Brand&Model_Chevrolet Beat,Name_Brand&Model_Chevrolet Captiva,Name_Brand&Model_Chevrolet Cruze,Name_Brand&Model_Chevrolet Enjoy,Name_Brand&Model_Chevrolet Optra,Name_Brand&Model_Chevrolet Sail,Name_Brand&Model_Chevrolet Spark,Name_Brand&Model_Chevrolet Tavera,Name_Brand&Model_Datsun GO,Name_Brand&Model_Datsun Redi,Name_Brand&Model_Datsun redi-GO,Name_Brand&Model_Fiat Abarth,Name_Brand&Model_Fiat Avventura,Name_Brand&Model_Fiat Grande,Name_Brand&Model_Fiat Linea,Name_Brand&Model_Fiat Petra,Name_Brand&Model_Fiat Punto,Name_Brand&Model_Fiat Siena,Name_Brand&Model_Force One,Name_Brand&Model_Ford Aspire,Name_Brand&Model_Ford Classic,Name_Brand&Model_Ford EcoSport,Name_Brand&Model_Ford Ecosport,Name_Brand&Model_Ford Endeavour,Name_Brand&Model_Ford Fiesta,Name_Brand&Model_Ford Figo,Name_Brand&Model_Ford Freestyle,Name_Brand&Model_Ford Fusion,Name_Brand&Model_Ford Ikon,Name_Brand&Model_Ford Mustang,Name_Brand&Model_Hindustan Motors,Name_Brand&Model_Honda Accord,Name_Brand&Model_Honda Amaze,Name_Brand&Model_Honda BR-V,Name_Brand&Model_Honda BRV,Name_Brand&Model_Honda Brio,Name_Brand&Model_Honda CR-V,Name_Brand&Model_Honda City,Name_Brand&Model_Honda Civic,Name_Brand&Model_Honda Jazz,Name_Brand&Model_Honda Mobilio,Name_Brand&Model_Honda WR-V,Name_Brand&Model_Honda WRV,Name_Brand&Model_Hyundai Accent,Name_Brand&Model_Hyundai Creta,Name_Brand&Model_Hyundai EON,Name_Brand&Model_Hyundai Elantra,Name_Brand&Model_Hyundai Elite,Name_Brand&Model_Hyundai Getz,Name_Brand&Model_Hyundai Grand,Name_Brand&Model_Hyundai Santa,Name_Brand&Model_Hyundai Santro,Name_Brand&Model_Hyundai Sonata,Name_Brand&Model_Hyundai Tucson,Name_Brand&Model_Hyundai Verna,Name_Brand&Model_Hyundai Xcent,Name_Brand&Model_Hyundai i10,Name_Brand&Model_Hyundai i20,Name_Brand&Model_ISUZU D-MAX,Name_Brand&Model_Isuzu MU,Name_Brand&Model_Isuzu MUX,Name_Brand&Model_Jaguar F,Name_Brand&Model_Jaguar XE,Name_Brand&Model_Jaguar XF,Name_Brand&Model_Jaguar XJ,Name_Brand&Model_Jeep Compass,Name_Brand&Model_Lamborghini Gallardo,Name_Brand&Model_Land Rover,Name_Brand&Model_Mahindra Bolero,Name_Brand&Model_Mahindra E,Name_Brand&Model_Mahindra Jeep,Name_Brand&Model_Mahindra KUV,Name_Brand&Model_Mahindra Logan,Name_Brand&Model_Mahindra NuvoSport,Name_Brand&Model_Mahindra Quanto,Name_Brand&Model_Mahindra Renault,Name_Brand&Model_Mahindra Scorpio,Name_Brand&Model_Mahindra Ssangyong,Name_Brand&Model_Mahindra TUV,Name_Brand&Model_Mahindra Thar,Name_Brand&Model_Mahindra Verito,Name_Brand&Model_Mahindra XUV300,Name_Brand&Model_Mahindra XUV500,Name_Brand&Model_Mahindra Xylo,Name_Brand&Model_Maruti 1000,Name_Brand&Model_Maruti 800,Name_Brand&Model_Maruti A-Star,Name_Brand&Model_Maruti Alto,Name_Brand&Model_Maruti Baleno,Name_Brand&Model_Maruti Celerio,Name_Brand&Model_Maruti Ciaz,Name_Brand&Model_Maruti Dzire,Name_Brand&Model_Maruti Eeco,Name_Brand&Model_Maruti Ertiga,Name_Brand&Model_Maruti Esteem,Name_Brand&Model_Maruti Estilo,Name_Brand&Model_Maruti Grand,Name_Brand&Model_Maruti Ignis,Name_Brand&Model_Maruti Omni,Name_Brand&Model_Maruti Ritz,Name_Brand&Model_Maruti S,Name_Brand&Model_Maruti S-Cross,Name_Brand&Model_Maruti SX4,Name_Brand&Model_Maruti Swift,Name_Brand&Model_Maruti Versa,Name_Brand&Model_Maruti Vitara,Name_Brand&Model_Maruti Wagon,Name_Brand&Model_Maruti Zen,Name_Brand&Model_Mercedes-Benz A,Name_Brand&Model_Mercedes-Benz B,Name_Brand&Model_Mercedes-Benz C-Class,Name_Brand&Model_Mercedes-Benz CLA,Name_Brand&Model_Mercedes-Benz CLS-Class,Name_Brand&Model_Mercedes-Benz E-Class,Name_Brand&Model_Mercedes-Benz GL-Class,Name_Brand&Model_Mercedes-Benz GLA,Name_Brand&Model_Mercedes-Benz GLC,Name_Brand&Model_Mercedes-Benz GLE,Name_Brand&Model_Mercedes-Benz GLS,Name_Brand&Model_Mercedes-Benz M-Class,Name_Brand&Model_Mercedes-Benz New,Name_Brand&Model_Mercedes-Benz R-Class,Name_Brand&Model_Mercedes-Benz S,Name_Brand&Model_Mercedes-Benz S-Class,Name_Brand&Model_Mercedes-Benz SL-Class,Name_Brand&Model_Mercedes-Benz SLC,Name_Brand&Model_Mercedes-Benz SLK-Class,Name_Brand&Model_Mini Clubman,Name_Brand&Model_Mini Cooper,Name_Brand&Model_Mini Countryman,Name_Brand&Model_Mitsubishi Cedia,Name_Brand&Model_Mitsubishi Lancer,Name_Brand&Model_Mitsubishi Montero,Name_Brand&Model_Mitsubishi Outlander,Name_Brand&Model_Mitsubishi Pajero,Name_Brand&Model_Nissan 370Z,Name_Brand&Model_Nissan Evalia,Name_Brand&Model_Nissan Micra,Name_Brand&Model_Nissan Sunny,Name_Brand&Model_Nissan Teana,Name_Brand&Model_Nissan Terrano,Name_Brand&Model_Nissan X-Trail,Name_Brand&Model_OpelCorsa 1.4Gsi,Name_Brand&Model_Porsche Boxster,Name_Brand&Model_Porsche Cayenne,Name_Brand&Model_Porsche Cayman,Name_Brand&Model_Porsche Panamera,Name_Brand&Model_Renault Captur,Name_Brand&Model_Renault Duster,Name_Brand&Model_Renault Fluence,Name_Brand&Model_Renault KWID,Name_Brand&Model_Renault Koleos,Name_Brand&Model_Renault Lodgy,Name_Brand&Model_Renault Pulse,Name_Brand&Model_Renault Scala,Name_Brand&Model_Skoda Fabia,Name_Brand&Model_Skoda Laura,Name_Brand&Model_Skoda Octavia,Name_Brand&Model_Skoda Rapid,Name_Brand&Model_Skoda Superb,Name_Brand&Model_Skoda Yeti,Name_Brand&Model_Smart Fortwo,Name_Brand&Model_Tata Bolt,Name_Brand&Model_Tata Hexa,Name_Brand&Model_Tata Indica,Name_Brand&Model_Tata Indigo,Name_Brand&Model_Tata Manza,Name_Brand&Model_Tata Nano,Name_Brand&Model_Tata New,Name_Brand&Model_Tata Nexon,Name_Brand&Model_Tata Safari,Name_Brand&Model_Tata Sumo,Name_Brand&Model_Tata Tiago,Name_Brand&Model_Tata Tigor,Name_Brand&Model_Tata Venture,Name_Brand&Model_Tata Xenon,Name_Brand&Model_Tata Zest,Name_Brand&Model_Toyota Camry,Name_Brand&Model_Toyota Corolla,Name_Brand&Model_Toyota Etios,Name_Brand&Model_Toyota Fortuner,Name_Brand&Model_Toyota Innova,Name_Brand&Model_Toyota Land,Name_Brand&Model_Toyota Platinum,Name_Brand&Model_Toyota Prius,Name_Brand&Model_Toyota Qualis,Name_Brand&Model_Volkswagen Ameo,Name_Brand&Model_Volkswagen Beetle,Name_Brand&Model_Volkswagen CrossPolo,Name_Brand&Model_Volkswagen Jetta,Name_Brand&Model_Volkswagen Passat,Name_Brand&Model_Volkswagen Polo,Name_Brand&Model_Volkswagen Tiguan,Name_Brand&Model_Volkswagen Vento,Name_Brand&Model_Volvo S60,Name_Brand&Model_Volvo S80,Name_Brand&Model_Volvo V40,Name_Brand&Model_Volvo XC60,Name_Brand&Model_Volvo XC90
0,2010,26.6,998.0,58.16,5.0,5.51,11.184421,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2015,19.67,1582.0,126.2,5.0,16.06,10.621327,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2011,18.2,1199.0,88.7,5.0,8.61,10.736397,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2012,20.77,1248.0,88.76,7.0,11.27,11.373663,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,2013,15.2,1968.0,140.8,5.0,53.14,10.613246,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [9]:
from sklearn.model_selection import train_test_split

# Step-3 Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

print(X_train.shape, X_test.shape)

(5076, 247) (2176, 247)


In [10]:
# Let us write a function for calculating r2_score and RMSE on train and test data
# This function takes model as an input on which we have trained particular algorithm
# The categorical column as the input and returns the boxplots and histograms for the variable

def get_model_score(model, flag = True):
    '''
    model : regressor to predict values of X

    '''
    # Defining an empty list to store train and test results
    score_list = [] 
    
    pred_train = model.predict(X_train)
    
    pred_train_ = np.exp(pred_train)
    
    pred_test = model.predict(X_test)
    
    pred_test_ = np.exp(pred_test)
    
    train_r2 = metrics.r2_score(y_train['Price'], pred_train_)
    
    test_r2 = metrics.r2_score(y_test['Price'], pred_test_)
    
    train_rmse = metrics.mean_squared_error(y_train['Price'], pred_train_, squared = False)
    
    test_rmse = metrics.mean_squared_error(y_test['Price'], pred_test_, squared = False)
    
    # Adding all scores in the list
    score_list.extend((train_r2, test_r2, train_rmse, test_rmse))
    
    # If the flag is set to True then only the following print statements will be dispayed, the default value is True
    if flag == True: 
        
        print("R-squared on training set : ", metrics.r2_score(y_train['Price'], pred_train_))
        
        print("R-squared on test set : ", metrics.r2_score(y_test['Price'], pred_test_))
        
        print("RMSE on training set : ", np.sqrt(metrics.mean_squared_error(y_train['Price'], pred_train_)))
        
        print("RMSE on test set : ", np.sqrt(metrics.mean_squared_error(y_test['Price'], pred_test_)))
    
    # Returning the list with train and test scores
    return score_list

<hr>

For Regression Problems, some of the algorithms used are :<br>

**1) Linear Regression** <br>
**2) Ridge / Lasso Regression** <br>
**3) Decision Trees** <br>
**4) Random Forest** <br>

### **Fitting a linear model**

Linear Regression can be implemented using: <br>

**1) Sklearn:** https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html <br>
**2) Statsmodels:** https://www.statsmodels.org/stable/regression.html

In [11]:
# Import Linear Regression from sklearn
from sklearn.linear_model import LinearRegression

In [12]:
# Create a linear regression model
lr = LinearRegression()

In [13]:
# Fit linear regression model
lr.fit(X_train, y_train['price_log']) 

LinearRegression()

In [14]:
# Get score of the model
LR_score = get_model_score(lr)

R-squared on training set :  0.9194220665928118
R-squared on test set :  0.8357036049135772
RMSE on training set :  3.0496936686528704
RMSE on test set :  4.337921116855664


**Observations from results: _____**

**Important variables of Linear Regression**

Building a model using statsmodels.

In [15]:
# Import Statsmodels 
import statsmodels.api as sm

# Statsmodel api does not add a constant by default. We need to add it explicitly
x_train = sm.add_constant(X_train)

# Add constant to test data
x_test = sm.add_constant(X_test)

def build_ols_model(train):
    
    # Create the model
    olsmodel = sm.OLS(y_train["price_log"], train)
    
    return olsmodel.fit()


# Fit linear model on new dataset
olsmodel1 = build_ols_model(x_train)

print(olsmodel1.summary())

                            OLS Regression Results                            
Dep. Variable:              price_log   R-squared:                       0.941
Model:                            OLS   Adj. R-squared:                  0.939
Method:                 Least Squares   F-statistic:                     329.9
Date:                Sun, 04 Sep 2022   Prob (F-statistic):               0.00
Time:                        21:24:31   Log-Likelihood:                 796.89
No. Observations:                5076   AIC:                            -1120.
Df Residuals:                    4839   BIC:                             428.4
Df Model:                         236                                         
Covariance Type:            nonrobust                                         
                                               coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------

In [16]:
# Retrive Coeff values, p-values and store them in the dataframe
olsmod = pd.DataFrame(olsmodel1.params, columns = ['coef'])

olsmod['pval'] = olsmodel1.pvalues

In [17]:
# Filter by significant p-value (pval <= 0.05) and sort descending by Odds ratio

olsmod = olsmod.sort_values(by = "pval", ascending = False)

pval_filter = olsmod['pval']<= 0.05

olsmod[pval_filter]

Unnamed: 0,coef,pval
Name_Brand&Model_Mahindra Scorpio,4.379503e-01,4.644124e-02
Name_Brand&Model_Chevrolet Beat,-4.333498e-01,4.584245e-02
Name_Brand&Model_Maruti Alto,-4.313793e-01,4.548050e-02
Name_Brand&Model_Volkswagen Beetle,1.015055e-12,4.177178e-02
Name_Brand&Model_Audi RS5,6.294938e-01,4.097310e-02
...,...,...
Location_Kolkata,-1.794738e-01,2.011904e-21
Power,2.561649e-03,2.243439e-24
kilometers_driven_log,-6.225758e-02,2.391834e-25
Year,9.000227e-02,0.000000e+00


In [18]:
# We are looking are overall significant varaible

pval_filter = olsmod['pval']<= 0.05
imp_vars = olsmod[pval_filter].index.tolist()

# We are going to get overall varaibles (un-one-hot encoded varables) from categorical varaibles
sig_var = []
for col in imp_vars:
    if '' in col:
        first_part = col.split('_')[0]
        for c in cars_data.columns:
            if first_part in c and c not in sig_var :
                sig_var.append(c)

                
start = '\033[1m'
end = '\033[95m'
print(start+ 'Most overall significant categorical varaibles of LINEAR REGRESSION  are ' +end,':\n', sig_var)

[1mMost overall significant categorical varaibles of LINEAR REGRESSION  are [95m :
 ['Name', 'Name_Brand&Model', 'Location', 'Engine', 'New_price', 'Owner_Type', 'Transmission', 'Power', 'kilometers_driven_log', 'Year']


**Build Ridge / Lasso Regression similar to Linear Regression:**<br>

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [19]:
# Import Ridge/ Lasso Regression from sklearn
from sklearn.linear_model import Ridge

In [20]:
# Create a Ridge regression model
rr = Ridge()

In [21]:
# Fit Ridge regression model
rr.fit(X_train, y_train['price_log']) 

Ridge()

In [22]:
# Get score of the model
RR_score = get_model_score(rr)

R-squared on training set :  0.8972426365657673
R-squared on test set :  0.8723792250086664
RMSE on training set :  3.443932509361351
RMSE on test set :  3.8232114825922547


**Observations from results: _____**

### **Decision Tree** 

https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

In [23]:
# Import Decision tree for Regression from sklearn
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor

In [24]:
# Create a decision tree regression model, use random_state = 1
dt = DecisionTreeRegressor(random_state = 1)

In [25]:
# Fit decision tree regression model
dt.fit(X_train, y_train['price_log'])

DecisionTreeRegressor(random_state=1)

In [26]:
# Get score of the model
DT_score = get_model_score(dt)

R-squared on training set :  0.9999575675672687
R-squared on test set :  0.809614557440355
RMSE on training set :  0.06998373573769759
RMSE on test set :  4.669651843952255


**Observations from results: _____**

Print the importance of features in the tree building. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.


In [27]:
print(pd.DataFrame(dt.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                          Imp
Power                                0.650564
Year                                 0.191724
New_price                            0.063528
kilometers_driven_log                0.019576
Engine                               0.009430
...                                       ...
Name_Brand&Model_Mahindra Ssangyong  0.000000
Name_Brand&Model_Mahindra Quanto     0.000000
Name_Brand&Model_Mahindra NuvoSport  0.000000
Name_Brand&Model_Mahindra Logan      0.000000
Name_Brand&Model_Volvo XC90          0.000000

[247 rows x 1 columns]


**Observations and insights: _____**

In [28]:
#another way
print(pd.DataFrame(dt.feature_importances_, index = X.columns, columns = ['Imp']).sort_values(by = 'Imp', ascending = False))

                                          Imp
Power                                0.650564
Year                                 0.191724
New_price                            0.063528
kilometers_driven_log                0.019576
Engine                               0.009430
...                                       ...
Name_Brand&Model_Mahindra Ssangyong  0.000000
Name_Brand&Model_Mahindra Quanto     0.000000
Name_Brand&Model_Mahindra NuvoSport  0.000000
Name_Brand&Model_Mahindra Logan      0.000000
Name_Brand&Model_Volvo XC90          0.000000

[247 rows x 1 columns]


### **Random Forest**

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [29]:
# Import Randomforest for Regression from sklearn
from sklearn.ensemble import RandomForestRegressor

In [30]:
# Create a Randomforest regression model 
rf = RandomForestRegressor(random_state = 1)

In [31]:
# Fit Randomforest regression model
rf.fit(X_train, y_train['price_log'])

RandomForestRegressor(random_state=1)

In [32]:
# Get score of the model
RF_score = get_model_score(rf)

R-squared on training set :  0.974278609041617
R-squared on test set :  0.8584377778154662
RMSE on training set :  1.7230400850261152
RMSE on test set :  4.026626234679523


**Observations and insights: _____**

**Feature Importance**

In [33]:
# Print important features similar to decision trees
print(pd.DataFrame(rf.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                            Imp
Power                                  0.638971
Year                                   0.190322
New_price                              0.066911
kilometers_driven_log                  0.019560
Engine                                 0.019321
...                                         ...
Name_Brand&Model_Toyota Prius          0.000000
Name_Brand&Model_Mahindra NuvoSport    0.000000
Name_Brand&Model_Lamborghini Gallardo  0.000000
Name_Brand&Model_Jaguar F              0.000000
Name_Brand&Model_Mahindra E            0.000000

[247 rows x 1 columns]


**Observations and insights: _____**

### **Hyperparameter Tuning: Decision Tree**

In [34]:
# Choose the type of estimator 
dtree_tuned = DecisionTreeRegressor(random_state = 1)

# Grid of parameters to choose from
# Check documentation for all the parametrs that the model takes and play with those
parameters = {'max_depth': [5,7,9],
             #'min_samples_split': [3,5,7],
             'min_samples_leaf': [5,10,15],
             #'max_features':['log2','sqrt']
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(r2_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring = scorer, cv = 10)

# Fitting the grid search on the train data
grid_obj = grid_obj.fit(X_train, y_train['price_log'])

# Set the model to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
dtree_tuned.fit(X_train, y_train['price_log'])


DecisionTreeRegressor(max_depth=9, min_samples_leaf=10, random_state=1)

In [35]:
# Get score of the dtree_tuned
DTTuned_score = get_model_score(dtree_tuned)


R-squared on training set :  0.878447609425652
R-squared on test set :  0.8043642505805639
RMSE on training set :  3.745673410078324
RMSE on test set :  4.733602026962086


**Observations and insights: _____**

**Feature Importance**

In [36]:
# Print important features of tuned decision tree similar to decision trees
print(pd.DataFrame(dtree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                       Imp
Power                             0.704142
Year                              0.202673
New_price                         0.063024
Name_Brand&Model_Honda Accord     0.006707
kilometers_driven_log             0.006249
...                                    ...
Name_Brand&Model_Hyundai Accent   0.000000
Name_Brand&Model_Hyundai Creta    0.000000
Name_Brand&Model_Hyundai EON      0.000000
Name_Brand&Model_Hyundai Elantra  0.000000
Name_Brand&Model_Volvo XC90       0.000000

[247 rows x 1 columns]


**Observations and insights: _____**

### **Hyperparameter Tuning: Random Forest**

In [37]:
# Choose the type of Regressor
rf_tuned = RandomForestRegressor(random_state=1)

# Define the parameters for Grid to choose from 
# Check documentation for all the parametrs that the model takes and play with those

parameters_rf = {
    'n_estimators': [100,150,200],
    'max_depth': [5,7,9],
    #'min_samples_split': [3,5,7],
    'min_samples_leaf': [5,10,15],
    #'max_features': ['log2','sqrt']
}


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(r2_score)

# Run the grid search
grid_obj2 = GridSearchCV(rf_tuned, parameters_rf, scoring = scorer, cv = 10)

# Fitting the grid search on the train data
grid_obj2 = grid_obj2.fit(X_train, y_train['price_log'])

# Set the model to the best combination of parameters
rf_tuned = grid_obj2.best_estimator_

# Fit the best algorithm to the data
rf_tuned.fit(X_train, y_train['price_log'])



RandomForestRegressor(max_depth=9, min_samples_leaf=5, n_estimators=150,
                      random_state=1)

In [38]:
# Get score of the model
RFTuned_score = get_model_score(rf_tuned)

R-squared on training set :  0.9112519059941248
R-squared on test set :  0.8456841098919237
RMSE on training set :  3.2005725275279557
RMSE on test set :  4.204099220247197


**Observations and insights: _____**

**Feature Importance**

In [39]:
# Print important features of tuned decision tree similar to decision trees
print(pd.DataFrame(rf_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                          Imp
Power                                0.680058
Year                                 0.199712
New_price                            0.068058
Engine                               0.015600
kilometers_driven_log                0.008219
...                                       ...
Name_Brand&Model_Mahindra Jeep       0.000000
Name_Brand&Model_Mahindra KUV        0.000000
Name_Brand&Model_Mahindra Logan      0.000000
Name_Brand&Model_Mahindra NuvoSport  0.000000
Name_Brand&Model_Volvo XC90          0.000000

[247 rows x 1 columns]


**Observations and insights: ______**

In [40]:
# Defining list of models you have trained
models = [lr, rr, dt, dtree_tuned, rf, rf_tuned]

# Defining empty lists to add train and test results
r2_train = []
r2_test = []
rmse_train = []
rmse_test = []

# Looping through all the models to get the rmse and r2 scores
for model in models:
    
    # Accuracy score
    j = get_model_score(model, False)
    
    r2_train.append(j[0])
    
    r2_test.append(j[1])
    
    rmse_train.append(j[2])
    
    rmse_test.append(j[3])

In [41]:
comparison_frame = pd.DataFrame({'Model':['Linear Regression','Ridge Regression','Decision Tree', 'Decision Tree - Tuned', 'Random Forest', 'Random Forest - Tuned'], 
                                          'r2_train': r2_train,'r2_test': r2_test,
                                          'rmse_train': rmse_train,'rmse_test': rmse_test}) 
comparison_frame

Unnamed: 0,Model,r2_train,r2_test,rmse_train,rmse_test
0,Linear Regression,0.919422,0.835704,3.049694,4.337921
1,Ridge Regression,0.897243,0.872379,3.443933,3.823211
2,Decision Tree,0.999958,0.809615,0.069984,4.669652
3,Decision Tree - Tuned,0.878448,0.804364,3.745673,4.733602
4,Random Forest,0.974279,0.858438,1.72304,4.026626
5,Random Forest - Tuned,0.911252,0.845684,3.200573,4.204099


**Observations: _____**

**Note:** You can also try some other algorithms such as KNN and compare the model performance with the existing ones.

### **Insights**

**Refined insights**:
- What are the most meaningful insights from the data relevant to the problem?

**Comparison of various techniques and their relative performance**:
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

**Proposal for the final solution design**:
- What model do you propose to be adopted? Why is this the best solution to adopt?