# Applying Boosting Algorithm

## Description of the problem.

An used car listing website plans to provide a value added feature to their customers. Once the customer uploads details of the car they want to sale, the features should estimate the expected price at which the car can be sold. Currently, the customers list the car at resale price based on their own experience or judgement. But this results in either revenue loss for customers, if they list the car at lower price or no or delayed sale, if they list at very high price. This value-add feature will help customer to find customers early and right price. 

The website has collected the past car resales data and plan to leverage that to build a ML model to estimate the resale price.

## Dataset

The dataset has 12 features of each car and the price at which they were sold. These are cars that were sold in 2019.

1. Id - Car's id. This is a sequence number.
2. Name - The brand and model of the car.
3. Location - The location in which the car is being sold or is available for purchase.
4. Year - The year or edition of the model.
5. Kilometers_Driven - The total kilometers are driven in the car by the previous owner(s) in KM.
6. Fuel_Type - The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
7. Transmission - The type of transmission used by the car. (Automatic / Manual)
8. Owner_Type - First, Second, Third, or Fourth & Above
9. Mileage - The standard mileage offered by the car company in kmpl or km/kg
10. Engine - The displacement volume of the engine in CC.
11. Power - The maximum power of the engine in bhp.
12. Seats - The number of seats in the car.
13. New_Price - The price of a new car of the same model.
14. Price - The price of the car (target).

### Load Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

np.random.seed(100)

In [None]:
cars_df = pd.read_csv( "https://drive.google.com/uc?export=download&id=10-R6GyVWjt_gjWEFD86mKHDvSWD9lp1z" )

In [None]:
cars_df.sample(5)

Unnamed: 0,index,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price,age,KM_Driven,make,mileage_new,engine_new,power_new
1587,3117,Maruti Zen Estilo LXI BS IV,Mumbai,2009,47000,Petrol,Manual,First,19.0 kmpl,998 CC,67.1 bhp,5.0,,1.46,10,47,maruti,19.0,998.0,67.1
1888,3693,Hyundai Xcent 1.1 CRDi S,Mumbai,2016,63141,Diesel,Manual,Second,24.4 kmpl,1120 CC,71 bhp,5.0,,4.5,3,63,hyundai,24.4,1120.0,71.0
845,1694,Hyundai i10 Sportz AT,Chennai,2013,21000,Petrol,Automatic,First,16.95 kmpl,1197 CC,78.9 bhp,5.0,,3.96,6,21,hyundai,16.95,1197.0,78.9
362,761,Tata Nano Lx BSIV,Chennai,2011,35000,Petrol,Manual,First,25.4 kmpl,624 CC,37.48 bhp,4.0,,1.6,8,35,tata,25.4,624.0,37.48
2858,5562,Maruti Ciaz VDI SHVS,Kochi,2018,17804,Diesel,Manual,Second,28.09 kmpl,1248 CC,88.5 bhp,5.0,,7.97,1,17,maruti,28.09,1248.0,88.5


In [None]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3092 entries, 0 to 3091
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              3092 non-null   int64  
 1   Name               3092 non-null   object 
 2   Location           3092 non-null   object 
 3   Year               3092 non-null   int64  
 4   Kilometers_Driven  3092 non-null   int64  
 5   Fuel_Type          3092 non-null   object 
 6   Transmission       3092 non-null   object 
 7   Owner_Type         3092 non-null   object 
 8   Mileage            3092 non-null   object 
 9   Engine             3092 non-null   object 
 10  Power              3092 non-null   object 
 11  Seats              3091 non-null   float64
 12  New_Price          411 non-null    object 
 13  Price              3092 non-null   float64
 14  age                3092 non-null   int64  
 15  KM_Driven          3092 non-null   int64  
 16  make               3092 

### Feature Set Selection

In [None]:
cars_df.columns

Index(['index', 'Name', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type',
       'Transmission', 'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats',
       'New_Price', 'Price', 'age', 'KM_Driven', 'make', 'mileage_new',
       'engine_new', 'power_new'],
      dtype='object')

In [None]:
x_features = ['KM_Driven', 'Fuel_Type', 'age',
              'Transmission', 'Owner_Type', 'Seats', 
              'make', 'mileage_new', 'engine_new', 
              'power_new', 'Location']

In [None]:
cat_vars = ['Fuel_Type', 
                'Transmission', 'Owner_Type',
                'make', 'Location']

In [None]:
num_vars = list(set(x_features) - set(cat_vars))

In [None]:
num_vars

['engine_new', 'KM_Driven', 'age', 'power_new', 'Seats', 'mileage_new']

In [None]:
cars_df[x_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3092 entries, 0 to 3091
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   KM_Driven     3092 non-null   int64  
 1   Fuel_Type     3092 non-null   object 
 2   age           3092 non-null   int64  
 3   Transmission  3092 non-null   object 
 4   Owner_Type    3092 non-null   object 
 5   Seats         3091 non-null   float64
 6   make          3092 non-null   object 
 7   mileage_new   3092 non-null   float64
 8   engine_new    3092 non-null   float64
 9   power_new     3092 non-null   float64
 10  Location      3092 non-null   object 
dtypes: float64(4), int64(2), object(5)
memory usage: 265.8+ KB


### Setting X and y variables

In [None]:
X = cars_df[x_features]
y = cars_df['Price']

### Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [None]:
X_train.shape

(2473, 11)

In [None]:
X_test.shape

(619, 11)

## Defining Transformation

1. Data imputation for Seats Column
    - Mean imputation 
2. Categorical Encoding for categorical columns
    - OHE Encoding
3. Data scaling
    - Standard scaling

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputed_num_vars = ['Seats']

In [None]:
imputed_num_vars

['Seats']

In [None]:
non_imputed_num_vars = list(set(num_vars) - set(imputed_num_vars))

In [None]:
non_imputed_num_vars

['engine_new', 'KM_Driven', 'age', 'power_new', 'mileage_new']

In [None]:
mean_imputer = SimpleImputer(strategy='mean')

### Encode Categorical Variables

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore')

### Scaling Numerical Vars

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

## Creating Pipelines

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
imputed_num_transformer = Pipeline( steps = [  
        ('imputation', mean_imputer),
        ('scaler', scaler)])

In [None]:
non_imputed_num_transformer = Pipeline( steps = [('scaler', scaler)])

In [None]:
cat_transformer = Pipeline( steps = [('ohencoder', ohe_encoder)])

In [None]:
preprocessor = ColumnTransformer(
    transformers=[  
        ('num_imputed', imputed_num_transformer, imputed_num_vars),
        ('num_not_imputed', non_imputed_num_transformer, non_imputed_num_vars),
        ('catvars', cat_transformer, cat_vars)])

### Decision Tree Regressor


In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree_regressor = DecisionTreeRegressor(max_depth=7)

In [None]:
reg = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', tree_regressor)])           

In [None]:
reg.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_imputed',
                                                  Pipeline(steps=[('imputation',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Seats']),
                                                 ('num_not_imputed',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['engine_new', 'KM_Driven',
                                                   'age', 'power_new',
                                                   'mileage_new']),
                                                 ('catvars

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score( reg,
                          X_train,
                          y_train,
                          cv = 10,
                          scoring = 'r2')

In [None]:
scores

array([0.79511729, 0.74005365, 0.78425536, 0.79843413, 0.78326706,
       0.70744353, 0.75331209, 0.75814041, 0.7863509 , 0.78538164])

In [None]:
scores.mean()

0.7691756070413229

In [None]:
scores.std()

0.027468146075534966

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
np.sqrt(mean_squared_error(y_test, reg.predict(X_test)))

1.0355771521768948

### Gradient Boosting

In [None]:
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor

In [None]:
gboost_regressor = GradientBoostingRegressor(n_estimators=100,learning_rate = 0.1)

In [None]:
gboost_reg = Pipeline(steps=[('preprocessor', preprocessor),
                              ('regressor', gboost_regressor)])           

In [None]:
gboost_reg.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_imputed',
                                                  Pipeline(steps=[('imputation',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Seats']),
                                                 ('num_not_imputed',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['engine_new', 'KM_Driven',
                                                   'age', 'power_new',
                                                   'mileage_new']),
                                                 ('catvars

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score( gboost_reg,
                          X_train,
                          y_train,
                          cv = 10,
                          scoring = 'r2')

In [None]:
scores

array([0.8737737 , 0.85253307, 0.8775764 , 0.89593407, 0.84973782,
       0.86631856, 0.86473147, 0.85039147, 0.85882263, 0.8743666 ])

In [None]:
scores.mean()

0.8664185774718696

In [None]:
scores.std()

0.01378915625357891

In [None]:
np.sqrt(mean_squared_error(y_test, gboost_reg.predict(X_test)))

0.7819054786959797

### XGBoost


#### objective
- Default = reg:linear
- It defines the loss function to be minimized. Most commonly used values are given below -
- reg:squarederror: regression with squared loss.
- reg:squaredlogerror: regression with squared log loss 1/2[log(pred+1)−log(label+1)]2. — All input labels are required to be greater than -1.
- binary:logistic: logistic regression for binary classification, output probability.
- multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes).

In [None]:
from xgboost import XGBRegressor

In [None]:
params = { "n_estimators": 400,
           "max_depth": 4,
           "objective": 'reg:squarederror' }

xgb_regressor = XGBRegressor(**params)

In [None]:
xgb_reg = Pipeline(steps=[('preprocessor', preprocessor),
                          ('regressor', xgb_regressor)])           

In [None]:
xgb_reg.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_imputed',
                                                  Pipeline(steps=[('imputation',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Seats']),
                                                 ('num_not_imputed',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['engine_new', 'KM_Driven',
                                                   'age', 'power_new',
                                                   'mileage_new']),
                                                 ('catvars

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score( xgb_reg,
                          X_train,
                          y_train,
                          cv = 10,
                          scoring = 'r2')

In [None]:
scores

array([0.89895372, 0.88198918, 0.90409322, 0.91557371, 0.91022841,
       0.87555769, 0.91763234, 0.88597505, 0.89580965, 0.90969536])

In [None]:
scores.mean()

0.8995508342555605

In [None]:
scores.std()

0.013798427799221571

In [None]:
np.sqrt(mean_squared_error(y_test,xgb_reg.predict(X_test)))

0.6689574400209832

### XGBoost: Parameter Tuning


#### subsample
- Default = 1
- It denotes the fraction of observations to be randomly sampled for each tree.
- Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
- Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. — This will prevent overfitting.
- Subsampling occurs once in every boosting iteration.

#### colsample_bytree
- Default = 1
- This is a family of parameters for subsampling of columns.
- All colsample_by parameters have a range of (0, 1], the default value of 1, and specify the fraction of columns to be subsampled.

#### lambda
- Default = 1
- This is used to handle the regularization part of XGBoost.
- L2 regularization term on weights (analogous to Ridge regression).
- Increasing this value will make model more conservative.

#### eta
- Default = 0.3
- This is the learning rate of the algorithm.
- It is the step size shrinkage used in update to prevent overfitting.
- It makes the model more conservative by shrinking the weights on each step.
- Range of eta is [0,1].

#### gamma
- Default = 0
- A node is split only when the resulting split gives a positive reduction in the loss function.
- Gamma specifies the minimum loss reduction required to make a split.
- The larger the gamma value, the more conservative is the algorithm.

### XGBoost: Classification Problems

#### scale_pos_weight
- Default = 0
- It controls the balance of positive and negative weights.
- It is useful for imbalanced classes.
- A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.
- A typical value to consider: sum(negative instances) / sum(positive instances).

In [None]:
from xgboost import XGBRegressor

In [None]:
params = { "n_estimators": 400,
           "max_depth": 5,
           "objective": 'reg:squarederror',
           "colsample_bytree": 0.8,
           "subsample": 0.75,
           "lambda": 100}

xgb_regressor = XGBRegressor(**params)

In [None]:
xgb_reg = Pipeline(steps=[('preprocessor', preprocessor),
                          ('regressor', xgb_regressor)])           

In [None]:
xgb_reg.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_imputed',
                                                  Pipeline(steps=[('imputation',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Seats']),
                                                 ('num_not_imputed',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['engine_new', 'KM_Driven',
                                                   'age', 'power_new',
                                                   'mileage_new']),
                                                 ('catvars

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score( xgb_reg,
                          X_train,
                          y_train,
                          cv = 10,
                          scoring = 'r2')

In [None]:
scores

array([0.90833599, 0.89084002, 0.9105688 , 0.92241302, 0.9047683 ,
       0.89198298, 0.90333138, 0.88314297, 0.89971622, 0.92842068])

In [None]:
scores.mean()

0.904352036484334

In [None]:
scores.std()

0.013297055146125203

In [None]:
np.sqrt(mean_squared_error(y_test,xgb_reg.predict(X_test)))

0.6403870236671778