<a href="https://colab.research.google.com/github/rajasreekalli/Data-Visualization/blob/main/Intro_to_Gradient_Boosting_Assignment_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Gradient Boosting

![gradient boosting image](https://media.geeksforgeeks.org/wp-content/uploads/20200721214745/gradientboosting.PNG)

Image thanks to [Geeks for Geeks](https://www.geeksforgeeks.org/ml-gradient-boosting/)

In this assignment you will:
1. import and prepare a dataset for modeling
2. test and evaluate 3 different boosting models and compare the fit times of each.
3. tune the hyperparameters of the best model to reduce overfitting and improve performance.

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, plot_confusion_matrix
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

In this assignment you will be working with census data.  Your goal is to predict whether a person will make more or less than $50k per year in income.

The data is available [here](https://drive.google.com/file/d/1drlRzq-lIY7rxQnvv_3fsxfIfLsjQ4A-/view?usp=sharing)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Prepare your dataset for modeling.

Remember to: 
1. Check for missing data, bad data, and duplicates.
2. Check your target class balance.
3. Perform your validation split
4. Create a preprocessing pipeline to use with your models.
5. Fit and evaluate your models using pipelines

In [3]:
df = pd.read_csv('/content/drive/MyDrive/census_income.csv')
df.head(10)

Unnamed: 0.1,Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-class
0,0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,5,37,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,6,49,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,7,52,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,8,31,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,9,42,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [4]:
df = df.drop(columns = 'Unnamed: 0')

In [5]:
df.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income-class      object
dtype: object

In [6]:
df.duplicated().sum()

3465

In [7]:
df = df.drop_duplicates()

In [8]:
df.duplicated().sum()

0

In [9]:
df.isnull().sum()

age               0
workclass         0
education         0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income-class      0
dtype: int64

From on the given dataset i am find out some question marks. So i can removed it.

In [10]:
df[df.eq('?').any(1)]

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-class
14,40,Private,Assoc-voc,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,?,>50K
27,54,?,Some-college,Married-civ-spouse,?,Husband,Asian-Pac-Islander,Male,0,0,60,South,>50K
38,31,Private,Some-college,Married-civ-spouse,Sales,Husband,White,Male,0,0,38,?,>50K
51,18,Private,HS-grad,Never-married,Other-service,Own-child,White,Female,0,0,30,?,<=50K
61,32,?,7th-8th,Married-spouse-absent,?,Not-in-family,White,Male,0,0,40,?,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
32530,35,?,Bachelors,Married-civ-spouse,?,Wife,White,Female,0,0,55,United-States,>50K
32531,30,?,Bachelors,Never-married,?,Not-in-family,Asian-Pac-Islander,Female,0,0,99,United-States,<=50K
32539,71,?,Doctorate,Married-civ-spouse,?,Husband,White,Male,0,0,10,United-States,>50K
32541,41,?,HS-grad,Separated,?,Not-in-family,Black,Female,0,0,32,United-States,<=50K


In [11]:
df['workclass'].value_counts()

Private             19621
Self-emp-not-inc     2473
Local-gov            2040
?                    1632
State-gov            1272
Self-emp-inc         1091
Federal-gov           946
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

In [12]:
df['occupation'].value_counts()

Prof-specialty       3885
Exec-managerial      3719
Adm-clerical         3340
Craft-repair         3298
Sales                3270
Other-service        2996
Machine-op-inspct    1702
?                    1639
Transport-moving     1445
Handlers-cleaners    1179
Farming-fishing       962
Tech-support          874
Protective-serv       631
Priv-house-serv       147
Armed-Forces            9
Name: occupation, dtype: int64

In [13]:
df['native-country'].value_counts()

United-States                 25721
Mexico                          633
?                               580
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Japan                            62
Guatemala                        62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                      

These three columns contains the question marks. So i removed it.

In [14]:
#new_df = df.replace(to_replace ='?', value ='null') 

In [15]:
# df = df[df['occupation'] == '?']

In [16]:
df.drop(df.index[df['occupation'] == '?'], inplace=True)

In [17]:
df['occupation'].value_counts()

Prof-specialty       3885
Exec-managerial      3719
Adm-clerical         3340
Craft-repair         3298
Sales                3270
Other-service        2996
Machine-op-inspct    1702
Transport-moving     1445
Handlers-cleaners    1179
Farming-fishing       962
Tech-support          874
Protective-serv       631
Priv-house-serv       147
Armed-Forces            9
Name: occupation, dtype: int64

In [18]:
df['workclass'].value_counts()

Private             19621
Self-emp-not-inc     2473
Local-gov            2040
State-gov            1272
Self-emp-inc         1091
Federal-gov           946
Without-pay            14
Name: workclass, dtype: int64

In [19]:
df.drop(df.index[df['native-country'] == '?'], inplace=True)

In [20]:
df['native-country'].value_counts()

United-States                 24259
Mexico                          600
Philippines                     188
Germany                         128
Puerto-Rico                     109
Canada                          107
India                           100
El-Salvador                     100
Cuba                             92
England                          86
Jamaica                          80
South                            71
Italy                            68
China                            68
Dominican-Republic               67
Vietnam                          64
Guatemala                        61
Japan                            59
Columbia                         56
Poland                           56
Haiti                            42
Iran                             42
Taiwan                           42
Portugal                         34
Nicaragua                        33
Peru                             30
Greece                           29
France                      

So i removed the all the question marks in these columns.

In [21]:
df.head(20)

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-class
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


I checked it. There is no question marks in these columns.

In [22]:
df['income-class'].value_counts()

<=50K    20024
>50K      6880
Name: income-class, dtype: int64

In [23]:
X = df.drop(columns=['income-class'])
y = df['income-class']

In [24]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [25]:
X_train.shape

(20178, 12)

In [26]:
y_train.shape

(20178,)

In [27]:
cat_select = make_column_selector(dtype_include='object')
num_select = make_column_selector(dtype_include='number')

In [28]:
encoder = OneHotEncoder(sparse='False',handle_unknown='ignore')
scalar = StandardScaler()

In [29]:
ohe_tupe = (encoder,cat_select)
num_tuple = (scalar,num_select)

In [30]:
column_transform = make_column_transformer(ohe_tupe,num_tuple,remainder='passthrough')

# eXtreme Gradient Boosting
We are going to compare both metrics and fit times for our models.  Notice the 'cell magic' in the top of the cell below.  By putting `%%time` at the top of a notebook cell, we can tell it to output how long that cell took to run.  We can use this to compare the speed of each of our different models.  Fit times can be very important for models in deployment, especially with very large dataset and/or many features.

Instantiate an eXtreme Gradient Boosting Classifier (XGBClassifier) below, fit it, and print out a classification report.  Take note of the accuracy, recall, precision, and f1-score, as well as the run time of the cell to compare to our next models.

In [31]:
%%time
xgb = XGBClassifier()
pipe = make_pipeline(column_transform,xgb)
pipe.fit(X_train,y_train)
train_pred = pipe.predict(X_train)
test_pred = pipe.predict(X_test)

CPU times: user 1.36 s, sys: 33.2 ms, total: 1.39 s
Wall time: 1.45 s


In [32]:
print('Classification Report for Training Set')
train_report = classification_report(y_train, train_pred)
print(train_report)
print('Classification Report for Test Set')
test_report = classification_report(y_test, test_pred)
print(test_report)

Classification Report for Training Set
              precision    recall  f1-score   support

       <=50K       0.87      0.95      0.91     15020
        >50K       0.81      0.59      0.68      5158

    accuracy                           0.86     20178
   macro avg       0.84      0.77      0.80     20178
weighted avg       0.86      0.86      0.85     20178

Classification Report for Test Set
              precision    recall  f1-score   support

       <=50K       0.87      0.95      0.91      5004
        >50K       0.80      0.58      0.67      1722

    accuracy                           0.85      6726
   macro avg       0.83      0.77      0.79      6726
weighted avg       0.85      0.85      0.85      6726



In [33]:
pipe.get_params()

{'columntransformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('onehotencoder',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse='False'),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f3e20affa90>),
                                 ('standardscaler', StandardScaler(),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f3e20afff10>)]),
 'columntransformer__n_jobs': None,
 'columntransformer__onehotencoder': OneHotEncoder(handle_unknown='ignore', sparse='False'),
 'columntransformer__onehotencoder__categories': 'auto',
 'columntransformer__onehotencoder__drop': None,
 'columntransformer__onehotencoder__dtype': numpy.float64,
 'columntransformer__onehotencoder__handle_unknown': 'ignore',
 'columntransformer__onehotencoder__sparse': 'False',
 'columntransformer__re

In [34]:
%%time
param_test = {'xgbclassifier__n_estimators':[2,5,7,8,9,10,15,20],
               'xgbclassifier__max_depth':range(5,10,20),
                'xgbclassifier__learning_rate':[0.1,1,10]}
grid = GridSearchCV(pipe,param_grid=param_test);
grid.fit(X_train,y_train)
best_param = grid.best_params_
best_param

CPU times: user 37.2 s, sys: 645 ms, total: 37.8 s
Wall time: 39.1 s


In [35]:
print('Testing accuracy:', grid.score(X_test, y_test))
print('Training accuracy:',grid.score(X_train,y_train))

Testing accuracy: 0.8642581028843295
Training accuracy: 0.8755079789870156


We can see there is only slight increase in the accuracy than without tuning with best parameters but computation time has been increased.

Which target class is your model better at predicting?  Is it significantly overfit?

# More Gradient Boosting

Now fit and evaluate a Light Gradient Boosting Machine and a the Scikit Learn (sklearn) gradient boost model.  Remember to use the `%%time` cell magic command to get the run time.

## LightGBM

In [37]:
%%time
lgm = LGBMClassifier()
pipe2 = make_pipeline(column_transform,lgm)
pipe2.fit(X_train,y_train)
test2_pred  = pipe2.predict(X_test)
train2_pred = pipe2.predict(X_train)

CPU times: user 1.01 s, sys: 18.8 ms, total: 1.03 s
Wall time: 1.26 s


In [38]:
print('Classification Report for Training Set')
train_report = classification_report(y_train, train2_pred)
print(train_report)
print('Classification Report for Testin set')
test_report = classification_report(y_test,test2_pred)
print(test_report)

Classification Report for Training Set
              precision    recall  f1-score   support

       <=50K       0.90      0.94      0.92     15020
        >50K       0.81      0.71      0.75      5158

    accuracy                           0.88     20178
   macro avg       0.86      0.82      0.84     20178
weighted avg       0.88      0.88      0.88     20178

Classification Report for Testin set
              precision    recall  f1-score   support

       <=50K       0.89      0.94      0.91      5004
        >50K       0.78      0.66      0.72      1722

    accuracy                           0.87      6726
   macro avg       0.84      0.80      0.82      6726
weighted avg       0.86      0.87      0.86      6726



In [39]:
pipe2.get_params()

{'columntransformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('onehotencoder',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse='False'),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f3e20affa90>),
                                 ('standardscaler', StandardScaler(),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f3e20afff10>)]),
 'columntransformer__n_jobs': None,
 'columntransformer__onehotencoder': OneHotEncoder(handle_unknown='ignore', sparse='False'),
 'columntransformer__onehotencoder__categories': 'auto',
 'columntransformer__onehotencoder__drop': None,
 'columntransformer__onehotencoder__dtype': numpy.float64,
 'columntransformer__onehotencoder__handle_unknown': 'ignore',
 'columntransformer__onehotencoder__sparse': 'False',
 'columntransformer__re

In [40]:
%%time
param_test = {'lgbmclassifier__n_estimators':[2,5,8,9,10,15,20,50,100],
               'lgbmclassifier__max_depth':range(8,10,15),
                'lgbmclassifier__learning_rate':[0.1,1,5],'lgbmclassifier__num_leaves': [31,35,40],'lgbmclassifier__subsample_freq':[ 0,1,2]}
grid1 = GridSearchCV(pipe2,param_grid=param_test)
grid1.fit(X_train,y_train)
best_param = grid1.best_params_
best_param

CPU times: user 3min 34s, sys: 2.39 s, total: 3min 36s
Wall time: 3min 40s


In [41]:
print('Testing accuracy:', grid1.score(X_test, y_test))
print('Training accuracy:',grid1.score(X_train,y_train))

Testing accuracy: 0.8642581028843295
Training accuracy: 0.8817028446823273


After using LGM classifier the scores are relatively same, there is a slight improvement of the score.

## GradientBoostingClassifier

In [42]:
%%time
gbc = GradientBoostingClassifier()
pipe3 = make_pipeline(column_transform,gbc)
pipe3.fit(X_train,y_train)
test3_pred = pipe3.predict(X_test)
train3_pred = pipe3.predict(X_train)

CPU times: user 2.43 s, sys: 15.9 ms, total: 2.45 s
Wall time: 2.45 s


In [43]:
print('Classification Report for Training Set')
train_report = classification_report(y_train, train3_pred)
print(train_report)
print('Classification Report for Testin set')
test_report = classification_report(y_test,test3_pred)
print(test_report)

Classification Report for Training Set
              precision    recall  f1-score   support

       <=50K       0.87      0.95      0.91     15020
        >50K       0.81      0.60      0.69      5158

    accuracy                           0.86     20178
   macro avg       0.84      0.77      0.80     20178
weighted avg       0.86      0.86      0.85     20178

Classification Report for Testin set
              precision    recall  f1-score   support

       <=50K       0.87      0.95      0.91      5004
        >50K       0.80      0.59      0.68      1722

    accuracy                           0.86      6726
   macro avg       0.83      0.77      0.79      6726
weighted avg       0.85      0.86      0.85      6726



In [44]:
pipe3.get_params()

{'columntransformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('onehotencoder',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse='False'),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f3e20affa90>),
                                 ('standardscaler', StandardScaler(),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f3e20afff10>)]),
 'columntransformer__n_jobs': None,
 'columntransformer__onehotencoder': OneHotEncoder(handle_unknown='ignore', sparse='False'),
 'columntransformer__onehotencoder__categories': 'auto',
 'columntransformer__onehotencoder__drop': None,
 'columntransformer__onehotencoder__dtype': numpy.float64,
 'columntransformer__onehotencoder__handle_unknown': 'ignore',
 'columntransformer__onehotencoder__sparse': 'False',
 'columntransformer__re

In [45]:
param_test = {'gradientboostingclassifier__n_estimators':[5,20,50,100],
               'gradientboostingclassifier__max_depth':range(5,10,15),
                'gradientboostingclassifier__learning_rate':[0.1,1,10]}
grid3 = GridSearchCV(pipe3,param_grid=param_test,verbose=True);
grid3.fit(X_train,y_train)
best_param = grid.best_params_
best_param

Fitting 5 folds for each of 12 candidates, totalling 60 fits


{'xgbclassifier__learning_rate': 1,
 'xgbclassifier__max_depth': 5,
 'xgbclassifier__n_estimators': 20}

In [46]:
print('Testing accuracy:', grid3.score(X_test, y_test))
print('Training accuracy:',grid3.score(X_train,y_train))

Testing accuracy: 0.8660422242045792
Training accuracy: 0.8803151947665775


After doing prediction using best parameters test train accuracy has been increased.


# Tuning Gradient Boosting Models

Tree-based gradient boosting models have a LOT of hyperparameters to tune.  Here are the documentation pages for each of the 3 models you used today:

1. [XGBoost Hyperparameter Documentation](https://xgboost.readthedocs.io/en/latest/parameter.html)
2. [LightGBM Hyperparameter Documentation](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)
3. [Scikit-learn Gradient Boosting Classifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

Choose the model you felt performed the best when comparing multiple metrics and the runtime for fitting, and use GridSearchCV to try at least 2 different values each for 3 different hyper parameters in boosting model you chose.

See if you can create a model with an accuracy between 86 and 90.


# Evaluation

Evaluate your model using a classifiation report and/or a confusion matrix.  Explain in text how your model performed in terms of precision, recall, and it's ability to predict each of the two classes.  Also talk about the benefits or drawbacks of the computation time of that model.

The best score was produced by LGBMClassifier and also with less overfitting  ,so LGBMClassifier model is best for prediction for the data; but LGBMClassifier took more computational time than other models.

# Conclusion

In this assignment you practiced:
1. data cleaning
2. instantiating, fitting, and evaluating boosting models using multiple metrics
3. timing how long it takes a model to fit and comparing run times between multiple models
4. and choosing a final model based on multiple metrics.

