## Day 35 Lecture 1 Assignment

In this assignment, we will learn about gradient boosting. We will use a dataset describing survival rates after breast cancer surgery loaded below and analyze the model generated for this dataset.

In [0]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [0]:
# Attributes:
# Age of patient at time of operation (numerical)
# Patient's year of operation (year - 1900, numerical)
# Number of positive axillary nodes detected (numerical)
# Survival status (class attribute)
#  -- 1 = the patient survived 5 years or longer
#  -- 2 = the patient died with in 5 year

cols = ['age', 'op_year', 'nodes', 'survival']
cancer = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/haberman.data', names=cols)

In [0]:
cancer.head()

Unnamed: 0,age,op_year,nodes,survival
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


In [0]:
cancer['survival'].unique()

array([1, 2])

Check for missing data and remove all rows containing missing data

In [0]:
# answer below:
cancer.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age       306 non-null    int64
 1   op_year   306 non-null    int64
 2   nodes     306 non-null    int64
 3   survival  306 non-null    int64
dtypes: int64(4)
memory usage: 9.7 KB


Adjust the target variable so that it has values of either 0 or 1

In [0]:
y = (cancer["survival"] == 1).astype(int)
y.head()

0    1
1    1
2    1
3    1
4    1
Name: survival, dtype: int64

Create a dummy variable from the number of nodes

In [0]:
X_no_dummies.nunique()

age        49
op_year    12
nodes      31
dtype: int64

In [0]:
X_no_dummies = cancer.drop('survival', axis=1)
X = pd.get_dummies(X_no_dummies, drop_first= True, columns= ['nodes'])
X.head()

Unnamed: 0,age,op_year,nodes_1,nodes_2,nodes_3,nodes_4,nodes_5,nodes_6,nodes_7,nodes_8,nodes_9,nodes_10,nodes_11,nodes_12,nodes_13,nodes_14,nodes_15,nodes_16,nodes_17,nodes_18,nodes_19,nodes_20,nodes_21,nodes_22,nodes_23,nodes_24,nodes_25,nodes_28,nodes_30,nodes_35,nodes_46,nodes_52
0,30,64,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,30,62,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,30,65,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,31,59,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,31,65,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Split the data into train and test (20% in test)

In [0]:
# answer below:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.head()

Unnamed: 0,age,op_year,nodes_1,nodes_2,nodes_3,nodes_4,nodes_5,nodes_6,nodes_7,nodes_8,nodes_9,nodes_10,nodes_11,nodes_12,nodes_13,nodes_14,nodes_15,nodes_16,nodes_17,nodes_18,nodes_19,nodes_20,nodes_21,nodes_22,nodes_23,nodes_24,nodes_25,nodes_28,nodes_30,nodes_35,nodes_46,nodes_52
143,52,59,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
239,62,58,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
77,44,63,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79,44,67,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
57,42,59,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Create a gradient boosted classification algorithm with a learning rate of 0.01 and max depth of 5. Report the accuracy.

In [0]:
# answer below:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(learning_rate=.01, max_depth=5)
gbc.fit(X_train, y_train)


GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.01, loss='deviance', max_depth=5,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [0]:
gbc.score(X_train, y_train)

0.8114754098360656

In [0]:
gbc.score(X_test, y_test)

0.7096774193548387

Print the confusion matrix for the test data. What do you notice about our predictions?

In [0]:
# answer below:
from sklearn.metrics import classification_report
y_true = y_test
y_pred = gbc.predict(X_test)

print(classification_report(y_true, y_pred))


              precision    recall  f1-score   support

           0       0.33      0.06      0.10        17
           1       0.73      0.96      0.83        45

    accuracy                           0.71        62
   macro avg       0.53      0.51      0.46        62
weighted avg       0.62      0.71      0.63        62



In [0]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_true, y_pred)

array([[ 1, 16],
       [ 2, 43]])

Print the confusion matrix for a learning rate of 1 and a learning rate of 0.5. What do you see now that stands out to you in the confusion matrix?

In [0]:
gbc = GradientBoostingClassifier(learning_rate=1, max_depth=5)
gbc.fit(X_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=1, loss='deviance', max_depth=5,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [0]:
# answer below:
gbc.score(X_train, y_train)

0.9836065573770492

In [0]:
gbc.score(X_test, y_test)

0.6451612903225806

In [0]:
gbc = GradientBoostingClassifier(learning_rate=.5, max_depth=5)
gbc.fit(X_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.5, loss='deviance', max_depth=5,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [0]:
gbc.score(X_train, y_train)

0.9836065573770492

In [0]:
gbc.score(X_test, y_test)

0.6290322580645161

Perform a grid search for the optimal learning rate.

In [0]:
# answer below:
from sklearn.model_selection import GridSearchCV
parameters = {'learning_rate':[.0001, .001, .01, .1, 1]}
grid_model = GridSearchCV(gbc, parameters)
grid_model.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.5,
                                                  loss='deviance', max_depth=5,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
      

In [0]:
grid_model.score(X_train, y_train)

0.8114754098360656

In [0]:
grid_model.score(X_test, y_test)

0.7096774193548387

List the feature importances for the model with the optimal learning rate.

In [0]:
# answer below:
features = { "feature": X_train.columns, "importance": grid_model.best_estimator_.feature_importances_, } 
features_df = pd.DataFrame(features) 
features_df.sort_values("importance", ascending=False)


Unnamed: 0,feature,importance
0,age,0.198954
14,nodes_13,0.142697
24,nodes_23,0.135063
1,op_year,0.127186
6,nodes_5,0.123136
12,nodes_11,0.067225
10,nodes_9,0.061916
22,nodes_21,0.035246
18,nodes_17,0.021547
25,nodes_24,0.021534
