## Day 35 Lecture 1 Assignment

In this assignment, we will learn about gradient boosting. We will use a dataset describing survival rates after breast cancer surgery loaded below and analyze the model generated for this dataset.

In [21]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [3]:
# Attributes:
# Age of patient at time of operation (numerical)
# Patient's year of operation (year - 1900, numerical)
# Number of positive axillary nodes detected (numerical)
# Survival status (class attribute)
#  -- 1 = the patient survived 5 years or longer
#  -- 2 = the patient died within 5 year

cols = ['age', 'op_year', 'nodes', 'survival']
cancer = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/haberman.data', names=cols)

In [4]:
cancer.head()

Unnamed: 0,age,op_year,nodes,survival
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


Check for missing data and remove all rows containing missing data

In [5]:
# answer below:

cancer.isnull().sum()

age         0
op_year     0
nodes       0
survival    0
dtype: int64

In [27]:
cancer = cancer.dropna()

Adjust the target variable so that it has values of either 0 or 1

In [28]:
# answer below:

cancer['survival'] = np.where(cancer['survival'] == 1, 1, 0)
cancer.survival.value_counts()

1    225
0     81
Name: survival, dtype: int64

Split the data into train and test (20% in test)

In [29]:
# answer below:

from sklearn.model_selection import train_test_split

X = cancer.drop('survival', axis=1)
y = cancer['survival']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Create a gradient boosted classification algorithm with a learning rate of 0.01 and max depth of 5. Report the accuracy.

In [30]:
# answer below:

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 5)
gbc.fit(X_train,y_train)
print('train score: ', gbc.score(X_train,y_train))
print('test score: ', gbc.score(X_test,y_test))

train score:  0.8647540983606558
test score:  0.7419354838709677


Print the confusion matrix for the test data. What do you notice about our predictions?

In [31]:
# answer below:

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train,gbc.predict(X_train))

array([[ 34,  32],
       [  1, 177]])

Print the confusion matrix for a learning rate of 1 and a learning rate of 0.5. What do you see now that stands out to you in the confusion matrix?

In [32]:
# answer below:

gbc = GradientBoostingClassifier(learning_rate = 1, max_depth = 5)

gbc.fit(X_train,y_train)

confusion_matrix(y_train,gbc.predict(X_train))

array([[ 64,   2],
       [  1, 177]])

In [33]:

gbc = GradientBoostingClassifier(learning_rate = 0.5, max_depth = 5)

gbc.fit(X_train,y_train)

confusion_matrix(y_train,gbc.predict(X_train))

array([[ 63,   3],
       [  0, 178]])

In [None]:
#prediction gets better as we increase learning rate

Perform a grid search for the optimal learning rate. Instead of accuracy, use a metric that will help your model predict the positive class.

In [34]:
# answer below:

from sklearn.model_selection import GridSearchCV

param = {'learning_rate': [0.01,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}

gbc = GradientBoostingClassifier(n_iter_no_change=10)
clf = GridSearchCV(gbc, param, cv=5, scoring='recall', n_jobs=2)
clf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=10,
           

In [35]:

clf.best_params_

{'learning_rate': 0.01}

List the feature importances for the model with the optimal learning rate.

In [36]:
# answer below:

x_cols = X_train.columns

pd.DataFrame({'columns': x_cols, 'importance scores':clf.best_estimator_.feature_importances_}).sort_values(
    by='importance scores', ascending=False)

Unnamed: 0,columns,importance scores
2,nodes,0.526703
0,age,0.385534
1,op_year,0.087764
