## Exercises 

For these exercises we will use the [Coimbra Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra). This consists of 9 quantitative measures and 1 target value which indicates the presence or not of cancer (1=Healthy, 2=Cancerous).

Please submit these completed exercises as a pdf file.

In [1]:
import numpy as np
import pandas as pd

In [2]:
coimbra = pd.read_csv('dataR2.csv')
coimbra.head()

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1,Classification
0,48,23.5,70,2.707,0.467409,8.8071,9.7024,7.99585,417.114,1
1,83,20.690495,92,3.115,0.706897,8.8438,5.429285,4.06405,468.786,1
2,82,23.12467,91,4.498,1.009651,17.9393,22.43204,9.27715,554.697,1
3,68,21.367521,77,3.226,0.612725,9.8827,7.16956,12.766,928.22,1
4,86,21.111111,92,3.549,0.805386,6.6994,4.81924,10.57635,773.92,1


In [3]:
coimbra.describe()

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1,Classification
count,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0
mean,57.301724,27.582111,97.793103,10.012086,2.694988,26.61508,10.180874,14.725966,534.647,1.551724
std,16.112766,5.020136,22.525162,10.067768,3.642043,19.183294,6.843341,12.390646,345.912663,0.499475
min,24.0,18.37,60.0,2.432,0.467409,4.311,1.65602,3.21,45.843,1.0
25%,45.0,22.973205,85.75,4.35925,0.917966,12.313675,5.474283,6.881763,269.97825,1.0
50%,56.0,27.662416,92.0,5.9245,1.380939,20.271,8.352692,10.82774,471.3225,2.0
75%,71.0,31.241442,102.0,11.18925,2.857787,37.3783,11.81597,17.755207,700.085,2.0
max,89.0,38.578759,201.0,58.46,25.050342,90.28,38.04,82.1,1698.44,2.0


### Exercise 1 

Check for missing values, and drop any rows which have missing values. Create a feature and target dataframe and split these into testing and training sets. (2 marks)

In [4]:
coimbra.isnull().values.any()  

False

In [6]:
from sklearn.model_selection import train_test_split
X = coimbra.drop(['Classification'], axis = 1) # drop the target variable for the features
y = coimbra['Classification'] # create a target dataframe

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=0)


### Exercise 2

Create a simple Random Forest Classification as a baseline model, and calculate the accuracy for this model. (2 marks)

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# instantatiate the RFC with 200 ensemble members
rfc = RandomForestClassifier(n_estimators=200).fit(X_train, y_train)
y_pred = rfc.predict(X_test)  # calculate the predicted values
# print the accuracy of the RFC
print('Accuracy {0}'.format(np.round(accuracy_score(y_test, y_pred),3)))


Accuracy 0.542


### Exercise 3

Create a AdaBoost model with a Decision Stump and a learning rate of 1, and calculate the accuracy of this model. (2 marks)

In [7]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                            n_estimators=200, learning_rate=.1)
ada_clf.fit(X_train, y_train)
y_pred = ada_clf.predict(X_test)  # calculate the predicted values
# print the accuracy of the classifier
print('Accuracy {0}'.format(np.round(accuracy_score(y_test, y_pred),3)))


Accuracy 0.583


### Exercise 4

For the same learning rate, find the optimal number of estimators. Just use the testing and training sets to calculate this, don't use cross-validation. (4 marks)

In [10]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

In [14]:
for n_est in range(1,201):
    box=[]
    ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                            n_estimators=n_est, learning_rate=.1)
    ada_clf.fit(X_train, y_train)
    y_pred = ada_clf.predict(X_test)  # calculate the predicted values
# print the accuracy of the classifier
    box.append(np.round(accuracy_score(y_test, y_pred),3))
    print(box)

    

[0.542]
[0.625]
[0.542]
[0.625]
[0.625]
[0.583]
[0.625]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.625]
[0.583]
[0.583]
[0.625]
[0.625]
[0.625]
[0.583]
[0.583]
[0.625]
[0.625]
[0.583]
[0.625]
[0.583]
[0.625]
[0.625]
[0.583]
[0.583]
[0.583]
[0.583]
[0.625]
[0.583]
[0.625]
[0.625]
[0.583]
[0.583]
[0.583]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.625]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.583]
[0.625]
[0.625]
[0.625]
[0.625]
[0.667]
[0.625]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.625]
[0.667]
[0.625]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.625]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.667]
[0.625]
[0.667]
[0.625]
[0.625]
[0.625]


In [15]:
nests = np.arange(1,201,1)
results=np.zeros(len(nests))
for j in range(len(nests)):
    ada_clf =AdaBoostClassifier(n_estimators = nests[j],learning_rate=1)
    ada_clf.fit(X_train,y_train)
    y_pred=ada_clf.predict(X_test)
    results[j]=accuracy_score(y_test,y_pred)
results_df=pd.DataFrame(results,columns=['accuracy(learning_rate=1)'],index=nests)

In [16]:
results

array([0.54166667, 0.625     , 0.5       , 0.54166667, 0.54166667,
       0.66666667, 0.66666667, 0.75      , 0.70833333, 0.70833333,
       0.66666667, 0.66666667, 0.625     , 0.66666667, 0.625     ,
       0.54166667, 0.625     , 0.58333333, 0.58333333, 0.625     ,
       0.66666667, 0.70833333, 0.75      , 0.75      , 0.70833333,
       0.66666667, 0.70833333, 0.70833333, 0.70833333, 0.70833333,
       0.70833333, 0.70833333, 0.66666667, 0.70833333, 0.625     ,
       0.625     , 0.58333333, 0.625     , 0.66666667, 0.66666667,
       0.625     , 0.58333333, 0.58333333, 0.625     , 0.66666667,
       0.66666667, 0.66666667, 0.66666667, 0.70833333, 0.66666667,
       0.70833333, 0.70833333, 0.70833333, 0.70833333, 0.70833333,
       0.70833333, 0.66666667, 0.70833333, 0.70833333, 0.70833333,
       0.66666667, 0.70833333, 0.66666667, 0.70833333, 0.70833333,
       0.70833333, 0.66666667, 0.66666667, 0.66666667, 0.66666667,
       0.66666667, 0.66666667, 0.66666667, 0.66666667, 0.66666

In [20]:
max_value = max(results)
max_value

0.75

In [22]:
max_index = np.argmax(results, axis=0)
max_index

7

Optimal estimators: 8(since index 7 means the eighth estimator is 8 that count from 1)