### Non-linear Models
##### Adapted by Dr. Harry Goldingay from solution by Miss Katrina Jones and Dr. Aniko Ekart (ML Module, 2021)

_Note: this task operates on the same dataset as the task from the previous week, so I will import much of the code from that task's solution. The code which is unique to this week's task starts after the "Cross-validation" heading below._ 

In [1]:
# Import general modules
import pandas as pd # For reading the csv
from sklearn.linear_model import Perceptron # Our chosen model
from sklearn.metrics import confusion_matrix # For creation of the confusion matrix
from sklearn.metrics import classification_report # For creation of precision, recall and f1-measures
from sklearn.metrics import accuracy_score # For help in comparing data given using accuracy score vs confusion matrix
import matplotlib.pyplot as plt

_Setting up and reading the database from the csv file._

In [2]:
#Use header=None to avoid the first row being used as column headers
df = pd.read_csv("haberman.csv",header=None)
df.head()

Unnamed: 0,0,1,2,3
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


_We can add textual column headers for readability_

In [3]:
df.columns = ['age','operationYear','positiveAuxNodes','survivalStatus']
df.head()

Unnamed: 0,age,operationYear,positiveAuxNodes,survivalStatus
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


In [4]:
df.describe() # Useful in this context due to us having so much data - we can see the range of the values in each columns...

Unnamed: 0,age,operationYear,positiveAuxNodes,survivalStatus
count,306.0,306.0,306.0,306.0
mean,52.457516,62.852941,4.026144,1.264706
std,10.803452,3.249405,7.189654,0.441899
min,30.0,58.0,0.0,1.0
25%,44.0,60.0,0.0,1.0
50%,52.0,63.0,1.0,1.0
75%,60.75,65.75,4.0,2.0
max,83.0,69.0,52.0,2.0


In [5]:
X = df.iloc[:,0:3]
y = df.iloc[:,3]

### Cross-validation

In [6]:
# This solution adapts the example found in the sklearn
# documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold

# Grid of parameters to search
p_grid = {
    "kernel": ["poly", "rbf"],
    "C": [1, 10],
    "gamma": [0.01, 0.1]
}

# Define how many inner and outer folds we want
inner_cv = KFold(n_splits=2, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=2, shuffle=True, random_state=42)

# Nested CV with parameter optimization
clf = GridSearchCV(estimator=SVC(), param_grid=p_grid, cv=inner_cv,refit=True)
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv)
print(nested_score)
print("\n")

[0.77124183 0.7254902 ]




In [7]:
# If we want to inspect the results...
clf.fit(X,y)
print("After inner CV...")
print(clf.best_params_)
print(clf.best_score_)
print("\n")

After inner CV...
{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.7549019607843137


