# Module 4 Guidance

This notebook is a template for module 4b and 4c, which will be tested in Google Colab, your code needs to run there.
The structure has been provided to improve consistency and make it easier for markers to understand your code but still give students the flexibility to be creative.  You need to populate the required functions to solve this problem.  All dependencies should be documented in the next cell.

You can:
    add further cells or text blocks to extend or further explain your solution
    add further functions

Dont:
    rename functions
   

In [None]:
# Fixed dependencies - do not remove or change.
import pytest
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/gdrive/')
# Import your dependencies

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer

# Models
from sklearn.ensemble import RandomForestClassifier

# Tuning
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

# Evaluating
from sklearn.metrics import confusion_matrix, accuracy_score




Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive/


In [None]:
# Import data

def import_local_data(file_path):
    """This function needs to import the data file into collab and return a pandas dataframe
    """
    raw_df = pd.read_excel(file_path)
    


    return raw_df

In [None]:
# The file is hosted on my google drive
% cd ../content/gdrive/My Drive
local_file = "breast-cancer.xls" 

/content/gdrive/My Drive


In [None]:
# Dont change
data = import_local_data(local_file)

### Conduct exploratory data analysis and explain your key findings - Examine the data, explain its key features and what they look like.  Highlight any fields that are anomalous.

In [None]:
data.head(10)
# Class is the target column. It has a binary value which can easily be changed to numerical. There are some anomolous fields at first glance; there are what appear to be datetime objects in the inv-nodes and tumor-size column.
# in addition to Class, node-caps, breast and irradiat are binary categorical values

In [None]:
data.info() # ideally all of the categories will have a numerical value, thos that are categorical will need to be labelled

In [None]:
# a quick check to see if there's any missing data in the columns
data.isnull().sum().sort_values(ascending=False)
# there is no missing data, but there may be additional spurious values such as the datetime objects noted when looking at head()

In [None]:
for col in data:
    print(data[col].value_counts(),'\n')

# Displaying each data category in this manner to deterine the unique values
# Doing this demonstrated that although there is no 'missing' data, there are some suspect values in some columns that will need dealing with separately

In [None]:
# in the tumor-size and inv-nodes category there are some values showing as datetime objects. These provide no value. To allow me isolate them from the other values for some quick analysis
# I ensure they're all converted to strings, as per the other values
data.loc[:, 'tumor-size'] = data.loc[:, 'tumor-size'].map(lambda x: str(x))

data.loc[:, 'inv-nodes'] = data.loc[:, 'inv-nodes'].map(lambda x: str(x))

In [None]:
# having convereted them to strings I can single out the spurious data as it's a far longer string that the 'correct' data
per_tum = data['tumor-size'][data['tumor-size'].str.len()>5].count()/data['tumor-size'].count()*100
count_tum = data['tumor-size'][data['tumor-size'].str.len()>5].count()
print(f'number of rows with spurious data for tumor-size is: {count_tum}, as a precentage that is {per_tum.round(3)} of the total')
# this shows that the spurious data represents just over 11% of all data for this category.

per_inv = data['inv-nodes'][data['inv-nodes'].str.len()>5].count()/data['inv-nodes'].count()*100
count_inv = data['inv-nodes'][data['inv-nodes'].str.len()>5].count()
print(f'number of rows with spurious data for inv-nodes is: {count_inv}, as a percentage that is {per_inv.round(3)} of the total')

number of rows with spurious data for tumor-size is: 32, as a precentage that is 11.189 of the total
number of rows with spurious data for inv-nodes is: 66, as a percentage that is 23.077 of the total


In [None]:
# how many rows in both tumor-size and 'inv-nodes have spurious data?
data[(data['tumor-size'].str.len()>5) & (data['inv-nodes'].str.len()>5)]
# I did this to see if it would be reasonable to remove the anomolous data, but of a total of 98 rows with datetime objects as values, only 2 were in both columns in a row.

*I spent some time here wondering how to best approach this. I didn't want to simply replace the anomolous values with the most common, but equally there was to much spurious data to remove.*

*Then I remembered I'd seen this type of 'corruption' of data before in excel where it just decides something it's seeing is a date, when it's not.*

In [None]:
# I don't know if this happened after I downloaded the file as i viewed it in excel before moving to gdrive. I decided to accept it as a thing to deal with.
data['tumor-size'].value_counts()
# the two values in tumor-size are most likely 10-14 and 05-09.
# it's really obvious now i've noticed it

In [None]:
data['inv-nodes'].value_counts()
# I think the values here should be 3-5, 6-8,9-11 and 12-14

# Explain your key findings
*9 contributing data fields and 1 target category
 all but 1 of the categories are object dtypes which will need to be converted into numerical representations of the data
 there's no missing data, but there are some anomolous values. In node-caps and breast-quad we have ? showing as a value. It's a small amount of the whole so that will just be assigned as a common value based on the rest of the data
 in tumor-size and inv-nodes there are a large number of fields showing datetime objects as values. I've concluded this is a corruption of the intended value where excel has converted what it's seeing into a date, which has in turn represented as a datetime object in Python. There is an obvious pattern so the 
I originally intended to replace the values as they should have been for that field, however the automated encoding will treat all of the dates as unique values and assign a value as they would if I manually replaced them.*

**Create any data pre-processing that you will conduct on seen and unseen data.  Regardless of the model you use, this dataframe must contain only numeric features and have a strategy for any expected missing values. Any objects can that are needed to handle the test data that are dependent on the training data can be stored in the model class.  You are recommended to use sklearn Pipelines or similar functionality to ensure reproduccibility.**

In [None]:
# Split your data so that you can test the effectiveness of your model
X = data.iloc[:,0:9]
y = data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=42, shuffle=True)

In [None]:
# Checking that the shapes of the datasets match. The model will not function with mismatched value counts.
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(200, 9) (200,)
(86, 9) (86,)


In [None]:
# Populate preprocess_training_data and preprocess_test_data to preprocess data.
# You nmust process test and train separately so your model does not accidently gain information that a model wouldnt have in reality and therefore get better predictions

In [None]:
class Module4_Model:
    
    def __init__(self):
        self.model = None
        
    def preprocess_training_data(self, X_train, y_train):
        """
        This function should process the training data and store any features required in the class
        """
            
        #categorical_mask = train.dtypes==object
        #cols = train.columns[categorical_mask].tolist()
        processed_X_train = X_train.copy()
        y_train.replace({'Class':{'recurrence-events':1,'no-recurrence-events':0}}, inplace=True)
        processed_X_train.replace({'irradiat':{'yes':1, 'no':0}}, inplace=True)
        
        #newcol=train['Class'].copy()
        processed_X_train=pd.get_dummies(processed_X_train)
        #processed_train['Class']=newcol

        return processed_X_train, y_train

    def preprocess_test_data(self, X_test, y_test):
    
        #categorical_mask = test.dtypes==object
        #cols = test.columns[categorical_mask].tolist()
        processed_X_test = X_test.copy()
        y_test.replace({'Class':{'recurrence-events':1,'no-recurrence-events':0}}, inplace=True)
        processed_X_test.replace({'irradiat':{'yes':1, 'no':0}}, inplace=True)
        
        #newcol=test['Class'].copy()
        processed_X_test = pd.get_dummies(processed_X_test)
        #processed_test['Class']=newcol

        return processed_X_test, y_test

    def process_for_model(self, x_train_processed,x_test_processed):
        # I opted for this option here to ensure that the test data has the same columns as the training data. This was an issue using this method. 
        # I only did it this way to fit with the class model that the challenge prescribed. I've approached a different way later to show how I intended 
        # to do it without this template.
        x_test_processed = x_test_processed.reindex(columns = x_train_processed.columns, fill_value=0)
        #X_train = x_train_processed.drop('Class', axis = 1)
        #y_train = x_train_processed.Class
        #X_test = x_test_processed.drop('Class', axis = 1)
        #y_test = x_test_processed.Class

        return x_train_processed, x_test_processed






In [None]:
# Dont change
my_model = Module4_Model()

In [None]:
# Dont change
x_train_processed, y_train = my_model.preprocess_training_data(X_train, y_train)

In [None]:
print(x_train_processed.shape, y_train.shape)

(200, 39) (200,)


In [None]:
# Dont change
x_test_processed, y_test = my_model.preprocess_test_data(X_test, y_test)

In [None]:
# Ensuring both sets are the same shape
X_train,X_test = my_model.process_for_model(x_train_processed, x_test_processed)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(200, 39) (200,)
(86, 39) (86,)


In [None]:
# Create a model
# I chose the Random Forest Classifier. The first run with standard parameters
random_forest = RandomForestClassifier(n_estimators=100)

In [None]:
# Train your model
random_forest.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [None]:
# use your model to make a prediction on unseen data
y_pred = random_forest.predict(X_test)
rf_acc = round(accuracy_score(y_test,y_pred),4)*100
print(f'{rf_acc}%')

72.09%


In [None]:
#At this point conduct randomized search cross validation to get a better grip on the parameters. Then switch to the other model before doing the same
# Some trial and error might make it better. Then deomstrate the pipeline version before moving toward NN

In [None]:
# run cross validation to get a mean idea of the accuracy
rfc_cv_score=cross_val_score(random_forest, X_train, y_train, cv=10, scoring='roc_auc')
# Cross validation seems to show a large variance in the accuracy of this model with a mean rouhgly where I expected
print(f'The accuracy scores for the model over 10 runs:--\n {rfc_cv_score}\n')
print(f'The mean Cross Validation Score for the model:--\n{round(rfc_cv_score.mean()*100,2)}%')

The accuracy scores for the model over 10 runs:--
 [0.46428571 0.79761905 0.82142857 0.64880952 0.73809524 0.44047619
 0.69047619 0.79761905 0.45833333 0.54945055]

The mean Cross Validation Score for the model:--
64.07%


*I used randomised search cross validation to assess possible better parameters for random forest*

In [None]:
# I redefine the variables to avoid crossover.

X = data.iloc[:,0:9]
y = data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=42, shuffle=True)

x_train_processed, y_train = my_model.preprocess_training_data(X_train, y_train)

x_test_processed, y_test = my_model.preprocess_test_data(X_test, y_test)

X_train,X_test = my_model.process_for_model(x_train_processed, x_test_processed)

In [None]:
# number of trees in a random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
# number of features to consider at every split
max_features = ['auto', 'sqrt']
# maximum number of levels in the tree
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
# minimum number of samples to split a node
min_samples_split = [2,5,10]
# min samples required at each leaf node
min_samples_leaf = [1,2,4]
# method of selecting samples for training each tree
bootstrap = [True, False]

random_grid={'n_estimators':n_estimators,
             'max_features':max_features,
             'max_depth':max_depth,
             'min_samples_split':min_samples_split,
             'min_samples_leaf':min_samples_leaf,
             'bootstrap':bootstrap}

In [None]:
rf_random = RandomizedSearchCV(estimator=random_forest, param_distributions = random_grid, n_iter=100, cv=3, verbose=2, n_jobs=-1)

In [None]:
rf_random.fit(X_train,y_train)

In [None]:
# This shows the best parameters based on the variations tried.
best_fit = rf_random.best_params_
best_fit

{'bootstrap': True,
 'max_depth': 80,
 'max_features': 'auto',
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 1400}

In [None]:
random_cv = RandomForestClassifier(n_estimators=1400, max_depth=80, max_features='auto', min_samples_leaf=2,min_samples_split=2, bootstrap=True)

In [None]:
random_cv.fit(X_train,y_train)
# Run against unseen test data
y_pred = random_cv.predict(X_test)
random_rfacc = accuracy_score(y_test, y_pred)

In [None]:
print(f'Accuracy using the Randomized Search setting identified is {round(random_rfacc,4)*100}%')

Accuracy using the Randomized Search setting identified is 74.42%


In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[57  5]
 [17  7]]


0.7441860465116279

In [None]:
# Asssess the accuracy of your model and explain your key findings

*The accuracy of the model is of a reasoable standard, however there are variations of ~10% during different efforts at trainig. Cross validation shows even wider variation, from as low as the 40th percetile up to 80th+ on occasion. The most common accuracy is around 70% with the main hit to the accuracy coming from a high number of false negatives, that is, expectation of no recurrence where in reality re-occurrence was present.*

*My belief is that the accuracy would be increased by a higher sample number, 286 is very low, and also a higher number of categories that contribute to the target variable. No amount of tinkering with the hyperparameters can overcome this shortfall.*

*Factors I considered may skew data is age. If data provided to models of this nature, for this purpose, has individuals who are already more advanced in age the likelihood of seeing re-occurence is lower than that for a younger person.*

In [None]:
# ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

*I wanted to attempt a different encoding method to see if the result is different in terms of accuracy*

In [None]:
# For this method I need to ensure the values are all string values.
data.loc[:, 'tumor-size'] = data.loc[:, 'tumor-size'].map(lambda x:str(x))
data.loc[:, 'inv-nodes'] = data.loc[:, 'inv-nodes'].map(lambda x:str(x))

In [None]:
# Again, redefine the variables to avoid crossover
X = data.iloc[:,0:9]
y = data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=42, shuffle=True)

x_train_processed, y_train = my_model.preprocess_training_data(X_train, y_train)

x_test_processed, y_test = my_model.preprocess_test_data(X_test, y_test)

X_train,X_test = my_model.process_for_model(x_train_processed, x_test_processed)

In [None]:
le_data = data.copy()
categorical_mask = le_data.dtypes==object
categorical_cols = le_data.columns[categorical_mask].tolist()
le=LabelEncoder()
le_data[categorical_cols]=le_data[categorical_cols].apply(lambda col:le.fit_transform(col))

In [None]:
leX = le_data.iloc[:, 0:9]
ley = le_data.iloc[:, -1]
leX_train, leX_test, ley_train, ley_test = train_test_split(leX,ley, test_size = 0.3, random_state=42, shuffle=True)

In [None]:
le_rfc = RandomForestClassifier(n_estimators=100)
le_rfc.fit(leX_train,ley_train)
le_prediction = le_rfc.predict(leX_test)
print(f'Score using this method {round(le_rfc.score(leX_test,ley_test),3)*100}%')

Score using this method 70.89999999999999%


In [None]:
le_rfc = cross_val_score(le_rfc, leX_train,ley_train, cv=10, scoring='roc_auc')
print(f'The mean score for cross validation using the label encoder instead of Pandas Dummy values is {round(le_rfc.mean(),3)*100}%')

The mean score for cross validation using the label encoder instead of Pandas Dummy values is 62.5%


*No real difference usin the Label Encoder instead of dummy values*

*This is the method I'd prepared before looking at the template: I think it works better for the user and probably deals with the potential for irregular data better.*

In [None]:
# Define transformers for each data type with methods for handling unkown and missing data
# In the numeric_transformer I've opted for the most_frequent option of filling as that is my general method of handling missing data when doing it manually.
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

numeric_features = data.select_dtypes(include=['int64', 'float64']).columns
categorical_features = data.select_dtypes(include=['object']).drop(['Class'], axis=1).columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

In [None]:
# Split and create the sample data for this method
pip_X = data.iloc[:,0:9]
pip_y = data.iloc[:, -1]
pip_X_train, pip_X_test, pip_y_train, pip_y_test = train_test_split(pip_X,pip_y, test_size=0.3, random_state=42, shuffle=True)

print(pip_X_train.shape, pip_y_train.shape)
print(pip_X_test.shape, pip_y_test.shape)

(200, 9) (200,)
(86, 9) (86,)


In [None]:
rf.fit(pip_X_train,pip_y_train)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='most_frequent',
                                                      

In [None]:
pip_y_pred = rf.predict(pip_X_test)

pip_rf_acc = round(rf.score(pip_X_test,pip_y_test)*100,2)

print(f'Accuracy of the pipeline method is {pip_rf_acc}%')

Accuracy of the pipeline method is 73.26%


In [None]:
cm = confusion_matrix(pip_y_test, pip_y_pred)
print(cm)
accuracy_score(pip_y_test, pip_y_pred)

[[51 11]
 [14 10]]


0.7093023255813954

In [None]:
# run cross validation to get a better overview of the scores
rfc_cv_score=cross_val_score(rf, pip_X_train, pip_y_train, cv=10, scoring='roc_auc')

In [None]:
# Cross validation seems to show a large variance in the accuracy of this model with a mean rouhgly where I expected
print(f'The scores for the model over the 10 runs of the model:--\n {rfc_cv_score}\n')
print(f'The mean Cross Validation Score for the model:--\n{round(rfc_cv_score.mean()*100,2)}%')

The scores for the model over the 10 runs of the model:--
 [0.45238095 0.7797619  0.86309524 0.64880952 0.73809524 0.53571429
 0.68452381 0.88095238 0.4047619  0.53846154]

The mean Cross Validation Score for the model:--
65.27%


*My attempt at an Artifical Neural Network*

In [None]:
import tensorflow as tf

In [None]:
from tensorflow.keras.callbacks import LearningRateScheduler

In [None]:
tf.__version__

'2.2.0'

In [None]:
dataset = data.copy()
X = dataset.iloc[:, 0:9].values
y = dataset.iloc[:, -1].values

In [None]:
#encode each categorical column. I've done all but degmalig
le=LabelEncoder()
cols = [0,1,2,3,4,6,7,8]
for col in cols:
    X[:, col] = le.fit_transform(X[:, col])

In [None]:
y = le.fit_transform(y)

In [None]:
#split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
#apply feature scaling to all the data...this is absolutely imperative to deep learning
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [None]:
#initialising the ANN
#creating a sequence of layers as opposed to computational graph
ann = tf.keras.models.Sequential()

*adding the input layer and first hidden layer*

In [None]:
#add a fully connected layer to the NN. This can be done at any phase
ann.add(tf.keras.layers.Dense(units=6, activation = "relu")) #rectifier activation function = relu
# units is important and designates the number of hidden neurons
# there is no way to know how many neurons we want. there is no rule, it's based on experimentation in tweaking the hyper parameters before training the model

In [None]:
#add the second layer. This is exactly the same as adding the first
ann.add(tf.keras.layers.Dense(units=6, activation = "relu"))

*adding the output layer*

In [None]:
#mostly the same as we're adding a new layer again, just slightly different settings as this is the output layer
# because we're doing classiication on a binary target variable we only need one unit/neuron, sigmoid activation function for predictions and probability
#of our vaiable function...this is only for the output layer. this shows the probability of the outcome rather than yes/no or 1/0
ann.add(tf.keras.layers.Dense(units = 1, activation = "sigmoid"))

*compiling the ANN*

In [None]:
#loss computes the prediction between the prediction and the real result
# if conducting loss on a binary classification you must use the following loss method. for none binary it should be categorcal_crossentropy and
#in activation for non-binary on the exit layer the activation should be "cross max"
ann.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"]) #metrics can take multiiple parameters so you enter in [] like a list

*training the ANN*


In [None]:
#batch size is commonly 32. training is conducted over a number of epochs which you also must define
ann.fit(X_train, y_train, batch_size = 32, epochs = 100)

*predicting the test results*

In [None]:
y_pred = ann.predict(X_test)
y_pred = (y_pred >0.5)
#shows the predicted results against the actual test results
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

*create confusion matrix to show actual accuracy of the network*

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
# Not a great outcome. Commonly it seems we get a false negative where the network predicts no recurrence when in fact some occurs.

[[54  0]
 [32  0]]


0.627906976744186

*Change the learning rates*

In [None]:
# redefine the samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
def step_decay_schedule(initial_lr=1e-3, decay_factor=0.75, step_size=10):


    def schedule(epoch):
        return initial_lr * (decay_factor ** np.floor(epoch/step_size))
    
    return LearningRateScheduler(schedule)

lr_sched = step_decay_schedule(initial_lr=1e-4, decay_factor=0.75, step_size=2)

ann.fit(X_train, y_train, batch_size=32, epochs=100, callbacks=[lr_sched])

In [None]:
#7.5510e-11
opt = tf.keras.optimizers.Adam(learning_rate=0.0000000000755)
ann.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
ann.fit(X_train, y_train, batch_size=32, epochs=100)

In [None]:
y_pred = ann.predict(X_test)
y_pred = (y_pred>0.5)
print(f'The ANN produced an accuracy score of {round(accuracy_score(y_test, y_pred),3)*100}%')

The ANN produced an accuracy score of 65.10000000000001%


*As with the ML model there is not a huge degree of accuracy, despite efforts to narrow down the hyperparameters. I believe this to be down to the sample size of the original data and the number of contributory categories within.*