# PCR Classification Test File

This notebook will allow the assessor to test our classification model on unseen test data. Simply run all cells in order. You will have to change the path to your file here:

In [1]:
TEST_PATH = "../testDataSetExample.xls" # path to the unseen dataset
TRAIN_PATH = "../trainDataset.csv" # path to the provided csv we uploaded to moodle
RESULTS_PATH = "../resultsPCR.csv" # path to the location in which you wish to save the results

In [2]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler 
from imblearn.over_sampling import SMOTE
from sklearn.svm import SVC 

This notebook can handle xls imports only if the xlrd dependency is installed:

`$conda install xlrd`

## Model Training

Prepare training data. This process is explained in more detail in other files.

In [3]:
df = pd.read_csv(TRAIN_PATH)

df = df[~(df == 999).any(axis=1)]
df.drop('ID', axis=1, inplace=True)
for i in range(12, len(df.columns)):
    df.columns.values[i] = 'img_' + str(i)

df = df.reset_index()

threshold = 4 # standard deviations from mean (99.7% of data)
for i, col_name in enumerate(df.columns[13:]):
    col = df[col_name]
    mean = np.mean(col)
    std = np.std(col)
    for j, x in enumerate(col):
        z = (x - mean) / std
        if z > threshold:
            df.iloc[j, i] = mean

# Assign features to X
X = df.drop('pCR (outcome)', axis=1).drop('RelapseFreeSurvival (outcome)', axis=1).drop('TrippleNegative', axis = 1)
X.drop('index', axis=1, inplace=True)

# Assign labels to y
y = df['pCR (outcome)']
X.head()

# Scale test data to match training data
scaler = StandardScaler()
Xs = scaler.fit_transform(X) 

Use PCA on image data, choosing a number of components such that we retain 90% of the variance

In [4]:
feature_names = list(X.columns) 
pca = PCA(n_components=0.90) # retain 90% of variance
img_pca = pca.fit_transform(Xs[:,9:]) #this number is 9 because after removing index, TrippleNegtaive, pCR outcome, RFS, that is the index of img_12  
print("N components:",pca.n_components_)

keep = 10
img_pca = img_pca[:,:keep] # retain the first 10 components
cols = ['pca_' + str(i+1) for i in range(keep)]
df_img_pca = pd.DataFrame(img_pca, columns=cols)

col_names = X.columns[:9]#this number is 9 because after removing index, TrippleNegtaive, pCR outcome, RFS, that is the index of img_12  
Xs_pca = pd.concat([pd.DataFrame(Xs[:,:9], columns=col_names), df_img_pca], axis=1)

N components: 14


Use SMOTE (minority class oversampling) to boost population of minority class for better testing.

In [5]:
Xs_pca_resampled, y_resampled = SMOTE().fit_resample(Xs_pca, y)

Train the model with optimal parameters found in testing (SVM: C=10, Gamma=0.1, Kernel=RBF)

In [6]:
model = SVC(C=50, gamma=0.1, kernel="rbf")
model.fit(Xs_pca_resampled, y_resampled)

SVC(C=50, gamma=0.1)

## Testing

Prepare testing data in the same manner as training data, not including SMOTE.

In [7]:
test_df = pd.read_excel(TEST_PATH)

test_df = test_df[~(test_df == 999).any(axis=1)]
ids = test_df.iloc[:,0]
test_df.drop('ID', axis=1, inplace=True)
for i in range(12, len(test_df.columns)):
    test_df.columns.values[i] = 'img_' + str(i)

test_df = test_df.reset_index()

threshold = 4 # standard deviations from mean (99.7% of data)
for i, col_name in enumerate(test_df.columns[13:]):
    col = test_df[col_name]
    mean = np.mean(col)
    std = np.std(col)
    for j, x in enumerate(col):
        z = (x - mean) / std
        if z > threshold:
            test_df.iloc[j, i] = mean

# Assign features to X
X = test_df.drop('TrippleNegative', axis = 1)
X.drop('index', axis=1, inplace=True)

# Scale test data to match training data
scaler = StandardScaler()
Xs = scaler.fit_transform(X) 


In [8]:
# Use PCA for dimensionality reduction
feature_names = list(X.columns) 
pca = PCA(n_components=14) # retain 90% of variance
img_pca = pca.fit_transform(Xs[:,9:]) #this number is 9 because after removing index, TrippleNegtaive, pCR outcome, RFS, that is the index of img_12  
print("N components:",pca.n_components_)

keep = 10
img_pca = img_pca[:,:keep] # retain the first 5 components
cols = ['pca_' + str(i+1) for i in range(keep)]
df_img_pca = pd.DataFrame(img_pca, columns=cols)

col_names = X.columns[:9]#this number is 9 because after removing index, TrippleNegtaive, pCR outcome, RFS, that is the index of img_12  
Xs_pca = pd.concat([pd.DataFrame(Xs[:,:9], columns=col_names), df_img_pca], axis=1)

N components: 14


In [22]:
y_predict = model.predict(Xs_pca)

output = pd.DataFrame(columns=['ID', 'PCR (Prediction)'])
output['ID'] = ids
output['PCR (Prediction)'] = y_predict

output.to_csv(RESULTS_PATH)

NameError: name 'SAVE_PATH' is not defined