### OBJECTIVE
You are trying to determine the 7-year survival of prostate cancer patients. A patient survived if they are still alive 7 years after diagnosis. This means that a patient is counted as dead whether or not the death was due to their cancer. You have been given details about the patients and their cancers to help you with your prediction.


### Model Selection: Random Forest Classifier in Scikit-Learn
Machine Learning techniques can be used for this modeling problem. Bearing in mind that this is ultimately a classification problem, a Random Forest Classifier was deemed to be the most favourable approach because:

1. A Logistic Regression model would assume some linearity within the data. However, a quick exploratory analysis showed that there is not much linear correlation between the columns, and the target ('survival_7_years'). On the other hand, Ensemble methods like Random Forest do not. Also, Logistic Regression can hardly handle categorical (binary) features, and we expect to have some in this dataset.
2. Support Vector Machine is inefficient to train, and would need normalising across the data set. This could be fun to explore, but, given the time constraint, a random forest classifier would do the job without requiring those extra manipulations. 
3. Gradient Boosted Decision Trees could perform better, but they are prone to over-fitting, and need a lot of their hyper parameters to be tuned perfectly, to get right. Again, this is something interesting to explore, but for the sake of time, Random Forest could perform the job without the fuss.

Howver, after modelling, it was demonstrated that Gradient Boosted Decision Trees outperformed Random Forests by 2 points, making them a more suitable choice for this project.

### General Strategy
1. Check the target column, to ensure that the data is balanced. If it is unbalanced, i.e, the ratio of positive to negative classes is skewed, then sampling measures have to be taken to improve the model's sensitivity.


2. Clean the data: Because scikit-learn's ensenble methods do not deal will with missing values (NaN), values will have to be imputed where missing. Note: It will be interesting to compare the performance of a model with imputed data, to that of an R model, which handles missing values. 

3. Make dummy columns for categorical variables (such as the symptoms column), so the Random Forest can handle them. 

4. Train and retrain the model, using k-fold cross validation, as well as feature selection methods, to minimise bias and variance.

5. Iterate, based on performance. If satisfactory, then fit the model on the test data provided.

6. Take a deep breath, stop overthinking scenarios, and step way from the problem :)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

raw_train = pd.read_csv('training_data.csv')
test_df = pd.read_csv('(name)_score.csv')
output_df = pd.read_csv('(name)_score.csv')
raw_train.head()

Unnamed: 0,id,diagnosis_date,gleason_score,t_score,n_score,m_score,stage,age,race,height,...,symptoms,rd_thrpy,h_thrpy,chm_thrpy,cry_thrpy,brch_thrpy,rad_rem,multi_thrpy,survival_1_year,survival_7_years
0,1,Jun-05,4.0,T1c,N0,M0,I,86.0,4.0,66.0,...,U03,0,0,1,1,0,1,1,1,0
1,2,Feb-06,8.0,T3a,N1,M0,IV,66.0,2.0,70.0,...,"U06,S07",1,1,1,0,0,0,1,1,0
2,3,Mar-06,9.0,T1a,N0,M0,IIB,84.0,4.0,69.0,...,"U01,U02,U03,S10",1,1,0,0,1,1,1,1,1
3,4,Feb-05,8.0,T2b,N0,M0,IIB,86.0,3.0,69.0,...,"U01,U02,S10,O11",0,0,0,1,0,1,1,0,0
4,5,Dec-01,8.0,T4,N0,M0,IV,78.0,4.0,70.0,...,"U01,U03,U05,S07",1,1,1,0,0,0,1,1,0


In [2]:
raw_train.loc[raw_train['survival_7_years'] == 1].shape[0]/raw_train.shape[0]

0.43230419239519013

With over 43% in the positive class, this dataset is not imbalanced, and can be trained without sampling.

### Cleaning Data
Eyeballing at the dataset, there are missing values at various columns. One method would be just to drop all rows containing missing values.

In [2]:
a_check = raw_train.dropna()
a_check.shape[0]/raw_train.shape[0]

0.15924601884952877

However, from the simple calculation above, it shows that dropping all missing values from the dataframe will leave us with only 16% of our data, for training a model. Now, there are some issues with this:
1. The data that we will be testing our model on might not be perfect, and could have a lot of missing values. The model is expected to make predictions for the entire data set, not just cherry-picked, perfect rows.
2. Even if the data we are testing is perfect, the trained model used only 16% of the data available to us, and will not be as accurate as a case where a model is built based on a larger data set. 

Bearing this in mind, it is prudent to clean the entire dataset before proceeding. 

In [3]:
columns = list(raw_train)
explore_table = pd.DataFrame()
explore_table['columns'] = pd.Series(columns)
for i in range(len(columns)):
    explore_table.ix[i,'missing values train %'] = (raw_train.loc[raw_train[columns[i]].isnull()].shape[0]
                                                    /raw_train.shape[0])*100
    explore_table.ix[i,'missing values test %'] = (test_df.loc[test_df[columns[i]].isnull()].shape[0]
                                                 /test_df.shape[0])*100
explore_table.head(len(columns))

Unnamed: 0,columns,missing values train %,missing values test %
0,id,0.0,0.0
1,diagnosis_date,0.0,0.0
2,gleason_score,2.079948,2.072674
3,t_score,0.0,0.0
4,n_score,0.0,0.0
5,m_score,0.0,0.0
6,stage,0.0,0.0
7,age,4.861878,5.619634
8,race,1.072473,1.049345
9,height,8.865778,9.045183


From the table above, we can see that both the test and training data have missing values in the same colums, except for 'survival_1_year', and 'survival_7-years', where the test set has a lot of missing values. 
#### Take-aways:
1. There is clearly some correlation between a patient surviving the first year, and a patient surviving 7 years, as a patient dead after 1 year is clearly dead after 7. However, with 50% of this data missing in the test set, we will either (a) have to ignore this as a feature in our model, or (b) predict 1st year survival, and then use it as a feature. to err on the side of caution, if our model performs fairly well, this column should be ignored.  

2. While the missing values for weight and height are close to 10% in both cases, there is a correlation between weight and height. We can use this to impute values for weight and height, and keep them as a feature. However, the BMI relationship between weight and height is likely what drives relevance to a patient's survival. As such, imputation might mess with this data.

3. tumor 6 months, psa 6 months, tumor 1 year, psa 1 year, and psa diagnosis should be left out as they have a significant number of missing values.

4. While family history, tea, smoker, previous cancer and first degree history have ~10% missing values, their %age of missing values are the same! The patients who have missing values in each of these fields, have missing values in ALL of them. This implies that they are not missing at random, and categorical values can be assigned where the values are NaN in these columns.

5. For all other missing value columns, imputation techniques can be used.


#### Imputation:

In [4]:

missing_type_a = ['gleason_score','age','tumor_diagnosis']
missing_type_b = ['tea','race','family_history', 'first_degree_history', 'previous_cancer', 'smoker']


### Use medians to fill in missing type a
for i in missing_type_a:
    temp = raw_train[i].dropna()
    med = temp.median()
    raw_train[i].fillna(med, inplace=True)

for i in missing_type_b:
    raw_train[i].fillna(-999,inplace=True)


### Build the Training Set & Engineer Features

In [5]:
training_set = pd.DataFrame()
training_set['id']=raw_train['id']

#### Engineer the Categorical Columns, and add non-categorical features:

A lot of the sections in the dataset are categorical, especially the symptoms column, where it is comma separated. A method to deal with such categorical data for ensemble methods, is the use of dummy columns for each category, with a '1' indicating positive for that category, and '0', otherwise.

In [None]:

categorical_columns = ['t_score', 'n_score', 'm_score', 'stage','race','family_history', 'first_degree_history', 
                       'previous_cancer', 'smoker', 'side','rd_thrpy', 'h_thrpy', 
                       'chm_thrpy','cry_thrpy', 'brch_thrpy', 'rad_rem', 'multi_thrpy']
other_columns = ['gleason_score','age','tumor_diagnosis','tea']
for i in categorical_columns:
    training_set = pd.concat([training_set, pd.get_dummies(raw_train[i], prefix=i)], axis = 1)
for j in other_columns:
    training_set = pd.concat([training_set, raw_train[j]], axis = 1)
training_set = pd.concat([training_set, raw_train['symptoms'].str.get_dummies(sep=',')], axis=1)
training_set = pd.concat([training_set, raw_train['survival_7_years']], axis = 1)
training_set.set_index('id')  


In [7]:
features = list(training_set.columns)
features.remove('id')
features.remove('survival_7_years')
len(features)

83

We have 83 features with which to train our random forest model.

### Verifying The Model

In [8]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100)

from sklearn.cross_validation import KFold
kf = KFold(len(training_set), n_folds=10, shuffle=True)

results = []

for traincv, testcv in kf:
    train = training_set.iloc[traincv] # Extract train data with cv indices
    valid = training_set.iloc[testcv] # Extract valid data with cv indices
    model = clf.fit(X = train[features], y = train['survival_7_years'])
    score = clf.score(X = valid[features], y = valid['survival_7_years'])
    results.append(score)

print(results)
print(sum(results)/len(results))



[0.62897985705003245, 0.59974009096816117, 0.63872644574398962, 0.64262508122157247, 0.61663417803768683, 0.61313394018205458, 0.61833550065019505, 0.63068920676202855, 0.63459037711313393, 0.6248374512353706]
0.624829212896


In [9]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=100)

from sklearn.cross_validation import KFold
kf = KFold(len(training_set), n_folds=10, shuffle=True)

results = []

for traincv, testcv in kf:
    train = training_set.iloc[traincv] # Extract train data with cv indices
    valid = training_set.iloc[testcv] # Extract valid data with cv indices
    model = clf.fit(X = train[features], y = train['survival_7_years'])
    score = clf.score(X = valid[features], y = valid['survival_7_years'])
    results.append(score)

print(results)
print(sum(results)/len(results))

[0.64912280701754388, 0.63807667316439243, 0.65497076023391809, 0.66016894087069522, 0.63807667316439243, 0.64434330299089726, 0.64759427828348504, 0.62873862158647598, 0.64954486345903772, 0.64304291287386217]
0.645367983364


The model does not perform as well as we would like, but comparing its score to a logistic model, SVM, and neural network, as well as extra trees classifier and other ensemble methods, the random forest classifier performs better than them. However, Gradient Boosting performs slightly better by 2 points.

### Making predictions
Using the same process we did for the training set, we can engineer features for the test set (scored set)

In [10]:
### Imputation

### Use medians to fill in missing type a
for i in missing_type_a:
    temp = test_df[i].dropna()
    med = temp.median()
    test_df[i].fillna(med, inplace=True)

for i in missing_type_b:
    test_df[i].fillna(-999,inplace=True)
    
### Feature Engineering

test_set = pd.DataFrame()
test_set['id']=test_df['id']

for i in categorical_columns:
    test_set = pd.concat([test_set, pd.get_dummies(test_df[i], prefix=i)], axis = 1)
for j in other_columns:
    test_set = pd.concat([test_set, test_df[j]], axis = 1)
test_set = pd.concat([test_set, test_df['symptoms'].str.get_dummies(sep=',')], axis=1)
test_set = pd.concat([test_set, test_df['survival_7_years']], axis = 1)

test_set.set_index('id')  

### get feature columns, and confirm that they match those of the training set
features_test = list(training_set.columns)
features_test.remove('id')
features_test.remove('survival_7_years')
len(features_test)



83

### Train & Fit Final model

In [11]:
from sklearn.ensemble import GradientBoostingClassifier
X_train = training_set[features]
X_test = test_set[features]
y_train = training_set['survival_7_years']

model = GradientBoostingClassifier(n_estimators=100)

model.fit(X_train,y_train)

test_df['survival_7_years'] = model.predict(X_test)
output_df['survival_7_years'] = test_df['survival_7_years']

In [12]:
output_df.to_csv('KaykayEssien_score.csv')