### Intro

In this notebook I will implement tree based models for predicting March blood donation from the blood transfusion service center data set. Previously I created logistic models which can be found [here](https://github.com/sundodger97/BloodDonors/blob/master/Analysis%20and%20Logistic%20Model.ipynb). That notebok also contains data exploration and problem statement.



### 1. Getting the Data Set

Original Data Set: 
https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center

DrivenData.org has the data split into train and test: 
https://www.drivendata.org/competitions/2/warm-up-predict-blood-donations/data/

The data contains four continuous predictors and one binary outcome:

* Recency - months since last donation 

* Frequency - total number of donation

* Monetary - total blood donated in c.c. 

* Time - months since first donation

* A target variable representing whether the person donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating).

In [1]:
import pandas as pd
import numpy as np

In [36]:
df = pd.read_csv('data/train.csv')
df.head(2)

Unnamed: 0.1,Unnamed: 0,Months since Last Donation,Number of Donations,Total Volume Donated (c.c.),Months since First Donation,Made Donation in March 2007
0,619,2,50,12500,98,1
1,664,0,13,3250,28,1


In [39]:
df.columns=['ID','Recency','Donations','Monetary','Time','Target']

df = df.drop(['ID','Monetary'],axis=1) # ID is not needed for trainig the model
df.head(3)

Unnamed: 0,Recency,Donations,Time,Target
0,2,50,98,1
1,0,13,28,1
2,1,16,35,1


### 2. Feature Creation

These are the same four features from the first logistic regression model

In [40]:
def split_space(X,Y,xlabel,ylabel,data):
    slope = (Y[1]-Y[0])/(X[1]-X[0])
    return (data[ylabel]>slope*(data[xlabel]-X[0])+Y[0],data[ylabel]<=slope*(data[xlabel]-X[0])+Y[0])
    #returns (above,below)

In [41]:
# Feature 1 & 2: Recent multi-donors who fall above (or below) a minimum rate of donations over time. See plot 2

condition = (df.Donations>2)&(df.Recency<9)
x_vals,y_vals = ((29,100),(2,19))
x_label,y_label = ('Time','Donations')

above, below = split_space(x_vals,y_vals,x_label,y_label,df)

df['F1'] = (condition&above).apply(int)
df['F2'] = (condition&below).apply(int)

In [42]:
# Feature 3: Multi-donors who donated 2 or 4 months ago. See plot 3

condition = (df.Donations>1)&df.Recency.isin([2,4])

df['F3'] = condition.apply(int)

In [43]:
# Feature 4: Donations per month since first donation. 

df['F4']=df.Donations/df.Time

In [44]:
df=df[['Recency', 'Donations', 'Time','F1','F2', 'F3','F4','Target']]

df.head(2)

Unnamed: 0,Recency,Donations,Time,F1,F2,F3,F4,Target
0,2,50,98,1,0,1,0.510204,1
1,0,13,28,1,0,0,0.464286,1


### 3. Preprocessing

* Split the data into train and test
* Normalization is not necessary for tree based approaches

In [14]:
from sklearn.model_selection import train_test_split as tts

In [15]:
X_train,X_test,y_train,y_test = tts(df.drop('Target',axis=1),df.Target,test_size=0.3)

### 4. Decision Tree and Random Forest Models

In [22]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, confusion_matrix

First we'll try the single decision tree.

In [23]:
dtree = DecisionTreeClassifier()

In [24]:
dtree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [31]:
dt_predictions = dtree.predict(X_test)

In [33]:
print(confusion_matrix(y_test,dt_predictions))
print('\n')
print(classification_report(y_test,dt_predictions))

[[115  13]
 [ 34  11]]


             precision    recall  f1-score   support

          0       0.77      0.90      0.83       128
          1       0.46      0.24      0.32        45

avg / total       0.69      0.73      0.70       173



Next we'll try a random forest.

In [17]:
rfc = RandomForestClassifier(n_estimators=200,max_features='sqrt')

In [18]:
rfc.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [28]:
rfc_predictions = rfc.predict(X_test)

In [29]:
print(confusion_matrix(y_test,rfc_predictions))
print('\n')
print(classification_report(y_test,rfc_predictions))

[[118  10]
 [ 31  14]]


             precision    recall  f1-score   support

          0       0.79      0.92      0.85       128
          1       0.58      0.31      0.41        45

avg / total       0.74      0.76      0.74       173



### 5. Decision Tree Inspection

The table below shows the relative importance of the data set features in the decision tree. It is interesting that feature four is weighted so highly here, when it was not a particularly effective variable in the logistic regression. I think this is because the decision tree only looks at one predictor at a time. 

In [27]:
pd.DataFrame(dtree.feature_importances_,index=X_train.columns,columns=['Importance'])

Unnamed: 0,Importance
Recency,0.187014
Donations,0.143287
Time,0.192275
F1,0.215415
F2,0.0
F3,0.027799
F4,0.23421


The diagram below is the single decision tree. The tree can be recreated by running the code below and pasting the data into the website in the comment. The point here is to point out the complexity of the tree itself. Decision trees are sometimes touted as simpler than random forrests, but this is a very large structure and it still performs worse than the random forrest and the logistic regression.

In [46]:
from sklearn.tree import export_graphviz

with open("treegraph.txt", "w") as f:
    f = export_graphviz(dtree, out_file=f)

# copy the text from the treegraph.txt file and paste at http://webgraphviz.com/

![test](tree.jpg)

### 6. Conclusion

#### Summary: 

The RFC model performs about as well as the logistic regression models in the first notebook. This, as well as the complexity of the single decision tree indicates that the data set is rather scrambled, which is to say that March donors and the non-March donors can not easily be serperated through partitions of the available data. 

