# Consolidated workflow (continued)

Import initial libraries and data file. N.B. The data being used read in here is already numercised for use in sklearn's decision trees and random forests

## Tanzania - data analysis

In [331]:
# import libraries and data
import numpy as np
import pandas as pd
import matplotlib as plot
import seaborn as sns

# check select_data
df = pd.read_csv('/Users/RAhmed/data store/Wesleyan_Capstone/all_numeric201808292240.csv')
df.shape

(59400, 24)

In [332]:
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

Sklearn's decision trees and random forests need categorical data changed into integers. See, e.g,: https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree

The above data set has already had this done.

In [333]:
df.columns

Index(['id', 'date_recorded', 'season_recorded', 'gps_height', 'installer',
       'longitude', 'latitude', 'basin', 'region_code', 'population',
       'public_meeting', 'scheme_management', 'permit', 'wpt_age',
       'construction_year', 'extraction_type_group', 'management_group',
       'payment_type', 'water_quality', 'quantity_group', 'source_type',
       'source_class', 'waterpoint_type_group', 'status_group'],
      dtype='object')

In [334]:
chosen_predictors = ['season_recorded', 'gps_height', 'installer', 'basin', 'region_code', 'population',
                     'public_meeting', 'scheme_management', 'permit', 'wpt_age','construction_year', 
                     'extraction_type_group', 'management_group', 'payment_type', 'water_quality', 
                     'quantity_group', 'source_type', 'waterpoint_type_group'
                    ]

predictors = df[chosen_predictors]
targets = df['status_group']

# train/test split
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.2)

print("pred_train.shape:", pred_train.shape)
print("pred_test.shape:", pred_test.shape)
print("tar_train.shape:", tar_train.shape)
print("tar_test.shape:", tar_test.shape)

pred_train.shape: (47520, 18)
pred_test.shape: (11880, 18)
tar_train.shape: (47520,)
tar_test.shape: (11880,)


Build a classifier model with a decision tree, on train set

In [335]:
# Build model on training data; initiate classifier from sklearn, then fit it with the training data
classifier=DecisionTreeClassifier(random_state=2)
classifier=classifier.fit(pred_train,tar_train)

# predict for the test values and create confusion matrix
predictions=classifier.predict(pred_test)
print("Confusion matrix:")
print(sklearn.metrics.confusion_matrix(tar_test,predictions))
print("Accuracy score:")
print(sklearn.metrics.accuracy_score(tar_test, predictions))

Confusion matrix:
[[5143  359  915]
 [ 357  332  159]
 [ 962  197 3456]]
Accuracy score:
0.7517676767676768


Trying again, to play with parameters, experiment, etc.

In [336]:
# Playing with hyper-parameters. Also have varied the wording of the code from that in the MOOC. 
# (Just to check all in order with exceptionally high result.)
# train/test split
from sklearn.utils import shuffle
df = shuffle(df)
X_train, X_test, y_train, y_test = train_test_split(predictors[:], targets[:], test_size=.2)

# Build model on training data; initiate classifier from sklearn, then fit it with the training data
model = DecisionTreeClassifier(random_state=2)
model.fit(X_train, y_train)

y_predict = model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

0.753030303030303

For confusion matrix, N.B. that:

functional = 0

functional_needs_repair = 1

non-functional = 2

In [337]:
from sklearn.metrics import confusion_matrix

pd.DataFrame(
    confusion_matrix(y_test, y_predict),
    columns=['Pred functional', 'Predicted func_need_repair', 'Predicted non_func'],
    index=['True functional', 'True func_need_repair', 'True non_func']
)

Unnamed: 0,Pred functional,Predicted func_need_repair,Predicted non_func
True functional,5129,392,924
True func_need_repair,361,342,167
True non_func,915,175,3475


In [338]:
# reprint for convenience
accuracy_score(y_test, y_predict)

0.753030303030303

Key ingredient: see which features are most important

In [339]:
importance = (dict(zip(X_train, model.feature_importances_)))
sorted_items = sorted(importance.items(), key = lambda x: x[1], reverse=True)
sorted_items

[('gps_height', 0.2082604258545679),
 ('wpt_age', 0.16868874655184754),
 ('quantity_group', 0.15828654323856964),
 ('waterpoint_type_group', 0.07808363349391287),
 ('installer', 0.060031464665566206),
 ('payment_type', 0.040058500800316824),
 ('region_code', 0.037644683802763236),
 ('population', 0.034576043447480574),
 ('construction_year', 0.03334846018836722),
 ('source_type', 0.02906491271719261),
 ('extraction_type_group', 0.029064330152186242),
 ('scheme_management', 0.02843719571344725),
 ('basin', 0.026773943965726868),
 ('season_recorded', 0.016865793793567594),
 ('water_quality', 0.016776162616608612),
 ('permit', 0.01159092730339643),
 ('public_meeting', 0.011505296222643704),
 ('management_group', 0.01094293547183868)]

## Visualise the Decision Tree

It turns out that visualising the decision tree is quite straight forward.

There are two things one can initially do: limit or not limit maximum depth as folling example:
model = DecisionTreeClassifier(max_depth=None)

As below, push the image out to a tree.dot file. Find it on your computer amd then cut and paste the code into a web viewer/converter. See, e.g.:
http://www.ilovefreesoftware.com/03/featured/free-online-dot-to-png-converter-websites.html https://dreampuf.github.io/GraphvizOnline/

Right-clicking the image in GraphvizOnline, I could save the png to desktop.

If you limit maximum depth, your decision tree may be small enough to fit on one page. Else, cut and paste a number of lines of code from the tree.dot file, making sure you finish as follows in the online converter, e.g.:

...
44 -> 54 ; 
}

If you do limit the max depth the tree will be different, of course, than a full tree.

In [340]:
from sklearn import tree

# visualise
tree.export_graphviz(model, out_file='tree.dot') 
# cut and paste the tree.dot file info into webgraphviz.com in a browser. 
# (You can paste as many layers of the tree as you want. See notes above.)

In [341]:
# save to a file
# df.to_csv('/Users/RAhmed/data store/Wesleyan_Capstone/all_numeric201808292240.csv', index=False)

## Quick Aside - do KNN approach

In [329]:
## Import the Classifier.
from sklearn.neighbors import KNeighborsClassifier
## Instantiate the model with 5 neighbors. 
knn_list = []
for i in range(1,15):
    knn = KNeighborsClassifier(n_neighbors=i)
    ## Fit the model on the training data.
    knn.fit(X_train, y_train)
    ## See how the model performs on the test data.
    print("train set", knn.score(X_test, y_test))
    ## Append to list.
    knn_list.append({i: knn.score(X_test, y_test)})
knn_list

train set 0.6723063973063973
train set 0.678030303030303
train set 0.6845959595959596
train set 0.6853535353535354
train set 0.6842592592592592
train set 0.687037037037037
train set 0.6813131313131313
train set 0.6831649831649832
train set 0.6811447811447812
train set 0.67996632996633
train set 0.6786195286195286
train set 0.6792087542087543
train set 0.6773569023569024
train set 0.6744949494949495


[{1: 0.6723063973063973},
 {2: 0.678030303030303},
 {3: 0.6845959595959596},
 {4: 0.6853535353535354},
 {5: 0.6842592592592592},
 {6: 0.687037037037037},
 {7: 0.6813131313131313},
 {8: 0.6831649831649832},
 {9: 0.6811447811447812},
 {10: 0.67996632996633},
 {11: 0.6786195286195286},
 {12: 0.6792087542087543},
 {13: 0.6773569023569024},
 {14: 0.6744949494949495}]

## Do a global cross-validation approach on Decision Tree

https://chrisalbon.com/machine_learning/model_evaluation/cross-validaton/

In [330]:
# With k-fold cross validation. 
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score

model = DecisionTreeClassifier(max_depth=None, random_state=2)

# Create k-Fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=2)

# Do k-fold cross-validation
cv_results = cross_val_score(model, # Classifier
                             predictors, # Feature matrix
                             targets, # Target vector
                             cv=kf, # Cross-validation technique
                             scoring="accuracy", # Loss function
                             n_jobs=-1) # Use all CPU scores

# Calculate mean
cv_results.mean()

0.7565824915824916

## Do a hold out cross-validation approach on Decision Tree

Two differences: 
> i). do cross-validation on train set only (not against all), and predict against test set<br>
> ii). experiment within the decision tree with various parameters

N.B. cross_validation does not give a fitted model. There is a cross_val_predict, but I think it doesn;t test against a seprate hold out set. (?)

In [342]:
# With k-fold cross validation. 
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score


# train/test split
X_train, X_test, y_train, y_test = train_test_split(predictors, targets, test_size=.2)

# Difference for this 
model = DecisionTreeClassifier(max_depth=20, criterion='gini', random_state=2, splitter='random')

# Create k-Fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

# Do k-fold cross-validation
cv_results = cross_val_score(model, # Classifier
                             X_train, # Feature matrix
                             y_train, # Target vector
                             cv=kf, # Cross-validation technique
                             scoring="accuracy", # Loss function
                             n_jobs=-1) # Use all CPU scores

# Calculate mean against train set!
print(cv_results.mean())

# So I have tweaked the 'model' (above) parameters to get the best model, 
# and can now test against the hold out set. ((Tweaked) Model has to be fitted)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)


0.7657617845117846


0.7611952861952862

^ above cross validation tweaking has improved accuracy (against hold out) by some 1.5%, great!

## Do Random Forest approach

In [343]:
# import all necessary libraries
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# feature importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

In [344]:
# run Random Forest
from sklearn.ensemble import RandomForestClassifier
# classifier (sometimes use clf) is initiated
classifier=RandomForestClassifier(n_estimators=25, random_state=2)
# next step trains the model
classifier=classifier.fit(X_train,y_train)
# now we apply the classifier to the test data
predictions=classifier.predict(X_test)

# we look at confusion matrix and accuracy of prediction on test values
print("RANDOM FOREST")
print("Confusion matrix:")
print(sklearn.metrics.confusion_matrix(y_test,predictions))
print("Accuracy score:")
print(sklearn.metrics.accuracy_score(y_test, predictions))
print()
# display the relative importance of each attribute using RandomForestClassifier
# make this more readable by having the names of the predictors and having sorted
zipped = zip(predictors, classifier.feature_importances_)
my_list = list(zipped)
my_list.sort(key=lambda tup: tup[1], reverse=True)
print('RandomForestClassifier relative feature importance:')
for item in my_list:
    print('{0:42} {1:>42}'.format(item[0], item[1]))

# fit an Extra Trees model to the data (instead of Random Forest)
model = ExtraTreesClassifier(random_state=2)
model.fit(X_train,y_train)
predictions=model.predict(X_test)
print()
print("EXTRA TREE")
print("Confusion matrix:")
print(sklearn.metrics.confusion_matrix(y_test,predictions))
print("Accuracy score:")
print(sklearn.metrics.accuracy_score(y_test, predictions))
print()
# make this more readable by having the names of the predictors and having sorted
zipped = zip(predictors, model.feature_importances_)
my_list = list(zipped)
my_list.sort(key=lambda tup: tup[1], reverse=True)
print('ExtraTreesClassifier relative feature importance:')
for item in my_list:
    print('{0:42} {1:>42}'.format(item[0], item[1]))

RANDOM FOREST
Confusion matrix:
[[5582  228  664]
 [ 349  330  137]
 [1011  104 3475]]
Accuracy score:
0.7901515151515152

RandomForestClassifier relative feature importance:
gps_height                                                        0.18180452806021108
wpt_age                                                           0.14987167311128888
quantity_group                                                     0.1361718327627374
construction_year                                                 0.07447677652451679
waterpoint_type_group                                             0.06271687878568638
installer                                                         0.05876835704805633
extraction_type_group                                            0.044161902256080146
payment_type                                                     0.042808397013703585
region_code                                                       0.04245088424662267
population                                         

## Do Random Forest with cross-validation and hold out set

In [345]:
# Random forest with cross-validation and hold out set
model = RandomForestClassifier(n_estimators=50, criterion='entropy', max_depth=18, 
                               bootstrap=True, oob_score=True, random_state=2)

# Create k-Fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

# Do k-fold cross-validation
cv_results = cross_val_score(model, # Classifier
                             X_train, # Feature matrix
                             y_train, # Target vector
                             cv=kf, # Cross-validation technique
                             scoring="accuracy", # Loss function
                             n_jobs=-1) # Use all CPU scores

# Calculate mean against train set!
cv_results.mean()

# Calculate mean against test set
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

0.8095117845117845

In [346]:
# Find relative importance
zipped = zip(predictors, model.feature_importances_)
my_list = list(zipped)
my_list.sort(key=lambda tup: tup[1], reverse=True)
print('RandomForestClassifier relative feature importance:')
for item in my_list:
    print('{0:42} {1:>42}'.format(item[0], item[1]))

RandomForestClassifier relative feature importance:
quantity_group                                                     0.1476545938994607
wpt_age                                                           0.13490896915174497
gps_height                                                        0.12553456961357742
construction_year                                                 0.07582540790241538
installer                                                         0.06426181947924077
waterpoint_type_group                                             0.06396458502376022
extraction_type_group                                              0.0545968042456523
region_code                                                       0.05294210843243622
payment_type                                                      0.05004533805214187
population                                                        0.03725701749227348
basin                                                             0.03686161463761816
sc

## Naive Bayes Classifier

In [324]:
df2 = df
df2.head()

Unnamed: 0,id,date_recorded,season_recorded,gps_height,installer,longitude,latitude,basin,region_code,population,...,construction_year,extraction_type_group,management_group,payment_type,water_quality,quantity_group,source_type,source_class,waterpoint_type_group,status_group
34182,16647,2013.158602,2,2,223,33.517407,-3.239933,4,16,0,...,1995,6,4,4,6,0,5,0,5,2
15794,4815,2013.099462,2,-49,197,40.145994,-10.309863,7,8,8,...,1980,6,4,2,7,0,5,0,5,2
35164,25675,2011.252688,1,1,223,33.24041,-8.83259,2,11,0,...,1984,1,4,3,6,0,4,1,1,2
14131,13088,2012.768817,0,633,157,34.021617,-3.955814,0,16,0,...,2009,1,2,6,6,3,3,1,5,2
57386,35092,2012.817204,0,1,672,32.397157,-4.782683,3,13,0,...,0,0,4,2,6,2,5,0,3,0


In [325]:
predictors2 = df2[chosen_predictors]
targets2 = df2['status_group']

X_train, X_test, y_train, y_test = train_test_split(predictors2, targets2, test_size=.2)
X_train.head()

Unnamed: 0,season_recorded,gps_height,installer,basin,region_code,population,public_meeting,scheme_management,permit,wpt_age,construction_year,extraction_type_group,management_group,payment_type,water_quality,quantity_group,source_type,waterpoint_type_group
14670,0,1310,223,4,19,2,2,7,0,1.803763,2011,5,4,6,6,1,5,3
17251,1,2,327,2,11,0,1,7,0,11.271505,2000,5,4,3,6,1,5,3
27952,2,1734,157,6,10,7,1,7,0,3.145161,2008,1,4,2,6,1,6,1
3552,2,1503,774,0,12,3,1,7,2,23.129032,1990,5,4,3,6,2,5,3
19173,1,33,774,8,6,8,2,9,0,23.209677,1990,0,4,2,6,1,5,3


In [326]:
df2['gps_height'].value_counts()[0]

4643

In [327]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)
GaussianNB(priors=None)
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)

0.4663299663299663

## Results

Random seed set at 2 for all except Naive  Bayes Classifier and KNN (where n/a)

| Method | Cross-validation | Hold-out set | Accuracy |
| ---- | ---- | ---- | ---- |
| Naive Bayes | no | yes | 47.7% |
| KNN | no | yes | 68.1% |
| Decision tree | no | yes | 75.7% |
| Decision tree | yes | no | 75.7% |
| Decision tree | yes | yes | 76.3% |
| Extra tree | no | yes | 79.0% |
| Random forest | yes | yes | 81.2% |

## How results could be improved
> Frequency encoding of variables;<br>
> Nearest neighbour for construction year?;<br>
> Google Elevation API for missing gps_height;<br>
> Smaller train set, perhaps model is over-fitted?  (Look how DTree improved with hold out set, when using cross validation)<br>