# Consolidated workflow

Import initial libraries and data file. N.B. The data already had season_recorded added as a composite variable, by me, earlier.

In [23]:
# import libraries and data
import numpy as np
import pandas as pd
import matplotlib as plot
import seaborn as sns

# check select_data
df = pd.read_csv('/Users/RAhmed/data store/Wesleyan_Capstone/select_data.csv')
df.shape

(59400, 23)

## Tanzania - data analysis

In [3]:
# import libraries and data
import numpy as np
import pandas as pd
import matplotlib as plot
import seaborn as sns

# check select_data
df = pd.read_csv('/Users/RAhmed/data store/Wesleyan_Capstone/all_numeric201808292240.csv')
df.shape

(59400, 24)

In [6]:
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

Sklearn's decision trees and random forests need categorical data changed into integers. See, e.g,: https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree

The above data set has already had this done.

In [7]:
df.columns

Index(['id', 'date_recorded', 'season_recorded', 'gps_height', 'installer',
       'longitude', 'latitude', 'basin', 'region_code', 'population',
       'public_meeting', 'scheme_management', 'permit', 'wpt_age',
       'construction_year', 'extraction_type_group', 'management_group',
       'payment_type', 'water_quality', 'quantity_group', 'source_type',
       'source_class', 'waterpoint_type_group', 'status_group'],
      dtype='object')

In [8]:
chosen_predictors = ['season_recorded', 'gps_height', 'installer', 'basin', 'region_code', 'population',
                     'public_meeting', 'scheme_management', 'permit', 'wpt_age','construction_year', 
                     'extraction_type_group', 'management_group', 'payment_type', 'water_quality', 
                     'quantity_group', 'source_type', 'waterpoint_type_group'
                    ]

predictors = df[chosen_predictors]
targets = df['status_group']

# train/test split
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.2)

print("pred_train.shape:", pred_train.shape)
print("pred_test.shape:", pred_test.shape)
print("tar_train.shape:", tar_train.shape)
print("tar_test.shape:", tar_test.shape)

pred_train.shape: (47520, 18)
pred_test.shape: (11880, 18)
tar_train.shape: (47520,)
tar_test.shape: (11880,)


Build a classifier model with a decision tree, on train set

In [9]:
# Build model on training data; initiate classifier from sklearn, then fit it with the training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)

# predict for the test values and create confusion matrix
predictions=classifier.predict(pred_test)
print("Confusion matrix:")
print(sklearn.metrics.confusion_matrix(tar_test,predictions))
print("Accuracy score:")
print(sklearn.metrics.accuracy_score(tar_test, predictions))

Confusion matrix:
[[5120  343  891]
 [ 354  338  179]
 [ 987  173 3495]]
Accuracy score:
0.7536195286195286


Trying again, to play with parameters, experiment, etc.

In [10]:
# Playing with hyper-parameters. Also have varied the wording of the code from that in the MOOC. 
# (Just to check all in order with exceptionally high result.)
# train/test split
from sklearn.utils import shuffle
df = shuffle(df)
X_train, X_test, y_train, y_test = train_test_split(predictors[:], targets[:], test_size=.2)

# Build model on training data; initiate classifier from sklearn, then fit it with the training data
model = DecisionTreeClassifier(max_depth=None)
model.fit(X_train, y_train)

y_predict = model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

0.7553030303030303

For confusion matrix, N.B. that:

functional = 0

functional_needs_repair = 1

non-functional = 2

In [11]:
from sklearn.metrics import confusion_matrix

pd.DataFrame(
    confusion_matrix(y_test, y_predict),
    columns=['Pred functional', 'Predicted func_need_repair', 'Predicted non_func'],
    index=['True functional', 'True func_need_repair', 'True non_func']
)

Unnamed: 0,Pred functional,Predicted func_need_repair,Predicted non_func
True functional,5148,373,945
True func_need_repair,353,366,157
True non_func,879,200,3459


In [12]:
# reprint for convenience
accuracy_score(y_test, y_predict)

0.7553030303030303

Key ingredient: see which features are most important

In [13]:
importance = (dict(zip(X_train, model.feature_importances_)))
sorted_items = sorted(importance.items(), key = lambda x: x[1], reverse=True)
sorted_items

[('gps_height', 0.21137366969202878),
 ('wpt_age', 0.161596418723216),
 ('quantity_group', 0.15625972345481492),
 ('waterpoint_type_group', 0.07989051614506813),
 ('installer', 0.05551817301162601),
 ('region_code', 0.043381341820464256),
 ('payment_type', 0.041610236404126505),
 ('population', 0.035799532888753094),
 ('extraction_type_group', 0.0334683228144356),
 ('construction_year', 0.0306857678054853),
 ('source_type', 0.029017294952964533),
 ('scheme_management', 0.027798322018767793),
 ('basin', 0.02540525378348438),
 ('water_quality', 0.017607862289897216),
 ('permit', 0.01579272825581607),
 ('public_meeting', 0.012979489291717978),
 ('season_recorded', 0.01139977332904869),
 ('management_group', 0.010415573318284663)]

## Visualise the Decision Tree

It turns out that visualising the decision tree is quite straight forward.

There are two things one can initially do: limit or not limit maximum depth as folling example:
model = DecisionTreeClassifier(max_depth=None)

As below, push the image out to a tree.dot file. Find it on your computer amd then cut and paste the code into a web viewer/converter. See, e.g.:
http://www.ilovefreesoftware.com/03/featured/free-online-dot-to-png-converter-websites.html https://dreampuf.github.io/GraphvizOnline/

Right-clicking the image in GraphvizOnline, I could save the png to desktop.

If you limit maximum depth, your decision tree may be small enough to fit on one page. Else, cut and paste a number of lines of code from the tree.dot file, making sure you finish as follows in the online converter, e.g.:

...
44 -> 54 ; 
}

If you do limit the max depth the tree will be different, of course, than a full tree.

In [14]:
from sklearn import tree

# visualise
tree.export_graphviz(model, out_file='tree.dot') 
# cut and paste the tree.dot file info into webgraphviz.com in a browser. 
# (You can paste as many layers of the tree as you want. See notes above.)

In [167]:
# save to a file
# df.to_csv('/Users/RAhmed/data store/Wesleyan_Capstone/all_numeric201808292240.csv', index=False)

## Quick Aside - do KNN approach

In [23]:
## Import the Classifier.
from sklearn.neighbors import KNeighborsClassifier
## Instantiate the model with 5 neighbors. 
knn_list = []
for i in range(1,15):
    knn = KNeighborsClassifier(n_neighbors=i)
    ## Fit the model on the training data.
    knn.fit(X_train, y_train)
    ## See how the model performs on the test data.
    knn.score(X_test, y_test)
    ## Append to list.
    knn_list.append({i: knn.score(X_test, y_test)})
knn_list

[{1: 0.6771043771043771},
 {2: 0.679040404040404},
 {3: 0.6873737373737374},
 {4: 0.6861952861952862},
 {5: 0.6871212121212121},
 {6: 0.6813973063973064},
 {7: 0.682996632996633},
 {8: 0.6821548821548822},
 {9: 0.6824915824915825},
 {10: 0.6788720538720538},
 {11: 0.6771043771043771},
 {12: 0.6781144781144781},
 {13: 0.6758417508417508},
 {14: 0.6734006734006734}]

## Do a global cross-validation approach on Decision Tree

https://chrisalbon.com/machine_learning/model_evaluation/cross-validaton/

In [28]:
# With k-fold cross validation. 
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score

model = DecisionTreeClassifier(max_depth=None)

# Create k-Fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

# Do k-fold cross-validation
cv_results = cross_val_score(model, # Classifier
                             predictors, # Feature matrix
                             targets, # Target vector
                             cv=kf, # Cross-validation technique
                             scoring="accuracy", # Loss function
                             n_jobs=-1) # Use all CPU scores

# Calculate mean
cv_results.mean()

0.757979797979798

## Do a hold out cross-validation approach on Decision Tree

Two differences: 
> i). do cross-validation on train set only (not against all), and predict against test set<br>
> ii). experiment within the decision tree with various parameters

N.B. cross_validation does not give a fitted model. There is a cross_val_predict, but I think it doesn;t test against a seprate hold out set. (?)

In [132]:
# With k-fold cross validation. 
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score


# train/test split
X_train, X_test, y_train, y_test = train_test_split(predictors, targets, test_size=.2)

# Difference for this 
model = DecisionTreeClassifier(max_depth=20, criterion='gini', random_state=2, splitter='random')

# Create k-Fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

# Do k-fold cross-validation
cv_results = cross_val_score(model, # Classifier
                             X_train, # Feature matrix
                             y_train, # Target vector
                             cv=kf, # Cross-validation technique
                             scoring="accuracy", # Loss function
                             n_jobs=-1) # Use all CPU scores

# Calculate mean against train set!
print(cv_results.mean())

# So I have tweaked the 'model' (above) parameters to get the best model, 
# and can now test against the hold out set. ((Tweaked) Model has to be fitted)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)


0.7652988215488217


0.7651515151515151

^ above cross validation tweaking has improved accuracy (against hold out) by some 1.5%, great!

## Do Random Forest approach

In [29]:
# import all necessary libraries
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# feature importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

In [34]:
# run Random Forest
from sklearn.ensemble import RandomForestClassifier
# classifier (sometimes use clf) is initiated
classifier=RandomForestClassifier(n_estimators=25)
# next step trains the model
classifier=classifier.fit(X_train,y_train)
# now we apply the classifier to the test data
predictions=classifier.predict(X_test)

# we look at confusion matrix and accuracy of prediction on test values
print("RANDOM FOREST")
print("Confusion matrix:")
print(sklearn.metrics.confusion_matrix(y_test,predictions))
print("Accuracy score:")
print(sklearn.metrics.accuracy_score(y_test, predictions))
print()
# display the relative importance of each attribute using RandomForestClassifier
# make this more readable by having the names of the predictors and having sorted
zipped = zip(predictors, classifier.feature_importances_)
my_list = list(zipped)
my_list.sort(key=lambda tup: tup[1], reverse=True)
print('RandomForestClassifier relative feature importance:')
for item in my_list:
    print('{0:42} {1:>42}'.format(item[0], item[1]))

# fit an Extra Trees model to the data (instead of Random Forest)
model = ExtraTreesClassifier()
model.fit(X_train,y_train)
predictions=classifier.predict(X_test)
print()
print("EXTRA TREE")
print("Confusion matrix:")
print(sklearn.metrics.confusion_matrix(y_test,predictions))
print("Accuracy score:")
print(sklearn.metrics.accuracy_score(y_test, predictions))
print()
# make this more readable by having the names of the predictors and having sorted
zipped = zip(predictors, model.feature_importances_)
my_list = list(zipped)
my_list.sort(key=lambda tup: tup[1], reverse=True)
print('ExtraTreesClassifier relative feature importance:')
for item in my_list:
    print('{0:42} {1:>42}'.format(item[0], item[1]))

RANDOM FOREST
Confusion matrix:
[[5619  210  637]
 [ 392  338  146]
 [ 908  125 3505]]
Accuracy score:
0.7964646464646464

RandomForestClassifier relative feature importance:
gps_height                                                         0.1821802097372602
wpt_age                                                           0.15181877573977384
quantity_group                                                     0.1276993841321722
construction_year                                                 0.07191145627623537
installer                                                         0.05964110750070234
waterpoint_type_group                                             0.05698960579470149
extraction_type_group                                              0.0546123771953428
payment_type                                                     0.043689133821211576
region_code                                                      0.042915515609313955
population                                         

## Do random forest with cross-validation and hold out set

In [151]:
# Random forest with cross-validation and hold out set
model = RandomForestClassifier(n_estimators=50, criterion='entropy', max_depth=18, 
                               bootstrap=True, oob_score=True, random_state=5)

# Create k-Fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

# Do k-fold cross-validation
cv_results = cross_val_score(model, # Classifier
                             X_train, # Feature matrix
                             y_train, # Target vector
                             cv=kf, # Cross-validation technique
                             scoring="accuracy", # Loss function
                             n_jobs=-1) # Use all CPU scores

# Calculate mean against train set!
cv_results.mean()

# Calculate mean against test set
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

0.8127104377104377

## Results

| Method | Cross-validation | Hold-out set | Accuracy |
| ---- | ---- | ---- | ---- |
| KNN | no | yes | 68.8% |
| Decision tree | no | yes | 75.5% |
| Decision tree | yes | no | 75.8% |
| Decision tree | yes | yes | 76.5% |
| Random forest | no | yes | 79.6% |
| Random forest | yes | yes | 81.2% |

In [35]:
# How could be improved: frequency encoding of variables; nearest neighbour for construction year?