# Predicting Functionality of Waterpoints in Tanzania

![img](images/lake_victoria.jpg)

Lake Victoria in Tanzania - photo courtesy of [thepinkbackpack.com](https://www.thepinkbackpack.com/).

# Overview

# Business Problem

About 4 million of Tanzania's 59 million residents lack access to potable (drinking) water; an even greater proportion of the Tanzanian population (nearly half) lack access to what water.org calls "[improved sanitation](https://water.org/our-impact/where-we-work/tanzania/)".

While the majority of Tanzanians *do* have access to clean water, a quick look at the functionality status of about 60,000 waterpoints (wells, pumps, etc.) indicates that nearly **half** of those waterpoints either need repair to function consistently.

![img](images/target_val_counts.jpg)

# Data

Data used in this classification project comes from an ongoing competition hosted by DrivenData, [*Pump it Up: Data Mining the Water Table*](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).



Repository structure for running this notebook available at the bottom of the [README.md](https://github.com/toastdeini/dsc-ph3-Tanzanian-water-well-status/blob/main/README.md).

Descriptions of each column can be found at [this link](data_dict_basic.txt) within this repository.

## Loadout

In [21]:
# Packages for data cleaning, plotting, and manipulation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# machine learning libraries/functions/classes

from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, \
                            classification_report
from sklearn.linear_model import LogisticRegression, RidgeCV
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, \
                             StackingClassifier, ExtraTreesClassifier
from sklearn.dummy import DummyClassifier
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier

In [2]:
# Importing training data
train_val = pd.read_csv('data/training_set_values.csv')

# Only using `status_group` column from label set, to
# avoid duplicating `id` column
train_label = pd.read_csv('data/training_set_labels.csv',
                             usecols = ['status_group'])


# Test set, provided by DrivenData - not to be used
# until models have been trained, internally validated, etc.

test_df = pd.read_csv('data/test_set_values.csv')

In [3]:
# Concatenating separate .csv files
df = pd.concat(objs = [train_val, train_label],
               axis = 1)

In [4]:
# Quick readout
df.sample(n = 5,
          random_state = 138)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
27487,60684,0.0,2013-02-15,Dwe,1137,DWE,37.136702,-4.075092,Madukani,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,non functional
52964,457,0.0,2011-04-06,,0,,34.368774,-8.767996,Kwa Mzee Lusambo,0,...,soft,good,seasonal,seasonal,river,river/lake,surface,communal standpipe,communal standpipe,non functional
26478,7855,50.0,2011-03-24,Private Individual,-8,Da,38.991011,-6.537616,Bakari,0,...,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe,functional
59113,50299,0.0,2013-03-19,Government Of Tanzania,1068,DWE,36.806314,-3.448869,Kwa Elishirikiamwea Swai,0,...,soft,good,enough,enough,spring,spring,groundwater,other,other,functional
14153,70382,0.0,2011-07-27,He,0,HE,31.635313,-1.718094,Kabubuya,0,...,soft,good,enough,enough,spring,spring,groundwater,hand pump,hand pump,non functional


## Feature Comprehension and Selection

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

We have about **forty** potential features here, indicated by columns `0` through `39`; column `40`, `status_group`, is our target variable. The majority (30) of those columns are currently stored as type `object`, the remainder as either `int64` or `float64`.

Thorough exploratory analysis revealed that a number of these columns contained information that was redundant with other columns, not informative to the modeling process, or otherwise superfluous.

In [6]:
# Dropping columns determined to be either irrelevant or
# superfluous in exploratory analysis

cols_to_drop = [
    'id',  # unique identifier, not useful for modeling
    'date_recorded',  # superfluous information, too many unique records
    'recorded_by',  # no unique information + no unique values across 59.4k rows
    'funder',   
    'installer',  # large number of unique values (see also `funder`);
                  # may be added back in later
    'wpt_name',  # identifier column, not useful for modeling
    'num_private',  # data dict. does not provide details for this column
    'subvillage',  # too many unique values - uninformative for modeling
    'region_code',  # redundant information vis-a-visa the simpler `region`
    'district_code',  # may be added back in later
    'lga',
    'ward',  # redundant location data (with `lga`)
    'scheme_management',  # may be added back in later
    'scheme_name',  # large number of nulls, redundant vis-a-vis `scheme_management`
    'extraction_type',
    'extraction_type_group',  # using `extraction_type_class` for generalized info
    'management',
    'management_group',  # may be added back in later
    'payment',  # identical information to `payment_type`
    'water_quality',  # comparable information to `quality_group` - redundant
    'quantity_group',  # identical information to `quantity` - redundant
    'source',  # redundant with other `source_` columns
    'waterpoint_type'  # used `waterpoint_type_group` instead
]

In [7]:
df = df.drop(columns = cols_to_drop).copy()

# Modeling

## Preprocessing

### Pipelines and Transformers

To ensure that our data leakage is kept to a minimum - ideally zero - I first set up pipelines to impute missing or null values, one-hot encode categorical variables, and scale numerical columns.

In [8]:
# Subpipes for imputing median values - to be used for `latitude` and `longitude`
subpipe_lat      = Pipeline(steps=[('num_impute', SimpleImputer(missing_values = -2.000000e-08,
                                                                strategy = 'median')),
                                   ('ss', StandardScaler())])

subpipe_long     = Pipeline(steps=[('num_impute', SimpleImputer(missing_values = 0.000000,
                                                                strategy = 'median')),
                                   ('ss', StandardScaler())])


# Subpipe for imputing median values
subpipe_num      = Pipeline(steps=[('num_impute', SimpleImputer(strategy = 'median')),
                                   ('ss', StandardScaler())])

# Subpipe for `construction_year`
subpipe_year     = Pipeline(steps=[('num_impute', SimpleImputer(missing_values = 0,
                                                                strategy = 'median')),
                                   ('ss', StandardScaler())])

# Subpipe for categorical features including `basin`, `payment_type`
subpipe_cat      = Pipeline(steps=[('freq_imputer_nan', SimpleImputer(strategy = 'most_frequent')),
                                   ('freq_imputer_unk', SimpleImputer(strategy = 'most_frequent',
                                                                      missing_values = 'unknown')),
                                   ('ohe', OneHotEncoder(drop = 'if_binary',
                                                         sparse = False,
                                                         handle_unknown = 'ignore'))])

In [9]:
# Columns to be passed through numerical pipeline
num_cols = ['amount_tsh',
            'gps_height',
            'population']

# Columns to be passed through categorical pipeline
cat_cols = ['basin',
            'region',
            'payment_type',
            'quantity',
            'quality_group',
            'permit',
            'public_meeting',
            'extraction_type_class',
            'source_type',
            'source_class',
            'waterpoint_type_group']

In [10]:
ct = ColumnTransformer(transformers = [
    ('subpipe_num', subpipe_num, num_cols),
    ('subpipe_year', subpipe_year, ['construction_year']),
    ('subpipe_long', subpipe_long, ['longitude']),
    ('subpipe_lat', subpipe_lat, ['latitude']),
    ('subpipe_cat', subpipe_cat, cat_cols)],
                       remainder = 'passthrough')

# ('subpipe_label', subpipe_label, ['status_group'])

### Train/Test Split

In [11]:
# Splitting DataFrame into features/values DataFrame
# (i.e. `X`) and labels series (`y`)

X = df.drop('status_group', axis = 1)
y = df['status_group']

In [12]:
# Splitting internal training data into separate
# training and test sets for (eventual) internal validation

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 138)

## Iterative Modeling

### Dummy model

For comparison, I start with a baseline model - sklearn's `DummyClassifier`.

In [13]:
# Pipeline for dummy model
dummy_model_pipe = Pipeline(steps=[
    ('ct', ct),
    ('dummy', DummyClassifier())
])

In [14]:
# Fit on training data
dummy_model_pipe.fit(X_train, y_train)

# Score on training data
dummy_model_pipe.score(X_train, y_train)

0.5420875420875421

scikit-learn's `DummyClassifier` predicts on the training data with an accuracy score of ~0.542, equal to the proportion of the **most frequent class** (`functional`). This is because, as a dummy model, it predicts `functional` (the most frequent value) every time.

We'll be looking to improve on that 54.2% accuracy in future models.

> The full, verbose modeling process is available to view in [this notebook](Jupyter_Notebooks/Modeling_Scratchwork.ipynb) within the repository.

## Final Model

The final model is an ensemble model - it combines a `RandomForestClassifier` and an `XGBClassifier`, using sklearn's `StackingClassifier`.

In [15]:
# RandomForestClassifier pipeline
rfc_pipe_three = Pipeline(steps=[
    ('ct', ct),
    ('rfc', RandomForestClassifier(max_features = 'sqrt',
                                   max_depth = 15,
                                   random_state = 138))
])


# XGBoost classifier pipeline
xgb_model_pipe = Pipeline(steps=[
    ('ct', ct),
    ('xgb', XGBClassifier(random_state = 138))
])

In [16]:
# Setting up estimators for StackingClassifier
final_estimators = [
    ('rfc_model', rfc_pipe_three),
    ('xgb_model', xgb_model_pipe)
]

final_model_pipe = StackingClassifier(final_estimators)

In [17]:
final_model_pipe.fit(X_train, y_train)

StackingClassifier(estimators=[('rfc_model',
                                Pipeline(steps=[('ct',
                                                 ColumnTransformer(remainder='passthrough',
                                                                   transformers=[('subpipe_num',
                                                                                  Pipeline(steps=[('num_impute',
                                                                                                   SimpleImputer(strategy='median')),
                                                                                                  ('ss',
                                                                                                   StandardScaler())]),
                                                                                  ['amount_tsh',
                                                                                   'gps_height',
                                             

In [18]:
final_model_pipe.score(X_train, y_train)

0.8507965224383135

In [19]:
cross_val_score(estimator = final_model_pipe,
                X = X_train,
                y = y_train)

array([0.79020101, 0.78592965, 0.78756281, 0.78464631, 0.79469783])

The lower scores on cross-validation *vis-a-vis* the simple accuracy score on the training data indicate that this model is likely overfit. Future models will need further hyperparametric tuning and more thorough analysis to reduce that overfitting.

In [22]:
print(classification_report(y_true = y_train,
                            y_pred = final_model_pipe.predict(X_train)))

                         precision    recall  f1-score   support

             functional       0.82      0.95      0.88     21574
functional needs repair       0.85      0.36      0.51      2921
         non functional       0.90      0.81      0.85     15303

               accuracy                           0.85     39798
              macro avg       0.86      0.71      0.75     39798
           weighted avg       0.86      0.85      0.84     39798



On the training data, we got an overall accuracy of about `0.85`. The `f1-score` column provides a more detailed analysis into the combined `precision` and `recall` scores for our model's prediction on the target classes. `functional` and `non functional` have F1 scores in the 0.8 - 0.9 range, whereas  the `functional needs repair` class has a considerably lower F1 score, 0.51 - our model was *less* effective in predicting this class.

Considering that cross-validation showed our model to be likely overfit, we can anticipate these scores dropping somewhat when unseen data is used inplace.

### Accuracy on Unseen Data

In [23]:
final_model_pipe.score(X_test, y_test)

0.7909907152331395

In [24]:
print(classification_report(y_true = y_test,
                            y_pred = final_model_pipe.predict(X_test)))

                         precision    recall  f1-score   support

             functional       0.77      0.91      0.83     10685
functional needs repair       0.62      0.22      0.33      1396
         non functional       0.84      0.73      0.78      7521

               accuracy                           0.79     19602
              macro avg       0.74      0.62      0.65     19602
           weighted avg       0.79      0.79      0.78     19602



#### Score summary

- ***Accuracy score:*** 0.79
- ***Precision score* for `non functional` predictions**: 0.84
- **Final `f1-score` per status group** (on test data):
    - `functional` - 0.83
    - `functional needs repair` - 0.33
    - `non functional` - 0.78

In [25]:
final_model_pipe

StackingClassifier(estimators=[('rfc_model',
                                Pipeline(steps=[('ct',
                                                 ColumnTransformer(remainder='passthrough',
                                                                   transformers=[('subpipe_num',
                                                                                  Pipeline(steps=[('num_impute',
                                                                                                   SimpleImputer(strategy='median')),
                                                                                                  ('ss',
                                                                                                   StandardScaler())]),
                                                                                  ['amount_tsh',
                                                                                   'gps_height',
                                             

# Conclusions

Our final model's accuracy on unseen data was `0.7909`, or just under **80%**. This score represents the proportion of correct predictions, i.e. instances where our model correctly guessed a well's functionality status.

With regards to `non functional` wells, specifically, our model attained a **precision score** of about **0.84** - intuitively, this means that 84% of waterpoints identified by the model as `non functional` were *true positives*, i.e. wells that were truly non-functional.

The **F1 scores** provide us with harmonic means of the model's precision and recall scores for each target class; respectively, the descending F1 scores of 0.83 for the `functional` class, 0.78 for the `non functional` class, and 0.33 for the `functional needs repair` class. While the F1 scores for the functional and non-functional classes are comparable to the overall accuracy, the F1 score for predicting wells that are functional but need repair is considerably lower - the model is not predicting on this class of waterpoints with as much success.

## Next Steps

# References/Other