# Introduction

The access to safe drinking water is an important aspect of health, sanitation and hygiene. According to the World Health Organisation, some 2.2 billion people around the world do not have access to safe drinking water<sup>1</sup>. The investment in water supply and sanitation is important for public heath in preventing the spread of diseases and can yield a net economic benefit. Therefore, the development of technologies that can effectively identify safe drinking water will facilitate the deployment of solutions that will alleviate water shortages. This project outlined here will use Machine Learning Algorithms to predict the water potability of water bodies based on labelled data from surveryed water quality metrics.   

<sup>1</sup> <sub>https://www.who.int/news/item/18-06-2019-1-in-3-people-globally-do-not-have-access-to-safe-drinking-water-unicef-who</sub>

# Dataset

The [`water_potability.csv`] dataset contains 10 different water quality metrics for 3276 different water bodies expressed as floating point values. More details on the dataset can be found on [Kaggle](https://www.kaggle.com/adityakadiwal/water-potability). 

#### 1. pH value:

PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status.  WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

#### 2. Hardness:

Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.

#### 3. Solids (Total dissolved solids - TDS):

Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

#### 4. Chloramines:

Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

#### 5. Sulfate:

Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.

#### 6. Conductivity:

Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.

#### 7. Organic_carbon:

Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

#### 8. Trihalomethanes:

THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.

#### 9. Turbidity:

The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

#### 10. Potability:

Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

# Task

The task is to create a machine learning model to determine if the sampled water body is fit for human consumption.

# Strategy

1. Exploratory Data Analysis (EDA)

2. Data Preparation for Machine Learning.
    * Data stratification into training and test set.
    * Upsampling of training set.
    * Building transformation pipeline - dealing with missing values using multiple mean imputation and standard scaling of dataset. 

3. Model Building
    * Logistic Regression  
    * Decision Tree Classifier
    * Random Forest Classifier

4. Random Forest hyper parameter tuning
    * Randomised Search Cross Validation
    * Grid Search Cross Validation 

5. Model comparison
    * ROC Curves and accuracy
    

# Data Exploration

Most columns for each data type is expressed as `float64` value with exceptions of "Potability" which is expressed as `int64`. As evident from the count values, pH, Sulfates and Trihalomethanes contain `NAN` values which were imputed with the mean of each column's values. 

In [None]:
# Read in data
import pandas as pd

water_quality = pd.read_csv('../input/water-potability/water_potability.csv')
water_quality.info()

In [None]:
water_quality["Potability"] = water_quality["Potability"].astype("category")
type(water_quality.iloc[:,1].values[1])

In [None]:
water_quality.describe()

In [None]:
water_quality.isna().sum().values

Counts are different indicating there are 491, 781, 162 NA's in `ph`, `Sulfate` and `Trihalomethanes` respectively.

In [None]:
water_quality.loc[water_quality["Potability"] == 1].describe()

In [None]:
water_quality.loc[water_quality["Potability"] == 0].describe()

## Data distribution

In [None]:
import matplotlib.pyplot as plt
potability = water_quality["Potability"].hist()
potability.set_title("Potability")
water_quality.hist(bins = 50, figsize = (20, 10))
plt.show
plt.suptitle("Water Quality Distribution plots", fontsize = 25)

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows = 3, ncols = 3, figsize = (20, 10))
fig.suptitle("Water Quality Box plots", fontsize = 25)
fig.subplots_adjust(hspace=0.5)
counter = 0

for row_idx in range(3):
    for col_idx in range(3):
        axs[row_idx, col_idx].boxplot(water_quality.iloc[:,counter].dropna(), vert = False)
        axs[row_idx, col_idx].set_title(water_quality.columns[counter])
        counter += 1


Data in the solids are right skewed while sulfate are slightly left skewed. Overall, the data is still relatively normally distributed. The histogram of potability indicated an unbalanced dataset which may bias the model during training.

# Data Prep for ML
## Splitting datasets into train and test set

* Stratify the dataset to split into training and testing set.
* Training dataset is up-sampled to address the imbalance in the potability dataset. This should address potential bias in the model as a result of imbalanced training dataset.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.utils import resample
from sklearn.utils import shuffle

# Data stratification
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
for train_index, test_index in split.split(water_quality, water_quality["Potability"]):
    strat_train_set = water_quality.loc[train_index]
    strat_test_set = water_quality.loc[test_index]
    
# Upsampling of data
portable = strat_train_set.loc[strat_train_set["Potability"] == 1]
not_portable  = strat_train_set.loc[strat_train_set["Potability"] == 0]
portable = resample(portable, replace = True, n_samples = len(not_portable), random_state = 42)
strat_train_set = pd.concat([portable, not_portable])
strat_train_set = shuffle(strat_train_set, random_state = 42)

strat_train_set["Potability"].hist()
plt.title("Training data: Potability")
plt.xlabel("Value")
plt.ylabel("Count")

In [None]:
strat_test_set["Potability"].value_counts()/len(strat_test_set)

In [None]:
water_quality["Potability"].value_counts()/len(water_quality)

Training was initially stratified then upsampled to fix unbalanced Potability data entries. Test data set was not upsampled to preserve data quality and was stratified as according to original data set. A 60/40 percent split for non-potable vs potable entries exists in the data. 

## Transformation pipeline

1. Multiple missing value imputation.
2. Standard scaling of data.

Transformation pipeline is created to standardise data transformations for all data being used to train the models. Specifically, the data will be fitted according to training data set values. Fitted information will be used to transform testing data set in order to compared performace between models.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Create pipeline
pipeline = Pipeline([
    ("iterative_imputer", SimpleImputer(strategy = "mean")),
    ("std_scaler", StandardScaler())
])

# predictors
training_pred = pipeline.fit_transform(strat_train_set.drop("Potability", axis = 1))

# explanatory
training_labels = strat_train_set["Potability"].values


# Fit and transform data through pipeline 
training_pred = pd.DataFrame(training_pred)
training_pred


# Model Building

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn import metrics

def display_scores(scores):
    print("Scores:", scores)
    print("Means:", scores.mean())
    print("Standard Deviation:", scores.std())

logistic_model = LogisticRegression(solver = "liblinear", random_state = 42)
logistic_model.fit(training_pred, training_labels)
log_scores = cross_val_score(logistic_model, 
                             training_pred, 
                             training_labels,
                             scoring = "accuracy", cv = 10)
display_scores(log_scores)

# Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state = 42)
dt.fit(training_pred, training_labels)
dt_scores = cross_val_score(dt, 
                           training_pred,
                           training_labels,
                           scoring = "accuracy", cv = 10)
display_scores(dt_scores)

# Randomforest classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_untuned = RandomForestClassifier(random_state = 42)
rf_untuned.fit(training_pred, training_labels)
rf_scores = cross_val_score(rf_untuned, training_pred, training_labels,
                               scoring = "accuracy")
display_scores(rf_scores)

In [None]:
model_scores = {"mean_cv_accuracy":[np.mean(log_scores),
                               np.mean(dt_scores),
                               np.mean(rf_scores)]}
model_scores = pd.DataFrame(data = model_scores, index = ["log", "dt", "rf"])
model_scores

## Random forest hyperparameter tuning by Randomised Search

The following lines of code were adapted from https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

In [None]:
# The following lines of code were adapted from 

from sklearn.model_selection import RandomizedSearchCV

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Create the random grid
random_grid = {'n_estimators': [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
               'criterion': ['gini', 'entropy'],
               'max_features': ['auto', 'sqrt'],
               'max_depth': max_depth,
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [1, 2, 4],
               'bootstrap': [True, False],
               'random_state' : [42]}
rf = RandomForestClassifier(random_state = 42)
rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid, 
                               n_iter = 100, cv = 3, 
                               verbose=2, n_jobs = -1, random_state = 42)
rf_random.fit(training_pred, training_labels)

In [None]:
rf_random.best_params_

## Random forest hyperparameter tuning by Grid Search

Randomised search resulted in parameters where:

* 'n_estimators': 1800,
* 'min_samples_split': 5,
* 'min_samples_leaf': 1,
* 'max_features': 'sqrt',
* 'max_depth': 70,
* 'criterion': 'gini',
* 'bootstrap': False
 
Best parameters from randomised search will feed a more intensive GridSearch for the best performing model ranging at the values defined in the randomised searched parameters.

#### First Grid Search

In [None]:
# Gridsearch for best parameter.
# import joblib

from sklearn.model_selection import GridSearchCV

param_1 = {
    "n_estimators" : [int(x) for x in np.linspace(start = 1600, stop = 2000, num = 5)],
    "min_samples_split" : [5],
    "min_samples_leaf" : [1],
    "max_features" : ['sqrt'],
    "max_depth" : [int(x) for x in np.linspace(start = 50, stop = 100, num = 5)],
    "criterion" : ['gini'],
    "bootstrap" : [False],
    "random_state" : [42]
}

param_grid = [param_1]
rf = RandomForestClassifier(random_state = 42)
rf_grid_search = GridSearchCV(rf, param_grid, cv = 5, scoring = "accuracy",
                             return_train_score = True, n_jobs = -1, verbose = 2)
rf_grid_search.fit(training_pred, training_labels)

rf_grid_search.best_estimator_.get_params()

# Export best estimator into pkl file: Uncomment next line to export model into a file
# joblib.dump(rf_grid_search.best_estimator_, "random_forest_best_estimator.pkl")

#### Second Grid Search

A second grid search is required as `n_estimators` for the best estimator lies on the bottom range of the defined grid search value.

In [None]:
# Gridsearch for best parameter.
import joblib
from sklearn.model_selection import GridSearchCV
param_2 = {
    "n_estimators" : [int(x) for x in np.linspace(start = 1300, stop = 1600, num = 5)],
    "min_samples_split" : [5],
    "min_samples_leaf" : [1],
    "max_features" : ['sqrt'],
    "max_depth" : [int(x) for x in np.linspace(start = 30, stop = 55, num = 5)],
    "criterion" : ['gini'],
    "bootstrap" : [False],
    "random_state" : [42]
}

param_grid = [param_2]
rf = RandomForestClassifier(random_state = 42)
rf_grid_search = GridSearchCV(rf, param_grid, cv = 5, scoring = "accuracy",
                             return_train_score = True, n_jobs = -1, verbose = 2)
rf_grid_search.fit(training_pred, training_labels)

rf_grid_search.best_estimator_.get_params()

# Export best estimator into pkl file
#joblib.dump(rf_grid_search.best_estimator_, "random_forest_best_estimator.pkl")

#### Cross validation of tuned random forest model

In [None]:
#rf_best = joblib.load("random_forest_best_estimator.pkl")
rf_best = rf_grid_search.best_estimator_
rf_best.fit(training_pred, training_labels)
forest_scores = cross_val_score(rf_best, training_pred, 
                                training_labels,
                               scoring = "accuracy")
display_scores(forest_scores)

In [None]:
pd.DataFrame([{'Untuned_Rf':rf_scores.mean(),
               'Tuned_Rf':forest_scores.mean()}], 
             index = ["Average Scores"])

Randomforest hyperparameter tuning resulted in a small increase in accuracy by 1.2% through K-fold cross validation of training dataset. A final reported accuracy of 85.7% is evident for the best performing tuned Randomforest model. 

# Model Validation (ROC & Confusion Matrix) - Test Set

Model validation is validated on stratified test data that has been transformed on the fitted values of training data. This results in the multiple imputation of missing values in the test set with the means of each category from the training data. It will also result in the scaling of test set to the fitted values evaluated in the training data. Model performance is evaluated by plotting the Receiver Operating Characteristic (ROC) curves and confusion matrix of the model predictions.


In [None]:
# Transform test data on fitted values of training data.
test_features = pipeline.transform(strat_test_set.drop("Potability", axis = 1))
test_labels = strat_test_set.loc[:,"Potability"]

model_list = {"DecisionTree":dt,
              "Log Reg":logistic_model,
             "RF": rf_untuned,
             "Tuned_RF": rf_best}

# Function to create ROC plot of multiple models.
def plot_multi_roc(model, features, labels):
    
    from sklearn.metrics import roc_curve, roc_auc_score
    from matplotlib import pyplot as plt
    
    if type(model) != dict:
        raise NameError("Not Valid Dict")
        
    for key in model_list:
        model = model_list[key]
        predictions = model.predict_proba(features)[:,1]
        auc = roc_auc_score(labels, predictions)
        fig_label = "%s AUC=%.3f" % (key, auc)
        fpr, tpr, threshld = roc_curve(labels, predictions)
        plt.plot(fpr, tpr, label = fig_label)
    
    plt.plot([0,1],[0,1], linestyle = '--')
    plt.legend()
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    
plot_multi_roc(model_list, test_features, test_labels)

Hyperparameter tuning of Random forest classifiers resulted in marginal gains in performance when compared to default hyperparameters. AUC values of 0.654 and 0.658 were reported for default Random Forest Model and tuned Random Forest Model respectively. Perhaps a different classifier model may be better suited for this task. Any feedback on how I could better tune the Random Forest model would be appreciated.

In [None]:
def draw_confusion_matrix(model, features, labels, threshold = 0.5):
    
    from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, f1_score, precision_score, recall_score
    
    predicted_prob = model.predict_proba(features)
    predicted_prob = pd.DataFrame(predicted_prob)
    predicted_prob["model_classification"] = np.where(predicted_prob.loc[:,1] > threshold, 1, 0)
    cm = confusion_matrix(labels, predicted_prob["model_classification"])
    disp = ConfusionMatrixDisplay(confusion_matrix = cm,
                             display_labels = model.classes_)
    
    tn, fp, fn, tp = confusion_matrix(labels, 
                                  predicted_prob["model_classification"]).ravel()
    disp.plot()
    print("True Negative: %s\nFalse Positives: %s\nFalse Negative: %s\nTrue Positives: %s" %(tn, fp, fn, tp))
    print("Sensitivity: %.4f\nSpecificity: %.4f\nPrecision: %.4f" %((tp/(tp+fn)),(tn/(tn+fp)),(tp/(tp+fp))))
    print("F1 Score: %.4f " %f1_score(labels, predicted_prob["model_classification"]))

draw_confusion_matrix(rf_best, test_features, test_labels, threshold = 0.5)

# Conclusions

The tuned Random Forest algorithm was the best performing model used to predict water potability. An AUC value of 0.658, sensitivity of 27.3%, precision of 73.7%, specificity of 93.8% and an F1 score of 0.39 was reported. In the context of water potability, false positives (predicting water is potable when it is not) will have greater consequence than false negatives (predicting water is not-potable when it is). Therefore, precision scores will have greater weight than sensitivity. Nonetheless, the model falsely predicts a significant proportion (73%) of drinkable water as not potable and will require further tuning. Any feedback on this work will be appreciated! 