<a href="https://colab.research.google.com/github/rafaelhora/water-potability/blob/main/EDA_water_potability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

## Context

According to the NGO Water.org, 785 million people lack access to safe water and 2 billion people lack access to a toilet. Also, Nearly 1 million people die each year from water, sanitation and hygiene-related diseases. It is a health crisis that is not given too much attention in mainstream media. 

![image](https://cloudfront-eu-central-1.images.arcpublishing.com/larazon/Q3KPFJKNJVHN7LR5ECYD5C7J4U.jpg)

Source: La Razón

Improving the access to potable water can reduce child and maternal mortality rates, increase home income and interrupt one aspect of the cicle of poverty. And increasing the capacity of water testing could facilitate this proccess. It is estimated that annually lack of water quality generates 260mi USD in losses (Sadoff et al).
 

## Feature Details


**pH value**: PH is an important parameter in evaluating the acid–base balance of water. **WHO has recommended maximum permissible limit of pH from 6.5 to 8.5.** The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

**Hardness**: Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. 

**Solids** (Total dissolved solids - TDS): Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. This is the important parameter for the use of water. **Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.**

**Chloramines**: Chlorine and chloramine are the major disinfectants used in public water systems. **Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.**

**Sulfate**: Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.

**Conductivity**: **Pure water is not a good conductor of electric current rather’s a good insulator.** Increase in ions concentration enhances the electrical conductivity of water. **According to WHO standards, EC value should not exceeded 400 μS/cm.**

**Organic_carbon**: Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. **According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.**

**Trihalomethanes**: THMs are chemicals which may be found in water treated with chlorine. **THM levels up to 80 ppm is considered safe in drinking water.**

**Turbidity**: The turbidity of water depends on the quantity of solid matter present in the suspended state. **The WHO recommended value is < 5.00 NTU.**

**Potability**: Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

# Predicting the Potability of Water using Machine Learning

## Data imports and cleaning

In [None]:
!pip install matplotlib==3.4.3 #google colab version of plt is deprecated and doesn't support some methods used in this notebook

In [None]:
!pip install xgboost==1.4.2 #let's guarantee that the colab version of xgboost is not depracated 

In [None]:
#imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy
from sklearn.metrics import confusion_matrix, classification_report, recall_score
import os

In [None]:
#importing dataset 

pwd = os.getcwd()
dataset = pd.read_csv('../input/water-potability/water_potability.csv')
original_dataset = dataset.copy()
dataset

In [None]:
dataset.info()

We will be dealing with only numerical variables. 

It also can be noted that this dataset has columns with Null values. 

In [None]:
#visualizing with null values
dataset.isnull().sum().reset_index()

We will inpute the median value for each column in the Null cells. 

But please note that it could be possible that the people responsible for the application of this study can request for unreliable measures to be completely dropped, due to the risks contained in the application of this model. 

In [None]:
#imputing missing values
dataset.fillna(value=dataset.median(), inplace=True)
dataset

In [None]:
dataset.isnull().sum().reset_index()

No more Null values.

## Exploratory analysis

In [None]:
#defining color palette for this EDA
colors = ['salmon', 'tab:blue', 'tab:purple', 'tab:orange', 'tab:green', 'tab:pink', 'tab:grey', 'tab:olive', 'tab:red', 'tab:cyan']

In [None]:
corr = dataset.corr()

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize = (10,8))
sns.heatmap(corr, ax =ax, annot = True, mask = mask)


plt.show()

As we can see in the correlation matrix, no variable represents strong linear co-relation with another. We can assume that linear models will be outperformed by probabilistical models for this study. 

In [None]:
dataset.describe()

It seems that some samples have Ph o 0, according to my sources this is quite an extreme value of accidity (similar to the accidity of a battery). Also we can see that the maximum value of the samples is 14 which is also an extreme alkaline value (comparable to bleach). 

Source: https://www.healthline.com/health/ph-of-drinking-water

Let's investigate these values  5 < Ph and Ph > 9 for a sanity check.

In [None]:
dataset[dataset.ph < 5].groupby('Potability')['ph'].count()

In [None]:
dataset[dataset.ph > 9].groupby('Potability')['ph'].count()

We can see that 184 samples even tho they are in the extremes of the Ph spectrum, are classified as potable. This could be result of input error, measurament error, etc. 

I decided to avoid these instances and will drop them from our dataset.

In [None]:
dataset.head()

In [None]:
dataset = dataset.drop(dataset[dataset.ph > 9].index, axis = 'index')
dataset = dataset.drop(dataset[dataset.ph < 5].index, axis = 'index')

In [None]:
dataset[((dataset.ph < 5) | ((dataset.ph > 9)))].groupby('Potability')['ph'].count()

The query above returns a Null series, this shows that we succesfully deleted the indexes with unreliable Ph values

In [None]:
fig, ax = plt.subplots(figsize = (10,8))

dataset.groupby('Potability').size().plot(kind='pie', ax = ax, labels = ['Not Potable', 'Potable'], autopct = '%1.2f%%', colors = {'salmon', 'tab:blue'}, shadow = True, explode = (0.08, 0))

plt.title('Water Potability')
fig.set_facecolor('white')
plt.tight_layout()
plt.show()

This dataset shows a light imbalance towards not potable results, but a 6:4 proportion does not require the use of supersampling techniques. 

In [None]:
print(dataset.columns)

In [None]:
fig, ax = plt.subplots(3,3, figsize = (12,24), squeeze = False)
columns = ['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity', 'Organic_carbon', 'Trihalomethanes', 'Turbidity']

j = 0

for column in columns:
    sns.violinplot(y = dataset[column], ax = ax.flat[j],color = colors[j])
    sns.swarmplot(y = dataset[column], ax = ax.flat[j], size = 2, color = 'black')
    ax.flat[j].set_title(column + ' values:')
    j = j + 1
    

plt.tight_layout()
plt.show()

In [None]:
#checking distribuition of data for scaling method selection
fig, axes = plt.subplots(5, 2, squeeze=False, figsize = (16,16))
fig.delaxes(axes.flat[9])
column = 0

for ax in axes.flat:
    p = sns.kdeplot(dataset.iloc[:, column], ax = ax, color = colors[column])
    
    #ploting median line
    x,y = p.get_lines()[0].get_data()
    cdf = scipy.integrate.cumtrapz(y, x, initial = 0)
    nearest_05 = np.abs(cdf - 0.5).argmin()
    x_median = x[nearest_05]
    y_median = y[nearest_05]
    ax.vlines(x_median, 0, y_median, colors = 'black')

    column +=1

plt.suptitle('Graph shows that some variables have skewed distribuitions', size = 22)
plt.tight_layout()
plt.show()

Since our data is skewed, we will use logarithimic scaling. 

## Scaling numeric variables


In [None]:
#scaling variables
from sklearn.preprocessing import StandardScaler
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1:]
#X = np.log(X + 1) #we add 1 becase log of numbers between 0 and 1 are NaN

#separating trainning and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)

## Testing classifiers with standard hyperparameters

The code below is for a function that we will use to visualize each classification model's performance using confusion matrix, and the recall as the main metric. 

We will use recall because the implications of false positives are way worse than false negatives, since recommending the consumption of non-potable water can lead to health problems. 

At the end, the model with the best recall-score will be chosen for hyperparameter tunning. 

In [None]:
#code to visualize model performance

def view_performance (y_test, y_pred, model = ""):

    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, xticklabels=['Negative' , 'Positive'], yticklabels=['Negative' , 'Positive'], cmap = 'rocket', fmt = 'g')
    plt.ylabel("Label")
    plt.xlabel("Predicted")
    title = 'Confusion Matrix for classifier'
    plt.title(title)
    plt.show
    model = str(model)
    cls_report = classification_report(y_test, y_pred)
    model_recall = recall_score(y_test, y_pred, average = 'binary')
    print(model, 'classifier results: \n\n\n', cls_report)
    return model_recall

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier_lr = LogisticRegression(random_state = 0)
classifier_lr.fit(X_train, y_train)
y_pred = classifier_lr.predict(X_test)

In [None]:
recall_log= view_performance(y_test, y_pred, model = 'Logistic')

### SVC

In [None]:
# Training the Kernel SVM model on the Training set
from sklearn.svm import SVC
classifier_svm = SVC(kernel = 'rbf', random_state = 1, degree=1)
classifier_svm.fit(X_train, y_train)
y_pred = classifier_svm.predict(X_test)

In [None]:
recall_sv = view_performance(y_test, y_pred, model = 'SVM')

### Naïve Bayes

In [None]:
# Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier_nb = GaussianNB()
classifier_nb.fit(X_train, y_train)
y_pred = classifier_nb.predict(X_test)

In [None]:
recall_nb = view_performance(y_test, y_pred, model = "Naïve Bayes")

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier_rt = RandomForestClassifier(n_estimators=100, random_state=1)
classifier_rt.fit(X_train, y_train)
y_pred = classifier_rt.predict(X_test)

In [None]:
recall_rf = view_performance(y_test, y_pred, model = "Random Forest")

### XGBoost Classifier

In [None]:
import xgboost as xg

classifier_xg = xg.XGBClassifier(use_label_encoder=False)
classifier_xg.fit(X_train, y_train)
y_pred = classifier_xg.predict(X_test)

In [None]:
recall_xg = view_performance(y_test, y_pred, model = 'XGBoost')

## Comparing recall scores 

In [None]:
perf = pd.DataFrame.from_dict({'scores':[recall_nb, recall_xg, recall_log, recall_rf, recall_sv], 'models': ['Naive Bayes', 'XGBoost', 'Logistic', 'Random Forest', 'SVM']})

fig, ax = plt.subplots(1,1, figsize= (12,8), squeeze=False)
plot = sns.barplot(data = perf, y = 'scores', x='models')
plt.bar_label(plot.containers[0], fmt ='%.4f', size = 14)

plt.title('XGBoost has the best recall-score of all models', size = 22)
plt.ylabel('recall-score', size = 12)
plt.xlabel('Regressor model', size = 12)
plt.tight_layout()
plt.show()

The logistic and SVM models have been heavily ineffective for this dataset, since in our EDA we observed that the variables have no linear correlation. 
 
**Tree-based models (xbgboost and random forest) gave the best recall-scores.**

Also, analyzing the confusion matrix of all classification models we have a very high rate of false positives, since we are studying of a water potability prediction, this is a importante issue. The implication of a false positive is that people will consume water that may harm their health, so we should keep this in mind when tunning our model.

## Hyperparameter tunning

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV

parameters = { 'eta': np.arange(0.01, 0.4, 0.01),
'min_child_weight': np.arange(1, 10, 1),
'max_depth': np.arange(2, 10, 1),
'gamma': np.arange(0.5, 1, 0.05),
'subsample': np.arange(0.5, 1, 0.05),
'colsample_bytree': np.arange(0.5, 1, 0.05),
'lambda': np.arange(1, 2, 0.1),
}

model = XGBClassifier(verbosity =0)
classifier = RandomizedSearchCV(model, parameters, n_iter = 2000, cv = 5, verbose=0, scoring = 'recall')

In [None]:
classifier.fit(X_train, y_train)
print(classifier.best_estimator_)

In [None]:
#saving model with best parameters
import pickle

filename = 'classifier.sav'

pickle.dump(classifier, open(filename, 'wb'))

In [None]:
y_pred = classifier.predict(X_test)

recall_xghp = view_performance (y_test, y_pred, model = "XGBoost with hyperparameter optimization")

In [None]:
final_perf = pd.DataFrame.from_dict({'score': [recall_xghp, recall_xg, recall_rf], 'model': ['XGBoost w/ hyperopt', 'XGBoost Default', 'Random Tree']})

improvement = 100 - (recall_xg / recall_xghp)*100 

fig, ax = plt.subplots(1,1, figsize = (12,8))

plot = sns.barplot(data = final_perf, x = 'model', y='score', ax = ax)
plt.title(f'We achieved a improvement of {improvement:.2f}% in recall-score using Hyperparameter Optmization', size = 18)
plt.ylabel('Recall-score', size=18)
plt.xlabel('Classification model', size=18)
plt.bar_label(plot.containers[0], fmt ='%.6f', size = 14)

plt.tight_layout()
plt.show()

In [None]:
!pip install shap

In [None]:
import shap
# explain the model's predictions using SHAP
explainer = shap.Explainer(classifier.best_estimator_)
shap_values = explainer(X)
shap.plots.bar(shap_values)

# Conclusions

From our Exploratory data analysis, and the study of our machine learning model we can take the following conclusions: 

 - As expected the average Ph of the samples is around  7.0 (neutral)
 - The low correlation of variables in this dataset causes linear classifiers to be very ineffective for this case. 
 - The levels of Sulfate, Hardness and Ph of the water are the biggest contributors for the potability of the samples. So the study of the 
 - Even with hyperparameter optmization, I would not recommend the usage of this model in real situations. Our recall-score is less than 50%, and the implications of consumptions of samples that resulted in false-positives outweight any gain from the use of ML. 

 - It is possible that the addition of features that have more correlation in this dataset can improve our accuracy. 