
# Water Potability



Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions. 

## Data Explanation


1. ph: pH of 1. water (0 to 14).
2. Hardness: Capacity of water to precipitate soap in mg/L.
3. Solids: Total dissolved solids in ppm.
4. Chloramines: Amount of Chloramines in ppm.
5. Sulfate: Amount of Sulfates dissolved in mg/L.
6. Conductivity: Electrical conductivity of water in μS/cm.
7. Organic_carbon: Amount of organic carbon in ppm.
8. Trihalomethanes: Amount of Trihalomethanes in μg/L.
9. Turbidity: Measure of light emiting property of water in NTU.
10. Potability: Indicates if water is safe for human consumption. Potable -1 and Not potable -0

## Content
The water_potability.csv file contains water quality metrics for 3276 different water bodies.
The csv file comes from Kaggle Dataset: [Water Potability](https://www.kaggle.com/adityakadiwal/water-potability)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
from sklearn.impute import SimpleImputer



In [None]:
df = pd.read_csv('/kaggle/input/water-potability/water_potability.csv')
df.head()      

Checking how balanced Potability is. 

In [None]:
sns.set_style('white')
sns.countplot(x=df.Potability)
plt.title('Potability Counts')
#plt.legend(labels=['0: Not Potable','1: Potable'])
plt.xticks([0,1],['Non Potable', 'Potable']);

## Overview of our Data
Hue represent Potability. 

In [None]:
sns.pairplot(df, hue='Potability')

In [None]:
sns.histplot(x='ph',data=df, hue='Potability', alpha=0.5).set(title='pH Distribution')
plt.legend(labels= ['Potable', 'Non Potable'])

## Nan Values
We will impute the nan values with the mean 

In [None]:
df.isnull().sum()

In [None]:
df.isnull().sum().plot.bar(title= 'Nan Counts')

# Modeling



In [None]:
X = df.drop('Potability', axis=1)
y = df.Potability 

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
imp = SimpleImputer(strategy='mean')
x_train = imp.fit_transform(x_train)
x_test = imp.transform(x_test)

In [None]:
np.isnan(x_train).sum()

In [None]:
model = RandomForestClassifier(n_jobs=-1)
model.fit(x_train, y_train)

In [None]:
score1 = model.score(x_test, y_test)
score1

In [None]:
from sklearn.metrics import classification_report 


In [None]:
y_pred = model.predict(x_test)
#y_pred, y_test

In [None]:
print(classification_report(y_test, y_pred))

## Feature Scaling 
We will perform feature scaling to inspect the impact in our RandomForest Model. 

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
model1 = RandomForestClassifier(n_jobs=-1)
model1.fit(x_train, y_train)

In [None]:
model1.score(x_test, y_test)

In [None]:
y_pred = model1.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score


In [None]:
scorer = cross_val_score(model1, x_train, y_train, cv=5)

In [None]:
scorer.mean()

We see that scaling wasn't really useful

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
imp = SimpleImputer(strategy='mean')
x_train = imp.fit_transform(x_train)
x_test = imp.transform(x_test)

In [None]:
from xgboost import XGBClassifier

In [None]:
boost = XGBClassifier(n_jobs=-1 )
boost.fit(x_train, y_train)

In [None]:
boost.score(x_test,y_test)

In [None]:
from xgboost import plot_importance

In [None]:
plot_importance(boost)

In [None]:
importances = dict(zip(df.columns, boost.feature_importances_))
imp = pd.DataFrame(importances, index=[0])

In [None]:
imp.T.plot.bar(legend=False)

In [None]:
feature_names = np.array(df.drop('Potability', axis=1).columns)
feature_names

In [None]:

plt.barh(feature_names, model.feature_importances_)

In [None]:
sorted_idx = list(model.feature_importances_.argsort())
plt.barh(feature_names[sorted_idx], model.feature_importances_[sorted_idx])
plt.suptitle('Feature Importance with RandomForest');

In [None]:
sorted_idx = list(boost.feature_importances_.argsort())
plt.barh(feature_names[sorted_idx], boost.feature_importances_[sorted_idx])
plt.suptitle('Feature Importance with XGBClassifier');

In [None]:
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve, plot_confusion_matrix

In [None]:
plot_roc_curve(model, x_test, y_test)

In [None]:
plot_precision_recall_curve(model, x_test, y_test)

In [None]:
plot_confusion_matrix(model, x_test, y_test)
plt.suptitle('RandomForestClassifier Confusion Matrix')

In [None]:
plot_roc_curve(boost, x_test, y_test)

In [None]:
plot_precision_recall_curve(boost, x_test,y_test)

In [None]:
plot_confusion_matrix(boost, x_test, y_test)
plt.suptitle('XGBClassifier Confusion Matrix');

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logist = LogisticRegression(n_jobs=-1)
logist.fit(x_train,y_train)
logist.score(x_test, y_test)

In [None]:
plot_roc_curve(logist,x_test,y_test)

In [None]:
plot_precision_recall_curve(logist,x_test,y_test)

In [None]:
plot_confusion_matrix(logist, x_test,y_test)
plt.suptitle('Logistic Regression Confusion Matrix')

## SHAP Values for RandomForestClassifier

In [None]:
pip install shap

In [None]:
import shap  # package used to calculate Shap values

#transform numpy to pandas dataframe to have column names

test_pred = pd.DataFrame(x_test, columns=X.columns)

# Create object that can calculate shap values
explainer = shap.TreeExplainer(model)

# calculate shap values. This is what we will plot.
# Calculate shap_values for all of val_X rather than a single row, to have more data for plot.
shap_values = explainer.shap_values(test_pred)

# Make plot.
shap.summary_plot(shap_values[1], test_pred)