# **Water Quality**

## **Introduction**

On Earth, water circulates constantly between several large reservoirs as a result of evaporation, precipitation and runoff. This resource, which is essential to the functioning of the planet, is also vital for the aquatic environments through which it flows. These ecosystems are essential to biodiversity and to human beings for many reasons.<br/>
Water is an important biological constituent, indispensable in its liquid form to all known living organisms.<br/>
Our target will be the potability of water.<br/>

## Table of Contents
* **Importing libraries and loading data**
* **EDA :**
    * Shape / Info
    * NaN values
    * Target Vizualisation
    * Variables description
    * Variables visualization
    * Variables correlation
    * Variables - Target correlation
* **Pre-Processing :**
    * Outliers extraction
    * Missing data
    * Skewness Correction
    * Normalization
    * Train Test Split
    * Smote
* **Modeling :**
    * Testing different classifiers
    * Hyperparameters Tuning
    * Voting Classifier

## Importing libraries and loading data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, boxcox, normaltest, probplot
import plotly.express as px
#import plotly.graph_objects as go
from collections import Counter

# warning library
import warnings
warnings.filterwarnings("ignore")

sns.set_theme(style="darkgrid")

#Loading data
df = pd.read_csv("../input/water-potability/water_potability.csv")
df.head()

In [None]:
df.describe()

## **EDA**

### Basic checklist :
- **Target :** Potability (unbalanced).
- **Shape :** (3276, 10).
- **Type of variables :** All columns are float64, except the target which is int64.
- **NaN values :** There is some NaN values in the columns : Sulfate, ph, Trihalomethanes. 

In [None]:
print("Shape :",df.shape)
print()
df.info()

All columns are float64, except the target which is int64.

In [None]:
sns.heatmap(df.isnull(), cmap='viridis')

In [None]:
print("NaN values :")
(df.isnull().sum()/df.shape[0]).sort_values(ascending=False)*100

There is some NaN values in the columns : Sulfate, ph and Trihalomethanes.

### Target Vizualisation

In [None]:
target_counts = df["Potability"].value_counts()
fig = px.pie(target_counts,values="Potability",names=["Not Potable", "Potable"])
fig.update_traces(textposition='inside', hole=.4,opacity=0.9, textinfo='percent+label')
fig.update_layout(title=dict(text='Water Potability', x=0.5), legend=dict(x=0.42, y=-0.05, orientation='h'))
fig.add_annotation(text='Potability', x=0.5, y=0.5, showarrow=False, font_size=15, opacity=0.9)
fig.show()

* **This dataset is unbalanced :**
    * Majority : Not Potable (class 0), 1998 rows
    * Minority : Potable (class 1), 1278 rows

In [None]:
df.head()

### Variables description

1. **ph :** pH of water (0 to 14).
2. **Hardness :** Capacity of water to precipitate soap in mg/L.
3. **Solids :** Total dissolved solids in ppm.
4. **Chloramines :** Amount of Chloramines in ppm.
5. **Sulfate :** Amount of Sulfates dissolved in mg/L.
6. **Conductivity :** Electrical conductivity of water in μS/cm.
7. **Organic_carbon :** Amount of organic carbon in ppm.
8. **Trihalomethanes :** Amount of Trihalomethanes in μg/L.
9. **Turbidity :** Measure of light emiting property of water in NTU (Nephelometric Turbidity Units).
10. **Potability :** Indicates if water is safe for human consumption.  Potable = 1 and Not potable = 0

### Variables visualization

In [None]:
def var_visualisation(df, target, var, color):
    colors = {"orange":"YlOrRd", "green":"YlGn", "blue":"PuBu", "purple":"BuPu"}
    var_col = df[var]
    
    fig, axs = plt.subplots(2,2, figsize=(12,8))
    fig.suptitle(var, weight='bold', fontsize=16)
    axs[0][0].set_title(f"{var} Histogram")
    axs[0][0].hist(var_col, bins=10, color=color)
    axs[0][1].set_title(f"{var} Displot")
    sns.distplot(var_col, fit=norm, color=color, ax=axs[0][1])
    axs[0][1].legend([f"Skew : {var_col.skew():.2f}\nKurt  : {var_col.kurt():.2f}"])
    axs[1][0].set_title(f"Histplot - {var} by {target}")
    sns.histplot(df, x=var, hue=target, kde=True, palette=colors[color], ax=axs[1][0])
    axs[1][1].set_title(f"Boxplot - {var} by {target}")
    sns.boxplot(df[target], var_col, palette=colors[color], ax=axs[1][1])
    plt.tight_layout()
    plt.show()

**pH**

In [None]:
var_visualisation(df, 'Potability', 'ph', 'orange')

* The distribution looks like a normal distribution.
* The ph does not seem to have much influence on the potability.

**Hardness**

In [None]:
var_visualisation(df, 'Potability', 'Hardness', 'green')

* The distribution looks like a normal distribution.

**Solids**

In [None]:
var_visualisation(df, 'Potability', 'Solids', 'blue')

* There is a right skewness.
* We can see a difference on the distribution of the different classes.

**Chloramines**

In [None]:
var_visualisation(df, 'Potability', 'Chloramines', 'purple')

* The distribution looks like a normal distribution.
* We can see a small difference on the distribution compared to the potability.

**Sulfate**

In [None]:
var_visualisation(df, 'Potability', 'Sulfate', 'orange')

* The distribution of Sulfate looks like a normal distribution.
* The distribution is flatter when the water is potable.

**Conductivity**

In [None]:
var_visualisation(df, 'Potability', 'Conductivity', 'green')

* There is a right skewness.

**Organic_carbon**

In [None]:
var_visualisation(df, 'Potability', 'Organic_carbon', 'blue')

* The distribution looks like a normal distribution.
* We can see a small difference on the distribution compared to the potability.

**Trihalomethanes**

In [None]:
var_visualisation(df, 'Potability', 'Trihalomethanes', 'purple')

* The distribution looks like a normal distribution.

**Turbidity**

In [None]:
var_visualisation(df, 'Potability', 'Turbidity', 'orange')

* The distribution looks like a normal distribution.
* The turbidity does not seem to have much influence on the potability.

### Variables correlation

In [None]:
sns.pairplot(df, hue='Potability')

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(df.corr(), annot=True, fmt=".2", cmap='viridis')
plt.show()

The variables appear to be poorly correlated.

### Variables - Target correlation

In [None]:
corr = df.corr()
corr['Potability'].sort_values(ascending=False)

Solids and Chloramines are the most positively correlated, Organic_carbon and Sulfate are the most negatively correlated.

**pH**

In [None]:
df[['ph', 'Potability']].groupby('Potability').mean().style.background_gradient('Reds')

The difference is not huge, the ph is slightly lower when the water is potable.

**Hardness**

In [None]:
df[['Hardness', 'Potability']].groupby('Potability').mean().style.background_gradient('Reds')

As before, the difference is small, Hardness is slightly lower when the water is potable.

**Solids**

In [None]:
df[['Solids', 'Potability']].groupby('Potability').mean().style.background_gradient('Reds')

Solids looks higher when the water is potable.

**Chloramines**

In [None]:
df[['Chloramines', 'Potability']].groupby('Potability').mean().style.background_gradient('Reds')

Chloramines is slightly higher when the water is potable.

**Sulfate**

In [None]:
df[['Sulfate', 'Potability']].groupby('Potability').mean().style.background_gradient('Reds')

Sulfate looks lower when the water is potable.

**Conductivity**

In [None]:
df[['Conductivity', 'Potability']].groupby('Potability').mean().style.background_gradient('Reds')

Conductivity looks lower when the water is potable.

**Organic_carbon**

In [None]:
df[['Organic_carbon', 'Potability']].groupby('Potability').mean().style.background_gradient('Reds')

Organic_carbon is lower when the water is potable.

**Trihalomethanes**

In [None]:
df[['Trihalomethanes', 'Potability']].groupby('Potability').mean().style.background_gradient('Reds')

Trihalomethanes is slightly higher when the water is potable.

**Turbidity**

In [None]:
df[['Turbidity', 'Potability']].groupby('Potability').mean().style.background_gradient('Reds')

The difference is not huge, the turbidity looks slightly higher when the water is potable.

## **Pre-Processing**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

### Outliers extraction

*Multiple outliers*

In [None]:
def del_outliers(data, columns):
    indices = []
    for col in columns :
        Q1 = np.nanpercentile(data[col],25)
        Q3 = np.nanpercentile(data[col],75)
        IQR15 = (Q3 - Q1) * 1.5
        outliers_list = data[(data[col] < Q1 - IQR15) | (data[col] > Q3 + IQR15)].index
        indices.extend(outliers_list)
        
    indices = Counter(indices)
    multiple_outliers = list(i for i, v in indices.items() if v > 1.5) # indices that appear multiple times
    
    print("Number of multiple outliers : ",len(data.loc[multiple_outliers]))
    
    return data.drop(multiple_outliers, axis=0).reset_index(drop=True)

df = del_outliers(df, df.columns[:-1])
df.shape

### Missing data

*Columns : Sulfate, ph, Trihalomethanes*

In [None]:
df['Sulfate'].fillna(df['Sulfate'].mean(), inplace=True)
df['ph'].fillna(df['ph'].mean(), inplace=True)
df['Trihalomethanes'].fillna(df['Trihalomethanes'].mean(), inplace=True)
df.head()

### Skewness Correction

*Columns :* Solids, Conductivity

**Solids**

In [None]:
fig, ax = plt.subplots(1,2, figsize=(12,4))
fig.suptitle('Solids before boxcox', weight='bold', fontsize=16)
ax[0].set_title('Solids displot')
sns.distplot(df['Solids'], fit=norm, ax=ax[0])
probplot(df['Solids'], plot=plt)
plt.tight_layout()
plt.show()

In [None]:
df['Solids'], _ = boxcox(df['Solids'])

fig, ax = plt.subplots(1,2, figsize=(12,4))
fig.suptitle('Solids after boxcox', weight='bold', fontsize=16)
ax[0].set_title('Solids displot')
sns.distplot(df['Solids'], fit=norm, ax=ax[0])
probplot(df['Solids'], plot=plt)
plt.tight_layout()
plt.show()

**Conductivity**

In [None]:
fig, ax = plt.subplots(1,2, figsize=(12,4))
fig.suptitle('Conductivity before boxcox', weight='bold', fontsize=16)
ax[0].set_title('Conductivity displot')
sns.distplot(df['Conductivity'], fit=norm, ax=ax[0])
probplot(df['Conductivity'], plot=plt)
plt.tight_layout()
plt.show()

In [None]:
df['Conductivity'], _ = boxcox(df['Conductivity'])

fig, ax = plt.subplots(1,2, figsize=(12,4))
fig.suptitle('Conductivity after boxcox', weight='bold', fontsize=16)
ax[0].set_title('Conductivity displot')
sns.distplot(df['Conductivity'], fit=norm, ax=ax[0])
probplot(df['Conductivity'], plot=plt)
plt.tight_layout()
plt.show()

### Normalization

*Standardization*

In [None]:
y = df['Potability']
X = df.drop(['Potability'], axis = 1)

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X.head()

### Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

### Smote (Synthetic Minority Over-Sampling TEchnique)

Smote  is a technique used to process unbalanced datasets.

In [None]:
smote = SMOTE(random_state = 0)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"Without SMOTE :\n{y_train.value_counts()}")
print(f"\nWith SMOTE :\n{y_train_smote.value_counts()}")

## **Modeling**

### Testing different classifiers

In [None]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.model_selection import GridSearchCV, learning_curve
from sklearn.metrics import accuracy_score, classification_report, plot_confusion_matrix

In [None]:
names = ["LogisticReg", "KNN", "SVC", "SGD", "RandomForest", "GradBoosting"]
models = [LogisticRegression(), KNeighborsClassifier(), SVC(),
          SGDClassifier(), RandomForestClassifier(), GradientBoostingClassifier()]
scores = []

for model in models :
    model.fit(X_train_smote, y_train_smote)
    scores.append(model.score(X_test, y_test))
    
results = pd.DataFrame({"Model":names, "Score": scores}).sort_values(by='Score', ascending=False)
results.style.background_gradient("Blues")

* The 4 best models are SVC, RandomForest, KNN and GradientBoosting.<br/>
* The classification of the models could changes a bit depending on the train test split.

### Optimization

**We will try to find the best hyperparameters for our 4 best models.**<br/>
*Models : GradBoosting, KNN, RandomForest and SVC*

**GradientBoosting**

In [None]:
%%time
params = {'learning_rate':[0.01,0.1], 'n_estimators':[500,1000],
            'max_features':['sqrt','log2'],'max_depth':[3,7,10]}
grid = GridSearchCV(GradientBoostingClassifier(), params)
grid.fit(X_train_smote, y_train_smote)
print('Best params :', grid.best_params_)
print('Best train score :', grid.best_score_)
xgb = grid.best_estimator_
print('Test score', xgb.score(X_test, y_test))

pred = xgb.predict(X_test)
N, train_score, val_score = learning_curve(xgb, X_train_smote, y_train_smote, train_sizes=np.linspace(0.2,1,5))

fig, ax = plt.subplots(1,2,figsize=(15,4))
ax[0].set_title("Confusion Matrix")
plot_confusion_matrix(xgb, X_test, y_test, ax=ax[0])
ax[0].grid(False)
ax[1].set_title("Learning Curve")
ax[1].plot(N, train_score.mean(axis=1), label="train")
ax[1].plot(N, val_score.mean(axis=1), label="validation")
ax[1].set_xlabel("train_sizes")
plt.legend()
plt.show()
print(f"\nClassification Report :\n{classification_report(y_test, pred)}")

**KNN**

In [None]:
params = {'n_neighbors':np.arange(1,20,2), 'leaf_size':np.arange(1,10)}
grid = GridSearchCV(KNeighborsClassifier(), params)
grid.fit(X_train_smote, y_train_smote)

print('Best params :', grid.best_params_)
print('Best train score :', grid.best_score_)
knn = grid.best_estimator_
print('Test score', knn.score(X_test, y_test))

pred = knn.predict(X_test)
N, train_score, val_score = learning_curve(knn, X_train_smote, y_train_smote, train_sizes=np.linspace(0.2,1,5))

fig, ax = plt.subplots(1,2,figsize=(15,4))
ax[0].set_title("Confusion Matrix")
plot_confusion_matrix(knn, X_test, y_test, ax=ax[0])
ax[0].grid(False)
ax[1].set_title("Learning Curve")
ax[1].plot(N, train_score.mean(axis=1), label="train")
ax[1].plot(N, val_score.mean(axis=1), label="validation")
ax[1].set_xlabel("train_sizes")
plt.legend()
plt.show()

print(f"\nClassification Report :\n{classification_report(y_test, pred)}")

**RandomForest**

In [None]:
%%time

params = {'n_estimators': [500, 750, 1000, 1500], 'max_features': [2, 3]}
grid = GridSearchCV(RandomForestClassifier(), params)
grid.fit(X_train_smote, y_train_smote)

print('Best params :', grid.best_params_)
print('Best train score :', grid.best_score_)
rf = grid.best_estimator_
print('Test score', rf.score(X_test, y_test))

pred = rf.predict(X_test)
N, train_score, val_score = learning_curve(rf, X_train_smote, y_train_smote, train_sizes=np.linspace(0.2,1,5))

fig, ax = plt.subplots(1,2,figsize=(15,4))
ax[0].set_title("Confusion Matrix")
plot_confusion_matrix(rf, X_test, y_test, ax=ax[0])
ax[0].grid(False)
ax[1].set_title("Learning Curve")
ax[1].plot(N, train_score.mean(axis=1), label="train")
ax[1].plot(N, val_score.mean(axis=1), label="validation")
ax[1].set_xlabel("train_sizes")
plt.legend()
plt.show()

print(f"\nClassification Report :\n{classification_report(y_test, pred)}")

**SVC**

In [None]:
params = {'C':[0.1,1,10,100], 'gamma':[1,0.1,0.01], 'kernel':['rbf', 'sigmoid']}
grid = GridSearchCV(SVC(), params)
grid.fit(X_train_smote, y_train_smote)

print('Best params :', grid.best_params_)
print('Best train score :', grid.best_score_)
svc = grid.best_estimator_
print('Test score', svc.score(X_test, y_test))

pred = svc.predict(X_test)
N, train_score, val_score = learning_curve(svc, X_train_smote, y_train_smote, train_sizes=np.linspace(0.2,1,5))

fig, ax = plt.subplots(1,2,figsize=(15,4))
ax[0].set_title("Confusion Matrix")
plot_confusion_matrix(svc, X_test, y_test, ax=ax[0])
ax[0].grid(False)
ax[1].set_title("Learning Curve")
ax[1].plot(N, train_score.mean(axis=1), label="train")
ax[1].plot(N, val_score.mean(axis=1), label="validation")
ax[1].set_xlabel("train_sizes")
plt.legend()
plt.show()

print(f"\nClassification Report :\n{classification_report(y_test, pred)}")

In [None]:
svc = SVC()
svc.fit(X_train_smote, y_train_smote)
print('Train score', svc.score(X_train_smote, y_train_smote))
print('Test score', svc.score(X_test, y_test))

pred = svc.predict(X_test)
N, train_score, val_score = learning_curve(svc, X_train_smote, y_train_smote, train_sizes=np.linspace(0.2,1,5))

fig, ax = plt.subplots(1,2,figsize=(15,4))
ax[0].set_title("Confusion Matrix")
plot_confusion_matrix(svc, X_test, y_test, ax=ax[0])
ax[0].grid(False)
ax[1].set_title("Learning Curve")
ax[1].plot(N, train_score.mean(axis=1), label="train")
ax[1].plot(N, val_score.mean(axis=1), label="validation")
ax[1].set_xlabel("train_sizes")
plt.legend()
plt.show()

print(f"\nClassification Report :\n{classification_report(y_test, pred)}")

SVC with the default hyperparameters appears to be better than the estimator returned by GridSearch.

### Final Model

**Voting Classifier (Hard Voting)**<br/>
The predicted output class is a class with the highest majority of votes.

In [None]:
%%time

model = VotingClassifier(estimators=[("GradBoosting",xgb), ("KNN", knn), ("RandomForest", rf), ("SVC", svc)], voting='hard')
model.fit(X_train_smote, y_train_smote)
print('Test score', model.score(X_test, y_test))

pred = model.predict(X_test)
N, train_score, val_score = learning_curve(model, X_train_smote, y_train_smote, train_sizes=np.linspace(0.2,1,5))

fig, ax = plt.subplots(1,2,figsize=(15,4))
ax[0].set_title("Confusion Matrix")
plot_confusion_matrix(model, X_test, y_test, ax=ax[0])
ax[0].grid(False)
ax[1].set_title("Learning Curve")
ax[1].plot(N, train_score.mean(axis=1), label="train")
ax[1].plot(N, val_score.mean(axis=1), label="validation")
ax[1].set_xlabel("train_sizes")
plt.legend()
plt.show()

print(f"\nClassification Report :\n{classification_report(y_test, pred)}")

* The different models seem to have less trouble on class 0 which was the majority class than on class 1.
* We can see on the learning curves that the validation curve is still improving.

## **Conclusion**

* Algorithms tends to categorize into the majority class and we can deal with unbalanced dataset with SMOTE.<br/>
  The accuracy does not seem to change much with this technique but the recall score and the f1-score for the minority class is improved.
* Based on the learning curves, with a larger dataset, our models should get better performances.

## Ending Note:
**I want to thank you for reading this notebook. This is my first notebook and I don't have much experience yet, if you have any comments or suggestions, please don't hesitate !<br/>
I also want to link the notebooks that helped me and that I was inspired by :**
* https://www.kaggle.com/rafetcan/red-wine-quality-classification-95-76-acc/notebook
* https://www.kaggle.com/jaykumar1607/water-quality-analysis-plotly-and-modelling

**Thanks again and I hope you enjoyed it.**