#  **Water Quality Potability Prediction**

Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions.

In [None]:
import numpy as np
import pandas as pd 

### Read dataset

In [None]:
df = pd.read_csv('../input/water-potability/water_potability.csv')
df.head()

# Data Explanation
* ph: pH of 1. water (0 to 14).
* Hardness: Capacity of water to precipitate soap in mg/L.
* Solids: Total dissolved solids in ppm.
* Chloramines: Amount of Chloramines in ppm.
* Sulfate: Amount of Sulfates dissolved in mg/L.
* Conductivity: Electrical conductivity of water in μS/cm.
* Organic_carbon: Amount of organic carbon in ppm.
* Trihalomethanes: Amount of Trihalomethanes in μg/L.
* Turbidity: Measure of light emiting property of water in NTU.
* Potability: Indicates if water is safe for human consumption. Potable - 1 and Not potable - 0

In [None]:
df.info()

# Exploratory Data Analysis

### Check the imbalance dataset

In [None]:
import matplotlib.pyplot as plt

labels = []
for i, df_visualize in enumerate(df.groupby(["Potability"])):
    labels.append(df_visualize[0])
    plt.bar(i, df_visualize[1].count(), label=df_visualize[0])
plt.xticks(range(len(labels)), labels)
plt.legend()
plt.title('Potability')
plt.show()

In [None]:
data = df.groupby("Potability")["ph"].sum()
pie, ax = plt.subplots(figsize=[10,6])
labels = data.keys()
plt.pie(x=data, autopct="%.1f%%", labels=labels, pctdistance=0.5)
plt.title("Potability", fontsize=14)

### Check the histogram of the data

In [None]:
df.hist(figsize=(15,15))
plt.show()

### Check the correlation of each column

In [None]:
df.corr()

In [None]:
import seaborn as sb


plt.figure(figsize=(12,10))
sb.heatmap(df.corr(), annot=True)

In [None]:
df[df.columns[1:]].corr()['Potability'][:].sort_values(ascending=False)

### Check the null value of the data

In [None]:
df.isnull().sum()

### Fill the null value with the **median** value

In [None]:
df['ph'].fillna(value=df['ph'].median(),inplace=True)
df['Sulfate'].fillna(value=df['Sulfate'].median(),inplace=True)
df['Trihalomethanes'].fillna(value=df['Trihalomethanes'].median(),inplace=True)

In [None]:
df.isnull().sum()

### Normalize Data using **MinMaxScaler**

In [None]:
x = df.drop(['Potability'], axis='columns')
y = df.Potability

In [None]:
from sklearn.preprocessing import MinMaxScaler


features_scaler = MinMaxScaler()
features = features_scaler.fit_transform(x)
features

### Cross Validation and Hyperparameter Tuning

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

model_params = {
    'svm': {
        'model': SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20,30,50],
            'kernel': ['rbf','linear','poly']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [10,50,100]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    },
    'KNN' : {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': [3,7,11,13]
        }
    }
    
}

In [None]:
from sklearn.model_selection import GridSearchCV
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(features, y)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df_score = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df_score

In [None]:
sb.barplot(x="model", y="best_score", data=df_score)
plt.ylim(0, 1)

### Find the best alghoirtm and the best parameter

In [None]:
row_score_max = df_score['best_score'].argmax()
df_score.loc[[row_score_max]]

### Display the result of the best alghoritm

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, y, test_size=0.25, random_state=101)

In [None]:
model = RandomForestClassifier(n_estimators=100)
model.fit(x_train,y_train)
model.score(x_test,y_test)

### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix


y_predicted = model.predict(x_test)
cm = confusion_matrix(y_test,y_predicted)
plt.figure(figsize = (10,7))
sb.heatmap(cm, annot=True, fmt=".1f")
plt.xlabel('Predicted')

### Classification Report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_predicted))

# Try handle imbalance dataset to get better result

In [None]:
y.value_counts()

### Handle Imbalance using **SMOTE**

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy='minority')
x_sm, y_sm = smote.fit_resample(features, y)

y_sm.value_counts()

### Cross validation and hyperparemeter using a new dataset

In [None]:
from sklearn.model_selection import GridSearchCV
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(x_sm, y_sm)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df_score_smote = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df_score_smote

In [None]:
sb.barplot(x="model", y="best_score", data=df_score_smote)
plt.ylim(0, 1)

In [None]:
row_score_max_smote = df_score_smote['best_score'].argmax()
df_score_smote.loc[[row_score_max_smote]]

### Display the result again to compare before and after handle imbalance dataset

In [None]:
x_train_smote, x_test_smote, y_train_smote, y_test_smote = train_test_split(x_sm, y_sm, test_size=0.25, random_state=101)

In [None]:
model_smote = RandomForestClassifier(n_estimators=100)
model_smote.fit(x_train_smote,y_train_smote)
model_smote.score(x_test_smote,y_test_smote)

In [None]:
y_predicted_smote = model_smote.predict(x_test_smote)
cm = confusion_matrix(y_test_smote,y_predicted_smote)
plt.figure(figsize = (10,7))
sb.heatmap(cm, annot=True, fmt=".1f")
plt.xlabel('Predicted')

In [None]:
print(classification_report(y_test_smote,y_predicted_smote))

# **Conclusion**
## With handle imbalance dataset with SMOTE highly affect the F1-score of the "1" Potabilty from **0.46** to **0.74**