# Water Potability Prediction with %65 Accuracy | %78 F1 Score

To reach potable water is important for humanity and a basic human right. Today there are the people, which don't reach potable water. There are various projects about exploring water. But important point in here is important to potable water or not. We can explorer with a good machine learning model by using various features. I worked to design a model by using water potability dataset. I used Random Forrest Classifer Algorithms. If you benefit this notebook, please upvote. If you have questions, ideas and recommendations you can comment this notebook. I'm so excited to read your comments.


<div align="center"><img src="https://api.hub.jhu.edu/factory/sites/default/files/styles/full_width/public/drink-more-water-hub.jpg?itok=nokoCMv6" /></div>


### CONTENT

[1. Libraries](#1) <br/>
[2. Explorer Data Analysis](#2) <br/>
[3. Data Visualization](#3) <br/>
[4. Data Imputation](#4) <br/>
[5. Train Test Split Dataset](#5) <br/>
[6. Create And Evaluate Model](#6) <br/>
[7. Conclution <br/>](#7)

<a id="1"></a>
## Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

from collections import Counter
from sklearn.impute import KNNImputer

# Train and Test Split
from sklearn.model_selection import train_test_split, GridSearchCV

# Create Model
from sklearn.ensemble import RandomForestClassifier

# Evaluate Model
from sklearn.metrics import classification_report

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="2"></a>
## Explorer Data Analysis

1. Water Potability dataset contains 3276 water resources and 9 features. 
1. ph, Sulfate and Trihalomethanes features have missing value of %15. 
1. In this dataset non potable water rate is %60.99 and portable water rate is %39.01. 
1. Dataset is a little imblanced dataset for potable water.

**Features:**

* ph               
* Hardness
* Solids           
* Chloramines      
* Sulfate          
* Conductivity     
* Organic_carbon   
* Trihalomethanes 
* Turbidity        

**Target**

* Potability 

In [None]:
data = pd.read_csv("/kaggle/input/water-potability/water_potability.csv")
data.head(10)

In [None]:
data.info()

In [None]:
print(pd.isnull(data).sum())

In [None]:
data.describe()

In [None]:
f, ax = plt.subplots(figsize=(10,10))
sns.heatmap(data.corr(),annot=True, linewidths=.5, ax=ax)
plt.show()

<a id="3"></a>
## Data Visualization

In [None]:
label = ['Potability', 'Not Potability']
counter_labels = Counter(data['Potability'])
size = [counter_labels[1], counter_labels[0]]
colors = ['blue', 'green']

plt.subplots(figsize=(8,8))
plt.pie(size, labels=label, colors=colors,autopct='%.2f')
plt.show()

In [None]:
f, ax = plt.subplots(3,3,figsize=(15,15), sharey = True)
sns.histplot(data['ph'],kde=True, ax=ax[0,0])
sns.histplot(data['Hardness'], kde=True, ax=ax[0,1])
sns.histplot(data['Solids'], kde=True, ax=ax[0,2])
sns.histplot(data['Chloramines'], kde=True, ax=ax[1,0])
sns.histplot(data['Sulfate'], kde=True, ax=ax[1,1])
sns.histplot(data['Conductivity'], kde=True, ax=ax[1,2])
sns.histplot(data['Organic_carbon'], kde=True, ax=ax[2,0])
sns.histplot(data['Trihalomethanes'], kde=True,ax=ax[2,1])
sns.histplot(data['Turbidity'], kde=True, ax=ax[2,2])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(10,10))
sns.boxplot(data=data.iloc[:,[0,3,6,8]])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(10,10))
sns.boxplot(data=data.iloc[:,[1,4,5,7]])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(5,5))
sns.boxplot(data=data.iloc[:,2])
plt.show()

<a id="4"></a>
## Data Imputation

Dataset is a little imblanced dataset for potable water. The features that have missing value contain %15 missing value. Classic imputation method(mean, fill zero etc.) can not give a good performans for model. So I used k-near-neighbors algorithms in order to impute missing value. More information about KNN algorithm for imputation click [here](https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/).

In [None]:
imputer = KNNImputer(n_neighbors=3)
new_data = imputer.fit_transform(data)
data = pd.DataFrame(new_data, columns=["ph","Hardness","Solids","Chloramines","Sulfate",
                                       "Conductivity","Organic_carbon","Trihalomethanes",
                                       "Turbidity","Potability"])

<a id="5"></a>
## Train And Test Split

I splitted dataset as %80 train and %20 test set.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(data.drop(["Potability"],axis=1),data["Potability"], test_size=0.2,random_state=42)
print("X_train",x_train.shape)
print("X_test",x_test.shape)
print("Y_train",y_train.shape)
print("Y_test",y_test.shape)

<a id="6"></a>
## Create And Evaluate Model

I used Random Forrest Classifier for a model. I tuned hyperparameter by using Grid Search Cross Validation.

In [None]:
rf = RandomForestClassifier(random_state=42)
param_grid = { 
    'n_estimators': [50, 100, 150],
    'max_depth': [2,3,4],
    'max_features': ['auto', 'sqrt'],
    'criterion' :['gini', 'entropy']
}

gscv = GridSearchCV(estimator=rf, param_grid=param_grid, cv= 10)
gscv.fit(x_train, y_train)

In [None]:
gscv.best_params_

In [None]:
rf=RandomForestClassifier(random_state=42, max_features='auto', max_depth=4, n_estimators= 50, criterion='gini')
rf.fit(x_train,y_train)
prediction = rf.predict(x_test)
print(classification_report(y_test, prediction, target_names=["0","1"]))

<a id="7"></a>
## Conclution

1. My model accuracy is  %65.
1. My model f1 score is %78.
1. My model precision and recall are %65 and %98.
1. The dataset is a little imblanced dataset for potable water. 
1. Now my model can predict non potable water successfully.
1. If they add potable water data, my model can be successfull to predict potable water.