# *Predicting Water Potability from Physico-Chemical Variables: An Exploratory Data Analysis and Machine Learning Approach*

---



---



## Problem
This project aims to predict the potability of water based on physico-chemical variables. It involves data collection, preprocessing, and exploratory data analysis (EDA) to understand the data's characteristics. Subsequently, a machine learning model is built and evaluated using appropriate metrics. The ultimate goal is to provide a reliable method for determining whether water is safe to drink based on its chemical properties.

## Loading DATA-LIBRARIES

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
downloaded = drive.CreateFile({'id':'1j6iz1eqa1XHpvH-J4JrkqWgEWdQu-3n2'}) # replace the id with id of file you want to access
downloaded.GetContentFile('water_porability.csv')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
pd.set_option('display.max_row', 111)
pd.set_option('display.max_column', 111)

In [None]:
water = pd.read_csv('water_porability.csv')
water.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


#1. Exploratory Data Analysis
## Objective

*   Understanding the data
*   Develop a naive model

## Checklist
### Data structure

*   target variable : potability
*   dimension : 3276 X 10
*   data types: 9 float, 1 integer (target; yes/no)
*   missing data: the variables contains NaN values
Trihalomethanes    4.945%
ph                 14.98%
Sulfate            23.84%

### Deep analysis

*    Target visualization
      *   non-potable 60.989%  (1998)
      *   potable     39.011%  
      (1278)

* Variables discription

1. pH value:

represents the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

2. Hardness:

Hardness is mainly caused by calcium and magnesium salts.

3. Solids (Total dissolved solids - TDS):

Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc.

4. Chloramines:

Chlorine and chloramine are the major disinfectants used in public water systems.

5. Sulfate:

Sulfates are naturally occurring substances that are found in minerals, soil, and rocks.

6. Conductivity:

Measures the conductivity of electric current.

7. Organic_carbon:

Measure of the total amount of carbon in organic compounds in pure water.

8. Trihalomethanes:

THMs are chemicals which may be found in water treated with chlorine.

9. Turbidity:

Measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter.

10. Potability:

Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

### Distribution analysis

- The distribution of the variables, (except solids variable; asymetric) seems symetric and follow the normal distribution.

- The varaibles quantified with diffrent metrics, a step of normalization highly recommanded

- The respresentations of the densities curves for the variables pH, Chloramines, Hardness, Solids, Sulfat. shows a difference between the two sets of water, potable and non-potable (hypothesis).
- The correlation matrix shows that all the correlation coefficient are close to 0, which means there is no correlation between the variables.


In [None]:
water.shape

(3276, 10)

In [None]:
water.dtypes.value_counts()

float64    9
int64      1
dtype: int64

In [None]:
for col in water.select_dtypes('float'):
    plt.figure()
    sns.distplot(water[col])

## Handling missing data

In [None]:
(water.isna().sum()/water.shape[0]).sort_values(ascending=True)

In [None]:
water.isna().sum()

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(water.isna(), cbar=False)

In [None]:
water.isna().sum()/water.shape[0]

In [None]:
potable_df = water[water['Potability'] == 1]
nonepotable_df = water[water['Potability'] == 0]

In [None]:
for col in water.select_dtypes('float'):
    plt.figure()
    sns.distplot(potable_df[col], label='potable')
    sns.distplot(nonepotable_df[col], label='non potable')
    plt.legend()

Target visualization

In [None]:
water['Potability'].value_counts() #True to see the %



## Correlation Matrix

In [None]:
sns.clustermap(water.select_dtypes('float').corr())

In [None]:
water.select_dtypes('float').corr()

#2. Data Preprocessing



In [None]:
water.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
trainset, testset = train_test_split(water, test_size=0.2, random_state=0)

In [None]:
trainset['Potability'].value_counts()

In [None]:
testset['Potability'].value_counts()

In [None]:
testset.shape, trainset.shape

In [None]:
#impute nan values
trainset.mean()

In [None]:
def preprocessor(df):
    df = df.fillna(df.mean()) #impute nan values with means
    X = df.drop('Potability', axis=1).reset_index(drop=True)
    y = df['Potability'].reset_index(drop=True)
    print(y.value_counts())
    return X, y

In [None]:
X_train, Y_train = preprocessor(trainset)

In [None]:
X_test, Y_test = preprocessor(testset)

#3. Build model

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

nb_model = GaussianNB().fit(X_train, Y_train)
y_pred = nb_model.predict(X_test)
print(classification_report(y_pred, Y_test))

In [None]:
from sklearn import tree
rf_model = tree.DecisionTreeClassifier()
rf_model.fit(X_train, Y_train)
y_pred = rf_model.predict(X_test)

print(classification_report(y_pred, Y_test))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, Y_train)
y_pred = knn.predict(X_test)

print(classification_report(y_pred, Y_test))

In [None]:
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression().fit(X_train,Y_train)
y_pred = reg.predict(X_test)

print(classification_report(y_pred, Y_test))

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier().fit(X_train,Y_train)
y_pred = rf.predict(X_test)

print(classification_report(y_pred, Y_test))

In [None]:
from sklearn.model_selection import GridSearchCV
params_nb = {'var_smoothing': np.logspace(0,-9, num=100)
}
params_rf = {
    'n_estimators': [200, 500],
    'max_features': ['sqrt', 'log2', None],
    'max_depth' : [4,5,6,7,8],
    'max_leaf_nodes': [3, 6, 9]
}
params_lr = {
    'penalty' : ['l1', 'l2'],
    "C": [0.001, 0.01, 0.1, 1]
}
params_knn = {
    'n_neighbors' : list(range(1,20,2)),
    'weights' : ['uniform', 'distance'],
}
params_dt = {
    'max_depth': list(range(3,15)),
    'criterion': ['gini', 'entropy', 'log_loss']
}

In [None]:
from sklearn.model_selection import GridSearchCV
#RandomizedSearchCV
grid_search = GridSearchCV(RandomForestClassifier(),param_grid=params_rf, scoring="accuracy")
grid_search.fit(X_train, Y_train)
print(grid_search.best_estimator_)

RandomForestClassifier(max_depth=8, max_features=None, max_leaf_nodes=9,
                       n_estimators=200)


In [None]:
grid_search = GridSearchCV(
    LogisticRegression(random_state=123,
    class_weight="balanced",
    solver="liblinear"), param_grid=params_lr, scoring="accuracy")
grid_search.fit(X_train, Y_train)
print(grid_search.best_estimator_)

In [None]:
reg = LogisticRegression(C=0.01, class_weight='balanced', penalty='l1',
                   random_state=123, solver='liblinear').fit(X_train,Y_train)
y_pred = reg.predict(X_test)

print(classification_report(y_pred, Y_test))