# Water Potability

![Water Potability](./images/pexels-photo-416528.jpeg)


# Table of content
##### 1.Import library
##### 2.Exploratory data analysis
##### 3.Data cleaning
##### 4.Removing outilers
##### 5.Spliting data and model buliding
##### 6.predictive System

# 1.Import Libraries

In [None]:
#Standard libraries for data analysis:----------------------

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# modules for data preprocessing-------------------------------------
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


#sklearn modules for Model Selection--------------------------------------
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm  import SVC
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

: 

In [None]:
df = pd.read_csv("water_potability.csv")

: 

In [None]:
df.head()

: 

In [None]:
df.columns

: 

## About the dataset
pH:

It measures the acidity or alkalinity of the water. pH values range from 0 to 14, with 7 being neutral.

Hardness:

Unit: mg/L (milligrams per liter). 
It indicates the concentration of calcium and magnesium in the water. Higher values suggest harder water.

Solids:

Unit: ppm (parts per million). 
It represents the total concentration of dissolved and suspended solids in the water. Includes both dissolved solids (TDS) and suspended solids.

Chloramines:

Unit: ppm (parts per million). 
It measures the concentration of chloramines, which are disinfectants used to treat water.

Sulfate:

Unit: mg/L (milligrams per liter). 
It indicates the concentration of sulfate ions in the water. High levels can affect the taste and quality of water.

Conductivity:

Unit: µS/cm (microsiemens per centimeter). 
It measures the water’s ability to conduct electricity, which correlates with the ion concentration. Higher values indicate higher ionic content.

Organic_carbon:

Unit: ppm (parts per million).  
It represents the concentration of organic carbon compounds in the water. Organic carbon levels can indicate the presence of organic pollutants.

Trihalomethanes:

Unit: µg/L (micrograms per liter). 
It measures the concentration of trihalomethanes, which are byproducts of water disinfection with chlorine. High levels can be harmful to health.

Turbidity:

Unit: NTU (Nephelometric Turbidity Units). 
It indicates the cloudiness of the water caused by large numbers of individual particles. Higher turbidity can be a sign of contaminants.

Potability:

 Binary (0 or 1). 
It indicates whether the water is potable (safe to drink) or not. 0 represents non-potable, and 1 represents potable.

# 2. Exploratory Data Analysis

In [None]:
df.info()

: 

Portablity is the only integer data type

In [None]:
df.describe()

: 

In [None]:
df.shape

: 

#### check for correlation

In [None]:
plt.figure(figsize = (12,8))
sns.heatmap(df.corr(), annot = True)

: 

All features seems to  have weak (low) correlation with potability and among them solids has highest correlation with potablity

In [None]:
# Check target feature

sns.countplot(x='Potability', data= df, color='skyblue', edgecolor='black')

: 

#### Imbalanced data

In [None]:
df['Potability'].value_counts()

: 

In [None]:
# Check distribution of each feature with Hist Plot

plt.figure(figsize=(16,9))
plt.subplots_adjust(wspace=0.3, hspace=0.3)


# Loop through each column in the dataset
o = 1
for i, col in enumerate(df.columns):
    plt.subplot(4, 4, o)
    sns.histplot(data=df, x=col, kde=True, color='skyblue', edgecolor='black', bins=20)
    plt.title(f'Distribution of {col}')
    o += 1

# Display the plots
plt.tight_layout()
plt.show()


: 

generally speaking all feature are similar to  normal distribution except solid column which is right skewed

In [None]:
#visualising dataset and also checking for outliers

fig, ax = plt.subplots(ncols = 5, nrows = 2, figsize = (16,9))
ax = ax.flatten()
index = 0
for col,values in df.items():
    sns.boxplot(y=col, data= df, ax= ax[index])
    index += 1

: 

Non linear relationship

# 3. Data cleaning

In [None]:
df.isnull().sum()

: 

In [None]:
# Check ISNA sum
isna_sum = df.isna().sum()
plt.figure(figsize=(10, 5))
sns.barplot(x=isna_sum.index, y=isna_sum.values, color='skyblue', edgecolor='black')
plt.title('missing values by column')
plt.xticks(rotation=45)
plt.show()

: 

### Ph, Chioramines Sulphates, Organic_carbon contains null value


In [None]:
# filling missing values
df['ph'] = df['ph'].fillna(df['ph'].mean())
df['Sulfate'] = df['Sulfate'].fillna(df['Sulfate'].mean())
df['Trihalomethanes'] = df["Trihalomethanes"].fillna(df["Trihalomethanes"].mean())

: 

In [None]:
df.isnull().sum()

: 

In [None]:
df.duplicated().sum()

: 

# 4. Removing Outliers

Below Converts Decimal Value to Int64 with Rounding Each value its nearest value eg: 3.9 to 4

In [None]:
for cols in df.columns: 
    if cols!="Potability":
        df[cols] = df[cols].round().astype('int64')

df.head(5)

: 

Below Is the Code Remove SPECIFIC Outliers Values from Main Dataframe

In [None]:
# finding outliers data
Q1 = df['Hardness'].quantile(0.25)
Q3 = df['Hardness'].quantile(0.75)
IQR = Q3-Q1
# identify outliers
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

outliers = df[(df['Hardness']<lower_bound) | (df['Hardness']>upper_bound)]
values_out_of_range = df[(df['Hardness']<Q1) | (df['Hardness']>Q3)]
outliers_df = pd.DataFrame(outliers)

df = df.drop(outliers_df.index)
df.head(4)

: 

In [None]:
# finding outliers data
Q1 = df['Trihalomethanes'].quantile(0.25)
Q3 = df['Trihalomethanes'].quantile(0.75)
IQR = Q3-Q1
# identify outliers
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

outliers = df[(df['Trihalomethanes']<lower_bound) | (df['Trihalomethanes']>upper_bound)]
values_out_of_range = df[(df['Trihalomethanes']<Q1) | (df['Trihalomethanes']>Q3)]
outliers_df = pd.DataFrame(outliers)

df = df.drop(outliers_df.index)
df.head(4)


: 

In [None]:
# finding outliers data
Q1 = df['Solids'].quantile(0.25)
Q3 = df['Solids'].quantile(0.75)
IQR = Q3-Q1
# identify outliers
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

outliers = df[(df['Solids']<lower_bound) | (df['Solids']>upper_bound)]
values_out_of_range = df[(df['Solids']<Q1) | (df['Solids']>Q3)]
outliers_df = pd.DataFrame(outliers)

df = df.drop(outliers_df.index)
df.head(4)

: 

In [None]:
# finding outliers data
Q1 = df['Sulfate'].quantile(0.25)
Q3 = df['Sulfate'].quantile(0.75)
IQR = Q3-Q1
# identify outliers
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

outliers = df[(df['Sulfate']<lower_bound) | (df['Sulfate']>upper_bound)]
values_out_of_range = df[(df['Sulfate']<Q1) | (df['Sulfate']>Q3)]
outliers_df = pd.DataFrame(outliers)

df = df.drop(outliers_df.index)
df.head(4)

: 

In [None]:
# finding outliers data
Q1 = df['Chloramines'].quantile(0.25)
Q3 = df['Chloramines'].quantile(0.75)
IQR = Q3-Q1
# identify outliers
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

outliers = df[(df['Chloramines']<lower_bound) | (df['Chloramines']>upper_bound)]
values_out_of_range = df[(df['Chloramines']<Q1) | (df['Chloramines']>Q3)]
outliers_df = pd.DataFrame(outliers)

df = df.drop(outliers_df.index)
df.head(4)

: 

In [None]:
fig, ax = plt.subplots(ncols = 5, nrows = 2, figsize = (16,9))
ax = ax.flatten()
index = 0
for col,values in df.items():
    sns.boxplot(y=col, data= df, ax= ax[index])
    index += 1

: 

Cleared all Outliers from Features : Hardness,Trihalomethanes and Solids

In [None]:
df.shape

: 

# 5.Spliting data and model buliding

In [None]:
X = df.drop('Potability', axis = 1)
y = df['Potability'] 

: 

In [None]:
y.value_counts()


: 

#### Handling imbalance data

SMOTE (Synthetic Minority Over-sampling Technique)¶
SMOTE : SMOTE is an data balancing technique used during binary classification. It tackles imbalance data problem by generating synthetic samples for minority class.

So here our minority class is value : 1

In [None]:


sm = SMOTE(random_state=42)

# Fit and apply the transform
X_resampled, y_resampled = sm.fit_resample(X, y)

: 

In [None]:
y_resampled.value_counts()

: 

#### Training and testing Dataset

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size = 0.2)

: 

In [None]:
X_train.shape, X_test.shape

: 

### Logistic Regression

In [None]:
# creating the object of model and training the model
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)

: 

In [None]:
#prediction
pred_lr = model_lr.predict(X_test)
accuracy_lr =accuracy_score(y_test,pred_lr)
print('accuracy_score =', accuracy_lr)
print('confusion_matrix =\n',confusion_matrix(y_test, pred_lr))
print('classification_report =\n', classification_report(y_test, pred_lr))


: 

### Decision Tree Classifier

In [None]:
# creating the object of model and training the model

model_dt = DecisionTreeClassifier(max_depth = 4)
model_dt.fit(X_train, y_train)

: 

In [None]:
#prediction
pred_dt = model_dt.predict(X_test)
accuracy_dt =accuracy_score(y_test,pred_dt)
print('accuracy_score =', accuracy_dt)
print('confusion_matrix =\n',confusion_matrix(y_test, pred_dt))
print('classification_report =\n', classification_report(y_test, pred_dt))


: 

### Random Forest Classifiers

In [None]:
# creating the object of model and training the model

model_rf = RandomForestClassifier()
model_rf.fit(X_train,y_train)

: 

In [None]:
#prediction

pred_rf = model_rf.predict(X_test)
accuracy_rf =accuracy_score(y_test,pred_rf)
print('accuracy_score =', accuracy_rf)
print('confusion_matrix =\n',confusion_matrix(y_test, pred_rf))
print('classification_report =\n', classification_report(y_test, pred_rf))


: 

### KNeighbors Classifiers

In [None]:
#training and predicting the model
for i in range(4,12):
    model_knn = KNeighborsClassifier(n_neighbors = i)
    model_knn.fit(X_train, y_train)
    pred_knn = model_knn.predict(X_test)
    print('accuracy =',accuracy_score(y_test,pred_knn), 'i =' ,i)

: 

In [None]:
model_knn = KNeighborsClassifier(n_neighbors = 4)
model_knn.fit(X_train, y_train)
pred_knn = model_knn.predict(X_test)
accuracy_knn =accuracy_score(y_test,pred_knn)
print('accuracy_score =', accuracy_knn)
print('confusion_matrix =\n',confusion_matrix(y_test, pred_knn))
print('classification_report =\n', classification_report(y_test, pred_knn))


: 

### Support Vector Machine

In [None]:
# creating the object of model and training the model
model_svc = SVC(kernel = "rbf")
model_svc.fit(X_train, y_train)

: 

In [None]:
#prediction
pred_svc = model_svc.predict(X_test)
accuracy_svc =accuracy_score(y_test,pred_svc)
print('accuracy_score =', accuracy_svc)
print('confusion_matrix =\n',confusion_matrix(y_test, pred_svc))
print('classification_report =\n', classification_report(y_test, pred_svc))


: 

### Adaboost Classifiers

In [None]:
# creating the object of model and training the model
model_adb = AdaBoostClassifier(n_estimators = 100)
model_adb.fit(X_train,y_train)

: 

In [None]:
#prediction
pred_adb = model_adb.predict(X_test)
accuracy_adb =accuracy_score(y_test,pred_adb)
print('accuracy_score =', accuracy_adb)
print('confusion_matrix =\n',confusion_matrix(y_test, pred_adb))
print('classification_report =\n', classification_report(y_test, pred_adb))


: 

### XGBOOST Classifier

In [None]:
# creating the object of model and training the model
model_xgb = XGBClassifier(n_estimators = 200, learning_rate = 0.03)
model_xgb.fit(X_train, y_train)

: 

In [None]:
#prediction
pred_xgb = model_xgb.predict(X_test)
accuracy_xgb =accuracy_score(y_test,pred_lr)
print('accuracy_score =', accuracy_xgb)
print('confusion_matrix =\n',confusion_matrix(y_test, pred_xgb))
print('classification_report =\n', classification_report(y_test, pred_xgb))


: 

In [None]:
models = pd.DataFrame({
    "Model" : ["Logistic Regression",
                 "Decision Tree",
                 "Random Forest",
                 "KNN",
                 "SVM",
                 "Adaboost",
                 "XGboost"],
    "Accuracy_score" : [accuracy_lr, accuracy_dt, accuracy_rf, accuracy_knn,
                          accuracy_svc, accuracy_adb, accuracy_xgb]
})
        

: 

In [None]:
models

: 

In [None]:
sns.barplot(x = "Accuracy_score", y = "Model", data = models)
models.sort_values (by ="Accuracy_score", ascending = False)

: 

##### Best model is Random Forest with highest accuracy

### Hypertuning 

In [None]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold

params_RF = {"min_samples_split": [2,6],
             "min_samples_leaf":  [1,4],
             "n_estimators" :[100,200,300],
             "criterion":["gini", 'entropy']
            }
cv_method = StratifiedKFold(n_splits = 3)
GridSearchCV_RF = GridSearchCV(estimator = RandomForestClassifier(),
                               param_grid = params_RF,
                               cv = cv_method,
                               verbose = 1,
                               n_jobs = 2,
                               scoring = "accuracy",
                               return_train_score = True
                              )
GridSearchCV_RF.fit(X_train,y_train)
best_params_RF = GridSearchCV_RF.best_params_
print("Best hyperparameters for Random Forests are =",best_params_RF)
                               

: 

In [None]:
best_estimator = GridSearchCV_RF.best_estimator_
best_estimator.fit(X_train, y_train)
y_pred_best = best_estimator.predict(X_test)
print(classification_report(y_test, y_pred_best))

: 

In [None]:
print(f"Accuracy of Random Forest Model = {round(accuracy_score(y_test, y_pred_best)*100,2)}%")

: 

# 6. Predictive system

In [None]:
df.columns

: 

In [None]:
ph = int(input("Enter the ph value"))
Hardness = int(input("Enter the Hardness value"))
solids = int(input("Enter the solids value"))
chloramines = int(input("Enter the chloramines value"))
sulfate = int(input("Enter the sulfate value "))
conductivity = int(input("Enter the conductivity value"))
organic_carbon = int(input("Enter the organic_carbon value"))
trihalomethanes = int(input("enter the trihalomethanes value"))
turbidity = int(input("Enter the turbidity value"))

: 

In [None]:
input_data = [[ph, Hardness, solids,chloramines, sulfate, conductivity, organic_carbon,
              trihalomethanes, turbidity]]

: 

In [None]:
model_prediction = best_estimator.predict(input_data)
model_prediction

: 

In [None]:
if model_prediction[0] == 0:
    print("Water is not safe for consumption")
else:
    print("Water is safe for consumption")

: 

: 