<center><img src="https://p9cdn4static.sharpschool.com/UserFiles/Servers/Server_791028/Image/News-Announcements/water-testing.jpg",height='1',width='0.05'></center>

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Dataset Overview</b></h1>

Safe and readily available water is important for public health, whether it is used for drinking, domestic use, food production or recreational purposes. Improved water supply and sanitation, and better management of water resources, can boost countries’ economic growth and can contribute greatly to poverty reduction.

Contaminated water and poor sanitation are linked to transmission of diseases such as cholera, diarrhoea, dysentery, hepatitis A, typhoid, and polio. Absent, inadequate, or inappropriately managed water and sanitation services expose individuals to preventable health risks. This is particularly the case in health care facilities where both patients and staff are placed at additional risk of infection and disease when water, sanitation, and hygiene services are lacking. Globally, 15% of patients develop an infection during a hospital stay, with the proportion much greater in low-income countries.

# Description of the features
* pH is a measure of how acidic/basic water is. The range goes from 0 to 14, with 7 being neutral. pHs of less than 7 indicate acidity, whereas a pH of greater than 7 indicates a base. The pH of water is a very important measurement concerning water quality.
* Hardness was originally a measure of the capacity of water to react with soap, where hard water requires more soap to create a lather.Water containing calcium carbonate at concentrations below 60 mg/l is generally considered as soft; 60–120 mg/l, moderately hard; 120–180 mg/l, hard; and more than 180 mg/l, very hard.
* The principal constituents are usually calcium, magnesium, sodium, and potassium cations and carbonate, hydrogencarbonate, chloride, sulfate, and nitrate anions. The presence of dissolved solids in water may affect its taste.
* Chloramines (also known as secondary disinfection) are disinfectants used to treat drinking water and they:
    * Are most commonly formed when ammonia is added to chlorine to treat drinking water.
    * Provide longer-lasting disinfection as the water moves through pipes to consumers.
* Sulfate is second to bicarbonate as the major anion in hard water reservoirs. Sulfates (SO4--) can be naturally occurring or the result of municipal or industrial discharges.In humans, concentrations of 500 - 750 mg/L cause a temporary laxative effect. However, doses of several thousand mg/L did not cause any long-term ill effects. At very high concentrations sulfates are toxic to cattle. Problems caused by sulfates are most often related to their ability to form strong acids which changes the pH.
* Conductivity of water is important is because it can tell you how much dissolved substances, chemicals, and minerals are present in the water. Higher amounts of these impurities will lead to a higher conductivity
* Total Organic Carbon (TOC) is a measure of the total amount of carbon in organic compounds in pure water and aqueous systems. ... Unless it's ultrapure, water will naturally contain some organic compounds, understanding how much is key.
* Trihalomethanes (THMs) are the result of a reaction between the chlorine used for disinfecting tap water and natural organic matter in the water. At elevated levels, THMs have been associated with negative health effects such as cancer and adverse reproductive outcomes.
* Turbidity is the measure of relative clarity of a liquid. It is an optical characteristic of water and is a measurement of the amount of light that is scattered by material in the water when a light is shined through the water sample. 


<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Importing libraries and Loading Dataset</b></h1>

In [None]:
import pandas as pd
import missingno as msno
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams
from sklearn import preprocessing
import warnings
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
import xgboost
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn import svm
warnings.filterwarnings("ignore")

In [None]:
data=pd.read_csv("../input/water-potability/water_potability.csv")
data.head()

In [None]:
data.info()

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Attributes with Missing values</b></h1>

In [None]:
#Plot to visualize the patterns in the missing values
msno.matrix(data)

In [None]:
#Percentage of missing values 
columns=data.columns
print("% of Missing values")
for col in columns:
    if data[col].isnull().sum():
        print("      ",col,": ",data[col].isnull().sum()/len(data))
    

In [None]:
data.describe()

In [None]:
#The Mean and Median of the pH,Sulfate and Trihalomethanes are almost similar, So replace the missing values with any of these
attributes=['ph','Sulfate','Trihalomethanes']
for col in attributes:
    data[col].fillna(data[col].mean(),inplace=True)
data.info()

In [None]:
values=[len(data[data['Potability']==0]),len(data[data['Potability']==1])]

# Figure Size
fig, ax = plt.subplots(figsize =(11, 11))

# Horizontal Bar Plot
ax.barh(['Non-Potable','Potable'],values)

# Remove axes splines
for s in ['top', 'bottom', 'left', 'right']:
    ax.spines[s].set_visible(False)
 
 # Remove x, y Ticks
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
 
# Add padding between axes and labels
ax.xaxis.set_tick_params(pad = 5)
ax.yaxis.set_tick_params(pad = 10)
ax.invert_yaxis()

# Add annotation to bars
for i in ax.patches:
    plt.text(i.get_width()+0.2, i.get_y()+0.5,
             str(round((i.get_width()), 2)),
             fontsize = 10, fontweight ='bold',
             color ='grey')
# Add Plot Title
ax.set_title('Ratio of Potable and Non-Potable',
             loc ='left', )
 

 # Show Plot

plt.show()

In [None]:
data

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Over Sampling Using SMOTE</b></h1>

In [None]:

X = data.loc[:, data.columns != 'Potability']
y = data.loc[:, data.columns == 'Potability']

from imblearn.over_sampling import SMOTE
Over_sample = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns

Over_sample_data_X, Over_sample_data_y = Over_sample.fit_resample(X_train, y_train)
Over_sample_data_X = pd.DataFrame(data=Over_sample_data_X ,columns=columns )
Over_sample_data_y= pd.DataFrame(data=Over_sample_data_y,columns=['Potability'])

# we can Check the numbers of our data
print("length of oversampled data is ",len(Over_sample_data_X))
print("Number of Non-Potable in oversampled data",len(Over_sample_data_y[Over_sample_data_y['Potability']==0]))

print("Number of Potable in oversampledd data",len(Over_sample_data_y[Over_sample_data_y['Potability']==1]))

print("Proportion of non Potable data in oversampled data is ",len(Over_sample_data_y[Over_sample_data_y['Potability']==0])/len(Over_sample_data_X))
print("Proportion of Potable data in oversampled data is ",len(Over_sample_data_y[Over_sample_data_y['Potability']==1])/len(Over_sample_data_X))


In [None]:
data=Over_sample_data_X
target=Over_sample_data_y


<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Distribution of the Features</b></h1>

In [None]:
sns.set(style="dark")
rcParams['figure.figsize'] = 8,5
for col in data.columns:
    if col=='Potability':
        continue
    sns.distplot(data[col],kde=True,color='red')
    plt.title("Distribution of {}".format(col))
    plt.show()

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Featue Scalling</b></h1>

In [None]:
#Normalizing the fetures using sklearn minmax scaler
min_max_scaler = preprocessing.MinMaxScaler(feature_range =(0, 1))
data=min_max_scaler.fit_transform(data)
data=pd.DataFrame(data)
print(data)
data.columns=columns

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Relationship b/w the features</b></h1>

In [None]:
rcParams['figure.figsize'] = 11,11
correlation=data.copy()
sns.heatmap(correlation.corr(),xticklabels=correlation.columns,yticklabels=correlation.columns,annot=True,cmap=sns.diverging_palette(220, 10, as_cmap=True),linewidths=2)
plt.title("Correlation Plot")
plt.show()

In [None]:
data

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Train Test Split</b></h1>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2,shuffle=True)

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
Accuracy_={}

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Train and Evaluvate</b></h1>

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Logistic Regression</b></h1>

In [None]:
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
Accuracy_["Logistic Regression"]=logreg.score(X_test, y_test)*100
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
rcParams['figure.figsize'] = 10,5
sns.set(font_scale=1.5)
plt.yticks(va='center')
sns.heatmap(confusion_matrix,xticklabels=['Potable','Non-Potable'],yticklabels=['Potable','Non-Potable'],annot=True,cmap='YlGnBu',linewidths=0.5,fmt='d',annot_kws={"size":15})
plt.title("Logistic Regression Confusion Matrix")
plt.show()

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>SVM Classifier</b></h1>

In [None]:
svc = svm.SVC(C=2)
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
Accuracy_["SVM Classifier"]=svc.score(X_test, y_test)*100
print('Accuracy of SVM classifier on test set: {:.2f}'.format(svc.score(X_test, y_test)))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
rcParams['figure.figsize'] = 10,5
sns.set(font_scale=1.5)
plt.yticks(va='center')
sns.heatmap(confusion_matrix,xticklabels=['Potable','Non-Potable'],yticklabels=['Potable','Non-Potable'],annot=True,cmap='YlGnBu',linewidths=0.5,fmt='d',annot_kws={"size":15})
plt.title("SVM Classifier Confusion Matrix")
plt.show()



<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b> Decision Tree</b></h1>

In [None]:
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
y_pred = DT.predict(X_test)
Accuracy_["Decision Tree"]=DT.score(X_test, y_test)*100
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(DT.score(X_test, y_test)))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix=confusion_matrix(y_test,y_pred)
rcParams['figure.figsize'] = 10,5
sns.set(font_scale=1.5)
plt.yticks(va='center')
sns.heatmap(confusion_matrix,xticklabels=['Potable','Non-Potable'],yticklabels=['Potable','Non-Potable'],annot=True,cmap='YlGnBu',linewidths=0.5,fmt='d',annot_kws={"size":15})
plt.title("Decision Tree Confusion Matrix")
plt.show()

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b> Random Forest<b></h1>

In [None]:
model_forest=RandomForestClassifier()
grid_forest=GridSearchCV(model_forest,param_grid={'max_depth':range(6,11)})
grid_forest.fit(X_train,y_train)
y_pred = grid_forest.predict(X_test)
Accuracy_["Random Forest"]=grid_forest.score(X_test, y_test)*100
print('Accuracy of Random Forest on test set: {:.2f}'.format(grid_forest.score(X_test, y_test)))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix=confusion_matrix(y_test,y_pred)
rcParams['figure.figsize'] = 10,5
sns.set(font_scale=1.5)
plt.yticks(va='center')
sns.heatmap(confusion_matrix,xticklabels=['Potable','Non-Potable'],yticklabels=['Potable','Non-Potable'],annot=True,cmap='YlGnBu',linewidths=0.5,fmt='d',annot_kws={"size":15})
plt.title("Random Forest Confusion Matrix")
plt.show()

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>XGBoost Classifier</b></h1>

In [None]:
model_xgb=XGBClassifier(n_estimators=10)
model_xgb.fit(X_train,y_train)
y_pred = model_xgb.predict(X_test)
Accuracy_["XG Boost"]=model_xgb.score(X_test, y_test)*100
print('Accuracy of XG Boostt on test set: {:.2f}'.format(model_xgb.score(X_test, y_test)))


In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix=confusion_matrix(y_test,y_pred)
rcParams['figure.figsize'] = 10,5
sns.set(font_scale=1.5)
plt.yticks(va='center')
sns.heatmap(confusion_matrix,xticklabels=['Potable','Non-Potable'],yticklabels=['Potable','Non-Potable'],annot=True,cmap='YlGnBu',linewidths=0.5,fmt='d',annot_kws={"size":15})
plt.title("XG Boost Confusion Matrix")
plt.show()

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Ada Boost</b></h1>

In [None]:
model_adaboost=AdaBoostClassifier(n_estimators=70)
model_adaboost.fit(X_train,y_train)
y_pred = model_adaboost.predict(X_test)
Accuracy_["Ada Boost"]=model_adaboost.score(X_test, y_test)*100
print('Accuracy of Ada Boost on test set: {:.2f}'.format(model_adaboost.score(X_test, y_test)))


In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix=confusion_matrix(y_test,y_pred)
rcParams['figure.figsize'] = 10,5
sns.set(font_scale=1.5)
plt.yticks(va='center')
sns.heatmap(confusion_matrix,xticklabels=['Potable','Non-Potable'],yticklabels=['Potable','Non-Potable'],annot=True,cmap='YlGnBu',linewidths=0.5,fmt='d',annot_kws={"size":15})
plt.title("Ada Boost Confusion Matrix")
plt.show()

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>KNN Classifier</b></h1>

In [None]:
model_KNN=KNeighborsClassifier()
grid_KNN=GridSearchCV(model_KNN,param_grid={'n_neighbors':range(4,12)})
grid_KNN.fit(X_train,y_train)
y_pred = grid_KNN.predict(X_test)
Accuracy_["KNN Classifier"]=grid_KNN.score(X_test, y_test)*100
print('Accuracy of KNN Classifier on test set: {:.2f}'.format(grid_KNN.score(X_test, y_test)))


In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix=confusion_matrix(y_test,y_pred)
rcParams['figure.figsize'] = 10,5
sns.set(font_scale=1.5)
plt.yticks(va='center')
sns.heatmap(confusion_matrix,xticklabels=['Potable','Non-Potable'],yticklabels=['Potable','Non-Potable'],annot=True,cmap='YlGnBu',linewidths=0.5,fmt='d',annot_kws={"size":15})
plt.title("KNN Classifier Confusion Matrix")
plt.show()

<h1 style="background-color:MediumSeaGreen;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>Test Accuracy</b></h1>

In [None]:
print("{:<25} {:<15}".format('Model','Accuracy'))
for k, v in sorted(Accuracy_.items(), key=lambda item :item[1],reverse=True):
    
    print("{:<25} {:<15}".format(k,v))

# If you find this notebook useful
<h1 style="background-color:tomato;font-family:newtimeroman;font-size:500%;text-align:center;border-radius: 15px 50px;padding: 5px 5px 5px 5px">
    <b>feel free to upvote!</b></h1>