<h1 style="background-color:#8ff080;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b><u>Drinking Water Quality Check</u></b></h1>


<center><img src="https://media.istockphoto.com/photos/glass-of-water-on-white-background-picture-id1161576130?k=6&m=1161576130&s=612x612&w=0&h=d0xqvms6VETXkvNyizAbgcSY0z_wmaw1-SG2TXdvD3M=",height='300',width='600'></center>

<h1 style="background-color:#f45123;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>Overview Of Dataset</b></h1>


Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions.

### About the columns of dataset

1. pH value:
PH is an important parameter in evaluating the acid–base balance of water. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

2. Hardness:
Hardness is mainly caused by calcium and magnesium salts.Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.

3. Solids (Total dissolved solids - TDS):
Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

4. Chloramines:
Chlorine and chloramine are the major disinfectants used in public water systems. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

5. Sulfate:
Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food.

6. Conductivity:
Pure water is not a good conductor of electric current rather’s a good insulator. According to WHO standards, EC value should not exceeded 400 μS/cm.

7. Organic_carbon:
Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

8. Trihalomethanes:
THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.

9. Turbidity:
The turbidity of water depends on the quantity of solid matter present in the suspended state. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

10. Potability:
Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

<h1 style="background-color:#f45123;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>Importing Libraries And Loading Dataset</b></h1>


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import AdaBoostClassifier,RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import xgboost
from xgboost import XGBClassifier
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
data=pd.read_csv('../input/water-potability/water_potability.csv')
data.head()

In [None]:
data.info()

### Checking percentage of missing values in dataset 

In [None]:
for col in data.columns:
    p=(data[col].isnull().sum()/len(data))*100
    print('the column {0} have {1} percent of NAN values'.format(col,p.round(2)))
    print()

In [None]:
# data.drop(['Sulfate'],axis=1,inplace=True)

### Replacing missing value by mean of all values in respective column 

In [None]:
def replace_nan_by_mean(info):
    for col in info.columns:
        info[col].fillna(np.mean(info[col]),inplace=True)
    return info
data=replace_nan_by_mean(data)

In [None]:
data.describe()

In [None]:
data.info()

<h1 style="background-color:#f45123;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>Train Test Spliting Of Data</b></h1>


In [None]:
train_data,test_data=train_test_split(data,test_size=0.2,random_state=42)

<h1 style="background-color:#f45123;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>Data Visualizations</b></h1>


In [None]:
plt.figure(figsize=(15,12))
sns.heatmap(train_data.corr(),annot=True,vmin=-1)
plt.show()

In [None]:
plt.figure(figsize=(18,15))
sns.pairplot(train_data)
plt.show()

In [None]:
plt.figure(figsize=(20,20))
for i in range(8):
    plt.subplot(4,2,(i%8)+1)
    sns.distplot(train_data[train_data.columns[i]])
    plt.title(train_data.columns[i],fontdict={'size':20,'weight':'bold'},pad=3)
plt.show()

In [None]:
train_inp=train_data.iloc[:,:9]
train_out=train_data.iloc[:,9]
test_inp=test_data.iloc[:,:9]
test_out=test_data.iloc[:,9]

<h1 style="background-color:#f45123;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>Scaling Data</b></h1>


In [None]:
scaler=MinMaxScaler()
train_x_std=scaler.fit_transform(train_inp)

In [None]:
test_x_std=scaler.transform(test_inp)

In [None]:
models_scores=pd.DataFrame()

<h1 style="background-color:#f45123;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>Models Evaluation</b></h1>

<h4 style="background-color:#f45999;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>Logistic Regression</b></h4>

In [None]:
model_log=LogisticRegression(C=0.1)
model_log.fit(train_x_std,train_out)

In [None]:
log_acc=accuracy_score(test_out,model_log.predict(test_x_std))
model_acc=pd.DataFrame({'Model name':['Logistic Regression'],'Accuracy':[log_acc]})
models_scores=models_scores.append(model_acc,ignore_index=True)

In [None]:
print('Train report of linear Regression \n',classification_report(train_out,model_log.predict(train_x_std)))
print('Test report of linear Regression \n',classification_report(test_out,model_log.predict(test_x_std)))

### From above classification report we could say that total accuracy is good i.e, 63% but it is quit clear that the model works very badly in predicting 1. Thus is is worst model and we should discard it.

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(confusion_matrix(test_out,model_log.predict(test_x_std)),annot=True,)
plt.title('Confusion matrix of test data',fontdict={'size':22,'weight':'bold'})
plt.show()

<h4 style="background-color:#f45999;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>DecisionTree Classifier</b></h4>

In [None]:
model_tree=DecisionTreeClassifier()
grid_tree=GridSearchCV(model_tree,param_grid={'max_depth':range(6,11)})
grid_tree.fit(train_inp,train_out)

In [None]:
tree_acc=accuracy_score(test_out,grid_tree.predict(test_inp))
model_acc=pd.DataFrame({'Model name':['Decision Tree classifier'],'Accuracy':[tree_acc]})
models_scores=models_scores.append(model_acc,ignore_index=True)
grid_tree.best_params_

In [None]:
print('Train report of DecisionTreeClassifier \n',classification_report(train_out,grid_tree.predict(train_inp)))
print('Test report of DecisionTreeClassifier \n',classification_report(test_out,grid_tree.predict(test_inp)))

In [None]:

plt.figure(figsize=(10,8))
sns.heatmap(confusion_matrix(test_out,grid_tree.predict(test_x_std)),annot=True)
plt.title('Confusion matrix of test data',fontdict={'size':22,'weight':'bold'})
plt.xlabel('Predicted value')
plt.ylabel('Actual value')
plt.show()

<h4 style="background-color:#f45999;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>RandomForest Classifier</b></h4>

In [None]:

model_forest=RandomForestClassifier()
grid_forest=GridSearchCV(model_forest,param_grid={'max_depth':range(6,11)})
grid_forest.fit(train_inp,train_out)

In [None]:
forest_acc=accuracy_score(test_out,grid_forest.predict(test_inp))
model_acc=pd.DataFrame({'Model name':['Random Forest Classifier'],'Accuracy':[forest_acc]})
models_scores=models_scores.append(model_acc,ignore_index=True)
print('best param',grid_forest.best_params_)
print('best score',grid_forest.best_score_)

In [None]:
print('Train report of RandomForestClassifier \n',classification_report(train_out,grid_forest.predict(train_inp)))
print('Test report of RandomForestClassifier \n',classification_report(test_out,grid_forest.predict(test_inp)))

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(confusion_matrix(test_out,grid_forest.predict(test_inp)),annot=True)
plt.title('Confusion matrix of test data',fontdict={'size':22,'weight':'bold'})
plt.xlabel('Predicted value')
plt.ylabel('Actual value')
plt.show()

<h4 style="background-color:#f45999;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>XGBoost Classifier</b></h4>

In [None]:
model_xgb=XGBClassifier(n_estimators=10)
# grid_xgb=GridSearchCV(model_forest,param_grid={'n_estimators':[25,50,75,100]})
model_xgb.fit(train_x_std,train_out)

In [None]:
xgb_acc=accuracy_score(test_out,model_xgb.predict(test_x_std))
model_acc=pd.DataFrame({'Model name':['XGBoost'],'Accuracy':[xgb_acc]})
models_scores=models_scores.append(model_acc,ignore_index=True)

In [None]:
print('Train report of XGBClassifier \n',classification_report(train_out,model_xgb.predict(train_x_std)))
print('Test report of XGBClassifier \n',classification_report(test_out,model_xgb.predict(test_x_std)))

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(confusion_matrix(test_out,model_xgb.predict(test_x_std)),annot=True)
plt.title('Confusion matrix of test data',fontdict={'size':22,'weight':'bold'})
plt.xlabel('Predicted value')
plt.ylabel('Actual value')
plt.show()

<h4 style="background-color:#f45999;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>KNeighbors Classifier</b></h4>

In [None]:
model_neighbor=KNeighborsClassifier()
grid_neighbor=GridSearchCV(model_neighbor,param_grid={'n_neighbors':range(4,12)})
grid_neighbor.fit(train_x_std,train_out)

In [None]:
neighbors_acc=accuracy_score(test_out,grid_neighbor.predict(test_x_std))
model_acc=pd.DataFrame({'Model name':['KNeighborsClassifier'],'Accuracy':[neighbors_acc]})
models_scores=models_scores.append(model_acc,ignore_index=True)
grid_neighbor.best_params_

In [None]:
print('Train report of KneighborsClassifier \n',classification_report(train_out,grid_neighbor.predict(train_x_std)))
print('Test report of KneighborsClassifier \n',classification_report(test_out,grid_neighbor.predict(test_x_std)))

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(confusion_matrix(test_out,grid_neighbor.predict(test_x_std)),annot=True)
plt.title('Confusion matrix of test data',fontdict={'size':22,'weight':'bold'})
plt.xlabel('Predicted value')
plt.ylabel('Actual value')
plt.show()

<h4 style="background-color:#f45999;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>Support Vector Classifier</b></h4>

In [None]:
model_svc=SVC(C=2)
model_svc.fit(train_x_std,train_out)

In [None]:
svc_acc=accuracy_score(test_out,grid_neighbor.predict(test_x_std))
model_acc=pd.DataFrame({'Model name':['SVC'],'Accuracy':[svc_acc]})
models_scores=models_scores.append(model_acc,ignore_index=True)

In [None]:
print('Train report of SVClassifier \n',classification_report(train_out,model_svc.predict(train_x_std)))
print('Test report of SVClassifier \n',classification_report(test_out,model_svc.predict(test_x_std)))

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(confusion_matrix(test_out,model_svc.predict(test_x_std)),annot=True)
plt.title('Confusion matrix of test data',fontdict={'size':22,'weight':'bold'})
plt.xlabel('Predicted value')
plt.ylabel('Actual value')
plt.show()

<h4 style="background-color:#f45999;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b>AdaBoost Classifier</b></h4>

In [None]:
model_adaboost=AdaBoostClassifier(n_estimators=70)
model_adaboost.fit(train_x_std,train_out)

In [None]:
adaboost_acc=accuracy_score(test_out,model_adaboost.predict(test_x_std))
model_acc=pd.DataFrame({'Model name':['Adaboost'],'Accuracy':[adaboost_acc]})
models_scores=models_scores.append(model_acc,ignore_index=True)

In [None]:
print('Train report of AdaboostClassifier \n',classification_report(train_out,model_adaboost.predict(train_x_std)))
print('Test report of ADAboostClassifier \n',classification_report(test_out,model_adaboost.predict(test_x_std)))

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(confusion_matrix(test_out,model_adaboost.predict(test_x_std)),annot=True)
plt.title('Confusion matrix of test data',fontdict={'size':22,'weight':'bold'})
plt.xlabel('Predicted value')
plt.ylabel('Actual value')
plt.show()

<h1 style="background-color:#f45123;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 10px;padding: 5px"><b><u>Models Test Accuracy</u></b></h1>

In [None]:
models_scores.sort_values(by=['Accuracy'],ascending=False,ignore_index=True)

### Hope you enjoyed my notebook if you found it useful. Please upvote it!!
### If having any queries or suggestions feel free to ask in comment section.