# WATER QUALITY ANALYSIS
# Context

Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions

# Data Discription


| Features | Description |
|--|--|
|**1. pH value**|it should be from 6.5 to 8.5|
|**2.Hardness**|Water containing calcium carbonate at concentrations below 60 mg/l is generally considered as soft; 60–120 mg/l, moderately hard; 120–180 mg/l, hard;and more than 180 mg/l, very hard (McGowan, 2000).
|**3. Solids (Total dissolved solids - TDS)**|This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.|
|**4. Chloramines:**| Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.|
|**5. Sulfate**|Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.|
|**6. Conductivity**|EC value should not exceeded 400 μS/cm.| 
|**7. Organic_carbon**|According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.|
|**8. Trihalomethanes**|THM levels up to 80 ppm is considered safe in drinking water.|
|**9. Turbidity**|The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.|
|**10. Potability**|Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.|

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.style.use('seaborn-ticks')

# Data Analysis

In [None]:
data = pd.read_csv("../input/water-potability/water_potability.csv")
data.head()

In [None]:
#shape
data.shape

In [None]:
# summary
data.describe()

In [None]:
data.info()

In [None]:
# visualization for distribution of columns.
fig, axs = plt.subplots(3, 3, figsize=(20, 20))
cols = np.array(data.columns[:9]).reshape(3, 3)


for i in range(3):
    for j in range(3):
        skewness = f"Skewness: {data[cols[i][j]].skew()}" # it should be between -1 to 1
        ax = sns.histplot(data[cols[i, j]], ax=axs[i, j])
        ax.set_title(skewness)
        
plt.show()

In [None]:
# fill missing values.
from sklearn.impute import SimpleImputer

X = data.loc[:, :'Turbidity']
y = data['Potability']

# create an object of simpleimputer
imputer = SimpleImputer(strategy='mean')
imputer.fit(X)
x_trans = imputer.transform(X)
x_trans = pd.DataFrame(x_trans, columns=X.columns)

In [None]:
# report which features were selected by RFE
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier


# define RFE
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
# fit RFE
rfe.fit(x_trans, y)
col_idx = []

# summarize all features
for i in range(x_trans.shape[1]):
    if rfe.support_[i]:
        col_idx.append(i)
        
cols = x_trans.columns
feature_cols = []
for idx in range(len(cols)):
    if idx in col_idx:
        feature_cols.append(cols[idx])

x_trans = x_trans[feature_cols]
x_trans.head()

In [None]:
# normailze the data
from sklearn.preprocessing import MinMaxScaler

# create an object of scaler
scale = MinMaxScaler()
# fit scaler
scale.fit(x_trans)
scaled_x = scale.transform(x_trans)
scaled_x = pd.DataFrame(scaled_x, columns = x_trans.columns)

In [None]:
scaled_x.describe()

In [None]:
scaled_x.isnull().sum()

In [None]:
# Building a model
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import roc_curve, roc_auc_score

x_train, x_test, y_train, y_test = train_test_split(scaled_x, y, train_size=0.7, random_state=123)

def predict_model(model):
    
    # fit model
    model.fit(x_train, y_train)
    pred_y = model.predict(x_test)
    
    score = accuracy_score(y_test, pred_y)
    report = classification_report(y_test, pred_y)
    return report
    

LRclf = LogisticRegression(solver="liblinear")
RFclf = RandomForestClassifier(max_depth = 5, n_estimators = 500)
DTclf = DecisionTreeClassifier()

print("Classification Report: \n", predict_model(RFclf))

## plotting roc-curve

kfold = model_selection.KFold(n_splits=10, random_state=7, shuffle=True)
scoring = 'roc_auc'
results = model_selection.cross_val_score(RFclf, scaled_x, y, cv=kfold, scoring=scoring)
print("Accuracy: %.3f (%.3f)" % (results.mean(), results.std()))


# create test array with all '0'
neg_prob = [0 for _ in range(len(y_test))]

# Roc-curve for RandomForestClassifier.
rf_prob = RFclf.predict_proba(x_test)

# Keeping probabilites for positive outcomes i.e for '1'
rf_prob = rf_prob[:, 1]

# computing roc_score for negative outcomes '0's
neg_test_score = roc_auc_score(y_test, neg_prob)
rf_test_score = roc_auc_score(y_test, rf_prob)

# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, neg_prob)
lr_fpr, lr_tpr, _ = roc_curve(y_test, rf_prob)

# plot roc_curves
plt.plot(ns_fpr, ns_tpr, linestyle='--', color='r')
plt.plot(lr_fpr, lr_tpr)
plt.show()
