<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Introduction</h3>

![](http://cdn.shopify.com/s/files/1/0079/0082/3665/articles/water_1200x1200.jpg?v=1579072340)

Fresh water is the primary source of human health, prosperity, and security. By around 2050 the world's population is expected to reach about nine billion. Assuming that standards of living continue to rise, the requirement of potable water for human consumption will amount to the resources of about three planet Earths. A key United Nations report indicates that water shortages will affect 2.3 billion people or 30% of the world's population in four dozen nations by 2025. Already, the crisis of potable water in most developing countries is creating public health emergencies of staggering proportions. In Bangladesh, for example, it is officially recognized by the government of Bangladesh that 50% of the country's approximately 150 million people, are at risk of arsenic poisoning from groundwater used for drinking. Recently, the government of Bangladesh, in its Action Plan for Poverty Reduction, stated its desire to ensure 100% access to pure drinking water across the region within the shortest possible time frame [3]. This is also consistent with key goals of the Millennium Development Goal “Eradication of extreme poverty and hunger” and “Halving by 2015, the proportion of people without sustainable access to safe drinking water”. Whether this is achievable within the stated time is debatable, but it clearly delineates the state of the world we live in. - Abul Hussam, in Monitoring Water Quality, 2013


This notebook will explore the different features related to water potability, Modeling, and predicting water potability.
We will dive into an in-depth analysis of what separates potable water from non-potable using traditional statistics, bayesian inference, and other machine learning approaches that will help us uncover the underlying process.


<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Libraries and Utilities</h3>


In [None]:
import os               
import numpy                   as np
import pandas                  as pd 
import matplotlib.pyplot       as plt
import seaborn                 as sns
import plotly.express          as ex
import plotly.graph_objs       as go
import plotly.offline          as pyo
import scipy.stats             as stats
import pymc3                   as pm
import theano.tensor           as tt
from plotly.subplots           import make_subplots
from sklearn.preprocessing     import StandardScaler
from sklearn.decomposition     import TruncatedSVD,PCA
from sklearn.ensemble          import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.tree              import DecisionTreeClassifier
from sklearn.linear_model      import LinearRegression,LogisticRegressionCV
from sklearn.svm               import SVC
from sklearn.metrics           import mean_squared_error,r2_score
from sklearn.pipeline          import Pipeline
from sklearn.model_selection   import cross_val_score,train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.manifold          import Isomap,TSNE
from sklearn.feature_selection import mutual_info_classif
from tqdm.notebook             import tqdm
from scipy.stats               import ttest_ind

#%pip install tune_sklearn
#from tune_sklearn              import TuneGridSearchCV


sns.set_style('darkgrid')
pyo.init_notebook_mode()
%matplotlib inline


plt.rc('figure',figsize=(18,11))
sns.set_context('paper',font_scale=2)

<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Loading the Data and Imputing Missing Values</h3>


In [None]:
water_df = pd.read_csv('/kaggle/input/water-potability/water_potability.csv')
water_df.head(4)

In [None]:
plt.title('Missing Values Per Feature')
nans = water_df.isna().sum().sort_values(ascending=False).to_frame()
sns.heatmap(nans,annot=True,fmt='d',cmap='vlag')

**Action**: We will impute the outliers in our data using the corresponding mean to label, i.e., all missing values that are labeled "potable" will be imputed using the mean of all non-missing "potable" samples, and the same action will be applied to "non-potable" samples with missing values.

In [None]:
# Impute Missing Values with Label Matching Mean
for col in ['Sulfate','ph','Trihalomethanes']:
    missing_label_0 = water_df.query('Potability == 0')[col][water_df[col].isna()].index
    water_df.loc[missing_label_0,col] = water_df.query('Potability == 0')[col][water_df[col].notna()].mean()

    missing_label_1 = water_df.query('Potability == 1')[col][water_df[col].isna()].index
    water_df.loc[missing_label_1,col] = water_df.query('Potability == 1')[col][water_df[col].notna()].mean()


<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Exploratory Data Analysis</h3>



<h3 style="background-color:orange;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;">General  Data Analysis</h3>



In [None]:
T = water_df.copy()
T.Potability =  T.Potability.map({1:'Potable',0:'Not Potable'})
ex.pie(T,names='Potability',title='Distribution of Target Labels (Drinkability)')

**Observation**: We see that we have some degree of unbalancedness in our data; we will not apply any upsampling/downsampling methodology as the proportions are more close to equal than to be extremely balanced (cases like 90% / 10% where upsampling is crucial).
Also, the more significant label ("Not potable") is the one with more samples; logically, we would prefer a model that will have more false negatives rather than a model that has more false positives.

In [None]:
fig = make_subplots(rows=3, cols=1,shared_xaxes=True,subplot_titles=('Perason Correaltion',  'Spearman Correaltion','Kendall Correlation'))
colorscale=     [[1.0              , "rgb(165,0,38)"],
                [0.8888888888888888, "rgb(215,48,39)"],
                [0.7777777777777778, "rgb(244,109,67)"],
                [0.6666666666666666, "rgb(253,174,97)"],
                [0.5555555555555556, "rgb(254,224,144)"],
                [0.4444444444444444, "rgb(224,243,248)"],
                [0.3333333333333333, "rgb(171,217,233)"],
                [0.2222222222222222, "rgb(116,173,209)"],
                [0.1111111111111111, "rgb(69,117,180)"],
                [0.0               , "rgb(49,54,149)"]]

s_val =water_df.corr('pearson')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,name='pearson',showscale=False,xgap=1,ygap=1,colorscale=colorscale),
    row=1, col=1
)


s_val =water_df.corr('spearman')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,xgap=1,ygap=1,colorscale=colorscale),
    row=2, col=1
)

s_val =water_df.corr('kendall')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,xgap=1,ygap=1,colorscale=colorscale,showscale=False),
    row=3, col=1
)

fig.update_layout(height=700, width=900, title_text="Different Inner Correlations Coefficients")
fig.show()

**Observation**: It appears that there is no linear/ranked correlation between our output label and our features, mostly due to the fact that we have a binary label and continuous features, traditional linear correlation coefficients won't tell us the true underlying story about the relationships between our features and the target variable.
Later in this notebook, we will perform more in depth analysis to try and uncover some of the relationships hidden in our data.

In [None]:
non_potabale = water_df.query('Potability == 0')
potabale     = water_df.query('Potability == 1')

for ax,col in enumerate(water_df.columns[:9]):
    plt.subplot(3,3,ax+1)
    plt.title(f'Distribution of {col}')
    sns.kdeplot(x=non_potabale[col],label='Non Potabale')
    sns.kdeplot(x=potabale[col],label='Potabale')
    plt.legend(prop=dict(size=10))
    

plt.tight_layout()

**Observation**: Looking at the distribution of all our features divided by our target label, we see that some of them have some difference, a key point that can help us select the features with which we will train our models.
To better understand the differences between the features with respect to the target label, a more robust analysis is required to confirm any hypothesis we may have at this point just from looking at the distribution plots.



<h3 style="background-color:orange;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;">Statistical Difference Analysis</h3>



In [None]:
ttest_results_pvalues,ttest_results_statistic = [],[]
for ax,col in enumerate(water_df.columns[:9]):
    statistic,pvalue = ttest_ind(non_potabale[col],potabale[col])
    ttest_results_pvalues.append(pvalue)
    ttest_results_statistic.append(statistic)
    
ttest_res_df = pd.DataFrame({'S':ttest_results_statistic,'P':ttest_results_pvalues,'F':water_df.columns[:9]})
ttest_res_df = ttest_res_df.sort_values(by='P')

**Explanation**: In order to test for any significant difference between "potable" and "non-potable" water samples, we will treat both labels as two separate populations from which we sampled 'n'  and 'k' samples (n = the number of "potable" samples, 'k' = the number of "non-potable" samples).
We will perform a two-tailed t-test to check if there is any significant difference between the two sample means, considering the sample size differences and unequal variance.
We expect to see low p-values for the features that indeed are significantly different between the labels.
We will set our significance level alpha to be equal to or less than 0.1.

In [None]:
tr  = go.Bar(x=ttest_res_df['F'] ,y=ttest_res_df['P'] ,name='T-test P Value')
tr2 = go.Bar(x=ttest_res_df['F'] ,y=ttest_res_df['S'] ,name='T-test F Statistic')

data = [tr2,tr]
fig = go.Figure(data=data,layout={'title':'T-test Results For Each Feature in Our Dataset','barmode':'overlay'})
fig.show()


**Observation**: After performing the two-tailed t-test, we see that only "Solids" and "Organic carbon" have p-values below our pre-defined alpha value, even though there are two more features closer to our alpha level than the other 4.
When we get to the modeling stage, the 4 features we will use will be all the features we see in the above plot with p-values below 0.18 (first 4 features in the plot)


In [None]:
mutual_info = []
for i in range(0,9):
    mi = mutual_info_classif(X=water_df.iloc[:,i].to_numpy().reshape(-1, 1),y=water_df.iloc[:,-1],random_state=42)
    mutual_info.append(mi[0])
mutual_info = pd.DataFrame({'Feature':water_df.columns[:9],'MI':mutual_info})
mutual_info = mutual_info.sort_values(by='MI')
tr  = go.Bar(x=mutual_info['Feature'] ,y=mutual_info['MI'] ,name='Mutual Information')

data = [tr]
fig = go.Figure(data=data,layout={'title':'Mutual Information Between Our Features and Potability','barmode':'overlay','yaxis_title':'Mutal Information'})
fig.show()


**Observation**: As an additional metric for consideration, we use "Mutal Information" to test and see if there is any similarity between the probability distribution of or continuous features with the Bernoulli distribution that represent our target.
We see that some of the worst scoring features in our t-test have the highest mutual information with our target label, conceptually meaning that knowing something about "Ph" decreases my uncertainty in assuming about "Potability," unfortunately, mutual information doesn't tell me exactly to what assumption does "Ph" contribute. Still, none the less it is an indicator of relationship and a strong what in the matter, so we will indeed include it as well in our modeling section.


<h3 style="background-color:orange;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;">Probabilistic Inference</h3>



In [None]:
plt.title('Potability as A Function of Turbidity')
sns.scatterplot(x=water_df.iloc[:,8],y=water_df.iloc[:,-1])
plt.show()

It hard to tell whether *the probability* of a water sample being potabale increases as the turbidity increases. We are interested in modeling the probability here. The best we can do is ask, "At turbidity Value $X$, what is the probability of a water sample being potable?". The goal of the following experiment is that question.

We need a function of turbidity, call it $p(X)$, that is bounded between 0 and 1 and changes from 1 to 0 as we increase turbidity. Such a function is well defined and known to us all, the *logistic function.*

$$p(X) = \frac{1}{ 1 + e^{ \;\beta X } } $$

In this model, $\beta$ is the variable we are uncertain about. Below are some examples for different value of beta plotted for $\beta = -2, 52, 7$.

$$ \text{Sample is Potabale, $M_i$} \sim \text{Ber}( \;P(turbidity_i)\; ), \;\; i=1..N$$

where $p(turbidity)$ is our logistic function and $turbidity_i$ are the turbidity values in our dataset.

In [None]:
with pm.Model() as model:
    beta = pm.Normal("beta", mu=0, tau=0.001, testval=0)
    alpha = pm.Normal("alpha", mu=0, tau=1/water_df.Turbidity.std(), testval=0)
    p = pm.Deterministic("p_parm", 1.0/(1. + tt.exp(beta*water_df.Turbidity + alpha)))

Notice in the above code we had to set the values of `beta` and `alpha` to 0. The reason for this is that if `beta` and `alpha` are very large, they make `p` equal to 1 or 0. Unfortunately, `pm.Bernoulli` does not like probabilities of exactly 0 or 1, though they are mathematically well-defined probabilities. So by setting the coefficient values to `0`, we set the variable `p` to be a reasonable starting value.

In [None]:
with model:
    observed = pm.Bernoulli("obs", p, observed=water_df.Potability)
    start = pm.find_MAP()
    step = pm.Metropolis()
    trace = pm.sample(32000, step=step, start=start)
    burned_trace = trace[20000::2]

In [None]:
alpha_samples = burned_trace["alpha"][:, None]
beta_samples = burned_trace["beta"][:, None]
plt.subplot(211)
plt.title(r"Posterior distributions of the variables $\alpha, \beta$")
sns.histplot(beta_samples, bins=35, alpha=0.85,label=r"posterior of $\beta$", palette=["#7A68A6"],stat='probability')
plt.legend()

plt.subplot(212)
sns.histplot(alpha_samples, bins=35, alpha=0.85,label=r"posterior of $\alpha$", palette=["#A60628"],stat='probability')
plt.legend();

In [None]:
t = np.linspace(water_df.Turbidity.min() - 2, water_df.Turbidity.max()+2, 150)[:, None]
def logistic(x, beta, alpha=0):
    return 1.0 / (1.0 + np.exp(np.dot(beta, x) + alpha))

p_t = logistic(t.T, beta_samples, alpha_samples)

mean_prob_t = p_t.mean(axis=0)

plt.plot(t, mean_prob_t, lw=3, label="average posterior \nprobability \ of potability")
plt.plot(t, p_t[0, :], ls="--", label="realization from posterior")
plt.plot(t, p_t[-2, :], ls="--", label="realization from posterior")
plt.scatter(water_df.Turbidity, water_df.Potability, color="tab:red", s=50, alpha=0.5)
plt.title("Posterior expected value of probability of a water sample being Potable; \
plus realizations")
plt.legend()
plt.ylim(-0.1, 1.1)
plt.xlim(t.min(), t.max())
plt.ylabel("probability")
plt.xlabel("turbidity");

We see that after exploring and modeling the potability as a process of turbidity, the underlying posterior distribution of a logistic model that should have found a threshold of classification if it was possible to gain confidence about potability of water based on turbidity, unfortunately, this is not the case.

<h3 style="background-color:orange;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;">Domain Analysis via Dimensionality Recudtion</h3>



In [None]:
N = 5 
pca_pipeline = Pipeline(steps = [
    ('scale',StandardScaler()),
    ('PCA',PCA(N))
])

tf_data = pca_pipeline.fit_transform(water_df.iloc[:,:9])
tf_data = pd.DataFrame({'PC1':tf_data[:,0],'PC2':tf_data[:,1],'PC3':tf_data[:,2],'PC4':tf_data[:,3],'PC5':tf_data[:,4],
                        'label':water_df.iloc[:,-1].map({0:'Not Potabale',1:'Potable'})})



In [None]:
ex.scatter_3d(tf_data,x='PC1',y='PC2',z='PC3',color='label',color_discrete_sequence=['salmon','green'],title=r'$\textit{Data in Reduced Dimension } R^9 \rightarrow R^3$')

**Observation**: After using Principal Components Analysis to reduce the dimensionality of our data from R9 to R3, we see no visible linear/polynomial separation between the labels, a key point that decreases our belief in models that exist rely heavily on spatial separation like SVM.



In [None]:
components = tf_data[['PC1','PC2','PC3','PC4','PC5']].to_numpy()

labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca_pipeline['PCA'].explained_variance_ratio_ * 100)
}

fig = ex.scatter_matrix(
    components,
    labels=labels,
    dimensions=range(N),
    color=tf_data['label'],
    color_discrete_sequence=['salmon','green']
)
fig.update_traces(diagonal_visible=False)
fig.update_layout(title='Data Spread Based on Different 2D Combinations of Principal Components')

fig.show()

In [None]:

evr = pca_pipeline['PCA'].explained_variance_ratio_
total_var = evr.sum() * 100
cumsum_evr = np.cumsum(evr)

trace1 = {
    "name": "individual explained variance", 
    "type": "bar",
    'y':evr}
trace2 = {
    "name": "cumulative explained variance", 
    "type": "scatter", 
     'y':cumsum_evr}
data = [trace1, trace2]
layout = {
    "xaxis": {"title": "Principal components"}, 
    "yaxis": {"title": "Explained variance ratio"},
  }
fig = go.Figure(data=data, layout=layout)
fig.update_layout(title='{:.2f}% of the Original Feature Variance Can Be Explained Using {} Dimensions'.format(np.sum(evr)*100,N))
fig.show()

**Observation**: Using five components (out of initially 9), we can see that we can only preserve 60 percent of the original variance; we can learn from this fact that our features are indeed uncorrelated between them and there is no linear combination that can tell us a better story regarding the target label after looking at the different permutations of principal components.



<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Model Selection / Baseline Model Evaluation</h3>



In [None]:
features = ttest_res_df.iloc[:3,:].F.to_list()
features.append('Turbidity')
train_x,test_x,train_y,test_y = train_test_split(water_df[features],water_df.iloc[:,-1],random_state=42,shuffle=True)


**Explanation**: The feature we select for our modeling stage are all the features we tested during our EDA section and found evidence of some difference or relationship to the / with respect to the target label.

In [None]:
RandomForest_Pipeline     = Pipeline(steps = [('scale',StandardScaler()),('RF',RandomForestClassifier(random_state=42))])
AdaBoost_Pipeline         = Pipeline(steps = [('scale',StandardScaler()),('AB',AdaBoostClassifier(random_state=42))])
SVC_Pipeline              = Pipeline(steps = [('scale',StandardScaler()),('SVM',SVC(random_state=42))])

RandomForest_CV_f1     = cross_val_score(RandomForest_Pipeline,water_df[features],water_df.iloc[:,-1],cv=10,scoring='f1')
AdaBoost_CV_f1         = cross_val_score(AdaBoost_Pipeline,water_df[features],water_df.iloc[:,-1],cv=10,scoring='f1')
SVC_CV_f1              = cross_val_score(SVC_Pipeline,water_df[features],water_df.iloc[:,-1],cv=10,scoring='f1')


**Explanation**: Our baseline model will be a Random Forest model as it can provide us with a separation of our domain that is nor linear and not polynomial; we will also see if AdaBoost using a decision tree that is conceptually similar to the Random Forest model will be able to provide us with some interesting results.
The third model test in this baseline section is a classifier based on SVM, and that is to confirm the hypothesis stated earlier that our data is not separable in higher dimensions.

In [None]:
fig = make_subplots(rows=3, cols=1,shared_xaxes=True,subplot_titles=('Random Forest Cross Val Scores',
                                                                     'AdaBoost Cross Val Scores',
                                                                     'SVM Cross Val Scores'))

fig.add_trace(
    go.Scatter(x=np.arange(0,len(SVC_CV_f1)),y=RandomForest_CV_f1,name='Random Forest'),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=np.arange(0,len(SVC_CV_f1)),y=AdaBoost_CV_f1,name='AdaBoost'),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(x=np.arange(0,len(SVC_CV_f1)),y=SVC_CV_f1,name='SVM'),
    row=3, col=1
)

fig.update_layout(height=700, width=900, title_text="Different Baseline Models 10 Fold Cross Validation")
fig.update_yaxes(title_text="RMSE")
fig.update_xaxes(title_text="Fold #")

fig.show()

**Observation**: As expected, we see some fair results both in our baseline Random Forest model and our baseline AdaBoost model.
Considering these results, we will try to create a more accurate and optimized model by using AdaBoost to envoke Random Forest models as its base estimators.

Notice that the SVM classifier indeed does an awful job confirming our hypothesis.


<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Hyperparameter Tuning</h3>



In [None]:


# RFBE = RandomForestClassifier(random_state=42)

# AdaBoost_Pipeline         = Pipeline(steps = [('scale',StandardScaler()),('AB',AdaBoostClassifier(random_state = 42,
#                                                                                                  base_estimator = RFBE))])

# AdaBoost_Pipeline.fit(train_x,train_y)

# parameters = {'AB__base_estimator__max_depth':[2,3,5],
#               'AB__base_estimator__min_samples_leaf':[2,5,10],
#               'AB__base_estimator__criterion':['entropy','gini'],
#               'AB__base_estimator__bootstrap':[True,False],
#               'AB__n_estimators':[5,10,25],
#               'AB__learning_rate':[0.01,0.1]}

# #ADA_RF_GS  = TuneGridSearchCV(AdaBoost_Pipeline,parameters,cv=3,verbose=1)
# ADA_RF_GS  = GridSearchCV(AdaBoost_Pipeline,parameters,cv=3,verbose=10)
# ADA_RF_GS.fit(water_df[features],water_df.iloc[:,-1])

# print("Best parameter (CV score=%0.3f):" % ADA_RF_GS.best_score_)
# print(ADA_RF_GS.best_params_)

**Notice**:Due to the reasonably long search time consumed by Grid Search, this block is commented and provided for any Kaggler who wishes to play and test different models/parameter values by himself.
The result of the grid search is given below.


<img src="https://i.ibb.co/vh4bJK4/Screenshot-2021-06-27-143926.jpg" width="1200" height="600">

In [None]:
{'AB__base_estimator__bootstrap': True, 'AB__base_estimator__criterion': 'gini', 'AB__base_estimator__max_depth': 5, 'AB__base_estimator__min_samples_leaf': 10, 'AB__learning_rate': 0.01, 'AB__n_estimators': 5}

<h3 style="background-color:orange;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;">Final Model</h3>



In [None]:
RFBE = RandomForestClassifier(random_state=42,bootstrap=True,criterion='gini',max_depth=5,min_samples_leaf=10)
AdaBoost_Pipeline         = Pipeline(steps = [('scale',StandardScaler()),('AB',AdaBoostClassifier(random_state = 42,
                                                                                                 base_estimator = RFBE,
                                                                                                 learning_rate=0.01,
                                                                                                 n_estimators=5))])

AdaBoost_Pipeline.fit(train_x,train_y)
f1 = AdaBoost_Pipeline.score(test_x,test_y)
print('F1 - Score of AdaBoost Model with Random Forest Base Estimators and Cross Validation Grid Search -[',np.round(f1,2),']')