<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Introduction
</div>

<div style='color:#212129;font-family:calibri light;text-align:justify;font-size:1.2em;font-weight:500;padding-left:2em;padding-right:4em'>
<b>Drinking water</b>, also known as potable water, is water that is safe to drink or use for food preparation. The amount of drinking water required to maintain good health varies, and depends on physical activity level, age, health-related issues, and environmental conditions.<sup>[1][2]</sup> For those who work in a hot climate, up to 16 litres (4.2 US gal) a day may be required.<sup>[1]</sup>
<br><br>    
Typically in developed countries, tap water meets drinking water quality standards, even though only a small proportion is actually consumed or used in food preparation. Other typical uses include washing, toilets, and irrigation. Greywater may also be used for toilets or irrigation. Its use for irrigation however may be associated with risks.<sup>[3]</sup> <b>Water may also be unacceptable</b> due to levels of <b>toxins</b> or suspended <b>solids</b>.<br><br>

Globally, by 2015, <b>89% of people</b> had access to water from a source that is <b>suitable for drinking</b> – called improved water source.<sup>[3]</sup> In <b>Sub-Saharan Africa</b>, access to potable water ranged from <b>40% to 80%</b> of the population. Nearly 4.2 billion people worldwide had access to tap water, while another 2.4 billion had access to wells or public taps.<sup>[3]</sup> The World Health Organization considers access to <b>safe drinking-water a basic human right</b>.<br><br>
        
About <b>1 to 2 billion people lack safe drinking water</b>.<sup>[4]</sup> <b>More people die from unsafe water than from war</b>, then-U.N. Secretary-General Ban Ki-moon said in 2010.<sup>[5]</sup>  
<br><br>
<p style='font-size:0.5em;margin-left:3em'>[1] Ann C. Grandjean (August 2004). "3" (PDF). Water Requirements, Impinging Factors, & Recommended Intakes. World Health Organization. pp. 25–34. Archived (PDF) from the original on 2016-02-22. This 2004 article focuses on the USA context and uses data collected from the US military.</p>
<p style='font-size:0.5em;margin-left:3em'>[2] Exposure Factors Handbook: 2011 Edition (PDF). National Center for Environmental Assessment. September 2011. Archived from the original (PDF) on 24 September 2015. Retrieved 24 May 2015.</p>
<p style='font-size:0.5em;margin-left:3em'>[3] "Water Fact sheet N°391". July 2014. Archived from the original on 5 June 2015. Retrieved 24 May 2015.</p>
<p style='font-size:0.5em;margin-left:3em'>[4] "Drinking-water". World Health Organization. March 2018. Retrieved 23 March 2018.</p>
<p style='font-size:0.5em;margin-left:3em'>[5] "Unsafe water kills more people than war, Ban says on World Day". UN News. 22 March 2010. Retrieved 10 May 2018.</p>
</div>

**Feel free to take a look at my other work:**
* [Netflix - Awesome EDA & Score Prediction](https://www.kaggle.com/mlanhenke/netflix-awesome-eda-score-prediction-wip)
* [Student-Test-Scores - EDA & Score Prediction](https://www.kaggle.com/mlanhenke/test-scores-epic-eda-prediction-cb-xgb-lgbm)

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Dataset Description
</div>

<div style='color:#212129;font-family:calibri light;text-align:justify;font-size:1.2em;font-weight:500;padding-left:2em;padding-right:4em'>
The dataset at hand consists of <b>10 columns</b> in total.<br> 
Our <b>goal</b> is to predict (classify) if a water source is potable or not.<hr/></div>

1. **pH-Value**: Measures the acid-balance. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5
2. **Hardness**: Caused by calcium and magnesium salts. Defined as the capacity of water to precipitate soap.
3. **Solids**: Indicates how much the water is mineralized (Total Dissolved Solids). Desired range for TDS is 500 up to 1000 mg/l.
4. **Chloramines**: Major disinfectants used in public water systems. Up to 4 milligrams per liter are considered safe.
5. **Sulfate**: Sulfates are naturally occurring substances that are found in minerals, soil, and rocks.
6. **Conductivity**: The amount of dissolved solids in water determines the electrical conductivity. Should not exceeded 400 μS/cm.
7. **Organic_carbon**: A measure of the total amount of carbon in organic compounds decaying in pure water.
8. **Trihalomethanes**: May be found in water treated with chlorine. THM levels up to 80 ppm is considered safe.
9. **Turbidity**: Indicate the quality of waste discharge with respect to colloidal matter. WHO recommended value of 5.00 NTU.
10. **Potability**: Our target variable. Indicates if safe for human consumption where 1 means potable and 0 means not potable

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Importing Data
</div>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.lines as lines

from skimage import io
from scipy.stats import skew, kurtosis

from warnings import filterwarnings
filterwarnings('ignore')

sns.set_style('white')
plt.rcParams['font.family'] = 'monospace'

In [None]:
file_path = '../input/water-potability/water_potability.csv'
df = pd.read_csv(file_path)

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Color Palettes
</div>

In [None]:
blues = ['#193f6e','#3b6ba5','#72a5d3','#b1d3e3','#e1ebec']
darks = ['#4e5560','#866a67','#9a9385','#c5bfa7','#e6dbc8']
cmap_blues = sns.color_palette(blues)
cmap_darks = sns.color_palette(darks)
sns.set_palette(cmap_blues)

In [None]:
sns.palplot(cmap_blues)
sns.palplot(cmap_darks)

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Basic Overview
</div>

In [None]:
print(f"Shape: {df.shape}")
print(':'*20)
df.head(5)

In [None]:
df.info()

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Missing Values
</div>

In [None]:
# get percentage of missing values per column
missing_values = df.isna().sum() / len(df) * 100
missing_values = missing_values.sort_values(ascending=False)
missing_values = missing_values[missing_values > 0]
missing_values = pd.DataFrame({'Feature':missing_values.index, 'Ratio':missing_values.values})

fig, ax = plt.subplots(1, 2, figsize=(12,6))

fig.text(0.05,1,'Missing Values', fontsize=16, fontweight='bold')
fig.text(0.05,0.95,'We can tell that missing values occur among three different columns.', fontsize=12, fontweight='light')
fig.subplots_adjust(wspace=0.5, hspace=0.5)

ax0 = sns.barplot(
    data=missing_values,
    y='Feature',
    x='Ratio',
    orient='h',
    ax=ax[0]
)

ax1 = sns.heatmap(
    data=df.isna(),
    cmap=cmap_darks,
    cbar=False,
    ax=ax[1]
)

ax0.set_xlabel('')
ax0.set_xticks([])
ax1.set_yticks([])

# annotations
for idx in range(0,len(missing_values.index)):
    ax0.annotate(
        f"{np.round(missing_values['Ratio'][idx],2)} %",
        xy=(missing_values['Ratio'][idx]-2.5,idx),
        va='center', ha='center', color='#fff'
    )

# despine
for a in ax:
    a.set_ylabel('')
    for spine in ['top','left','right','bottom']:
        a.spines[spine].set_visible(False) 

# seperation line       
l1 = lines.Line2D([0.5, 0.5], [0, 0.9], transform=fig.transFigure, figure=fig,color='#ccc',lw=1)
fig.lines.extend([l1])

plt.show()

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Target: Potability
</div>

In [None]:
plt.figure(figsize=(12,6))
plt.title('Potability Count', size=16, y=1, x=0.12, fontweight='bold')
plt.grid(color='gray', axis='x', linestyle=':', linewidth=1, alpha=0.5, zorder=0, dashes=(2,10))

ax = sns.countplot(
    data=df, y='Potability', orient='h',
    edgecolor='white', linewidth=1, alpha=0.95
)

ax.imshow(
    io.imread('https://media.istockphoto.com/photos/water-splash-picture-id182812025?k=6&m=182812025&s=170667a&w=0&h=QVsOSW86k5WIVaMrQikXvSCkMKMAVg9FKYkHRprrtVs='),
    aspect=ax.get_aspect(),
    extent=ax.get_xlim() + ax.get_ylim(),
    alpha=0.35,
    zorder=0
)

plt.xlabel('')
plt.ylabel('')
plt.yticks([])

v_counts = df['Potability'].value_counts()
pot_0, pot_1 = v_counts[0], v_counts[1]
plt.figtext(0.7,0.7,f"# Not-Drinkable: {pot_0}",backgroundcolor='#e1ebec')
plt.figtext(0.45,0.3,f"# Drinkable: {pot_1}",backgroundcolor='#e1ebec')

sns.despine(left=True)

In [None]:
# calculate ratio of potability classes
total_counts = pot_0 + pot_1
r0 = pot_0 / total_counts
r1 = pot_1 / total_counts
r_df = pd.DataFrame({'Ratio_0':[r0],'Ratio_1':[r1]})

fig, ax = plt.subplots(1,1, figsize=(12, 4))

ax.barh(r_df.index, r_df['Ratio_0'], color=blues[0], alpha=0.9)
ax.barh(r_df.index, r_df['Ratio_1'], color=blues[1], alpha=0.9, left=r_df['Ratio_0'])

ax.set_xlim(0,1)
ax.set_xticks([])
ax.set_yticks([])

x_0 = r_df['Ratio_0'][0]
x_1 = r_df['Ratio_1'][0]

# Class 0 Annotation
ax.annotate(
    f"{np.round(x_0*100,2)}%",
    xy=(x_0/2, 0),
    va='center', ha='center', fontsize=40, fontweight='light',color='#fff'
)

ax.annotate(
    'Not-Drinkable',
    xy=(x_0/2, -0.15),
    va='center', ha='center', fontsize=15, fontweight='light',color='#eee'
)

# Class 1 Annotation
ax.annotate(
    f"{np.round(x_1*100,2)}%",
    xy=(x_0 + x_1/2, 0),
    va='center', ha='center', fontsize=40, fontweight='light',color='#fff'
)

ax.annotate(
    'Drinkable',
    xy=(x_0 + x_1/2, -0.15),
    va='center', ha='center', fontsize=15, fontweight='light',color='#eee'
)

# Title & Subtitle
fig.text(0.125,1.03,'Potability Distribution',fontsize=16, fontweight='bold')
fig.text(0.125,0.92,'Our dataset is not equally distributed in terms of potability.',fontsize=12)  

for spine in ['top','left','right','bottom']:
    ax.spines[spine].set_visible(False)

plt.show()

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Feature Columns
</div>

In [None]:
# get all features, split df into potability classes
feature_cols = [*df.columns.drop(labels='Potability')]
df_pot_0 = df[df['Potability'] == 0].copy()
df_pot_1 = df[df['Potability'] == 1].copy()

In [None]:
fig = plt.figure(figsize=(15,9))

for idx, feature in enumerate(feature_cols):
    plt.subplot(3,3,idx+1)
    plt.title(f"Distribution of {feature}", size=11, y=1.05)
    plt.grid(color='gray', axis='x', linestyle=':', linewidth=1, alpha=0.5, zorder=0, dashes=(2,10))

    sns.kdeplot(
        data=df_pot_0, x=feature, shade=True, color=blues[0], 
        edgecolor='black', linewidth=1, alpha=0.8, label='CLASS 0'
    )

    sns.kdeplot(
        data=df_pot_1, x=feature, shade=True, color=blues[3], 
        edgecolor='black', linewidth=1, alpha=0.8, label='CLASS 1'
    )

    plt.xlabel('')
    plt.ylabel('')
    plt.yticks([])

    plt.legend(facecolor=blues[4], fontsize=7)
    sns.despine(left=True)

# seperation lines    
l1 = lines.Line2D([0.07, 0.07], [0.97, 1.12], transform=fig.transFigure, figure=fig,color='#ccc',lw=1)
# l2 = lines.Line2D([0.07, 0.4], [1.12, 1.12], transform=fig.transFigure, figure=fig,color='#ccc',lw=1)
fig.lines.extend([l1])

fig.subplots_adjust(wspace=0.25,hspace=0.5)
fig.text(0.05,1.15,'Feature Distribution', fontsize=16, fontweight='bold')
fig.text(0.08,0.97,'''
By looking at the different feature distributions we can tell that most of 
the features are normally distributed. However the distribution of solids looks 
skewed with outliers to the right. By comparing both classes of potability 
we can tell if a certain feature is a determining factor for the upcoming 
classification task. The most important features seem to be: 
pH-Value, Hardness and Sulfate.
''', fontsize=12, fontweight='light')

plt.show()

In [None]:
# compare means
print('Mean Comparison by Potability')
mean_group = df.groupby('Potability').mean()
mean_group

In [None]:
# create correlation map
corr_map = df.drop(columns='Potability').corr()

# create mask
mask = np.triu(np.ones_like(corr_map,dtype=bool))

# create correlation heatmap
fig = plt.figure(figsize=(15,9))

ax = sns.heatmap(
    data=corr_map, square=True, center=0, linewidth=1,
#     cmap=sns.diverging_palette(240, 10, s=60, l=40, n=9, center="light", as_cmap=True),
    cmap=cmap_blues,
    cbar=False,
    mask=mask,
    annot=True,
    fmt='.2f',
    cbar_kws={'shrink': 0.82}
)

ax.annotate(
    'A small sign of correlation,\nif any...',
    fontsize=10,fontweight='light',
    xy=(2.3, 4.1), xycoords='data',
    xytext=(0.6, 0.95), textcoords='axes fraction',
    arrowprops=dict(
        facecolor=darks[0], shrink=0.025, 
        connectionstyle='arc3, rad=0.3'),
    horizontalalignment='left', verticalalignment='top'
)

# seperation lines    
l1 = lines.Line2D([0.65, 0.65], [0.4, 0.7], transform=fig.transFigure, figure=fig,color='#ccc',lw=1)
fig.lines.extend([l1])

# Title & Annotation
fig.text(0.2,0.93,'Feature Correlation',fontsize=16, fontweight='bold')
fig.text(0.66,0.69,'Insights:', fontsize=12, fontweight='bold')
fig.text(0.64,0.52,'''
    Nearly all of the features are uncorrelated 
    which means there is little to no sign of multicolinearity.
    A small sign of correlation is evident when taking a look 
    at the sulfate-level and the dissolved minerals, which makes sense.
    The less (solids) minerals are dissolved the lower the sulfate-levels
    should be since sulfate is a mineral itself.
''', fontsize=12, fontweight='light')

plt.show()

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Data Preprocessing
</div>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

In [None]:
# prepare dataset for split
X = df.drop(columns='Potability').copy()
y = df['Potability'].copy()

# create train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

# preprocess data
pipe = Pipeline([
    ('impute',SimpleImputer()),
    ('scale',StandardScaler())
])

X_train = pd.DataFrame(data=pipe.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(data=pipe.transform(X_test), columns=X_test.columns)

# check results exemplary on ph
print(f"Train Shape: {X_train.shape} Train ph Mean: {np.round(np.mean(X_train['ph']),0)} Train ph Std: {np.std(X_train['ph'])}")
print(f"Test Shape: {X_test.shape} Test ph Mean: {np.round(np.mean(X_test['ph']),0)} Test ph Std: {np.std(X_test['ph'])}")

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Modeling
</div>

In [None]:
from sklearn.metrics import precision_score
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier, Perceptron, PassiveAggressiveClassifier
from sklearn.svm import SVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier

In [None]:
# test different base models
models = [
    ('LogReg', LogisticRegression(max_iter=1000)),
    ('Ridge', RidgeClassifier()),
    ('SGD', SGDClassifier(max_iter=1000, tol=1e-3)),
    ('SVC', SVC()),
    ('NuSVC', NuSVC()),
    ('DTree', DecisionTreeClassifier()),
    ('GNB', GaussianNB()),
    ('BNB', BernoulliNB()),
    ('Perc', Perceptron()),
    ('NC', NearestCentroid()),
    ('RFC', RandomForestClassifier()),
    ('Ada', AdaBoostClassifier()),
    ('XGB', XGBClassifier(verbosity = 0)),
    ('PAC', PassiveAggressiveClassifier())
]

results = dict()

for name, model in models:
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    score = precision_score(y_test, y_hat, average='macro')
    results[name] = score

In [None]:
# create dataframe from results
df_results = pd.DataFrame([results])
df_results = df_results.transpose()
df_results = df_results.rename(columns={0:'Precision'}).sort_values(by='Precision',ascending=False)

In [None]:
fig = plt.figure(figsize=(15,9))

ax = sns.barplot(data=df_results, y=df_results.index, x='Precision', orient='h', palette='Blues', saturation=0.4, edgecolor=darks[0], linewidth=0.5)

ax.set_xlabel('')
ax.set_xticks([])

# Bar Annotation
for idx in range(0,len(df_results.index)):
    color='black' 
    if idx > (len(df_results.index)-4): color='white'
        
    ax.annotate(
        f"{np.round(df_results['Precision'][idx]*100,2)} %",
        xy=(df_results['Precision'][idx]-0.035, idx),
        va='center', ha='center', fontsize=12, color=color
    )

# Title & Annotation
fig.text(0.1,0.93,'Model Evaluation: Precision Score',fontsize=16, fontweight='bold')
fig.text(0.82,0.69,'Insights:', fontsize=12, fontweight='bold')
fig.text(0.8,0.56,'''
    Support Vector or Decision Tree based algorithms 
    perform best, whereas the linear models perform worst.
    This should make sense, since in our EDA we couldn't
    find any linear relation at all.
''', fontsize=12, fontweight='light')

# seperation lines    
l1 = lines.Line2D([0.8, 0.8], [0.5, 0.7], transform=fig.transFigure, figure=fig,color='#ccc',lw=1)
fig.lines.extend([l1])

# Despine
for spine in ['top','left','right','bottom']:
    ax.spines[spine].set_visible(False)

<div style='background:#72a5d3;padding-left:2em;text-align:left;font-family:monospace;font-weight:bold;font-size:1.65em;color:#fff'>
    Conclusion
</div>

<div style='color:#212129;font-family:calibri light;text-align:justify;font-size:1.2em;font-weight:500;padding-left:2em;padding-right:4em'>
While working with this dataset, I tried to put special <b>emphasis</b> on the <b>exploratory data analysis</b>.
My main goal here was to sharpen my plotting skills and put a little bit more effort into the visuals.

As far as <b>analysis</b> goes, the dataset seems to be synthetically generated, since the feature distribution 
in comparision to the classification doesn't look quite right. However for the purpose of practicing, 
this dataset is super easy and fun to handle.
    
Things I'd have to do to <b>improve the model</b> would be hyperparameter tuning (optuna study) and combining the baseline models into an ensemble learner (VotingClassifer e.g.).
</div>

<div style='border-radius:3px;background:#b1d3e3;padding:2em;text-align:left;font-family:monospace;font-weight:light;font-size:1.1em;color:black'>
    <b>Thanks for checking out my notebook!</b><br>
    Feel free to leave a comment, a suggestion, an upvote or just a simple message to say hello :)
</div>