In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **LIBRARIES**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# **Dataset import**

In [None]:
water_potability_filepath = '../input/water-potability/water_potability.csv'

full_data = pd.read_csv(water_potability_filepath)

In [None]:
full_data.shape

There are a total of *3276* training examples, having 9 features and 1 target, i.e. *Potability*.

In [None]:
full_data.info()

Missing values in *ph*, *Sulfate*, and *Trihalomethanes*. We will deal with these later.

# FEATURE DESCRIPTIONS

* **ph** - Indicates acidic/alkaline behaviour of the water sample. Recommended pH value as per WHO guidelines is 6.5 to 8.5.

* **Hardness** - The amount of dissolved magnesium and calcium salts in water. According to WHO, hardness should not exceed 120 to 170 mg/L.

* **Solids** - Given by Total Dissolved Solids (TDS), or the amount of solids dissolved in water. TDS value of 50 to 250 ppm is deemed safe.

* **Chloramines** - Formed when ammonia is added to chlorine to treat drinkning water. Chlorine levels up to 4 mg/L (or 4 parts per million (ppm)) are considered safe in drinking water.

* **Sulfates** - These are a part of naturally occuring minerals and are dissolved into groundwater. Sulfate levels above 250 mg/L in water are considered unsafe.

* **Conductivity** - It is the measure of tendancy of water to conduct electricity. According to WHO, Electrical Conductivity (EC) should not exceed 400 μS/cm.

* **Organic_carbon** - It is a measure of the total amount of carbon in organic compounds in pure water. Total Organic Carbon (TOC) below 25 ppm is considered safe.

* **Trihalomethanes** - These may be found in water treated with chlorine. THM values upto 100 μg/L are considered fit for drinking water.

* **Turbidity** - It is a measure of light emitting properties of water. According to WHO, turbidity of drinking water shouldn't be more than 5 NTU, and should ideally be below 1 NTU.

* **Potability** - Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

In [None]:
full_data.describe()

In [None]:
full_data.head(10)

The data seems to be grouped by *Potability*, as we only see 0 values in head.
Therefore, it is best to shuffle this dataset before proceeding.

In [None]:
full_data = full_data.sample(frac = 1).reset_index(drop = True)

full_data.head(10)

# Data Analysis

In [None]:
full_data.Potability.value_counts()

In [None]:
plt.figure(figsize = (10,6))
full_data.Potability.value_counts().plot.pie(autopct = "%.1f%%")
plt.title('Potability distribution pie chart', pad = 20,
         fontdict = {'size' : 15, 'color' : 'darkblue', 'weight' : 'bold'})
plt.show()

There seems to be a slight imbalance in target value. However, nothing major to be worried about (**yet**).

Now we will classify the features into numerical and categorical. This will help us to preprocess the data in an appropriate manner.

In [None]:
def col_type(col):
    if (col.nunique() <= 10) | (col.dtype == 'object'):
        return 'cat'
    else:
        return 'num'

numerical_col = [col for col in full_data.columns
                if col_type(full_data[col]) == 'num']
categorical_col = [col for col in full_data.columns
                  if col_type(full_data[col]) == 'cat']

print('Numerical features: {}'.format(numerical_col))
print('Categorical features: {}'.format(categorical_col))

Seems like the only categorical column we have is the target feature. Therefore, the entire feature set will consist of numerical type features.

In [None]:
features = numerical_col

In [None]:
sns.pairplot(hue = 'Potability', data = full_data)

There doesn't seem to be any pattern in any of these pair plots. Almost all the points corresponding to 0 Potability and 1 Potabilitiy seem to be jumbled together randomly. This might prove to be a big problem because there doesn't appear any room for classification based on these plots.

* Let's try finding correlation based on a correlation matrix.

In [None]:
corrMat = full_data.corr()

plt.figure(figsize = (15,10))
sns.heatmap(corrMat, square = True, annot = True)
plt.show()

No significant relation between any two features or between a feature and target. 

This makes me question the credibility of the dataset since water quality is a sensitive topic, and there are known borderline values for these features which assign them to safe/unsafe categories. So this kind of randomness just doesn't seem genuine. 
Like for example, we have some examples where water bodies with pH exceeding 10.0 (which is alkaline and way over the safety threshold) is deemed potable. 

In [None]:
full_data.sort_values(by = 'ph', ascending = False).head(10)

Fourth row, *ph* = **13.175402**, *chloramines* = **8.9**, *Sulfate* = **375**, *Conductivity* = **500**, and still deemed Potable. This just does not make sense.

Another questionable feature is *Solids*, which I'll elaborate below:

In [None]:
print('Mean value of TDS (ppm):',full_data.Solids.mean())

plt.figure(figsize = (8,5))
sns.distplot(full_data.Solids)
plt.xlabel('Solids (ppm)')
plt.title('Distribution of Total Dissolved Solids (TDS)')
print()
plt.show()

According to the above distribution, mean value of TDS in the dataset is about 22000 ppm. However, safe values for TDS ranges from about 50 to 250 ppm. Thus, about every sample from this dataset should be unfit for consumption.

For this reason, it seems highly likely that either this feature was incorrectly extracted or that the unit is messed up. Hence, we could get rid of this column from the dataset.

In [None]:
#full_data.drop('Solids', axis = 1, inplace = True)

However, on inspecting the correlation matrix, it appears that *Potability* has relatively the highest correlation coefficient with *Solids*, however small it may be. Thus, it may not be wise to remove your most significant feature.

In [None]:
corrMat.Potability.abs().sort_values(ascending = False)[1:]

However, we will divide the entire column *Solids* by 100 so that the values at least seem believable.

In [None]:
full_data.Solids = full_data.Solids / 100

print('Mean value of TDS (ppm):',full_data.Solids.mean())

plt.figure(figsize = (8,5))
sns.distplot(full_data.Solids)
plt.xlabel('Solids (ppm)')
plt.title('Distribution of Total Dissolved Solids (TDS)')
print()
plt.show()

In [None]:
from scipy.stats import norm

f, axes = plt.subplots(3, 3, figsize = (10, 10))

sns.distplot(full_data.ph, fit = norm, ax = axes[0,0])
sns.distplot(full_data.Hardness, fit = norm, ax = axes[0,1])
sns.distplot(full_data.Solids, fit = norm, ax = axes[0,2])
sns.distplot(full_data.Chloramines, fit = norm, ax = axes[1,0])
sns.distplot(full_data.Sulfate, fit = norm, ax = axes[1,1])
sns.distplot(full_data.Conductivity, fit = norm, ax = axes[1,2])
sns.distplot(full_data.Organic_carbon, fit = norm, ax = axes[2,0])
sns.distplot(full_data.Trihalomethanes, fit = norm, ax = axes[2,1])
sns.distplot(full_data.Turbidity, fit = norm, ax = axes[2,2])

plt.show()

All features seem to be distributed as a normal distribution. Funny how the features are all balanced, while the dataset is not.

# **Data Preprocessing**

Now we will handle the missing values as well as scale the features of our dataset.

In [None]:
X = full_data[features]
y = full_data.Potability

It is important to always split the data into train-test sets before applying any transformations on it. Not doing so could result in data leakage in our model and consequently, inaccurate model predictions.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
print(X_train.shape, y_train.shape, sep = '\n')

In [None]:
full_train = pd.concat([X_train, y_train], axis = 1)
full_val = pd.concat([X_val, y_val], axis = 1)

full_train.head()

# Missing Values

The problem starts by identifying the type and count of missing values. Then we will adopt a suitable algorithm to fill the blanks.

In [None]:
missing_val_col = full_train.isnull().sum().sort_values(ascending = False)
missing_val_col = missing_val_col[missing_val_col > 0]
missing_ratio_col = missing_val_col / full_train.shape[0]

missing = pd.concat([missing_val_col, missing_ratio_col * 100], axis = 1,
                   keys = ['total', '%'])
missing

In [None]:
for col in missing_val_col.index:
    plt.figure(figsize = (15,10))
    sns.boxplot(x = 'Potability', y = col, data = full_train)
    plt.show()

Medians for Potability = 0 box and Potability = 1 box seem to overlap in all 3 cases, meaning there isn't any significant relation between Potability and the features in debate. This is congruent to our earlier correlation matrix analysis. 

Imputing the missing values with mean seems okay, as it won't disturb this dataset.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy = 'median')
imputed_X_train = pd.DataFrame(imputer.fit_transform(full_train))
imputed_X_val = pd.DataFrame(imputer.transform(full_val))

imputed_X_train.columns = full_train.columns
imputed_X_val.columns = full_val.columns

In [None]:
imputed_X_train.isnull().any().any()

All missing values have now been dealt with. 

Before moving onto feature scale, it is wise to first take a look at the outliers in our dataset, as their presense might lead to glitchy data transformations.

In [None]:
for col in features:
    plt.figure(figsize = (10,8))
    sns.boxplot(y = col, data = imputed_X_train)
    plt.show()

There seem to be just a few major outliers, and I've written the filter code for them in the code cell below. However, for such few outliers, using *RobustScaler()* should be enough.

> the centering and scaling statistics of RobustScaler is based on percentiles and are therefore not influenced by a few number of very large marginal outliers.

> 

In [None]:
#filter = (imputed_X_train['Organic_carbon'] < 25) 
#imputed_X_train = imputed_X_train.loc[filter]

#filter = (imputed_X_train['Hardness'] > 50)
#imputed_X_train =imputed_X_train.loc[filter]

#filter = (imputed_X_train['Conductivity'] < 700)
#imputed_X_train = imputed_X_train.loc[filter]

In [None]:
y_train = imputed_X_train.Potability.round()
imputed_X_train.drop('Potability', axis = 1, inplace = True)

y_val = imputed_X_val.Potability.round()
imputed_X_val.drop('Potability', axis = 1, inplace = True)

In [None]:
print(imputed_X_train.shape, y_train.shape)
print(imputed_X_val.shape, y_val.shape)

# Feature Scaling

Although using just the *StandardScaler()* should do the job, the abovementioned outliers might throw off the calculations of the scaler. Hence, a better approach would be to first use *RobustScaler()*, which will handle the outliers before passing the data to the *StandardScaler()*

In [None]:
from sklearn.preprocessing import StandardScaler, RobustScaler

scaler = RobustScaler()
robust_X_train = scaler.fit_transform(imputed_X_train)
robust_X_val = scaler.transform(imputed_X_val)

sc = StandardScaler()
scaled_X_train = sc.fit_transform(robust_X_train)
scaled_X_val = sc.transform(robust_X_val)

On passing through the Scalers, our Data Frame has now been converted to a numpy array. So, for convention, we will convert the array back to a Data Frame.

In [None]:
final_X_train = pd.DataFrame(scaled_X_train, index = imputed_X_train.index, columns = imputed_X_train.columns)
final_X_val = pd.DataFrame(scaled_X_val, index = imputed_X_val.index,
                            columns = imputed_X_val.columns)

final_X_train.describe()

The train and test set are now ready for model fitting and evaluation!

# **MODEL**

Now comes the main part. As mentioned earlier, we have a dataset that contains randomness to a very high degree. Thus, we cannot pick a single model and expect it to work the best with this dataset. For this reason, I'm going to use *GridSearchCV* to enable parameter tuning over several Machine Learning models. Hopefully, by the end of this, we will have a model that works decently on this dataset.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [None]:
lr = LogisticRegression()
xgb = XGBClassifier(eval_metric = 'logloss')
dt = DecisionTreeClassifier()
rfc = RandomForestClassifier()
ada = AdaBoostClassifier()
knn = KNeighborsClassifier()

Choosing the scoring metric for such a dataset can be tricky. Let's first emphasise on what we expect from our predictions:

**1.** Given a number of water bodies, we don't want a large number of impotable bodies to be classified as potable, as it could lead to a widespread of diseases. In other words, we don't want a large number of False Positives. 

**2.** At the same time, we'd want to classify a fair number of potable water bodies as potable, because that is what we're mostly concerned about. 

**3.** But the given dataset is skewed towards 0 class, i.e. impotable water bodies have majority over potable. This could result in a model that classifies most of the water bodies as impotable. Such a model will, in fact, yield a high accuracy on test sets, but would have very little advantage in real-world applications.

**Conclusion:** We are looking for a model having a high TP/FP ratio, in other words, high precision. This is because, practically speaking, the penalty for classifying an impotaple sample as potable should be high, and precision serves the purpose for such situations. However, with a dataset skewed in favor of negative class, high precision could result in low accuracy, or also a large FP.
Keeping all this in mind, Matthews Correlation Coefficient seems suitable.

> The Matthews correlation coefficient (MCC) is a highly reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.

In [None]:
from sklearn.metrics import make_scorer, precision_score, matthews_corrcoef, accuracy_score, confusion_matrix, recall_score, roc_auc_score, f1_score
from scipy.stats import uniform, randint

scorer = make_scorer(matthews_corrcoef)

lr_param = {'C' : uniform(0.01,10),
           'penalty' : ['l2'], 'class_weight' : ['balanced']}
lr_rs = RandomizedSearchCV(lr, param_distributions = lr_param, scoring = scorer, cv = 5)


xgb_param = {'n_estimators' : [50, 100, 200, 300, 400, 500],
            'learning_rate' : uniform(0.03,1.0)}
xgb_rs = RandomizedSearchCV(xgb, param_distributions = xgb_param, scoring = scorer, cv = 5)

dt_param = {'criterion' : ['gini','entropy'], 
            'max_depth' : np.arange(1,50), 
            'min_samples_leaf' : [2, 3, 4, 5, 10, 20, 30, 40, 50]}
dt_rs = RandomizedSearchCV(dt, param_distributions = dt_param, scoring = scorer, cv=5)


rfc_param = {'n_estimators' : [50, 100, 200, 400, 500],
            'max_depth' : np.arange(1,50),
            'min_samples_leaf' : [2, 3, 4, 5, 10, 20, 30, 40, 50],
            'criterion' : ['entropy', 'gini']}
rfc_rs = RandomizedSearchCV(rfc, param_distributions = rfc_param, scoring = scorer, cv = 5)

ada_param = {'n_estimators' : [50, 100, 200, 400, 500],
            'learning_rate' : uniform(0.01, 1.0)}
ada_rs = RandomizedSearchCV(ada, param_distributions = ada_param, scoring = scorer, cv = 5)


knn_param = {'n_neighbors' : np.arange(1,50),
            'weights' : ['uniform', 'distance']}
knn_rs = RandomizedSearchCV(knn, param_distributions = knn_param, scoring = scorer, cv = 5)

In [None]:
lr_rs.fit(final_X_train, y_train)
xgb_rs.fit(final_X_train, y_train)
dt_rs.fit(final_X_train, y_train)
rfc_rs.fit(final_X_train, y_train)
ada_rs.fit(final_X_train, y_train)
knn_rs.fit(final_X_train, y_train)

print('Logistic Regression best parameters:', lr_rs.best_params_)
print('XGB best parameters:', xgb_rs.best_params_)
print('Decision Tree best parameters:', dt_rs.best_params_)
print('RFC best parameters:', rfc_rs.best_params_)
print('Ada Boost best parameters:', ada_rs.best_params_)
print('KNN best parameters:', knn_rs.best_params_)

We have the best set of parameters for each model.

Tuning these parameters on the repsective models - 

In [None]:
lr = LogisticRegression(C = lr_rs.best_params_['C'], penalty = lr_rs.best_params_['penalty'], 
                        class_weight = lr_rs.best_params_['class_weight'], 
                        random_state = 42)

xgb = XGBClassifier(n_estimators = xgb_rs.best_params_['n_estimators'], 
                    learning_rate = xgb_rs.best_params_['learning_rate'], 
                    random_state = 42, eval_metric = 'logloss')

dt = DecisionTreeClassifier(min_samples_leaf = dt_rs.best_params_['min_samples_leaf'], 
                            max_depth = dt_rs.best_params_['max_depth'], 
                            criterion = dt_rs.best_params_['criterion'], 
                            random_state = 42)

rfc = RandomForestClassifier(n_estimators = rfc_rs.best_params_['n_estimators'], 
                             criterion = rfc_rs.best_params_['criterion'], 
                             max_depth = rfc_rs.best_params_['max_depth'], 
                             min_samples_leaf = rfc_rs.best_params_['min_samples_leaf'], 
                             random_state = 42)

ada = AdaBoostClassifier(n_estimators = ada_rs.best_params_['n_estimators'], 
                         learning_rate = ada_rs.best_params_['learning_rate'], 
                         random_state = 42)

knn = KNeighborsClassifier(n_neighbors=  knn_rs.best_params_['n_neighbors'], 
                           weights = knn_rs.best_params_['weights'])

models = [(lr, 'Logistic Regression'), (xgb, 'XG Boost'), (dt, 'Decision Tree'), 
          (rfc, 'Random Forest'), (ada, 'Ada Boost'), (knn, 'K Neighbors')]

In [None]:
#dataframe to keep track of scores of various models
evaluations = pd.DataFrame({'Model' : [], 'F1' : [], 'MCC' : [], 'Precision' : [],
                            'Accuracy' : [], 'Recall' : [], 'AUC' : []})

#function that evaluates and returns different scores obtained by a model
def evaluate(actual, preds):
    f1 = f1_score(actual, preds, average = 'binary')
    mcc = matthews_corrcoef(actual, preds)
    precision  = precision_score(actual, preds)
    accuracy = accuracy_score(actual, preds)
    recall = recall_score(actual, preds)
    auc = roc_auc_score(actual, preds)
    confusion = confusion_matrix(actual, preds)
    return (f1, mcc, precision, accuracy, recall, auc, confusion)

for model, model_name in models:
    model.fit(final_X_train, y_train)
    preds = model.predict(final_X_val)
    f1, mcc, precision, accuracy, recall, auc, confusion = evaluate(y_val, preds)
    cur_model = {'Model' : model_name, 'F1' : f1, 'MCC' : mcc, 'Precision' : precision,
                 'Accuracy' : accuracy, 'Recall' : recall, 'AUC' : auc}
    evaluations = evaluations.append(cur_model, ignore_index = True)
    print(model_name, 'Confusion Matrix:')
    print(confusion)
    #print('Model: {} f1: {:.3f} accuracy: {:.3f}'.format(model_name, f1, accuracy))
    
evaluations.set_index('Model', inplace = True)
evaluations

**1.** Except for Logistic regression, the MCC score for the others turned out to be okayish. 

**2.** Even though Decision Tree has a good F1 score comparatively, but it didn't account for the large number of False Positives.

**3.** Ada Boost gave better than random results, but just not good enough.

Based on these results, XG Boost Classifier, Random Forest Classifier, and K Nearest Neighbors Classifier have performed slightly better than the rest of the models. Both these models gave similar results, so both are suitable to pass as our final model.

# Conclusion

Although the stats are not very impressive (from a general perspective), but they came out to be better than expected for this particular dataset. 

Final Model: **Random Forest Classifier, XGB Classifier, or K Neighbors Classifier**

Expected test set MCC: **~0.20-0.25**

Expected test set Accuracy: **~65-70%**