# Predicting water potability

The dataset for the following notebook has been taken from the link : https://www.kaggle.com/adityakadiwal/water-potability
<br>

The parameters used in the dataset are :
<br>
1. pH value
2. Hardness
3. Solids (Total dissolved solids - TDS)
4. Chloramines
5. Sulfate
6. Conductivity
7. Organic_carbon
8. Trihalomethanes
9. Turbidity
10. Potability (0 = Not potable , 1 = Potable)


The purpose of the following notebook is:
* To analyse the given dataset using various classification models
* To predict the potability of water on test dataset using the best model
* To improve the model using exhaustive search by GridSearchCV

## Importing the tools needed

In [None]:
# importing mathematical and analytical tools
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# importing classification models
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier

# importing evaluation tools
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_roc_curve , accuracy_score , precision_score , f1_score , recall_score

In [None]:
# importing the data
df = pd.read_csv("water_potability.csv")

In [None]:
# shuffling the data
df = df.sample(frac = 1)

## Splitting the data into train and test set
    80 percent data is for training and rest for testing

In [None]:
x = df.drop("Potability",axis=1)
y = df.Potability
x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [None]:
df.info()

## Preprocessing the data
Here we fill the missing values in our dataset , and since our entire dataset is numerical type, we do not need to perform categorical transformations.

In [None]:
# function to fill missing values
def fill_missing(df):
    for column in df.columns:
        if df[column].isna().sum():
            
            # if found a column with missing values , replace the empty space in it with the median value of the column
            df[column].fillna(df[column].median() , inplace=True)
    return df

In [None]:
x_train_filled = fill_missing(x_train)
for column in x_train_filled.columns:
    # convert the training set to all integers
    x_train_filled[column] = x_train_filled[column].astype(int)

In [None]:
# checking for missing values in preprocessed data
x_train_filled.isna().sum()

## Modelling
Now we start to fit our preprocessed training data into the different models that have been imported in this notebook , i.e.
1. Linear SVC
2. Logistic Regression
3. Random Forest Classifier
4. K Nearest Neighbor Classifier

In [None]:
model_1 = LinearSVC()
model_2 = LogisticRegression()
model_3 = RandomForestClassifier()
model_4 = KNeighborsClassifier()

In [None]:
# fitting and scoring the training set
model_1.fit(x_train_filled,y_train)
score_1 = model_1.score(x_train_filled,y_train)


In [None]:
model_2.fit(x_train_filled,y_train)
score_2 = model_2.score(x_train_filled,y_train)

In [None]:
model_3.fit(x_train_filled,y_train)
score_3 = model_3.score(x_train_filled,y_train)

In [None]:
model_4.fit(x_train_filled,y_train)
score_4 = model_4.score(x_train_filled,y_train)

Plotting a bar graph of the scores...

In [None]:
score_data = {"LinearSVC" : score_1,
              "LogisticRegression" : score_2,
              "RandomForesClassifier" : score_3,
              "KNeighborsClassifier" : score_4}
fig , ax = plt.subplots(figsize = (10,8))
ax.bar(score_data.keys() , score_data.values())
ax.set(ylabel = "Score on test data",
      title = "Comparison of the classification models scores on training data",
      ylim = (0,1.2));

From the above bar graph , it is quite clear that Random Forest Classifier has performed the best on training data . Hence , we shall use it to predict the target on test data.

## Predictions on test data
Now when we have finally chosen a model for fitting the test data , we can go ahead and preprocess the test data as well through the fill_missing function and converting it into integer format.


In [None]:
x_test_filled = fill_missing(x_test)
for column in x_test_filled.columns:
    x_test_filled[column] = x_test_filled[column].astype(int)

Evaluating test data

In [None]:
model_3_before_tuning = model_3.score(x_test_filled , y_test)
model_3_before_tuning

Clearly , we can see that the model has performed poorly on the test data as it was only capable of correcty predicting 63 percent of the data , therfore , we need to perform hyperparameter tuning on the random forest classifier.

## Hyperparameter tuning
We shall now define a grid that contains random values of hyperparameters of random forest model like number of estimators, maximum depth etc. which will be used in grid search CV.

In [None]:
grid = {"criterion" : ["gini" , "entropy"],
       "n_estimators" : [50,70,90,110,130,150],
       "min_samples_split" : np.arange(2,8,2),
       "max_features" : ["auto", "sqrt", "log2"],
       "max_depth": [None, 5, 10, 20, 30],
       "min_samples_leaf": [1, 2, 4]}

In [None]:
# defining a new classifier model that uses grid search CV
gs_clf = GridSearchCV(estimator=model_3,
                      param_grid=grid, 
                      cv=5,
                      verbose=2)

## Warning :
The cell below can take hours to run as it is an exhaustive search over 8100 different combinations of hyperparameters (it took almost 2 hours to run in my system with an i5 8th gen processor and 512 GB ssd) , so if you wish to avoid waiting for so much time , please consider the final parameters commented two cells after.

In [None]:
gs_clf.fit(x_train_filled , y_train)

In [None]:
gs_clf.best_params_

In [None]:
#best parameters = {'criterion': 'entropy',
#                   'max_depth': 30,
#                   'max_features': 'log2',
#                   'min_samples_leaf': 4,
#                   'min_samples_split': 4,
#                   'n_estimators': 110}

## this set of parameters will most probably fit your data as well and if it doesn't , sadly you will have to run the exhaustive search above
## implement it in the gs_clf model and then proceed further....

In [None]:
# this is the predicted potability values after tuning
gs_y_preds = gs_clf.predict(x_test_filled)

In [None]:
gs_y_preds

In [None]:
# scoring our model's performance on test data
model_3_after_tuning = gs_clf.score(x_test_filled , y_test)

In [None]:
model_3_after_tuning

## Woohoo!!
As we can see clearly, hyperparameter tuning through grid search CV significantly improved our model's performance on the test data. We can depict is graphically because graphs make us understand better

In [None]:
data = {"before tuning" : model_3_before_tuning,
        "after tuning" : model_3_after_tuning}
fig , ax = plt.subplots(figsize = (10,8))
ax.bar(data.keys() , data.values())
ax.set(ylabel = "Score on test data",
      title = "Comparison before and after tuning",
      ylim = (0,1.2));
plt.axhline(y = model_3_before_tuning , color = "r", linestyle = "--")
plt.axhline(y = model_3_after_tuning , color = "g", linestyle = "--")

## Checking other parameters
Our last step is to check for other parameters like accuracy , precision , f1 and recall to make sure that our model is correctly classifying the data and is not overfitting or underfitting.

In [None]:
print(f"Accuracy: {accuracy_score(y_test, gs_y_preds)*100:.2f}%")
print(f"Precision: {precision_score(y_test, gs_y_preds)*100:.2f}%")
print(f"Recall: {recall_score(y_test, gs_y_preds)*100:.2f}%")
print(f"F1: {f1_score(y_test, gs_y_preds)*100:.2f}%")

We have got pretty good results for these parameters as well , so now we can rest assured and pat ourselves on the back for drastically improving the accuracy of our model.

# Thankyou