# Downloading the Dataset From Kaggle

I am **Alptug Aydin**.

Date: 12.08.2021

Outline:

- Necessary imports

- Data Loading

- Exploring Data: 
    - Check if any categorical, duplicated or missing data
    - Deeper look into missing data
    - Imputing them with interpolation
 
 
- Split, Scale, Balance
    - Scale with MinMaxScaler
    
    
- Train and test with various models: 
    - Logistic Regression
    - Decision Tree
    - Random Forest
    - KNN
    - SVM
    - Simple NN


Note: This is my first notebook on Kaggle after recently finished a Data Science course. I would be glad if you comment or criticize. Thanks for any advice or comment in advance.

# Necessary Imports

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection  import GridSearchCV

# Loading and Exploring the Data

In [None]:
df = pd.read_csv("../input/water-potability/water_potability.csv")
df.head()

In [None]:
df.tail()

In [None]:
df.describe().transpose()

In [None]:
plt.figure(figsize = (12,8))

sns.heatmap(df.corr(),
           annot = True,
           cmap = 'magma')

In [None]:
df.select_dtypes(['object']).columns

There is **no categorical data** to consider in pre-process phase.



In [None]:
df.duplicated().value_counts()

**No duplicates.**

In [None]:
plt.figure(figsize = (8,6))
sns.countplot(x = df['Potability'])

In [None]:
perc = 100 * df['Potability'].value_counts() / len(df)
perc


Dataset is **not** in perfect balance. 

- Not Potable 60.99%
- Potable 39.01%



In [None]:
len(df)

In [None]:
df.info()

**ph, Sulfate, Trihalomethanes** columns contains null values. Let's visualize..

In [None]:
df.isnull().sum()

Missing value counts.

In [None]:
null_percentages = df.isnull().sum() * 100 / len(df)
null_percentages.sort_values(ascending=False)

Percentages of missing value containing samples.

In [None]:
# visualize null variables as heatmap
plt.figure(figsize = (12, 6))
sns.heatmap(df.isnull(),
            yticklabels = False,
            cbar = False,
            cmap = 'viridis')

As shown **23.84% of Sulfate, 14.99% of ph, 4.95% of Trihalomethanes** is missing. Before processing them, I want to examine a bit deeper. 

In [None]:
all_null_rows = df[df.isnull().any(axis=1)]
all_null_rows

In [None]:
one_null_rows = df.loc[df.isnull().sum(1) == 1]
len(one_null_rows)

Row count with only one missing cell.

In [None]:
two_null_rows = df.loc[df.isnull().sum(1) == 2]
len(two_null_rows)

Row count with only two missing cell.

In [None]:
three_null_rows = df.loc[df.isnull().sum(1) == 3]
len(three_null_rows)

Row count with only three missing cell.

Samples with one missing values compose majority as **1105 / 1265** corresponds to **~87%** of all samples with missing data. 

In [None]:
cor_mat = df[['ph', 'Sulfate', 'Trihalomethanes', 'Potability']].corr()
cor_mat

In [None]:
plt.figure(figsize = (10,6))
sns.heatmap(cor_mat,
            annot = True,
            cmap = 'coolwarm')

In [None]:
# see https://www.dummies.com/programming/big-data/data-science/how-to-use-python-to-select-the-right-variables-for-data-science/

X = df.dropna().drop('Potability', axis = 1)
y = df.dropna()['Potability']


from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_regression

Selector_f = SelectPercentile(f_regression, percentile=25)
Selector_f.fit(X,y)

for n,s in zip(df.columns,Selector_f.scores_):
 print ('F-score: %3.2f for feature %s ' % (s,n))

In [None]:
F_df = pd.DataFrame(Selector_f.scores_, df.drop('Potability', axis = 1).columns, columns = ['F-score']) # create F-score df
F_df['cor_potability'] = df.corr()['Potability'].drop('Potability') # Add the correlation column with respect to 'Potability'
F_df.sort_values(by = 'F-score', ascending = False)

Put them together in a dataframe.

Correlation and F values on 'Potability' are consistent. The **Solids** attribute has relatively more impact on **Potability** compare to **ph** and **Trihalomethanes**.

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, 
                               sharey = False, 
                               figsize = (14,6))

sns.distplot(F_df['F-score'], 
             ax = ax1)

sns.distplot(F_df['cor_potability'], 
             ax = ax2)

In [None]:
null_rows_SO = df.loc[(df.isnull().sum(1) == 1) & (df['Sulfate'].isnull())]

null_rows_PH = df.loc[(df.isnull().sum(1) == 1) & (df['ph'].isnull())]

null_rows_TRI = df.loc[(df.isnull().sum(1) == 1) & (df['Trihalomethanes'].isnull())]


one_miss = pd.DataFrame([len(null_rows_SO), len(null_rows_PH), len(null_rows_TRI)], 
                    ['Sulfate','ph','Trihalomethanes'], 
                    columns = ['Sample Count'])
one_miss

This table shows the sample count for each of this attributes where the only just one value is missing. Remember how it looks like with some example samples:

In [None]:
df.loc[(df.isnull().sum(1) == 1)].iloc[[0,1,58]]

## Handling Missing Values

In [None]:
df = df.interpolate(limit_direction ='both') # selected 'both' --> to fill the edges too
missing_count = df.loc[(df.isnull().sum(1) > 0)]
print("Missing count: ", len(missing_count))

## Balancing

In [None]:
df_copy = df.copy() # deep copy

In [None]:
print(df_copy['Potability'].value_counts(), "\n")

plt.figure(figsize = (8,6))
sns.countplot(x = df_copy['Potability'])

Before balancing

In [None]:
zero, one = df['Potability'].value_counts()
df_zero = df[df['Potability'] == 0]
df_one = df[df['Potability'] == 1]

df_sampled = df_one.sample(zero, replace=True)
df_sampled = pd.concat([df_zero, df_sampled], axis=0)

print(df_sampled['Potability'].value_counts(), "\n")

plt.figure(figsize = (8,6))
sns.countplot(x = df_sampled['Potability'])

Oversampling randomly

In [None]:
zero, one = df['Potability'].value_counts()
df_zero = df[df['Potability'] == 0]
df_one = df[df['Potability'] == 1]

df_sampled = df_zero.sample(one)
df_sampled = pd.concat([df_one, df_sampled], axis=0)

print(df_sampled['Potability'].value_counts(), "\n")

plt.figure(figsize = (8,6))
sns.countplot(x = df_sampled['Potability'])

Undersampling randomly

In [None]:
balance = False

if balance: 
    df = df_sampled
else:
    df = df_copy

Set the 'balance' variable True if you want to continue with the desired balancing method. One of the sampling cells above must be executed already.

# Train Test Split and Scale

In [None]:
X = df.drop('Potability', axis = 1)
y = df['Potability']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.30,
                                                    random_state = 101)

In [None]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train and Evaluate

## Logistic Regression

In [None]:
log_model = LogisticRegression() 

log_model.fit(X_train, y_train)

predictions = log_model.predict(X_test)

print(classification_report(y_test, predictions))
print("Confusion Matrix: \n", confusion_matrix(y_test, predictions))

## Decision Tree

In [None]:
dtree = DecisionTreeClassifier()

dtree.fit(X_train, y_train)

predictions = dtree.predict(X_test)

print(classification_report(y_test, predictions))

print("Confusion Matrix: \n", confusion_matrix(y_test, predictions))

## Random Forest

In [None]:
rfc = RandomForestClassifier(n_estimators = 20)

rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)

print(classification_report(y_test, rfc_pred))

print("Confusion Matrix: \n", confusion_matrix(y_test, rfc_pred))

## KNN 

In [None]:
errors = []

for k in range(1,40):

  knn = KNeighborsClassifier(n_neighbors = k)
  knn.fit(X_train, y_train)
  pred_k = knn.predict(X_test)

  current_error = np.mean(pred_k != y_test)
  errors.append(current_error)

In [None]:
plt.figure(figsize = (10,6))
plt.plot(range(1,40), 
           errors,
           marker = 'o',
           markerfacecolor='red',
           markersize = 10)
plt.title('Error Rate vs K')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
knn = KNeighborsClassifier(n_neighbors = 37) 
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)

print(classification_report(y_test, predictions))

print("Confusion Matrix: \n", confusion_matrix(y_test, predictions))

## SVM with GridSearchCV

In [None]:
param_grid = {'C': [0.1, 1, 10],
              'gamma': [1, 0.1, 0.01]}
grid = GridSearchCV(SVC(),
                    param_grid,
                    verbose = 3)
grid.fit(X_train, y_train)

In [None]:
print(grid.best_params_, "\n")
print(grid.best_estimator_)

In [None]:
grid_predictions = grid.predict(X_test)

print(classification_report(y_test, grid_predictions))

print("Confusion Matrix: \n", confusion_matrix(y_test, grid_predictions))

## Simple NN

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
X = df.drop('Potability', axis = 1).values
y = df['Potability'].values


X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.30,
                                                    random_state = 101)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
X_train.shape

In [None]:
model = Sequential()

model.add(Dense(units = 9, activation = 'relu'))
model.add(Dropout(0.2))

model.add(Dense(units = 18, activation = 'relu'))
model.add(Dropout(0.2))

model.add(Dense(units = 36, activation = 'relu'))
model.add(Dropout(0.2))

model.add(Dense(units = 1, activation = 'sigmoid'))

model.compile(optimizer = 'adam',
              loss = 'binary_crossentropy')

In [None]:
# better to --> minimize loss, maximize accuracy
early_stop = EarlyStopping(monitor = 'val_loss',
                           mode = 'min',
                           verbose = 1,
                           patience = 25) # 25 more epochs

In [None]:
model.fit(x = X_train,
          y = y_train,
          epochs = 250,
          validation_data = (X_test, y_test),
          callbacks = [early_stop],
          verbose = 0) # silent

In [None]:
losses = pd.DataFrame(model.history.history)

print(losses.head(), "\n")
print(losses.tail())

In [None]:
losses.plot(figsize = (12, 8))

In [None]:
predictions = model.predict_classes(X_test)

print(classification_report(y_test, predictions))

print("Confusion Matrix: \n", confusion_matrix(y_test, predictions))

## Comparison plot

In [None]:
model_types = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'KNN', 'SVM', 'Simple NN']
accuracies = [0.61, .58, .67, .66, .67, .67]
pd.DataFrame(accuracies, model_types, columns = ['Accuracy']).sort_values(by = 'Accuracy', ascending = False)

In [None]:
plt.figure(figsize = (12, 6) )
sns.barplot(model_types, accuracies)

This is my first notebook on Kaggle after recently finished a Data Science course. I would be glad if you comment or criticize. Thanks for any advice or comment in advance. 