# Assignment -2 
# Meghana Rao 
# BL.EN.U4CSE18071

# <center> Water Portability Prediction</center> 

# Dataset 
https://www.kaggle.com/sharomeethan/water-potability-classifier/

The dataset contains of 9 attributes/features which are;
pH.
Hardness.
Solids.
Chloramines.
Sulfates.
Conductivity.
Organic Carbon.
Trihalomethanes.
Turbidity.

All these attributes are continous

The variable to be predicted is Potability (Discrete)  - that defines the consumption state of that particular sample of water.

0 -> not suitable for consumption.

1 -> suitable for consumption.

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import plotly.express as ex
import plotly.graph_objs as go

from sklearn.model_selection import train_test_split, cross_val_score ,StratifiedKFold, GridSearchCV
from sklearn.preprocessing import MinMaxScaler,StandardScaler #To scale the Dataset
from sklearn.pipeline import Pipeline #to assemble several steps that can be cross-validated together while setting different parameters.
from sklearn.metrics import confusion_matrix,roc_curve,accuracy_score, classification_report, roc_auc_score  #to evaluate best model
from sklearn.decomposition import TruncatedSVD,PCA
#Algorithms 
from sklearn.neighbors import KNeighborsClassifier #to get the KNN classifier 
from sklearn.naive_bayes import GaussianNB #to get the Gaussian Naive Bayes Classifier 
from sklearn.linear_model import LogisticRegression #to get Logistic Regression
from sklearn.svm import SVC #to get support vector machine

## Import data

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df_raw = pd.read_csv("../input/water-potability/water_potability.csv")
df_raw

In [None]:
#Generates the descriptive statistics which includes - mean,  standard deviation, etc.
df_raw.describe()

In [None]:
df_raw.info()

In [None]:
#figuring out how many examples are available per class
sns.countplot(x=df_raw["Potability"])
print(f'{df_raw.Potability[df_raw.Potability==1].count()/df_raw.Potability.count()*100:.2f} % of samples are potable (1)')

Here, you can see that the dataset is imbalanced, i.e only about 39% of the data is portable

Therefore we need to balance the dataset to get more accurate results. 

In [None]:
# Correlation matrix for dataset
plt.figure(figsize=(15,10))
sns.heatmap(df_raw.corr(), annot=True, cmap="inferno")

Since most values have low correlation, there's no collinearity, 

### All the features are required for determining the potability!! (Cannot drop any)


## Checking Missing Data

In [None]:
df_raw.isna().sum()

We have incomplete data for pH, sulfate, and trihalomethanes, We can fill all NA values with the mean of each column 

## Filling missing values by mean of each column. 
We need to make sure that the dataset is grouped by various portability values, i.e, only mean values of each column for portability 0 must be seperated from mean values for portability 1 to ensure that the distinction between the 2 still exists and the model can perform better. 

### Therefore we use group by to find the 2 different means(for potability - 0 and 1) for each column.

In [None]:
def fill_nan(df):
    for index, column in enumerate(df.columns[:9]):
        # print(index, column)
        df[column] = df[column].fillna(df.groupby('Potability')[column].transform('mean'))
    return df
        
df = fill_nan(df_raw)
df.isna().sum()                                                       

In [None]:
df.describe()

## Splitting data

In [None]:
#Dropping last column (output)
X = df.drop(['Potability'], axis = 1)
y = df['Potability']

# Splitting
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1111, stratify=y) #stratify=y

## INPUTS for splitting the dataset into test and train
### Test_size
Test size = 0.3 i.e, 70% of the data is used for training and 30% for testing

### random_state
controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

### Stratify 
ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array

In [None]:
pp = sns.pairplot(data=df,
                  y_vars=['Potability'],
                  x_vars=['ph', 'Hardness', 'Solids','Chloramines', 'Sulfate'])

pp = sns.pairplot(data=df,
                  y_vars=['Potability'],
                  x_vars=['Conductivity','Organic_carbon', 'Trihalomethanes', 'Turbidity'])

Therefore, the variables are barely linearly seperable, therefore models like SVM that depend of linear seperability will perform poorly! 

## Scaling the dataset 
Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.


Below, we can see that the dataset is also scaled using Min-Max scaler 

The formulae to calculate how to scale is 
### X_standard_deviation = (X - X_min) / (X_max - X_min)

the scaled dataset is given by 

### X_scaled = X_std * (X_max - X_min ) + X_min

# <center> KNN</center> 

In [None]:
k = range(1,20,1)
testing_accuracy = []
training_accuracy = []
score = 0

for i in k:
    knn = KNeighborsClassifier(n_neighbors = i)
    pipe_knn = Pipeline([('scale', MinMaxScaler()), ('knn', knn)])
    pipe_knn.fit(X_train, y_train)
    
    y_pred_train = pipe_knn.predict(X_train)
    training_accuracy.append(accuracy_score(y_train, y_pred_train))
    
    y_pred_test = pipe_knn.predict(X_test)
    acc_score = accuracy_score(y_test,y_pred_test)
    testing_accuracy.append(acc_score)
    
    if score < acc_score:
        score = acc_score
        best_k = i
        
print('Best Accuracy Score', score, 'Best K-Score', best_k)

In [None]:
sns.lineplot(k, testing_accuracy)
sns.scatterplot(k, testing_accuracy)

sns.lineplot(k, training_accuracy)
sns.scatterplot(k, training_accuracy)
plt.legend(['testing accuracy', 'training accuracy'])

### Therefore, 14 neighbours is the best choice 

In [None]:
# Train the model again for K = 2 to plot ROC curves 
def model_evaluation(model, metric):
    skfold = StratifiedKFold(n_splits = 5)
    model_cv = cross_val_score(model, X_train, y_train, cv = skfold, scoring = metric)
    return model_cv

knn = KNeighborsClassifier(n_neighbors = 14)
pipe_knn = Pipeline([('scale', MinMaxScaler()), ('knn', knn)])
pipe_knn.fit(X_train, y_train)

pipe_knn_cv = model_evaluation(pipe_knn, 'roc_auc')

KNN_roc_auc= roc_auc_score(y,pipe_knn.predict(X))
fpr,tpr,thresholds = roc_curve(y,pipe_knn.predict_proba(X)[:,1])
plt.figure()
plt.plot(fpr,tpr,label="AUC (area = %0.2f)" % KNN_roc_auc)
plt.plot([0,1],[0,1],"r--")
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("ROC")
plt.show()

In [None]:
# confusion Maxtrix
cm5 = confusion_matrix(y_test, pipe_knn.predict(X_test))
sns.heatmap(cm5/np.sum(cm5), annot = True, fmt=  '0.2%', cmap = 'Reds')

# <center> Logistic Regression</center> 

In [None]:
lr_model = LogisticRegression(max_iter=120,random_state=0, n_jobs=20)
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

### Calculating Accuracy Score

In [None]:
lr = accuracy_score(y_test, lr_pred)
print("accuracy - " + str(lr))

In [None]:
print(classification_report(y_test,lr_pred))

In [None]:
logit_roc_auc= roc_auc_score(y,lr_model.predict(X))
fpr,tpr,thresholds = roc_curve(y,lr_model.predict_proba(X)[:,1])
plt.figure()
plt.plot(fpr,tpr,label="AUC (area = %0.2f)" % logit_roc_auc)
plt.plot([0,1],[0,1],"r--")
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("ROC")
plt.show()

In [None]:
# confusion Maxtrix
cm1 = confusion_matrix(y_test, lr_pred)
sns.heatmap(cm1/np.sum(cm1), annot = True, fmt=  '0.2%', cmap = 'Reds')

# <center> Naive Bayes</center> 

In [None]:
nb=GaussianNB()
nb_model=nb.fit(X_train,y_train)
y_pred=nb_model.predict(X_test)
accuracy_score(y_test,y_pred)

In [None]:
nb_params={'var_smoothing': np.logspace(0,-9, num=100)}
nb_cv=GridSearchCV(estimator=nb, 
                 param_grid=nb_params, 
                 cv=10, 
                 verbose=1, 
                 scoring='accuracy') 
nb_cv.fit(X_train,y_train)
nb_cv.best_params_

In [None]:
nb=GaussianNB(var_smoothing=1e-9)
nb_tuned=nb.fit(X_train,y_train)

In [None]:
y_pred=nb_tuned.predict(X_test)
accuracy_score(y_test,y_pred)

In [None]:
bayes_roc_auc= roc_auc_score(y,nb_tuned.predict(X))
fpr,tpr,thresholds = roc_curve(y,nb_tuned.predict_proba(X)[:,1])
plt.figure()
plt.plot(fpr,tpr,label="AUC (area = %0.2f)" % bayes_roc_auc)
plt.plot([0,1],[0,1],"r--")
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("ROC")
plt.show()

In [None]:
# confusion Maxtrix
cm1 = confusion_matrix(y_test, y_pred)
sns.heatmap(cm1/np.sum(cm1), annot = True, fmt=  '0.2%', cmap = 'Reds')

# <center> SVM </center> 

### To support the previous hypothesis, that the dataset has very little linear seperability, lets perform some domain analysis

In [None]:
N = 5 
pca_pipeline = Pipeline(steps = [
    ('scale',StandardScaler()),
    ('PCA',PCA(N))
])

tf_data = pca_pipeline.fit_transform(df.iloc[:,:9])
tf_data = pd.DataFrame({'PC1':tf_data[:,0],'PC2':tf_data[:,1],'PC3':tf_data[:,2],'PC4':tf_data[:,3],'PC5':tf_data[:,4],
                        'label':df.iloc[:,-1].map({0:'Not Potabale',1:'Potable'})})

In [None]:
ex.scatter_3d(tf_data,x='PC1',y='PC2',z='PC3',color='label',color_discrete_sequence=['salmon','green'],title=r'$\textit{Data in Reduced Dimension } R^9 \rightarrow R^3$')

fdhfgnj

In [None]:
components = tf_data[['PC1','PC2','PC3','PC4','PC5']].to_numpy()

labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca_pipeline['PCA'].explained_variance_ratio_ * 100)
}

fig = ex.scatter_matrix(
    components,
    labels=labels,
    dimensions=range(N),
    color=tf_data['label'],
    color_discrete_sequence=['salmon','green']
)
fig.update_traces(diagonal_visible=False)
fig.update_layout(title='Data Spread Based on Different 2D Combinations of Principal Components')

fig.show()

In [None]:
evr = pca_pipeline['PCA'].explained_variance_ratio_
total_var = evr.sum() * 100
cumsum_evr = np.cumsum(evr)

trace1 = {
    "name": "individual explained variance", 
    "type": "bar",
    'y':evr}
trace2 = {
    "name": "cumulative explained variance", 
    "type": "scatter", 
     'y':cumsum_evr}
data = [trace1, trace2]
layout = {
    "xaxis": {"title": "Principal components"}, 
    "yaxis": {"title": "Explained variance ratio"},
  }
fig = go.Figure(data=data, layout=layout)
fig.update_layout(title='{:.2f}% of the Original Feature Variance Can Be Explained Using {} Dimensions'.format(np.sum(evr)*100,N))
fig.show()

Using five components (out of initially 9), we can see that we can only preserve 60 percent of the original variance; we can learn from this fact that our features are indeed uncorrelated between them and there is no linear combination that can tell us a better story regarding the target label after looking at the different permutations of principal components

# Lets try training with 5 components(features) instead of 9

In [None]:
pca_data = pca_pipeline.fit_transform(df.iloc[:,:9])
pca_data = pd.DataFrame({'PC1':pca_data[:,0],'PC2':pca_data[:,1],'PC3':pca_data[:,2],'PC4':pca_data[:,3],'PC5':pca_data[:,4],
                        'label':df.iloc[:,-1]})

#Dropping last column (output)
X = pca_data.drop(['label'], axis = 1)
y = pca_data['label']

# Splitting
X_pca_train, X_pca_test, y_pca_train, y_pca_test = train_test_split(X,y, test_size=0.3, random_state=1111, stratify=y)

In [None]:
# Initialize SVM classifier
clf_rbf = SVC(kernel='rbf')
clf_poly = SVC(kernel='poly')

# Fit data
clf_rbf = clf_rbf.fit(X_pca_train, y_pca_train)
clf_poly = clf_poly.fit(X_pca_train, y_pca_train)

y_pca_pred=clf_rbf.predict(X_pca_test)
rbf_acc = accuracy_score(y_pca_test,y_pca_pred)
print(rbf_acc)

y_pca_pred=clf_poly.predict(X_pca_test)
poly_acc= accuracy_score(y_pca_test,y_pca_pred)
print(poly_acc)

In [None]:
clf_rbf.get_params()

In [None]:
# Get support vectors themselves
from mpl_toolkits.mplot3d import Axes3D
support_vectors = clf_rbf.support_vectors_
# Visualize support vectors
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca_train.iloc[0:10,0], X_pca_train.iloc[0:10,1],X_pca_train.iloc[0:10,2], s=60, c ='skyblue')
ax.scatter(support_vectors[0:10,0], support_vectors[0:10,1],support_vectors[0:10,2],s=60, c='red')
plt.show()

In [None]:
support_vectors = clf_poly.support_vectors_
# Visualize support vectors
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca_train.iloc[0:10,0], X_pca_train.iloc[0:10,1],X_pca_train.iloc[0:10,2], s=60, c ='skyblue')
ax.scatter(support_vectors[0:10,0], support_vectors[0:10,1],support_vectors[0:10,2],s=60, c='red')
plt.show()

In [None]:
svc_params={"kernel" : ["rbf"],
                 "gamma": [0.001, 0.01, 0.1, 1,'scale'],
                 "C": [1,10,50,100]}

svc=SVC()
svc_cv_model=GridSearchCV(svc,svc_params,cv=10,n_jobs=-1,verbose=1).fit(X_pca_train,y_pca_train)
svc_cv_model.best_params_

In [None]:
kernel = "rbf"
C = 1
gamma = 'scale'

In [None]:
svc_pca_tuned=SVC(kernel=kernel,C=C,gamma=gamma).fit(X_pca_train,y_pca_train)
y_pca_pred=svc_pca_tuned.predict(X_pca_test)
accuracy_score(y_pca_test,y_pca_pred)

In [None]:
svc_tuned=SVC(kernel=kernel,C=C,gamma=gamma).fit(X_train,y_train)
y_pred=svc_tuned.predict(X_test)
accuracy_score(y_test,y_pred)

In [None]:
# confusion Maxtrix
cm1 = confusion_matrix(y_test, y_pred)
sns.heatmap(cm1/np.sum(cm1), annot = True, fmt=  '0.2%', cmap = 'Reds')

In [None]:
models=[pipe_knn,
         lr_model,
         clf_rbf,
         nb_tuned]


result=[]

results=pd.DataFrame(columns=["Models","Accuracy"])
names = ['K Nearest Neighbours','Logistic Regression','Support Vector Machines','Gaussian Naive Bayes']
i =0
for model in models:
    name = names[i]
    i=i+1
    if i == 3:
        y_pred=model.predict(X_pca_test)
        accuracy=accuracy_score(y_pca_test,y_pred)
    else:
        y_pred=model.predict(X_test)
        accuracy=accuracy_score(y_test,y_pred) 
    
    result=pd.DataFrame([[name,accuracy*100]],columns=["Models","Accuracy"])
    results=results.append(result)
print(results)
sns.barplot(x="Accuracy",y="Models",data=results,color="b")
