# CS 345 Final Project
---
Fynn Crossland, Zoe Lauer

---

We will be looking at the potability of water based on classifications of quality, and seeing if training models can accurately predict if water is potable or not. We will be using models we learned in class (SVM), and a model we found through research (Naive Bayes) and compare the results of the two models to see which is better.

Data found at: https://www.kaggle.com/datasets/uom190346a/water-quality-and-potability

In [5]:
import numpy as np
import pandas as pd
np.set_printoptions(precision=4)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler

# Download latest version
datas = pd.read_csv("water_potability.csv")

print(datas.head())


         ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0       NaN  204.890455  20791.318981     7.300212  368.516441    564.308654   
1  3.716080  129.422921  18630.057858     6.635246         NaN    592.885359   
2  8.099124  224.236259  19909.541732     9.275884         NaN    418.606213   
3  8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4  9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   

   Organic_carbon  Trihalomethanes  Turbidity  Potability  
0       10.379783        86.990970   2.963135           0  
1       15.180013        56.329076   4.500656           0  
2       16.868637        66.420093   3.055934           0  
3       18.436524       100.341674   4.628771           0  
4       11.558279        31.997993   4.075075           0  


First we have to process the data. This being remove any rows that are mising values, and then split it into the labels and features maticies.

In [6]:
datas = datas[~np.isnan(datas).any(axis = 1)]
labels = datas.iloc[:,-1]
features = datas.iloc[:,:-1]
print(labels.shape)
print(features.shape)

(2011,)
(2011, 9)


Now we are going to make the test and training sets that we are going to be using to train the models.

In [7]:
## making of the train and test splits.
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=5)

First we are going to look at model that we learned in class.

# Support Vector Machine(SVM)

Within this model, I want to compare a linear kernal, a linear weighted SVM, as well as a nonlinear weighted SVM to compare the base accuracy.

To begin, I'll be showcasing the SVM as a linear classifier with a C score of 10.

In [8]:
from sklearn import svm
from sklearn.metrics import classification_report, accuracy_score

classifier = svm.SVC(kernel = 'linear', C=10)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = (np.mean(y_test == y_pred))
print("Base accuracy for SVM: ", accuracy)
print("Linear and unweighted classification report:")
print(classification_report(y_test, y_pred))

Base accuracy for SVM:  0.5860927152317881
Linear and unweighted classification report:
              precision    recall  f1-score   support

           0       0.59      1.00      0.74       354
           1       0.00      0.00      0.00       250

    accuracy                           0.59       604
   macro avg       0.29      0.50      0.37       604
weighted avg       0.34      0.59      0.43       604



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


From here is an SVM with a linear kernel, balanced weight, and C score of 10.   What this is intended to illustrate is questioning the balance of the data set and understanding if there are further misclassifications.
To begin with a linear kernel is intended to show off a difference in what happens whenever a balanced class_weight is added.

In [None]:
from sklearn.svm import SVC

weighted_model_linear = SVC(kernel='linear', class_weight='balanced', C=10)
weighted_model_linear.fit(X_train, y_train)
y_weighted_linear = weighted_model_linear.predict(X_test)
accuracy_weighted_linear = (np.mean(y_test == y_weighted_linear))
print("Base accuracy for SVM: ", accuracy_weighted_linear)
print("Linear with class weights:")
print(classification_report(y_test, y_weighted_linear))

Finally here is a non-linear, blanaced SVM classifier for consideration.  The intention for this is to further illustrate a comparison for the different forms of accuracy that may be present from SVM and what you can do with the class.

In [None]:
from sklearn.svm import SVC

weighted_model = SVC(kernel='rbf', class_weight='balanced', C=10)
weighted_model.fit(X_train, y_train)
y_weighted = weighted_model.predict(X_test)
accuracy_weighted = (np.mean(y_test == y_weighted))
print("Base accuracy for SVM: ", accuracy_weighted)
print("Nonlinear with class weights:")
print(classification_report(y_test, y_weighted))

In looking at the base accuracy scores, the non-weighted SVM ended up having a fractionally better accuracy, and the weighted and linear SVM ended up having the worst base accuracy score.  In adding the **'class-weight'** parameter, it was meant to exhibit any form of false positives that were present in the data, as well as hopefully exemplify the minority class(1) better than the non-weighted and linear SVM.

---
Now to look at a model that we didn't experiement with in class, the Naive-Bayes model.
---
When looking at the Gaussian Naive Bayes model, it is looking at the data with the assumption that the data being looked at is of a normal distibution.

In [None]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)
clf_pred = clf.predict(X_test)
g_test_accuracy = (np.mean(y_test == clf_pred))
print("Base accuracy for Gaussian Naive Bayes: ", g_test_accuracy)

Now to look at a different kind of classifier, the Categorical Naive Bayes model, which looks at data that is categorically distributed. This is also true with our data which contains many categorical features that we are looking at.

In [None]:
from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB()
clf.fit(X_train, y_train)
clf_pred = clf.predict(X_test)
c_test_accuracy = (np.mean(y_test == clf_pred))
print("Base accuracy for Categorical Naive Bayes: ", c_test_accuracy)
diff = g_test_accuracy - c_test_accuracy
print("\nDifference between Gaussian model and Categorical model: ", diff)

When comparing the two base models, we can see that the Gaussian model is better when predicting with the data that we are using.

However, is there a difference if we make changes to the Categorical model?

In [None]:
alphas = [0.0001, 0.001, 0.01, 0.1, 1., 10., 100., 1000.]
test_accuracy = []
for a in alphas:
  clf = CategoricalNB(alpha = a)
  clf.fit(X_train, y_train)
  clf_pred = clf.predict(X_test)
  accuracy = (np.mean(y_test == clf_pred))
  test_accuracy.append(accuracy)
print("Test accuracy for Categorical Naive Bayes: ", test_accuracy)
print("The base alpha = 1.0 and the best alpha value = ", alphas[-2])

fig = plt.figure(figsize=(4,4))
plt.semilogx(alphas, test_accuracy, 'ob')
plt.xlabel("Alpha")
plt.ylabel("Test Accuaracy");

In [None]:
best = np.max(test_accuracy)
print("Best test accuracy: ", best,"\nCompared to the base Categorical model: ", best- c_test_accuracy)
print("Difference between the Guassian model and best of the Categorical model with changed alpha scores: ", g_test_accuracy - best)

The base Categorical Naive Bayes is only correct at predicting just over 50% of the time, while looking at using different alpha values, we find that the bigger the alpha, the better the prediction. However this can lead to overfitting, which is something that we do not want to happen with the training model.

Even with the changes to the Categorical model, the Guassian model was better at predicting the data by about 3%, which isn't the biggest of differences however it is still enough to be a significant in deciding which model to use.

# What does this mean?


---
When looking at the two types of models, SVM and Naive Bayes, we saw that Naive Bayes Guassian model has the best accuracy at predicting the values of the training set. This being the difference of 62% accuracy of the Guassian model compared to the 59% accuracy of the best SVM model that we tested against.

However, the range of accuracy that we found was 50-62%. This isn't that a large range, but isn't great when saying that any of the models we looked at predict accurately half the time. This can be a matter that the models that we used were not the best choice for the data that we were looking at. Yet, with the data we are looking at, knowing if water is potable is very important and predicting accurately if it is only half the time isn't ideal, but there is room to grow.

---

# The Future and Contributions

If given more time, we would have explored more models with the data that we were looking at to try and find the best model.

For this project, plans of action were done by both members. Fynn worked with the Naive Bayes model and all evaluation of it. Zoe worked with the SVM model and all evaluations of it. Conclusions and connections were done by both members.