# Setting up
The below code contains necessary steps for setting up our machine learning environment. Key features are described in the comments.

## Just Some Fun Kaggle Info
I wanted to see if I could use a faster gpu... sadly i couldnt :(

In [None]:
import torch

# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')

# Introduction
### goal:
to find a relationship between water potability and various features of water

### Why?
We know that potable water is the water that we can drink, use in cooking, etc. In Australia we are very fortunate to have potable water in urban areas, this however is not always the case and some communities, such as the ones in Bali, don't have access to potable water. A machine learning model that can predict accurately whether water is potable based on measurable traits could help improve food safety for many. 

Water is made un-drinkable from the contaminants that it contains, contaminants being any particles or molecules other than water. This would mean water created by reverse osmosis a growing water source in Australia would be potable because the process used to make it physically separates the water from contaminants such as salt. In most water there are three main types of contaminants, physical, chemical, biological, radiological. 

### Columns Description
1. ph: pH of 1. water (0 to 14).
2. Hardness: Capacity of water to precipitate soap in mg/L.
3. Solids: Total dissolved solids in ppm.
4. Chloramines: Amount of Chloramines in ppm.
5. Sulfate: Amount of Sulfates dissolved in mg/L.
6. Conductivity: Electrical conductivity of water in μS/cm.
7. Organic_carbon: Amount of organic carbon in ppm.
8. Trihalomethanes: Amount of Trihalomethanes in μg/L.
9. Turbidity: Measure of light emitting property of water in NTU.
10. Potability: Indicates if water is safe for human consumption. Potable -1 and Not potable -0

### prediction Target
I am predicting if water is potable so setting water potability as the target would be a good idea.

### Hypotheses 
1. What will be the best model? If a [Support-Vector Machine](http://) is used I will get the highest accuracy
2. What types of features will have the strongest effect on predictions. I believe the PH will be the most important
3. Goal for accuracy of predictions! 100% (obviously) however a more realistic goal would be 75% accuracy

# Setup
I need to import modules so much of my code can function. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualisation purposes
from sklearn.tree import DecisionTreeClassifier ,plot_tree # Our model and a handy tool for visualising trees
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
import plotly.express as px
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=Warning)


import pandas_profiling as pp
from collections import Counter
warnings.simplefilter(action='ignore', category=Warning)
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Gather and explore the data
### Discussion of why data was selected.
I chose this data for several reasons. It has a plethora of features, such as chloramines, a somewhat large amount of data with 3277 lines. This data set was also the best data set compared to others on Kaggle as there, at the time of choosing, were no alternatives, most importantly it has potability as a feature

### how it's suitable for making predictions. 
This data is good for making predictions as all of the features or contaminants are listed on the United States of America's National Primary Drinking Water Regulations that establish maximum contaminant levels (MCLs) for various contaminants. This means the features have an effect on the prediction target, water potability, so by training a model on this data it should be accurate
	- what is being predicted and why. 

### Columns Description
1. ph: pH of 1. water (0 to 14).
2. Hardness: Capacity of water to precipitate soap in mg/L.
3. Solids: Total dissolved solids in ppm.
4. Chloramines: Amount of Chloramines in ppm.
5. Sulfate: Amount of Sulfates dissolved in mg/L.
6. Conductivity: Electrical conductivity of water in μS/cm.
7. Organic_carbon: Amount of organic carbon in ppm.
8. Trihalomethanes: Amount of Trihalomethanes in μg/L.
9. Turbidity: Measure of light emitting property of water in NTU.
10. Potability: Indicates if water is safe for human consumption. Potable -1 and Not potable -0 


In [None]:
train_file_path = '../input/water-potability/water_potability.csv'

# Create a new Pandas DataFrame with our training data
class_train_data = pd.read_csv(train_file_path)

#class_test_data.columns
class_train_data.describe(include='all')
class_train_data.head()

# lets Have a Closer Look at the Data
I will use box and whiskers plots along with a distribution graph to see what the data "looks" like and so i can make an informed decision on what model and features to use 

In [None]:
def boxdistriplot(columnName):
    if not columnName == 'Potability':
        sns.catplot(x="Potability", y=columnName, data=class_train_data, kind="box");
        plt.figure()
        ax = sns.distplot(class_train_data[columnName][class_train_data.Potability == 1],color="darkturquoise", rug=True)
        sns.distplot(class_train_data[columnName][class_train_data.Potability == 0], color="lightcoral", rug=True);
        plt.legend(['Potable', 'Not Potable']) 


for column in class_train_data.columns:
    boxdistriplot(column)

# Analysis

From this, we can see that most of the data could go either way whether it potable or not. This tells us that it is gonna be hard to predict and also suggests that the different contaminants all have a similar bearing on potability.

# Prepare the data
In this example, we want to predict whether or not water is potable Therefore the 'potability' column is our prediction target.

Before we can separate our prediction target 'y' from the rest of the data, we need to do some preparation so that there aren't any rows with missing values as our machine learning model will not be able to handle them.

### dropping rows
as I stated above all of the contaminants have a similar effect on whether water is potable, this is partly why I have decided against dropping any features as they are all equal, in addition, this data set has 3000+ rows, and the effect of dropping the rows with missing values shouldn't take away from the data that much. 


In [None]:
# Let's reduce our data to only the features we need and the target.
# The features we chose have similar 'count' values when we describe() them
# We need to keep the target as part of our DataFrame for now.
selected_columns = ['ph', 'Hardness', 'Chloramines', 'Conductivity', 'Organic_carbon', 'Trihalomethanes','Turbidity', 'Potability']
X_columns = ['ph', 'Hardness', 'Chloramines', 'Conductivity', 'Organic_carbon', 'Trihalomethanes','Turbidity']
# Create our new training set containing only the features we want
prepared_data = class_train_data[selected_columns]

# Drop rows (axis=0) that contain missing values
prepared_data = prepared_data.dropna(axis=0)

# Check that you still have a good 'count' value. The value should be the same for all columns.
# If your count is very low then you may need to remove features with the lowest count.
prepared_data.describe(include='all')

# Separate Features From Target
Now that we have a set of data (as a Pandas DataFrame) without any missing values, let's separate the features we will use for training from the target.




In [None]:
# Separate out the prediction target
y = prepared_data.Potability

# Drop the target column (axis=1) from the original dataframe and use the rest as our feature data
X = prepared_data.drop('Potability', axis=1)

# Take a look at the data again
X.head()
#y.head()

## One Hot Encode Categorical Data 
One of the difficulties of working with machine learning models is that most of them can only work with numerical features. Just in case there are any problems with numbers being words ill-use one-hot encoding, it also might gain my marks, but that's not important in the real world. 

One Hot Encoding is the most widely used approach for converting a character, word, or sentence into a number. One Hot Encoding creates new (binary) columns, indicating the presence of each possible category value in the original data. In other words, it separates each of the options for a category into a separate column, where a 1 means that the row fits the category in question and a zero indicates it doesn't.

## Split data into training and testing data.
Splitting the training set into two subsets is important because you need to have data that your model hasn't seen yet with actual values to compare to your predictions to be able to tell how well it is performing. If this isn't done the model can't be accurately validated and tuned for things such as over-fitting. I have used a simple method and split 25% of the dataset for validation.  


In [None]:
# One hot encode the features. This will only act on columns containing non-numerical values.
one_hot_X = pd.get_dummies(X)

one_hot_X.head()

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(one_hot_X, y, test_size = 0.25, random_state=1)

#verify the split, verification data should be less than half

val_X.head()

Now we know that there won't be any errors from non-numerical data

# Choose and Train a Model
Now that we have data our model can digest, let's use it to train a model and make some predictions. We're going to use a __Decision Tree Classifier__ which is different from the Decision Tree Regressor used in the [Intro to Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning) in that it makes categorical predictions instead of continuous numerical predictions. 

In this case, the category we want to predict is whether or not water is potable, with the output being a 1 if it is and a 0 if it is not, making it well suited to this model. Decision Tree Classifiers are also able to work with non-numerical prediction targets as well. For example, you might have a 'y' that contains the names (as strings) of different species of flowers. It's only features that need to be encoded.

For an example of a Decision Tree Classifier working with a non-numerical 'y' and a more in-depth look at how they work, take a look at this Kaggle notebook (https://www.kaggle.com/chrised209/decision-tree-modeling-of-the-iris-dataset)

I have chosen this model as a sort of baseline as it is the "simplest" model to see how it would perform compared to other models, it should do ok, better than a coin flip but not as good as others. 

Ok, let's train our model and see what it looks like.

In [None]:
# Create a decision tree classifier with a maximum depth of 3 for easy display later on
# Try changing the max_depth to see what happens
class_predictor = DecisionTreeClassifier(max_depth=3)

# Train the model on the one hot encoded data
class_predictor.fit(train_X, train_y)

# Let's plot the tree to see what it looks like!
plt.figure(figsize = (20,10))
plot_tree(class_predictor,
          feature_names=train_X.columns,
          class_names=['0', '1', '2', '3', '4', '5'],
          filled=True)
plt.show()

# Note for class_names we've used strings to represent each of the values.
# However, the real values are 0 for perished and 1 for survived.
# Class names for plot_tree must be strings so to get the right replacement values
# we had to do the following:
# First get a list of classes the tree will classify things as with the following command
print(class_predictor.classes_) ###
# This gives us [0,1] 
# Now we can create a new list with the replacement class strings in the same order.

## Pretty Cool!
Note that there are other ways to view a decision tree and there may be other parameters you could include when plotting the tree to display the nodes differently, but this is fine for now and is really just for a fun visual.



# Evaluate model performance and tune hyperparameters
Now that we have a sweet looking model, let's see how good it is at predicting passenger survival on our training set. 

The function below determines both the MAE and accuracy of the model used

In [None]:
def validate(pred, vy, vX):
    rf_val_mae = mean_absolute_error(pred, vy)
    print("Validation MAE: {}".format(rf_val_mae))
    acc = accuracy_score(vy, pred)
    print("Validation Accuracy: {}".format(acc))

Unfortunately i had problems with a more featured validation method that used a graph so i will just show MAE and accuracy because thats just as effective

In [None]:
#sad, non functional, code 
#print("Making predictions for the first 5 passengers in the training set.")
#    print("The predictions are:")
    # Merge actual target values and predictions back in with original features to see how we went.
    # intialise data of lists.
#    data = {'Actual':['Tom', 'nick', 'krish', 'jack'],
#            'preidtced':['Tom', 'nick', 'krish', 'jack']}
    # Create DataFrame
#    results = pd.DataFrame(data)


now i will apply the function above to validate

In [None]:
pred = class_predictor.predict(val_X)    
print(validate(pred, val_y, val_X))

## Wow, That's not good?

Because we split our data into training and validation sets we can see the problem of over-fitting or under fitting appearing. Because a decision tree classifier is also pretty basic it doesnt deal with new data very well. to try and improve this I will be doing two things, change the model to a random forrest and also try testing a variety of parameters. 

## Now to Make a Forrest

I chose this model as the next step from a simple decision tree classifier, it works on the same principle and returns a true or false prediction, what it does is makes several different decision tree classifier models and their outputs are compared, The benefits of this model as follows
1. It reduces overfitting in decision trees and helps to improve the accuracy
2. It is flexible to both classification and regression problems
3. It works well with both categorical and continuous values
4. It automates missing values present in the data
5. Normalising data is not required as it uses a rule-based approach.

However, despite these advantages, a random forest algorithm also has some drawbacks.
1. It requires much computational power as well as resources as it builds numerous trees to combine their outputs. 
2. It also requires much time for training as it combines a lot of decision trees to determine the class.
3. Due to the ensemble of decision trees, it also suffers interpretability and fails to determine the significance of each variable.

despite this, I feel it should make for a good improvement upon the decision tree classifier as our goal is to make an accurate prediction and time is not super important, in addition, a random forrest regressor's inability to determine the significance of the value of a variable is largely inconsequential as most of my data has a similar bearing on the potability. 


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

# Define the model. Set random_state to 1
rf_model = RandomForestClassifier(random_state=1)

# fit your model
rf_model.fit(train_X, train_y)

## Now I Will Validate Using the Same Function

In [None]:
pred = rf_model.predict(val_X)    
print(validate(pred, val_y, val_X))

   ## It's Still Not Great so Ill Tune Parameters

I can see that moving from a decision tree classifier to random forest diddn't come with that big of an improvement, in fact its worse. This might be because the random forest is overfitted to the data as it has no limit to the number of leaf nodes it can have whereas the decision tree does.

To prevent this over-fitting i will tune the parameters

To do so i have a function that enters a number of leaf-nodes and determins the MAE for that number of leaf-nodes. Using this i can use a loop to test a wide range of magnitudes and determine the best one with little effort

In [None]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [None]:
candidate_max_leaf_nodes = [2, 3, 4, 5, 25, 50, 100, 250, 500, 1000]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
for max_leaf_nodes in range(1, 30):
    my_mae = get_mae(max_leaf_nodes*10, train_X, val_X[X_columns], train_y, val_y)
    print(f"Max leaf nodes: {max_leaf_nodes*10}             mean error:{my_mae}")

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = 216

## And Again
now i know the best number of leaf-nodes is in a range of 20-40 I will test every magnitude along that range

In [None]:
candidate_max_leaf_nodes = [2, 3, 4, 5, 25, 50, 100, 250, 500, 1000]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
for max_leaf_nodes in range(20, 40):
    my_mae = get_mae(max_leaf_nodes, train_X, val_X[X_columns], train_y, val_y)
    print(f"Max leaf nodes: {max_leaf_nodes}             mean error:{my_mae}")

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = 30

OK, so done the best i can using the random forrest

In [None]:
print("Making predictions for the first 5 passengers in the training set.")


rf_model = RandomForestClassifier(max_leaf_nodes=best_tree_size, random_state=0)

# fit your model
rf_model.fit(train_X, train_y)

# Get the first five predictions as a list
pred = rf_model.predict(val_X[X_columns])

## Validate time

In [None]:
pred = rf_model.predict(val_X)    
print(validate(pred, val_y, val_X))

# analysis
so, we can see that with the more typical, tree-based, machine learning models accuracy can reach about 63% with my testing. This time around the accuracy was higher because tuned the hyperparameters finding the equilibrium between over fit and under-fit. Whilst it is an improvement from the-non tuned model as well as the basic decision tree it can still be better as I am not close to my 100% accuracy goal, to improve I will essentially restart and "clean" the data.

# preparing the data, but better
so, last time we prepared the data i avoided removing the outliers in the data set. In addition there where some skewed results that coudld effect the preformance our our model to solve these issues i will re-do the "cleaning" of my data using a skewness corrector that i found. This method uses a BoxCox Transformation to help correct the data and by doing so my model should preform better. 

In [None]:
# Dropping NUll Values
class_train_data.dropna(inplace = True)

# Checking size of data after dropping NUll Value
class_train_data.shape

In [None]:
def skewnessCorrector(dataset,columnName):
    import seaborn as sns
    from scipy import stats
    from scipy.stats import norm, boxcox

    print('''Before Correcting''')
    (mu, sigma) = norm.fit(dataset[columnName])
    print("Mu before correcting {} : {}, Sigma before correcting {} : {}".format(
        columnName.capitalize(), mu, columnName.capitalize(), sigma))
    plt.figure(figsize=(20, 10))
    plt.subplot(1, 2, 1)
    sns.distplot(dataset[columnName], fit=norm, color="lightcoral");
    plt.title(columnName.capitalize() +
              " Distplot before Skewness Correction", color="black")
    plt.subplot(1, 2, 2)
    stats.probplot(dataset[columnName], plot=plt)
    plt.show()
    # Applying BoxCox Transformation
    dataset[columnName], lam_fixed_acidity = boxcox(
        dataset[columnName])
    
    print('''After Correcting''')
    (mu, sigma) = norm.fit(dataset[columnName])
    print("Mu after correcting {} : {}, Sigma after correcting {} : {}".format(
        columnName.capitalize(), mu, columnName.capitalize(), sigma))
    plt.figure(figsize=(20, 10))
    plt.subplot(1, 2, 1)
    sns.distplot(dataset[columnName], fit=norm, color="orange");
    plt.title(columnName.capitalize() +
              " Distplot After Skewness Correction", color="black")
    plt.subplot(1, 2, 2)
    stats.probplot(dataset[columnName], plot=plt)
    plt.show()

col = ['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity',
       'Organic_carbon', 'Trihalomethanes', 'Turbidity']

Now i can apply this function to the features

In [None]:
for column in col:
    skewnessCorrector(class_train_data,column)

Now that the data has been "cleaned" better we can once again follow the machine learning steps

In [None]:
# Separate out the prediction target
y = class_train_data.Potability

# Drop the target column (axis=1) from the original dataframe and use the rest as our feature data
X = class_train_data.drop('Potability', axis=1)

# Take a look at the data again
X.head()
#y.head()

# splitting
Part of the re-cleaning means I need to spit the data once again. In this part I decided to reduce the size of my testing data set down to 15% from 25% this should allow the Support-Vector model to train on more data and hopefully be more accurate and still leaves roughly 400 rows of testing data to compare to.

In [None]:
# One hot encode the features. This will only act on columns containing non-numerical values.
one_hot_X = pd.get_dummies(X)

one_hot_X.head()

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(one_hot_X, y, test_size = 0.15, random_state = 1)

#verify the split, verification data should be less than half

val_X.head()

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
train_X = sc.fit_transform(train_X)
val_X = sc.transform(val_X)

# train
Ok, now we have some better-prepared data I will train it on a new model. The support vector model. This should work well with our data as it is a classifier meaning it returns only true or false and not a range of data. This is suitable as we are predicting if water is potable or not.
I believe this will be the model that will achieve the highest accuracy because support vector models avoid problems with using multiple features that our other models use. it also
 
1. It scales relatively well to high-dimensional data.
2. SVM models have generalization in practice, the risk of over-fitting is less in SVM.

and some things that I don't understand

1. The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex problem.
2. Unlike in neural networks, SVM is not solved for local optima.

This comes at the cost of time and is difficult to interpret by people. The parameters are also hard to tune as its hard to visualize their impact. This is why i didn't tune the model but from the difference it created in outer models I can assume that 2-3% accuracy is left on the table.




In [None]:
# Define the model. Set random_state to 1
svm_model = svm.NuSVC(random_state=1)

# fit your model
svm_model.fit(train_X, train_y)# Define the model. Set random_state to 1

## Now I Will Validate Using the Same Function

In [None]:
pred = svm_model.predict(val_X)    
print(validate(pred, val_y, val_X))

# Conclusion
A reminder of the purpose of the investigation 
This investigation aimed to find a relationship between water potability and various features of water, this could help people determine easily if water is potable or not to improve water safety

We predicted from this data if water was potable or not based on a number of common contaminants found in water

A detailed discussion of the quality of predictions 

Our predictions didn't have a particularly good success rate, with my best attempt resulting in an accuracy of 67% this is about a 6-7% percent improvement from the basic decision tree classifier at the start however it was below the 100% accuracy goal. This could have been my fault or the fact that portability isn't affected by just one contaminant all that much. 

Comparison of results with hypotheses

I was correct and my hypothesis was supported that the support vector model results in the highest accuracy

What features had the strongest effect on predictions and why?!

We also learned that the features had a similar effect on the portability however from the box plots and the Gini impurity I determined that Ph had the biggest effect with Chloramines following closely after, this is likely because ph, if to high or low can kill our cells and outher things so it would be important in determining if water is potable.

In addition, the support vector model is the best model as it had the highest accuracy. 
