# <a id='toc1_'></a>[Subject: Machine Learning, 19th Feb 2024 - 29th March 2024.](#toc0_)
### <a id='toc1_1_1_'></a>[Topic: Individual Assessment.](#toc0_)
- Research Purpose: Using real-world data sets on supervised learning models to evaluate the classification of two algorithms.

### <a id='toc1_1_2_'></a>[Learning Outcomes:](#toc0_)
- MO1: Compare and contrast the basic principles and characteristics.

**Table of contents**<a id='toc0_'></a>    
- [Subject: Machine Learning, 19th Feb 2024 - 29th March 2024.](#toc1_)    
    - [Topic: Individual Assessment.](#toc1_1_1_)    
    - [Learning Outcomes:](#toc1_1_2_)    
- [Support Vector Machines](#toc2_)    
    - [](#toc2_1_1_)    
    - [Links:](#toc2_1_2_)    
- [Esembles ?](#toc3_)    
- [Import libraries to run the project:](#toc4_)    
  - [Analysis and treatment of dataset (10%):](#toc4_1_)    
    - [Data collection and processing:](#toc4_1_1_)    
    - [Feature Scaling (Normalisation):](#toc4_1_2_)    
    - [Transform data](#toc4_1_3_)    
    - [Create the test and training sets](#toc4_1_4_)    
  - [Model and Training (40%):](#toc4_2_)    
    - [SVM Classification Model](#toc4_2_1_)    
    - [Training the Model](#toc4_2_2_)    
    - [Grid Search Model](#toc4_2_3_)    
      - [Training the Model](#toc4_2_3_1_)    
  - [Prediction and Evaluation (30%)](#toc4_3_)    
    - [Support Vector Machines (SVM)](#toc4_3_1_)    
    - [Understanding the classification report](#toc4_3_2_)    
    - [Fine-tune the hyperparameters of SVM via Grid Search:](#toc4_3_3_)    
  - [Comparison of Two Models (10%)](#toc4_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[Support Vector Machines](#toc0_)
SVMs are a Machine Learning Algorithm built on supervision with the use of classification to group classified problems via their usefulness and features. We want to search for the optimium classification of said features which means that we require a method of classifying; to do so we use a hyperplane. From there we can find the closest data points (support vectors) to the hyperplane which determins the angle or direction of the classified data.

Moreover, there are 3 types of Kernals in SVM. Those being Radial basis function (RBF), non-linear and linear. In linear algebra, transposition of $T^T$ is obtained by flipping a matrix, $T(a ⋅ b) := b ⋅ a$, essentially swapping the rows and columns of $T$ such that $T = T^T$. This is important to know as the hyperplane is calculating its optimal perpendicular pair between the support vector and the hyperplane - these support vectors are what help visualise our additional lines which can be used to find maximized margin.

$Hyperplane = $ $H: {x|w^Tx+b = 0}$


From there to calculate margin, $γ$, we draw  support vectors $w/||w||$ where $||w||$ is the Euclidean distance (width) and $w$ is the number of support vectors. Therefore Maximized Margin is:

$γ$ = $2 ⋅ 1/||w|| = 2$

*Note: $w$ denotes weight, $x$ denotes input, $b$ denotes bias and $^T$ denotes Transpose.*

### <a id='toc2_1_1_'></a>[](#toc0_)
In our SVM we are using binary classification to determine if the individual is diabetic or not. This is because our dataset target variable also is indicative of this practice. To determine our classification success we use the following:

![Logo](accuracy_calc.PNG)

Accuracy being the percentage of correct predictions on a test data set. <br>
Precision is the ratio between True Positives and all the Positives. <br>
Recall measures our model to determine correctly identified True Positives.

$Accuracy = {\frac{\sum TP + TN}{\sum TP + FP + FN + TN}}$ <br>

$Precision$ = ${\frac{\sum TP}{\sum TP + FP}}$ <br>

$Recall$ = $\frac{\sum TP}{\sum TP + FN}$


### <a id='toc2_1_2_'></a>[Links:](#toc0_)
"Lecture 3: The Perceptron" (2024) cornell.edu. [online] Available from: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote03.html [Accessed 20th March 2024]

"Lecture 9: SVM" (2024) cornell.edu. [online] Available from: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html [Accessed 20th March 2024]
- https://www.youtube.com/watch?v=NnmKeYUYMPY

# <a id='toc3_'></a>[Esembles ?](#toc0_)
Some text about esembles

# <a id='toc4_'></a>[Import libraries to run the project:](#toc0_)

In [1]:
"""
This notebook is part of the Individual Assessment.
It contains two supervised learning models with information
about their use-case and the understanding of different
classifications of real-world datasets comparatively.

Originally made by Reece Turner, 22036698.
"""

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Active root directory of the project folder
current_directory = os.path.dirname(os.path.abspath(__name__))
current_directory += "\\"

# Constant Variables
C_HYPERPARAMETER = 1.0

## <a id='toc4_1_'></a>[Analysis and treatment of dataset (10%):](#toc0_)


### <a id='toc4_1_1_'></a>[Data collection and processing:](#toc0_)
In this section we want to identify which columns in our dataframe are features and which one is the target variable.
According to our diabetes description our data is as follows:

Number of Instances: 768

Number of Attributes (Features): 8 plus class

For Each Column:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)

9. Outcome (1 is interpreted as "tested positive for diabetes" and 0 as "negative")

*Note: Our Outcome column will be the target variable $y$ as it has 2 classifications (binary classification).*

An analysis of our features suggest:
- Times pregnant can be 0 because the individual does not require pregnancy to be diabetic or not.
- 2-Hour serum insulin 

In [2]:
# Load the dataset
data = pd.read_csv(current_directory + "dataset\diabetes.csv")

### <a id='toc4_1_2_'></a>[Feature Scaling (Normalisation):](#toc0_)
Now that we have loaded our dataset and identifed both the features and outcome; we need to normalise the data into appropriate ranges to that the model can determine a better accuracy.

In [3]:
X_features = data.drop(columns=["Outcome"])
y_target = data["Outcome"]

scaler = StandardScaler()
scaler.fit(X_features)

### <a id='toc4_1_3_'></a>[Transform data](#toc0_)
After fitting our features we need to standardised them to prevent overhead. This is performed by the transform function. The function uses ML preprocessing techniques called centering and scaling for this particular standardisation.

In [4]:
standardised_data = scaler.transform(X_features)
features = standardised_data

# Output the transformed data
print(features, y_target)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]] 0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


### <a id='toc4_1_4_'></a>[Create the test and training sets](#toc0_)
Since this algorithm is supervised we need to give the testing data a partial amount of the holistic data so we can test how reliable or consistant it is to the untested data.

![Logo](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/1_train-test-split_0.jpg)

"Understanding Train Test Split" (2022) builtin.com. [online] Available from: https://builtin.com/data-science/train-test-split [Accessed 25th March 2024]

In [5]:
# Split the data and keep random state to 0 so we can produce replicable results
X_train, X_test, y_train, y_test = train_test_split(
    features,  # Use the standardised features rather than the original X_features
    y_target,
    test_size=0.2, # 20 percent of the full dataset
    random_state=0
)

print(
    f"Original Features {features.shape[0]} - 100% of the original set.",  # 768 rows + 8 cols
    f"\nX_train Features {X_train.shape[0]}  - {(X_train.shape[0]/features.shape[0])*100}% of the original set.",
    f"\nX_test Features {X_test.shape[0]}   - {(X_test.shape[0]/features.shape[0])*100}% of the original set."
)

Original Features 768 - 100% of the original set. 
X_train Features 614  - 79.94791666666666% of the original set. 
X_test Features 154   - 20.052083333333336% of the original set.


## <a id='toc4_2_'></a>[Model and Training (40%):](#toc0_)
In this section, we want to create a model and perform fundamental machine learning practices. Since our approach is supervised we need to determine the accuracy of the model/algorithm therefore the introduction of Grid Search would assist in finding the best most appropriate hyperparameters with regards to over and under fitting.

### <a id='toc4_2_1_'></a>[SVM Classification Model](#toc0_)
Create a SVM Classifier Model to fit our data.

In [6]:
model = svm.SVC(
    kernel="linear",
    random_state=0,
    C=C_HYPERPARAMETER
)

### <a id='toc4_2_2_'></a>[Training the Model](#toc0_)
Fit the SVM model according to the given training data.

In [7]:
model.fit(X_train, y_train)

### <a id='toc4_2_3_'></a>[Grid Search Model](#toc0_)
Exhaustive search over specified parameter values of an estimator. See <a id='toc4_3_3_'></a>[Fine-tune the hyperparameters of SVM via Grid Search](#toc0_) for incremental accuracy.

In [8]:
# Create a grid search
param_grid = {
    "C": [C_HYPERPARAMETER],
    "kernel": ['linear']
}

# The number of subset (folds) from the current set.
# This is typically a better method for checking accuracy as it
# requires more than a single estimator for our training.
cross_validation = 5

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=cross_validation,
    scoring="accuracy"  # Forces classification instead of regression.

)

#### <a id='toc4_2_3_1_'></a>[Training the Model](#toc0_)
Training the grid search for a single C hyperparameter of SVM. We introduce variations of kernel into the grid search whilst maintaining the same random state, SVM classification kernel and C hyperparameter.

In [9]:
grid_search.fit(X_train, y_train)
best_svm_classifier = grid_search.best_estimator_
best_svm_params = grid_search.best_params_

# This metric is the 'cv' performance of the machine learning model.
best_svm_score = grid_search.best_score_

print(
    "****** TRAINING DATA ******",
    "\nSVM Grid Search mean score: %.6f" % (best_svm_score*100) + " %"
)

****** TRAINING DATA ****** 
SVM Grid Search mean score: 75.737705 %


## <a id='toc4_3_'></a>[Prediction and Evaluation (30%)](#toc0_)
### <a id='toc4_3_1_'></a>[Support Vector Machines (SVM)](#toc0_)
Here we can produce the results of the training and test data. Using sklearn we want to obtain the classification report of $y$, such that $f(x): y$ where $x$ is the input data.

In [10]:
# Training data accuracy score
y_train_predict = model.predict(X_train)
training_accuracy = accuracy_score(
    y_train,
    y_train_predict
)

# Training data precision score
training_precision = precision_score(
    y_train,
    y_train_predict
)

In [22]:
# Test data accuracy
y_test_predict = model.predict(X_test)
test_accuracy = accuracy_score(
    y_test,
    y_test_predict
)

# Test data precision score
test_precision = precision_score(
    y_test,
    y_test_predict
)

print(
    "****** TEST DATA ******",
    "\nPrecision score: " + "%.6f" % (test_precision*100) + " %",
    "\nAccuracy score: " + "%.6f" % (test_accuracy*100) + " %"
)

print(
    "\n****** TRAINING DATA ******",
    "\nPrecision score: " + "%.6f" % (training_precision*100) + " %",
    "\nAccuracy score: " + "%.6f" % (training_accuracy*100) + " %"
)

if test_accuracy > training_accuracy:
    print(
        "It seems you may be **UNDERFITTING** by a "
        + "%.6f" % ((test_accuracy - training_accuracy)*100) + " %" + " delta."
    )
elif test_accuracy < training_accuracy:
    print(
        "It seems you may be **OVERFITTING** by a "
        + "%.6f" % ((test_accuracy - training_accuracy)*100) + " %" + " delta."
    )
else:
    print(
        "Your model is operational with these hyperparameters."
        + "This is because your testing data training data accuracy matches."
    )


****** TEST DATA ****** 
Precision score: 76.315789 % 
Accuracy score: 82.467532 %

****** TRAINING DATA ****** 
Precision score: 71.839080 % 
Accuracy score: 76.384365 %
It seems you may be **UNDERFITTING** by a 6.083168 % delta.


In the code above we discovered the current results for training and test data on an SVM, but what if we want to optimise this and visualise the best hyperparameter data? See <a id='toc4_3_3_'></a>[Fine-tune the hyperparameters of SVM via Grid Search](#toc0_).

### <a id='toc4_3_2_'></a>[Understanding the classification report](#toc0_)
Now that we can perform model predictions, it's important that we consider the possibility of the model overfitting or underfitting its predictions. Overfitting is where the model creates false negatives/positives which could be misleading to the user. As a result, we get larger errorenous margins as a percentage in our precision score or accuracy score.

To prevent this can can implement some techniques used in Machine Learning and Algorithms:
- Changing the hyperparameters to obtain the best results.
- Using search algorithms
- Find missing or invalid data, if applicable.

In our case we used a simple exhaustive search to find the best hyperparameters over a series of cross-validated data. However, there are some issues we need to consider when using this approach. If grid is too large of the validation set is too small this can lead to overfitting or underfitting. How do we analyse this? See <a id='toc4_3_3_'></a>[Fine-tune the hyperparameters of SVM via Grid Search:](#toc0_).

*Note: if the training predicted accuracy is higher than the test accuracy this typically means that it is overfitting. Vise versa for underfitting.*

### <a id='toc4_3_3_'></a>[Fine-tune the hyperparameters of SVM via Grid Search:](#toc0_)
In this process we can discover which is the most accurate hyperparameters for the given data enabling us to evaluate the performance of the model. Different datasets results in variations of accuracy so we need to identify which parameters work best for our interest.

Moreover, we are going to demonstrate our models performance via the use of a Grid Search. This search algorithms time complexity, O(n) - Linear, is relatively efficient for our needs and with our constants of the current SVM configuration as well as the data set being under 10,000 cases its suitable for our needs.

Additionally, to display all the results we can use a library called seaborn to grid, with colours, our data visuallising which hyperparameters worked best. We can graph it such that $x_1$ is the hyperparameter and $y_1$ is our accuracy/precision/recall.

In [12]:
# Display the grid search via seaborn/matplotlib.

# Title - Validation set accuracy
# X1 - Linear kernel
# Y1 - C

# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
# https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-fitting-machine-learning/
# https://www.google.com/search?sca_esv=2cdbf9d7674441ef&sxsrf=ACQVn0-ylhN7OPFVJPYTwvjGNbUs98nzAg:1711409064019&q=heatmap+fine+tuned+hyperparameters&uds=AMwkrPvI4IV3fhrjf0GTZ_c6zdTxMANKf8_EqaoXBSmbFbrMziVhza3Yn6AV8GfwlF0m_rhhLh5FkdL72p9fi2FMb1NNVA7hvCoAV-37AcCZaJqyNwg9O8m7ebdS35yxuqdJ4qHPyGq7q6bhjPOewGsQkldPAXja2WqVDCwHHhObjuBN-OWr1vBaclaWALJqCVyJuiSLnPEIqG29BHTJwpXMf0aTrhlBddyjZ6sjwInkmQpWHhO46Rkv7ZPJGdBN_cqA84XurSjhnBWxNx8knH7Aj3S-9ik5W__LE-pk2BXFO5PqxAqmz68&udm=2&prmd=ivnbmtz&sa=X&ved=2ahUKEwi5183Kx5CFAxVvVkEAHdZfACoQtKgLegQIChAB&biw=1278&bih=1277&dpr=1#vhid=C_krUB53Aw880M&vssid=mosaic

## <a id='toc4_4_'></a>[Comparison of Two Models (10%)](#toc0_)

- Summarize the key differences and similarities between SVM and ensemble methods in terms of their algorithmic characteristics and performance.
- Emphasize the importance of considering the specific requirements and constraints of the problem domain when selecting between SVM and ensemble methods.
