# Subject: Machine Learning, 19th Feb 2024 - 29th March 2024.
### Topic: Individual Assessment.
- Research Purpose: Using real-world data sets on supervised learning models to test accuracy of two algorithms.

### Learning Outcomes:
- MO1: Compare and contrast the basic principles and characteristics.

# Support Vector Machines (SVM) ?
SVMs are a Machine Learning Algorithm built on supervision with the use of classification to group classified problems via their usefulness and features. We want to search for the optimium classification of said features which means that we require a method of classifying; to do so we use a hyperplane. From there we can find the closest data points (support vectors) to the hyperplane which determins the angle or direction of the classified data.

Moreover, there are 3 types of Kernals in SVM. Those being Radial basis function (RBF), non-linear and linear. In linear algebra, transposition of $T^T$ is obtained by flipping a matrix, $T(a ⋅ b) := b ⋅ a$, essentially swapping the rows and columns of $T$ such that $T = T^T$. This is important to know as the hyperplane is calculating its optimal perpendicular pair between the support vector and the hyperplane - these support vectors are what help visualise our additional lines which can be used to find maximized margin.

$Hyperplane = $ $H: {x|w^Tx+b = 0}$


From there to calculate margin, $γ$, we draw  support vectors $w/||w||$ where $||w||$ is the Euclidean distance (width) and $w$ is the number of support vectors. Therefore Maximized Margin is:

$γ$ = $2 ⋅ 1/||w|| = 2$

*Note: $w$ denotes weight, $x$ denotes input, $b$ denotes bias and $^T$ denotes Transpose.*

### Binary Classification
In our SVM we are using binary classification to determine if the individual is diabetic or not. This is because our dataset target variable also is indicative of this practice. To determine our classification success we use the following:

![Logo](accuracy_calc.PNG)

Accuracy being the percentage of correct predictions on a test data set. <br>
Precision is the ratio between True Positives and all the Positives. <br>
Recall measures our model to determine correctly identified True Positives.

$Accuracy = {\frac{\sum TP + TN}{\sum TP + FP + FN + TN}}$ <br>

$Precision$ = ${\frac{\sum TP}{\sum TP + FP}}$ <br>

$Recall$ = $\frac{\sum TP}{\sum TP + FN}$


### Links:
"Lecture 3: The Perceptron" (2024) cornell.edu. [online] Available from: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote03.html [Accessed 20th March 2024]

"Lecture 9: SVM" (2024) cornell.edu. [online] Available from: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html [Accessed 20th March 2024]
- https://www.youtube.com/watch?v=NnmKeYUYMPY

# Esembles ?
Some text about esembles

## Import libraries to run the project:

In [113]:
"""
This notebook is part of the Individual Assessment.
It contains two supervised learning models with information
about their use-case and the understanding of different
classifications of real-world datasets comparatively.

Originally made by Reece Turner, 22036698.
"""

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Active root directory of the project folder
current_directory = os.path.dirname(os.path.abspath(__name__))
current_directory += "\\"

## Analysis and treatment of dataset (10%)


### Data collection and processing:
In this section we want to identify which columns in our dataframe are features and which one is the target variable.
According to our diabetes description our data is as follows:

Number of Instances: 768

Number of Attributes (Features): 8 plus class

For Each Column:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)

9. Outcome (1 is interpreted as "tested positive for diabetes" and 0 as "negative")

*Note: Our Outcome column will be the target variable $y$ as it has 2 classifications (binary classification).*

An analysis of our features suggest:
- Times pregnant can be 0 because the individual does not require pregnancy to be diabetic or not.
- 2-Hour serum insulin 

In [114]:
# Load the dataset
data = pd.read_csv(current_directory + "dataset\diabetes.csv")

### Feature Scaling (Normalisation):
Now that we have loaded our dataset and identifed both the features and outcome; we need to normalise the data into appropriate ranges to that the model can determine a better accuracy.

In [115]:
X_features = data.drop(columns=["Outcome"])
y_target = data["Outcome"]

scaler = StandardScaler()
scaler.fit(X_features)

### Transform data

In [116]:
standardised_data = scaler.transform(X_features)
features = standardised_data
print(features, y_target)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]] 0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


### Create the test and training sets
Since this algorithm is supervised we need to give the testing data a partial amount of the holistic data so we can test how reliable or consistant it is to the untested data.

In [117]:
X_train, X_test, y_train, y_test = train_test_split(
    X_features, y_target, test_size=0.2, random_state=0
)

print(
    features.shape,  # 758 rows + 8 cols
    X_train.shape,
    X_test.shape
)

(768, 8) (614, 8) (154, 8)


## Model and Training (40%):

### SVM Classification
Create a SVM Classifier Model

In [118]:
model = svm.SVC(
    kernel="linear",
    random_state=0,
    C=1.0
)

### Training the Model
Fit the SVM model according to the given training data.

In [119]:
model.fit(X_train, y_train)

## Prediction and Evaluation (30%)
### Support Vector Machines (SVM)

In [120]:
# Training data accuracy
X_train_predict = model.predict(X_train)
training_accuracy = accuracy_score(
    y_train,
    X_train_predict
)

print(f"Accuracy of training data: {training_accuracy}")

Accuracy of training data: 0.7654723127035831


In [121]:
# Test data accuracy
X_test_predict = model.predict(X_test)
test_accuracy = accuracy_score(
    y_test,
    X_test_predict
)

print(f"Accuracy of test data: {test_accuracy}")
print(X_train_predict)

Accuracy of test data: 0.8181818181818182
[1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0
 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0
 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 1
 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 1
 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1
 0 0 0 1 1 0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 0
 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1
 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1
 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1
 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0
 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1
 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0

In [122]:
import numpy as np
import matplotlib.pyplot as plt

def plot_model_decision_boundary(model, X, y):
    # Convert X to numpy array if it's a DataFrame
    if isinstance(X, pd.DataFrame):
        X = X.values

    # Generate a grid of points
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))

    # Predict the class labels for the grid points
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the decision boundary
    plt.contourf(xx, yy, Z, alpha=0.4)

    # Plot the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k')

    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Classification with Hyperplane')
    plt.show()


In [123]:
plot_model_decision_boundary(model, X_train, y_train)



ValueError: X has 2 features, but SVC is expecting 8 features as input.

### Understanding the accuracy...
Now that we can perform model predictions, it's important that we consider the possibility of the model overfitting or underfitting its predictions. Overfitting is where the model creates false negatives/positives which could be misleading to the user. As a result erroneous data is generated.

**Note: if the training prediction accuracy is higher than the test accuracy this typically means that it is overfitting. Vise versa for underfitting.**

## Comparison of Two Models (10%)

- Summarize the key differences and similarities between SVM and ensemble methods in terms of their algorithmic characteristics and performance.
- Emphasize the importance of considering the specific requirements and constraints of the problem domain when selecting between SVM and ensemble methods.


# Notes
- We load the dataset and split it into features (X) and the target variable (y).
- We split the data into training and testing sets.
- We create an SVM classifier with a linear kernel and train it on the training data.
- We make predictions on the test set and calculate the accuracy of the model.
- We visualize the decision boundary of the SVM model on a scatter plot using two features (Glucose and BMI). The points are colored according to the target variable (Outcome).

Data Cleaning:

Handle missing values: Remove or impute missing values in the dataset.
Remove outliers: Identify and handle outliers if necessary.
Feature Scaling:

Scale the features to ensure they have similar ranges. This can be done using techniques like Min-Max scaling or standardization.
Train-Test Split:

Split the dataset into training and testing sets to evaluate the model's performance.
Encoding Categorical Variables:

Convert categorical variables into numerical format using techniques like one-hot encoding or label encoding.
Normalization:

Normalize the data to have zero mean and unit variance to ensure features are on a similar scale.

# TO DO LIST:
- Look up what makes diabeties so i can find missing values in the dataset.
- Dropped BMI as weight and height do not exist therefore they are N/A.
