# Pattern Recognision and Machine Learning
By **Tzanetis Savvas**(10889) and **Zoidis Vasilis**(10652).

## Part D
Once again, the first thing we need to do is import the correct **libraries**.

In [7]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

The library **numpy** is used for exporting our results for the predictions on the **test** dataset, to a **.npy** files, while **pandas** is used for reading the datasets **Test** and **TV** from the **.csv** files. After that, we will also import some libraries from **sklearn**, such as **SVC**, **StandardScaler** and **cross_val_score**. We will explain the use of these libaries later.

Next, we need to load the **TV** and **Test** datasets like so:

In [8]:
train_data = pd.read_csv("Datasets/datasetTV.csv", header=None)
test_data = pd.read_csv("Datasets/datasetTest.csv", header=None)

F_train = train_data.iloc[:, :-1].values
L_train = train_data.iloc[:, -1].values

F_test = test_data.values

In the code above, apart from loading the datasets using **pandas**, we also seperate the features and labels from the **TV** dataset, and extract the features of the **Test** dataset.

Next, using the aforementioned **StandardScaler** library, we will be scale our data. This ensures that all features contribute equally to the model's decision-making process, as features with larger scales won't be able to dominate the learning process. Also, scaling accelerates the convergence of gradient descent used in our model of choice the **SVM(Support Vector Machine)**, ensuring that the optimization algorithm moves at a consistent rate for all features.

In [9]:
scaler = StandardScaler()
F_train = scaler.fit_transform(F_train)
F_test = scaler.transform(F_test)

All that's left now is to train our model, but before doing this we must first tune the **hyper-parameters** of our **Support Vector Machine**. This is done using **Grid Search** as well as **Random Search** when the computational load for calculating the best combination of parameters using the **Grid Search** method seems too expensive. With these things in mind, we found out that the best parameters are:

- *kernel = **rbf***
- *C = **4***
- *gamma = **scale***
- *random_state = **42***

In [10]:
svm_model = SVC(kernel='rbf', C=4, gamma='scale', random_state=42)

We also need to evaluate our model's accuracy using **5-fold cross-validation**, as this method helps us detect **overfitting** and gives us a better estimate on *unseen* data by splitting our dataset into **5** folds.

In [11]:
scores = cross_val_score(svm_model, F_train, L_train, cv=5, scoring='accuracy')
print(f"Cross-validation scores: {scores}")
print(f"Cross-validation accuracy: {scores.mean():.2f}")

Cross-validation scores: [0.864494   0.85877644 0.85134362 0.8506865  0.83981693]
Cross-validation accuracy: 0.85


Last but not least. We fit our model on the **Training** dataset and make predictions for the given **Test** dataset using it.

In [12]:
svm_model.fit(F_train, L_train)

labelsX = svm_model.predict(F_test)
np.save("Results/labelsX.npy", labelsX)