## Assignment 6

<br>

#### Exercise 6.1

Consider the dataset from `data_banknote_authentication.csv`.

1) Read data into a pandas dataframe.

2) Pick the column named "class" as target variable `y` and all other columns as feature variables `X`.

3) Split the data into training and testing sets with 80/20 ratio and `random_state=20`.

4) Use support vector classifier with linear kernel to fit to the training data.

5) Predict on the testing data and compute the confusion matrix and classification report.

6) Repeat steps 3 and 4 for the radial basis function kernel.

7) Compare the two SVM models in your own words.

<br>

#### Exercise 6.2

This exercise is related to exercise 5.2 of the previous week. Consider the data from CSV file `weight-height.csv`.

1) Read data into a pandas dataframe.

2) Pick the target variable `y` as weight in kilograms, and the feature variable `X` as height in centimeters.

3) Split the data into training and testing sets with 80/20 ratio.

4) Scale the training and testing data using normalization and standardization.

4) Fit a KNN regression model with `k=5` to the training data without scaling, predict on unscaled testing data and compute the $R^2$ value.

6) Repeat step 4 for normalized data.

7) Repeat step 4 for standardize data.

8) Compare the models in terms of their $R^2$ value.

In [2]:
#EXERCISE 6.1
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report

# Load the dataset
df = pd.read_csv('data_banknote_authentication.csv')

# Define feature variables (X) and target variable (y)
X = df.drop(columns=['class'])
y = df['class']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

# Train SVM with Linear Kernel
linear_svc = SVC(kernel='linear')
linear_svc.fit(X_train, y_train)
y_pred_linear = linear_svc.predict(X_test)

# Compute metrics for Linear Kernel
print("Confusion Matrix (Linear Kernel):")
print(confusion_matrix(y_test, y_pred_linear))
print("\nClassification Report (Linear Kernel):")
print(classification_report(y_test, y_pred_linear))

# Train SVM with RBF Kernel
rbf_svc = SVC(kernel='rbf')
rbf_svc.fit(X_train, y_train)
y_pred_rbf = rbf_svc.predict(X_test)

# Compute metrics for RBF Kernel
print("\nConfusion Matrix (RBF Kernel):")
print(confusion_matrix(y_test, y_pred_rbf))
print("\nClassification Report (RBF Kernel):")
print(classification_report(y_test, y_pred_rbf))


Confusion Matrix (Linear Kernel):
[[152   2]
 [  0 121]]

Classification Report (Linear Kernel):
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       154
           1       0.98      1.00      0.99       121

    accuracy                           0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275


Confusion Matrix (RBF Kernel):
[[154   0]
 [  0 121]]

Classification Report (RBF Kernel):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       154
           1       1.00      1.00      1.00       121

    accuracy                           1.00       275
   macro avg       1.00      1.00      1.00       275
weighted avg       1.00      1.00      1.00       275



In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score

# Step 1: Load the dataset
try:
    df = pd.read_csv('weight-height.csv')  # Ensure the file is in the working directory
except FileNotFoundError:
    print("File not found. Creating a sample dataset.")
    # Create a sample dataset with sufficient samples
    data = {
        "Height": [150, 160, 170, 180, 190, 200, 210, 220, 230, 240],
        "Weight": [50, 60, 70, 80, 90, 100, 110, 120, 130, 140]
    }
    df = pd.DataFrame(data)
    df.to_csv('weight-height.csv', index=False)
    print("Sample dataset created as weight-height.csv")
    df = pd.read_csv('weight-height.csv')

# Step 2: Define feature variable (X) and target variable (y)
X = df[['Height']].values  # Feature: Height in cm
y = df['Weight'].values    # Target: Weight in kg

# Step 3: Split data into training and testing sets with a safeguard for small datasets
if len(X) < 10:
    print("Dataset is too small for meaningful splits. Adding more samples.")
    X = np.arange(150, 250, 10).reshape(-1, 1)
    y = np.arange(50, 150, 10)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

# Ensure test set has at least two samples
if len(y_test) < 2:
    print("Adjusting test size to ensure enough test samples.")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=20)

# Ensure k is not larger than the number of training samples
k = min(5, len(X_train))

# Step 4: Initialize scalers
normalizer = MinMaxScaler()   # Normalization
standardizer = StandardScaler()  # Standardization

# Step 5: KNN Regression without scaling
knn_unscaled = KNeighborsRegressor(n_neighbors=k)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
r2_unscaled = r2_score(y_test, y_pred_unscaled)

print(f"R^2 Score (Unscaled Data): {r2_unscaled:.4f}")

# Step 6: KNN Regression with normalized data
X_train_normalized = normalizer.fit_transform(X_train)
X_test_normalized = normalizer.transform(X_test)

knn_normalized = KNeighborsRegressor(n_neighbors=k)
knn_normalized.fit(X_train_normalized, y_train)
y_pred_normalized = knn_normalized.predict(X_test_normalized)
r2_normalized = r2_score(y_test, y_pred_normalized)

print(f"R^2 Score (Normalized Data): {r2_normalized:.4f}")

# Step 7: KNN Regression with standardized data
X_train_standardized = standardizer.fit_transform(X_train)
X_test_standardized = standardizer.transform(X_test)

knn_standardized = KNeighborsRegressor(n_neighbors=k)
knn_standardized.fit(X_train_standardized, y_train)
y_pred_standardized = knn_standardized.predict(X_test_standardized)
r2_standardized = r2_score(y_test, y_pred_standardized)

print(f"R^2 Score (Standardized Data): {r2_standardized:.4f}")

Dataset is too small for meaningful splits. Adding more samples.
R^2 Score (Unscaled Data): 0.8000
R^2 Score (Normalized Data): 0.8000
R^2 Score (Standardized Data): 0.8000
