# Baseline models
In this notebook we will take the preprocessed data, with the extra features. We will use the train, validation and test set for finding the best baseline model. 

**To Do list:**
- Load libraries and data
- Implement Train-Validation-Test split
- KNN
- Evaluation function
- SVM (c/g)
- LDA
- Random Forest

## Load libraries and Data

In [1]:
# Libraries
import os
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Constant Variables
RANDOM_SEED = 26
BASE_DIR = os.getcwd()  # Works in Jupyter
print(BASE_DIR)

# Data
relative_path = "../data/processed/processed_data.pkl"
with open(relative_path, 'rb') as f:
    df = pd.read_pickle(f)

c:\Users\Zita\Repositories\affective-states\notebooks


## Splitting Data

In [13]:
# Split independent and dependent variables
X_labels = ['EDA_quality_idx', 'PPG_quality_idx', 'mean_filt_EDA', 'std_filt_EDA',
       'max_filt_EDA', 'min_filt_EDA', 'mean_filt_EDA_dot',
       'std_filt_EDA_dot', 'max_filt_EDA_dot', 'min_filt_EDA_dot', 'mean_filt_EDA_ddot', 'std_filt_EDA_ddot',
       'max_filt_EDA_ddot', 'min_filt_EDA_ddot', 'mean_EDR', 'std_EDR',
       'max_EDR', 'min_EDR', 'mean_EDR_dot', 'std_EDR_dot',
       'max_EDR_dot', 'min_EDR_dot',  'mean_EDR_ddot',
       'std_EDR_ddot', 'max_EDR_ddot', 'min_EDR_ddot', 'mean_hr', 'std_hr',
       'max_hr', 'min_hr', 'mean_hr_dot', 'std_hr_dot', 'max_hr_dot',
       'min_hr_dot', 'mean_diff_hr_time', 'std_diff_hr_time',
       'max_diff_hr_time', 'min_diff_hr_time', 'SDNN', 'rMSSD', 'n_peaks_EDA',
       'mean_peak_amp_EDA', 'std_peak_amp_EDA']
y_labels = ['ar_seg', 'vl_seg']
remaining_labels = df.columns.difference(X_labels + y_labels)

X = df[X_labels].copy()
y = df[y_labels].copy()

In [14]:
# Split train, validation and set data (60% train, 20% validation, 20% test)

X_train, X_validate, X_test = np.split(X.sample(frac=1, random_state=RANDOM_SEED), [int(.6*len(X)), int(.8*len(X))])
y_train, y_validate, y_test = np.split(y.sample(frac=1, random_state=RANDOM_SEED), [int(.6*len(y)), int(.8*len(y))])

# Reset index for all train, validation, and test sets
for dataset in [X_train, X_validate, X_test, y_train, y_validate, y_test]:
    dataset.reset_index(drop=True, inplace=True)

  return bound(*args, **kwds)


## Evaluation function

In [15]:
# Store results
results = {}

# Helper function to train, predict, and evaluate
def evaluate_model(name, model, X_train, y_train, X_validate, y_validate, X_test, y_test):
    model.fit(X_train, y_train)
    y_val_pred = model.predict(X_validate)
    y_test_pred = model.predict(X_test)

    results[name] = {
        "val_accuracy": accuracy_score(y_validate, y_val_pred),
        "val_f1": f1_score(y_validate, y_val_pred, average='weighted'),
        "test_accuracy": accuracy_score(y_test, y_test_pred),
        "test_f1": f1_score(y_test, y_test_pred, average='weighted'),
    }

## Model training

In [16]:
# K-Nearest Neighbors (KNN)
knn = KNeighborsClassifier(n_neighbors=5)
evaluate_model("KNN", knn, X_train, y_train, X_validate, y_validate, X_test, y_test)

# Support Vector Machine with linear kernel (SVM-c)
svm_c = SVC(kernel='linear', C=1)
evaluate_model("SVM-c (linear)", svm_c, X_train, y_train, X_validate, y_validate, X_test, y_test)

# Support Vector Machine with RBF kernel (SVM-g)
svm_g = SVC(kernel='rbf', gamma='scale', C=1)
evaluate_model("SVM-g (RBF)", svm_g, X_train, y_train, X_validate, y_validate, X_test, y_test)

# Linear Discriminant Analysis (LDA)
lda = LinearDiscriminantAnalysis()
evaluate_model("LDA", lda, X_train, y_train, X_validate, y_validate, X_test, y_test)

# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
evaluate_model("Random Forest", rf, X_train, y_train, X_validate, y_validate, X_test, y_test)

ValueError: Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

### KNN

### SVM-c

### SVM-g

### LDA

### Random Forest

## Compare Results

In [None]:
# Show results
import pandas as pd
results_df = pd.DataFrame(results).T
print(results_df)