# IGP 5 Models

## Preprocessing

1. import files into dataframe
2. extract 'full' days (1440 rows per date)
3. extract number of days matching scores.csv

In [1]:
# load functions in python file with magic command
%run ../code/preprocess.py

In [2]:
import pandas as pd
folderpath = '../depresjon'
output_csv_path = '../output/'
scores_csv_path = '../depresjon/scores.csv'

# extract files
df = extract_from_folder(folderpath)

# extract full days (true days)
full_df = preprocess_full_days(df)

# extract days per scores 
final = extract_days_per_scores(full_df, scores_csv_path)

# pivot df to wide format
final_pivot = pivot_dataframe(final)

In [3]:
# save to csv
final_pivot.to_csv(output_csv_path + 'preprocessed-wide.csv', index=False)
final.to_csv(output_csv_path+ 'preprocessed-long.csv', index=False)

In [4]:
# list of variable names to delete
var_list = ['df', 'full_df',  'final', 'final_pivot']

# loop over the list and delete variables if they exist
for var in var_list:
    if var in locals():
        del locals()[var]


### Notes

* Kept all id, date combinations to maximise data
* will split into train, test, val
* will keep proportions



## Import from CSV

1. import preprocessed csv file

In [1]:
import pandas as pd
output_csv_path = '../output/'
scores_csv_path = '../depresjon/scores.csv'

# import from csv
df = pd.read_csv(output_csv_path + 'preprocessed-long.csv', parse_dates=['timestamp', 'date'])

## Features

>all row level, therforen no data leakage

* inactiveDay
* activeNight
* inactiveLight
* activeDark
* mean
* std
* percentZero
* kurtosis

In [2]:
# load functions in python file with magic command
%run ../code/features.py

In [3]:
# calculate features
features_full = calculate_all_features(df, sunlight_df)

In [None]:
features_full

## Split into Female, Male, Both datasets

In [14]:
# load functions in python file with magic command
%run ../code/model.py


In [15]:

male, female, both = split_and_prepare_data(features_full)

# shapes of the datasets 
print(f"Male dataset shape: {male.shape}")
print(f"Female dataset shape: {female.shape}")
print(f"Both genders dataset shape: {both.shape}")


Male dataset shape: (310, 9)
Female dataset shape: (383, 9)
Both genders dataset shape: (693, 9)


## Model

In [None]:
# training and validation sets
X_train, X_validation, y_train, y_validation = validation_data(male)

# evaluate models
results = evaluate_models(models1, X_train, y_train)


In [61]:
#print_top_models(results, metric='accuracy')
print_top_models(results, top_n=3)
#print_top_models(results, metric='mcc', top_n=10)
#print_top_models(results, metric='f1', top_n=10)
#print_top_models(results, metric='training_time', top_n=10)

Top 3 models for training time (fastest to slowest):
1. Naive Bayes: 0.018370437622070312 seconds
2. Decision Tree: 0.022133302688598634 seconds
3. SVC linear: 0.0240386962890625 seconds



* model selection and evaluation strategy
  * either start with many models (garcia) - no hyperparameter
  * choose best mcc, f1, accuracy -> top 3 to go into next round
  * then look at feature importance -> rationale
  * then look at hyperparameter tuning final model
  * then look at ensemble??
  * repeat for other datasets
 
* model evaluation
* metric selection and reason
  * `accuracy` - prop of correct predictions; good overall performanced indicator
  * `recall (sensitivity)` - prop of actual positives that are correctly identified.  ability to identify all actual cases of depression.  crucial to minimise false negatives that is failing to identify individuals who are depressed.
  * `precision` - prop of predicted depression which are correct (true positive predictions among all positive predictions) - important when need to avoid false positives (unnecessary concern, intervention, medication, treatment)
   * `F1` - harmonic mean of precision and recall - balance between the two, especially if imbalanced class distribution
   * `specificity` - ability to identify non-depression correctly - important to ensure healthy individuals are not misclassified -  measures the proportion of actual negatives that are correctly identified by the mode
  * `MCC` - takes into account true adn false positives and negatives.  reliable statistic rate that produces a high score only if the prediction obtained good results in all four matrix categories
  * `ROC-AUC - Area Under the Receiver Operating Characteristic Curve`: Evaluates the model’s ability to discriminate between the classes. A higher AUC indicates better model performance.   ROC-AUC is suitable for depression prediction when you want to evaluate the model's ability to distinguish between depressed and non-depressed individuals across different threshold settings.
  * `training time`

TODO research Matthews Correlation Coefficient, F1 as key metrics - getting the balance right
TODO add metric maths to slides and their importance (contextual)



* feature importance analysis - SHAP, Feature Permutation
* Hyperparameter tuning
* Ensemble models
* Validation

Flexible Decision Boundary: Unlike linear classifiers like Logistic Regression or Linear Discriminant Analysis (LDA), QDA can model non-linear decision boundaries between classes. This flexibility allows QDA to capture more complex relationships in the data.
Unrestricted Covariance Matrices: QDA allows each class to have its own covariance matrix, whereas Linear Discriminant Analysis (LDA) assumes a common covariance matrix for all classes. This can be beneficial when the classes have different variances or when the relationship between features and classes is complex.
Handling Non-Normal Data: Although QDA assumes that the data within each class follows a multivariate normal distribution, it can still perform well even if this assumption is not strictly met, especially if the departure from normality is not severe.
Effective with Small Datasets: QDA can be effective with small datasets because it estimates separate covariance matrices for each class, potentially providing better modeling of the underlying data distribution.
Robustness to Outliers: QDA can be more robust to outliers compared to linear classifiers like Logistic Regression because it models each class's covariance separately, allowing it to better adapt to the data distribution.