# Final Model

Possible explorations:

* Rework Rodriguez - data leakage.
* Male, Female, Full
* Activity / Inactivity
* Feature importance
* Data leakage

* 

## Preprocessing dataset

* extract data
* keep full days only (1440 min per date)
* reduce by 'days' on `scores_csv`, minimum days = 7, exact days = 7
  * this means `condition_8` gets dropped (only 5 days)
  * remaining 54 people have full week's worth of data

In [39]:
import pandas as pd

# load functions in python file with magic command
%run ../code/final-1-rodriguez.py

In [42]:
# paths
folderpath = '../data/depresjon'
output_csv_path = '../data/petter/final.csv'
scores_csv_path = '../data/depresjon/scores.csv'

# full ds, no csv
df = extract_from_folder(folderpath)

# keep full days only
full_df = preprocess_full_days(df, print_info=False)

# reduce to `num_days` in scores.csv
reduce_df = extract_days_per_scores(full_df, scores_csv_path)


In [43]:
# add scores df
final_scores_df = add_scores(reduce_df, scores_df = pd.read_csv(scores_csv_path))

## drop cols 5, 6, 9-16
final_df = final_scores_df.drop(columns=final_scores_df.columns[[5, 6, 9, 10, 11, 12, 13, 14, 15, 16]],axis=1, inplace=False)

In [44]:
# save to csv
final_df.to_csv('../data/petter/final-rod-1-all.csv', index=False)

## Exploration - Rodriguez with no `data leakage`

* Avoiding data leakage with 14 features used by Rodriguez
* Split into separate datasets before engineering features


In [45]:
import pandas as pd

# load functions in python file with magic command
%run ../code/final-1-rodriguez.py

In [46]:
# import dataset, timestamp, date as datetime
df_rod = pd.read_csv('../data/petter/final-rod-1-all.csv', parse_dates=['timestamp', 'date'])

# drop cols - gender and age
df_rod = df_rod.drop(['gender', 'age'], axis=1)


New shape to dataframe - min/hour columns

In [47]:
# extract hour and minute from timestamp
df_rod['hour'] = df_rod['timestamp'].dt.hour
df_rod['minute'] = df_rod['timestamp'].dt.minute

# pivot the DataFrame
df_pivot = df_rod.pivot(index=['date', 'id', 'label', 'hour'], columns='minute', values='activity')

# rename columns
df_pivot.columns = [f'min_{minute:02d}' for minute in range(60)]

# reset index
df_pivot.reset_index(inplace=True)

In [81]:
#missing = df_pivot[df_pivot.isnull().any(axis=1)]
#print(missing)

New dataframes - day, night, full

In [48]:
#  subsets based on time ranges
day = df_pivot[(df_pivot['hour'] >= 8) & (df_pivot['hour'] < 20)]  # day: 8 am to 8 pm
night = df_pivot[(df_pivot['hour'] >= 21) | (df_pivot['hour'] < 7)]  # night: 9 pm to 7 am
full = df_pivot  # full day:  24 hours

# print shapes
print(day.shape)
print(night.shape)
print(full.shape)

(12348, 64)
(10290, 64)
(24696, 64)


### Features

* Ensure that split into test/train is done before calculating features
* Functions below split the datasets first and then standardise and calculate 14 features. 

### Features and Random Forest model

`preprocess_and_calculate` - for feature generation
`fit_and_evaluate` - for model (Random Forest)

In [49]:
dfs = [day, night, full]
df_names = ['"day"', '"night"', '"full"']

for df, df_names in zip(dfs, df_names):
    print(f'Processing {df_names} dataset')
    X_train, y_train, X_test, y_test = preprocess_and_calculate_features(df)
    print(f'X_train shape: {X_train.shape}')
    print(f'y_train shape: {y_train.shape}')
    print(f'X_test shape: {X_test.shape}')
    print(f'y_test shape: {y_test.shape}')
    print('')

    print(f'Fitting model for {df_names} dataset')
    accuracy, f1, conf_matrix, recall, mcc, precision, roc_auc, specificity, support = fit_and_evaluate(X_train, y_train, X_test, y_test)

    print(f"Accuracy: {accuracy:.4f}, \nF1-score: {f1:.4f}, \nConfusion Matrix:\n{conf_matrix}\nRecall: {recall:.4f}, \nMCC: {mcc:.4f}")
    print("\n")

Processing "day" dataset
X_train shape: (10200, 23)
y_train shape: (10200,)
X_test shape: (2148, 23)
y_test shape: (2148,)

Fitting model for "day" dataset
Accuracy: 0.5605, 
F1-score: 0.4010, 
Confusion Matrix:
[[888 276]
 [668 316]]
Recall: 0.3211, 
MCC: 0.0937


Processing "night" dataset
X_train shape: (8500, 23)
y_train shape: (8500,)
X_test shape: (1790, 23)
y_test shape: (1790,)

Fitting model for "night" dataset
Accuracy: 0.5089, 
F1-score: 0.3644, 
Confusion Matrix:
[[659 311]
 [568 252]]
Recall: 0.3073, 
MCC: -0.0143


Processing "full" dataset
X_train shape: (20400, 23)
y_train shape: (20400,)
X_test shape: (4296, 23)
y_test shape: (4296,)

Fitting model for "full" dataset
Accuracy: 0.5517, 
F1-score: 0.4056, 
Confusion Matrix:
[[1713  615]
 [1311  657]]
Recall: 0.3338, 
MCC: 0.0760




### Interpretation

The above model is a recreation of the Rodriguez model but making sure that there is no data leakage, by standardising and generating features after splitting the datasets into `train` and `test`.

The data was reduced to the 'days' column on 'scores.csv' so there are an uneven number of days - but each day is full - that is, it contains 1440 rows.

As a direct comparison to the same process with data leakage: 

**Day**
* Accuracy dropped from 0.7289 to 0.5605
* F1 stayed the same around 0.4
* Recall dropped from 0.45 to 0.32
* MCC dropped from 0.22 to 0.09

**Night**
* Accuracy dropped from 0.7095 to 0.5089
* F1 stayed the same around 0.35
* Recall dropped from 0.39 to 0.307
* MCC dropped from 0.17 to -0.01

**Full**
* Accuracy dropped from 0.71 to 0.55
* F1 improved slightly from 0.37 to 0.406
* Recall dropped from 0.42 to 0.33
* MCC dropped from 0.19 to 0.07


**It is clear that the methodology employed by the authors of this study introduced significant data leakage, even more than I was able to introduce given their reported accuracy of 0.99**

Model with data leakage can be found here: 

[..\petter\Rodriguez-recreation.ipynb](..\petter\Rodriguez-recreation.ipynb)