# IGP 5 Models

## Preprocessing

1. import files into dataframe
2. extract 'full' days (1440 rows per date)
3. extract number of days matching scores.csv

In [1]:
# load functions in python file with magic command
%run ../code/preprocess.py

In [2]:
import pandas as pd
folderpath = '../depresjon'
output_csv_path = '../output/'
scores_csv_path = '../depresjon/scores.csv'

# extract files
df = extract_from_folder(folderpath)

# extract full days (true days)
full_df = preprocess_full_days(df)

# extract days per scores 
final = extract_days_per_scores(full_df, scores_csv_path)

# pivot df to wide format
final_pivot = pivot_dataframe(final)

In [3]:
# save to csv
final_pivot.to_csv(output_csv_path + 'preprocessed-wide.csv', index=False)
final.to_csv(output_csv_path+ 'preprocessed-long.csv', index=False)

In [4]:
# list of variable names to delete
var_list = ['df', 'full_df',  'final', 'final_pivot']

# loop over the list and delete variables if they exist
for var in var_list:
    if var in locals():
        del locals()[var]


### Notes

* Kept all id, date combinations to maximise data
* will split into train, test, val
* will keep proportions



## Features



To calculate the features: 

* **Day / Night** - determined by hours, e.g. 08:00-20:00

$\text{{day\_night}} = \begin{cases} 
0 & \text{{if }} \text{{day\_start}} \leq \text{{hour}} < \text{{day\_end}} \\
1 & \text{{otherwise}}
\end{cases}$

* **Light / Dark** - determined by monthly sunset/sunrise times in Norway

$\text{{light\_dark}} = \begin{cases} 
0 & \text{{if }} \text{{sunrise\_time}} \leq \text{{timestamp}} < \text{{sunset\_time}} \\
1 & \text{{otherwise}}
\end{cases}$


* **Active / Inactive** - active is where the rolling average (window = 11) of 'active minute' (`activity threshold` > 5) is greater than `rolling threshold` (2)

$\text{{active\_inactive}} = \begin{cases} 
1 & \text{{if }} \text{{activity}} \geq \text{{activity\_threshold}} \\
0 & \text{{otherwise}}
\end{cases}$

$\text{{rolling\_sum}} = \text{{rolling sum of }} \text{{active\_inactive}} \text{{ over a window of }} \text{{rolling\_window}}$

$\text{{active\_inactive\_period}} = \begin{cases} 
1 & \text{{if }} \text{{rolling\_sum}} \geq \text{{rolling\_threshold}} \\
0 & \text{{otherwise}}
\end{cases}$



>all row level, therfore no data leakage - that is features are computed separately for each (id, date) combination so that there is no data leakage / contamination


* **inactiveDay**: The proportion of time during the day when the participant is inactive.

$\text{{inactiveDay}} = \frac{{\text{{Number of inactive hours during the day}}}}{{\text{{Total number of hours during the day}}}}$


* **activeNight**: The proportion of time during the night when the participant is active.

$\text{{activeNight}} = \frac{{\text{{Number of active hours during the night}}}}{{\text{{Total number of hours during the night}}}}$

* **inactiveLight**: The proportion of time during periods of light (e.g., daytime) when the participant is inactive.

$\text{{inactiveLight}} = \frac{{\text{{Number of inactive hours during periods of light}}}}{{\text{{Total number of hours during periods of light}}}}
$


* **activeDark**: The proportion of time during periods of darkness (e.g., nighttime) when the participant is active.

$\text{{activeDark}} = \frac{{\text{{Number of active hours during periods of darkness}}}}{{\text{{Total number of hours during periods of darkness}}}}$


* **mean**: The average value of activity data for each hour of the day. It represents the central tendency of the data.

$\text{{mean}}_{\text{{person-date}}} = \frac{{\sum_{i=1}^{n} \text{{activity}}_{\text{{person-date}}}(i)}}{{n}}$


* **std**: The standard deviation of activity data for each hour of the day. It measures the dispersion or spread of the data around the mean.

$\text{{std}}_{\text{{person-date}}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\text{{activity}}_{\text{{person-date}}}(i) - \text{{mean}}_{\text{{person-date}}})^2}$


* **percentZero**: The percentage of data points that have a value of zero for each hour of the day.

$\text{{percent\_zero}}_{\text{{person-date}}} = \frac{{\text{{Number of hours with zero activity}}_{\text{{person-date}}}}}{{\text{{Total number of hours}}_{\text{{person-date}}}}} \times 100$


* **kurtosis**: A measure of the "tailedness" or shape of a distribution. It indicates how sharply peaked or flat the distribution is compared to a normal distribution. Positive kurtosis indicates a relatively peaked distribution, while negative kurtosis indicates a relatively flat distribution.

$\text{{kurtosis}}_{\text{{person-date}}} = \frac{{\frac{1}{n} \sum_{i=1}^{n} (\text{{activity}}_{\text{{person-date}}}(i) - \text{{mean}}_{\text{{person-date}}})^4}}{{\left( \frac{1}{n} \sum_{i=1}^{n} (\text{{activity}}_{\text{{person-date}}}(i) - \text{{mean}}_{\text{{person-date}}})^2 \right)^2}}$

* **median**: The middle value in the sorted list of values.

$\text{median}_{\text{person-date}} = 
\begin{cases} 
\text{activity}_{\text{person-date}}\left(\frac{n+1}{2}\right) & \text{if } n \text{ is odd} \\
\frac{1}{2} \left( \text{activity}_{\text{person-date}}\left(\frac{n}{2}\right) + \text{activity}_{\text{person-date}}\left(\frac{n}{2} + 1\right) \right) & \text{if } n \text{ is even}
\end{cases}$

* **first quartile (0.25)**: The value below which 25% of values fall.

$\text{{Q1}}_{\text{{person-date}}} = \text{{activity}}_{\text{{person-date}}}\left(\frac{n+1}{4}\right)$

* **third quartile (0.75)**: The value below which 75% of values fall.

$\text{{Q3}}_{\text{{person-date}}} = \text{{activity}}_{\text{{person-date}}}\left(\frac{3(n+1)}{4}\right)$


## Feature Engineering

1. [x] calculate row-level independent features (participant-day) on whole dataset
2. [x] split into male, female, both datasets
3. [x] split each into train and validate datasets
4. [x] normalise male, female, both train sets
5. [x] normalise validation sets with respective parameters from train sets

In [3]:
import pandas as pd
output_csv_path = '../output/'
scores_csv_path = '../depresjon/scores.csv'

# import from csv
df = pd.read_csv(output_csv_path + 'preprocessed-long.csv', parse_dates=['timestamp', 'date'])

# load functions in python file with magic command
%run ../code/features.py
%run ../code/model.py

## Prepare Female, Male, Both datasets

### Row level features

In [4]:
# calculate features
features_full = calculate_all_features(df, sunlight_df)
# save to csv
features_full.to_csv(output_csv_path + 'features.csv', index=False)

### Split into Female, Male, Both datasets

In [5]:
male, female, both = split_and_prepare_data(features_full)

# shapes of the datasets 
print(f"Male dataset shape: {male.shape}")
print(f"Female dataset shape: {female.shape}")
print(f"Both genders dataset shape: {both.shape}")

# save to csv
male.to_csv(output_csv_path + 'male.csv', index=False)
female.to_csv(output_csv_path + 'female.csv', index=False)
both.to_csv(output_csv_path + 'both.csv', index=False)

Male dataset shape: (310, 12)
Female dataset shape: (383, 12)
Both genders dataset shape: (693, 12)


### Split into Train and Validate sets

In [7]:
# split into train and validate
male_X_train, male_X_valid, male_y_train, male_y_valid = validation_data(male)
female_X_train, female_X_valid, female_y_train, female_y_valid = validation_data(female)
both_X_train, both_X_valid, both_y_train, both_y_valid = validation_data(both)

# shapes of the datasets
print(f"Male shapes: {male_X_train.shape}, {male_X_valid.shape}, {male_y_train.shape}, {male_y_valid.shape}")
print(f"Female shapes: {female_X_train.shape}, {female_X_valid.shape}, {female_y_train.shape}, {female_y_valid.shape} ")
print(f"Both shapes: {both_X_train.shape}, {both_X_valid.shape}, {both_y_train.shape}, {both_y_valid.shape}")  

# save to csv
male_X_train.to_csv(output_csv_path + 'male_X_train.csv', index=False)
male_X_valid.to_csv(output_csv_path + 'male_X_valid.csv', index=False)
male_y_train.to_csv(output_csv_path + 'male_y_train.csv', index=False)
male_y_valid.to_csv(output_csv_path + 'male_y_valid.csv', index=False)
female_X_train.to_csv(output_csv_path + 'female_X_train.csv', index=False)
female_X_valid.to_csv(output_csv_path + 'female_X_valid.csv', index=False)
female_y_train.to_csv(output_csv_path + 'female_y_train.csv', index=False)
female_y_valid.to_csv(output_csv_path + 'female_y_valid.csv', index=False)
both_X_train.to_csv(output_csv_path + 'both_X_train.csv', index=False)
both_X_valid.to_csv(output_csv_path + 'both_X_valid.csv', index=False)
both_y_train.to_csv(output_csv_path + 'both_y_train.csv', index=False)
both_y_valid.to_csv(output_csv_path + 'both_y_valid.csv', index=False)


Male shapes: (263, 11), (47, 11), (263,), (47,)
Female shapes: (325, 11), (58, 11), (325,), (58,) 
Both shapes: (589, 11), (104, 11), (589,), (104,)


### Normalise Train and Validate Sets

In [8]:
# normalise train, apply to train and val
male_X_train_scaled, male_X_valid_scaled = normalise_data(male_X_train, male_X_valid)
female_X_train_scaled, female_X_valid_scaled = normalise_data(female_X_train, female_X_valid)
both_X_train_scaled, both_X_valid_scaled = normalise_data(both_X_train, both_X_valid)

# save to csv
male_X_train_scaled.to_csv(output_csv_path + 'male_X_train_scaled.csv', index=False)
male_X_valid_scaled.to_csv(output_csv_path + 'male_X_valid_scaled.csv', index=False)
female_X_train_scaled.to_csv(output_csv_path + 'female_X_train_scaled.csv', index=False)
female_X_valid_scaled.to_csv(output_csv_path + 'female_X_valid_scaled.csv', index=False)
both_X_train_scaled.to_csv(output_csv_path + 'both_X_train_scaled.csv', index=False)
both_X_valid_scaled.to_csv(output_csv_path + 'both_X_valid_scaled.csv', index=False)