# Using LSTM Features


## Objectives

* objectives:
  * Fit a RNN (LSTM) model 
  * 
  
* plan: 

1. libraries and functions
2. load data
3. preprocess data
   - [x] full days only
   - [ ] resample
   - [ ] normalise activity
4. feature engineering
   - [ ] `mean`, `std`, `min`, `max`, `sum`
   - [ ] `%0 active`
5. train/test split
6. modelling
   - [ ] import libraries
   - [ ] model selection
   - [ ] model evaluation
7. interpretation / visualisation
   
Try to recreate: 

**Bibliography:** Arora, A., Chakraborty, P. and Bhatia, M.P.S. (2023) Identifying digital biomarkers in actigraph based sequential motor activity data for assessment of depression: a model evaluating SVM in LSTM extracted feature space. International Journal of Information Technology [online]. 15 (2), pp. 797–802. Available from: https://doi.org/10.1007/s41870-023-01162-5 [Accessed 17 February 2024].


## Data Preprocessing

In [1]:
# import libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler

### functions
- [x] extract depresjon from folder
- [x] extract full days (1440 rows) records and minimum full records
- [ ] resample, e.g. to hourly
- [] normalise data (mean = 0, std = 1)


## Extract data from folder

In [3]:

def extract_folder(folderpath, add_scores=False, downsample=None):
    """
    Extract CSV data from folder and subfolders into a dataframe.

    Args:
      folderpath (str): Path to the folder containing CSV files.
      add_scores (bool, optional): Boolean to add scores.csv to the dataframe. Defaults to False.
      downsample (int, optional): Number of rows to downsample CSVs to. Defaults to None.

    Returns:
      pandas.DataFrame: DataFrame of concatenated CSV data.
    """
    import os
    import pandas as pd
    
    # Dict to store dataframes by condition  
    dfs = {'control': [], 'condition': []}

    try:
        # Handle top-level scores CSV
        if add_scores and 'scores.csv' in os.listdir(folderpath):
            scores_path = os.path.join(folderpath, 'scores.csv')  
            dfs['scores'] = pd.read_csv(scores_path)

        # Get subfolders
        subfolders = [f for f in os.listdir(folderpath) if os.path.isdir(os.path.join(folderpath, f))]

        for subfolder in subfolders:
            subfolderpath = os.path.join(folderpath, subfolder)  

            # Get list of CSV files
            files = os.listdir(subfolderpath)

            for file in files:
                filepath = os.path.join(subfolderpath, file)

                # Extract ID from filename 
                id = file.split('.')[0]

                df = pd.read_csv(filepath)

                # Downsample if needed
                if downsample:
                    df = df.sample(downsample)

                # Add ID column - this is the filename without the extension
                df['id'] = id

                # Add 'condition' column
                df['condition'] = subfolder

                # Convert 'timestamp' and 'date' to datetime
                df['timestamp'] = pd.to_datetime(df['timestamp'])
                df['date'] = pd.to_datetime(df['date'])

                # Append to dict by condition
                if subfolder == 'control':
                    dfs['control'].append(df)
                else:  
                    dfs['condition'].append(df)

    except OSError:
        print(f"Error reading folder: {folderpath}")

    # concatenate dfs for each condition
    dfs['control'] = pd.concat(dfs['control'])
    dfs['condition'] = pd.concat(dfs['condition'])

    # Reset index on the final df
    df = pd.concat([dfs['control'], dfs['condition']]).reset_index(drop=True)

    # add label column
    df['label'] = 0
    df.loc[df['condition'] == 'condition', 'label'] = 1
    
    # remove old 'condition' column
    df.drop('condition', axis=1, inplace=True)

    # Final concat
    return df

In [4]:
# load the data
df = extract_folder('../data/depresjon')

## Data preprocessing

### Extract full days only


In [5]:
def extract_full_days(df):
    """
    Extracts full days from a DataFrame.

    Parameters:
    df (DataFrame): The input DataFrame.

    Returns:
    DataFrame: The DataFrame containing only full days (1440 rows per day).

    """
    # group by id and date, count rows, and filter where count equals 1440
    full_days_df = df.groupby(['id', 'date']).filter(lambda x: len(x) == 1440)
    
    # print id and date combinations that don't have 1440 rows
    not_full_days = df.groupby(['id', 'date']).size().reset_index(name='count').query('count != 1440')
    print("id and date combinations that don't have 1440 rows:")
    print(not_full_days)
    
    return full_days_df



In [6]:
full_df = extract_full_days(df)

id and date combinations that don't have 1440 rows:
                id       date  count
0      condition_1 2003-05-07    720
16     condition_1 2003-05-23    924
17    condition_10 2004-08-31    900
32    condition_10 2004-09-15    495
33    condition_11 2004-09-28    870
...            ...        ...    ...
1101     control_7 2003-04-23    610
1102     control_8 2003-11-04    900
1122     control_8 2003-11-24    658
1123     control_9 2003-11-11    900
1143     control_9 2003-12-01    778

[115 rows x 3 columns]


In [7]:
# print shape of the full days dataframe
print(full_df.shape)

# print the first few rows of the full days dataframe
print(full_df.head())

(1481760, 5)
              timestamp       date  activity         id  label
540 2003-03-19 00:00:00 2003-03-19         0  control_1      0
541 2003-03-19 00:01:00 2003-03-19         0  control_1      0
542 2003-03-19 00:02:00 2003-03-19         0  control_1      0
543 2003-03-19 00:03:00 2003-03-19         0  control_1      0
544 2003-03-19 00:04:00 2003-03-19       175  control_1      0


### Downsample to hourly



In [8]:
def resample_data(df, freq, agg_func='mean'):
    """
    Resamples the given DataFrame based on the specified frequency and aggregation function.

    Parameters:
    - df (DataFrame): The input DataFrame.
    - freq (str): The frequency at which to resample the dat. (e.g., H for hourly, D for daily, 5T for every 5 minutes),
    - agg_func (str, optional): The aggregation function to apply during resampling. Defaults to 'mean'.

    Returns:
    - df_resampled (DataFrame): The resampled DataFrame.
    """
    df_resampled = df.set_index('timestamp').groupby('id').resample(freq).agg(agg_func)
    df_resampled.reset_index(inplace=True)
    return df_resampled

FIXME future warning - non-numeric cols

In [9]:
# resample the full days dataframe to hourly frequency
resample_df = resample_data(full_df, 'h')

### Normalise 'activity'

In [10]:
def normalise_data(df, columns_to_normalise):
    """
    Normalise the specified columns in the df using StandardScaler.

    Parameters:
    - df (pandas.DataFrame): The DataFrame to be normalised.
    - columns_to_normalise (list): A list of column names to be normalised.

    Returns:
    - df (pandas.DataFrame): The DataFrame with the specified columns normalised.
    """
    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()
    df[columns_to_normalise] = scaler.fit_transform(df[columns_to_normalise])
    return df

In [11]:
# normalise 'activity'
norm_df = normalise_data(resample_df, ['activity'])

## Sequences

### Create 24 hour sequences

The general idea - I think - is to create 24-hour long sequences from the hourly resampled dataframe so that *daily patterns* can be captured.  Then the `LSTM` model learns from the full day sequences.

The sliding window effectively means taking a fixed-size window and liding across the data in steps: 

* window size = 24; hop-size = 12:
  * 24 rows = first window
  * slide window 12 rows down and take rows 12-36 as the next window
  * etc.


In [12]:
def create_sequences(full_df, window_size, hop_size):
    """
    Create sequences from a given dataframe.

    Args:
        full_df (pandas.DataFrame): df with full days
        window_size (int): The size of each window/sequence.
        hop_size (int): The number of steps to move the window.

    Returns:
        list: A list of sequences/windows created from the dataframe.
    """
    sequences = []

    for i in range(0, len(full_df), hop_size):
        window = full_df.iloc[i:i+window_size]
        sequences.append(window)

    return sequences


In [13]:
# Create sequences for full days df
window_size = 24
hop_size = 12
sequences = create_sequences(norm_df, window_size, hop_size)

In [14]:
# print the number of sequences
print(len(sequences))

# print the shape of the first sequence
print(sequences[0].shape)

# print first sequence
print(sequences[2060])

2068
(24, 5)
              id           timestamp       date  activity  label
24720  control_9 2003-11-27 00:00:00 2003-11-27 -0.704176    0.0
24721  control_9 2003-11-27 01:00:00 2003-11-27 -0.704176    0.0
24722  control_9 2003-11-27 02:00:00 2003-11-27 -0.704176    0.0
24723  control_9 2003-11-27 03:00:00 2003-11-27 -0.704176    0.0
24724  control_9 2003-11-27 04:00:00 2003-11-27 -0.704176    0.0
24725  control_9 2003-11-27 05:00:00 2003-11-27 -0.704176    0.0
24726  control_9 2003-11-27 06:00:00 2003-11-27 -0.704176    0.0
24727  control_9 2003-11-27 07:00:00 2003-11-27 -0.704176    0.0
24728  control_9 2003-11-27 08:00:00 2003-11-27 -0.704176    0.0
24729  control_9 2003-11-27 09:00:00 2003-11-27 -0.704176    0.0
24730  control_9 2003-11-27 10:00:00 2003-11-27 -0.704176    0.0
24731  control_9 2003-11-27 11:00:00 2003-11-27 -0.704176    0.0
24732  control_9 2003-11-27 12:00:00 2003-11-27 -0.681956    0.0
24733  control_9 2003-11-27 13:00:00 2003-11-27 -0.703437    0.0
24734  contr

In [15]:
lengths = [len(seq) for seq in sequences]
#print(set(lengths))

# print sequences with length not equal to window_size
print([seq for seq in sequences if len(seq) != window_size])

[              id           timestamp       date  activity  label
24804  control_9 2003-11-30 12:00:00 2003-11-30  -0.69612    0.0
24805  control_9 2003-11-30 13:00:00 2003-11-30  -0.69612    0.0
24806  control_9 2003-11-30 14:00:00 2003-11-30  -0.69612    0.0
24807  control_9 2003-11-30 15:00:00 2003-11-30  -0.69612    0.0
24808  control_9 2003-11-30 16:00:00 2003-11-30  -0.69612    0.0
24809  control_9 2003-11-30 17:00:00 2003-11-30  -0.69612    0.0
24810  control_9 2003-11-30 18:00:00 2003-11-30  -0.69612    0.0
24811  control_9 2003-11-30 19:00:00 2003-11-30  -0.69612    0.0
24812  control_9 2003-11-30 20:00:00 2003-11-30  -0.69612    0.0
24813  control_9 2003-11-30 21:00:00 2003-11-30  -0.69612    0.0
24814  control_9 2003-11-30 22:00:00 2003-11-30  -0.69612    0.0
24815  control_9 2003-11-30 23:00:00 2003-11-30  -0.69612    0.0]


Remove partial sequence for `control_9`

In [16]:
# remove sequences with length not equal to window_size
full_seqs = [s for s in sequences if len(s) == window_size]

In [17]:
# print length of sequences
print([seq for seq in full_seqs if len(seq) != window_size])

[]


Create data set

In [18]:
def create_dataset(sequences, activity_column='activity', label_column='label'):
  """
  Create a dataset from a list of sequences.

  Parameters:
  sequences (list): A list of sequences.
  activity_column (str): The name of the column containing activity data in each sequence. Default is 'activity'.
  label_column (str): The name of the column containing label data in each sequence. Default is 'label'.

  Returns:
  inputs (numpy.ndarray): An array of input sequences.
  targets (numpy.ndarray): An array of target sequences.
  """

  import numpy as np
  inputs = []
  targets = []

  for seq in sequences:
    # Extract just activity column
    input_arr = seq[activity_column].values

    # Extract label column
    target_arr = seq[label_column].values

    inputs.append(input_arr)
    targets.append(target_arr)

  inputs = np.array(inputs)
  targets = np.array(targets)

  return inputs, targets

In [19]:
# create dataset from full sequences
X_train, y_train = create_dataset(full_seqs, 'label')
# Reshape X_train for the model
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)


In [20]:
# print the shape of X_train and y_train
print(X_train.shape, y_train.shape)

(2067, 24, 1) (2067, 24)


## LSTM

Article (Arora, 2023) states that they extracted high level features from the fifth LSTM layer with this configuration: 

* num_timesteps = 24 
* num_features = 1  
* num_layers = 5 
* units = [125, 100, 75, 50, 25] 
* dropout = 0.1


### side step - set up separate environment - too many issues instlling tensorfloww

1. gpu drivers
2. cuda toolkit, cuDNN
3. running in wsl


In [21]:
import tensorflow as tf

# Check if TensorFlow sees any GPUs
gpus = tf.config.list_physical_devices('GPU')

if gpus:
    # If GPUs are available, print the number of GPUs
    print("Num GPUs Available:", len(gpus))
else:
    print("No GPUs Available")



2024-02-18 19:21:38.303993: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-18 19:21:38.642894: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-18 19:21:38.643180: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-18 19:21:38.698760: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-18 19:21:38.809205: I tensorflow/core/platform/cpu_feature_guar

Num GPUs Available: 1


2024-02-18 19:21:42.184298: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-18 19:21:42.499748: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-18 19:21:42.499797: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.


In [22]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

# configuration
num_timesteps = 24 # X data shape 
num_features = 1  # Number of features
num_layers = 5 
units = [125, 100, 75, 50, 25] # Units in each LSTM layer
dropout = 0.1

model = Sequential()

#  LSTM layers  
for i in range(num_layers):
  model.add(LSTM(units[i], return_sequences=True, dropout=dropout)) 

#  connected output layer  
model.add(Dense(25)) 

# compile model
model.compile(loss='mean_squared_error', optimizer='adam') 

# generate features
features = model.predict(X_train)

2024-02-18 19:21:55.410077: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-18 19:21:55.410155: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-18 19:21:55.410173: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-18 19:21:55.794423: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-18 19:21:55.795118: I external/local_xla/xla/stream_executor



In [23]:
# print features
print(features)

[[[ 7.91813727e-06 -1.46745761e-05  2.23000934e-05 ... -2.13794374e-05
    1.99195210e-06 -6.26087467e-06]
  [ 2.96882499e-05 -6.89646549e-05  1.03898288e-04 ... -1.01119571e-04
    1.17164618e-05 -3.10999167e-05]
  [ 6.27766058e-05 -1.88223479e-04  2.80895241e-04 ... -2.78581749e-04
    3.94541566e-05 -9.17305224e-05]
  ...
  [-7.48083787e-03 -4.97863395e-03  3.54944705e-03 ... -1.01685189e-02
    1.10634314e-02 -1.96319968e-02]
  [-8.14214442e-03 -4.97901067e-03  3.14453384e-03 ... -1.01871649e-02
    1.17522776e-02 -2.08753459e-02]
  [-8.78571905e-03 -4.98929992e-03  2.74188560e-03 ... -1.01795997e-02
    1.24003105e-02 -2.20377594e-02]]

 [[ 7.91813727e-06 -1.46745761e-05  2.23000934e-05 ... -2.13794374e-05
    1.99195210e-06 -6.26087467e-06]
  [ 2.96882499e-05 -6.89646549e-05  1.03898288e-04 ... -1.01119571e-04
    1.17164618e-05 -3.10999167e-05]
  [ 6.27766058e-05 -1.88223479e-04  2.80895241e-04 ... -2.78581749e-04
    3.94541566e-05 -9.17305224e-05]
  ...
  [-7.48083787e-03 -4.9

TODO

Process:
* ~~Downsample to hourly averages~~
* ~~create 24 hour sequences using overlapping sliding windows with 12 hour hop~~
* ~~each 24 hour sequence will be one data example~~

LSTM features:
* LSTM network with 5 layers (125, 100, 75, 50, 25 nodes)
* use tanh activations and 0.1 dropout
* feed 24 hour sequence and extract features from last layer

Stat features:
* men, std, %zero activities
* use overlapping window

Train SVM classifier
* linear SVM with c=0.1

Eval model
* 10 fold cross validation


## Feature engineering


## Train / Test split

## Models

### Import model libraries

### Define models (broken into sets -not sure of time needed)

## Model evaluation function

## Models

### Results - summary - out of the box models: 

* **Accuracy** - proportion of total predictions correct ->    `gradient boost`

$$\frac{{\text{{True Positive}} + \text{{True Negative}}}}{{\text{{Total Prediction}}}}$$

* **Precision**: proportion of positive prediction that are actually correct (Positive Predictive value) -> `neural network`
  
$$\frac{{\text{{True Positive}}}}{{\text{{True Positives}}+ \text{{False Positives}}}}$$

* **Recall**: proportion of actual positives that are correctly identified (aka Sensitivity) -> `Naive Bayes`

$$\frac{{\text{{True Positive}}}}{{\text{{True Positives}} + \text{{False Negatives}}}}$$

* **F1**: harmonic mean of Precision and Recall -> `gradient boost`

$$\frac{{{{2}} * \text{{(Precision}} * \text{{Recall)}}}}{{\text{{Precision}} + \text{{Recall}}}}$$

* **MCC**: measure of quality of binary classifications - considered a balanced measure ->  `gradient boost`

$$\frac{{\text{{(TP * TN - FP *FN)}}}}{{\text{{sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))}}}}$$

* **Quickest**: time ->  `Naive Bayes`



#### Reminder: 

* `True Positives (TP)`:  model predicted positive, and the truth is also positive.
* `True Negatives (TN)`:  model predicted negative, and the truth is also negative.
* `False Positives (FP)`: model predicted positive, and the truth is negative.
* `False Negatives (FN)`: model predicted negative, and the truth is positive.

## Visualise results
