# Random Forest - baseline

* Referencing: 
> **Bibliography:** Zanella-Calzada, L.A., Galván-Tejada, C.E., Chávez-Lamas, N.M., Gracia-Cortés, M. del C., Magallanes-Quintanar, R., Celaya-Padilla, J.M., Galván-Tejada, J.I. and Gamboa-Rosales, H. (2019) Feature Extraction in Motor Activity Signal: Towards a Depression Episodes Detection in Unipolar and Bipolar Patients. Diagnostics [online]. 9 (1), p. 8. Available from: https://www.mdpi.com/2075-4418/9/1/8 [Accessed 28 November 2023].
* [article notes](../literature/Zanella-FeatureExtraction.md)


## Plan 

1. Load and process `depresjon`
   * load into pandas df
   * select `control` and `condition` -> it seems that they used first 4 control and first 5 condition participants
   * normalise data (mean = 0, std = 1)
   * remove incomplete cases

2. Extract features - 14 features
   * mean
   * standard deviation
   * variance
   * trimmed mean
   * coefficient of variation
   * inversse coefficient of variation
   * kurtosis
   * skewness
   * quantailes (1, 5, 25, 75, 95, 99)

3. 

In [1]:
# libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Load and preprocess data

### Load data files

In [10]:
def extract_folder(folderpath, add_scores=False, downsample=None):
    """
    Extract CSV data from folder and subfolders into a dataframe.

    Args:
      folderpath (str): Path to the folder containing CSV files.
      add_scores (bool, optional): Boolean to add scores.csv to the dataframe. Defaults to False.
      downsample (int, optional): Number of rows to downsample CSVs to. Defaults to None.

    Returns:
      pandas.DataFrame: DataFrame of concatenated CSV data.
    """

    # Dict to store dataframes by condition  
    dfs = {'control': [], 'condition': []}

    try:
        # Handle top-level scores CSV
        if add_scores and 'scores.csv' in os.listdir(folderpath):
            scores_path = os.path.join(folderpath, 'scores.csv')  
            dfs['scores'] = pd.read_csv(scores_path)

        # Get subfolders
        subfolders = [f for f in os.listdir(folderpath) if os.path.isdir(os.path.join(folderpath, f))]

        for subfolder in subfolders:
            subfolderpath = os.path.join(folderpath, subfolder)  

            # Get list of CSV files
            files = os.listdir(subfolderpath)

            for file in files:
                filepath = os.path.join(subfolderpath, file)

                # Extract ID from filename 
                id = file.split('.')[0]

                df = pd.read_csv(filepath)

                # Downsample if needed
                if downsample:
                    df = df.sample(downsample)

                # Add ID column - this is the filename without the extension
                df['id'] = id

                # Add 'condition' column
                df['condition'] = subfolder

                # Convert 'timestamp' and 'date' to datetime
                df['timestamp'] = pd.to_datetime(df['timestamp'])
                df['date'] = pd.to_datetime(df['date'])

                # Append to dict by condition
                if subfolder == 'control':
                    dfs['control'].append(df)
                else:  
                    dfs['condition'].append(df)

    except OSError:
        print(f"Error reading folder: {folderpath}")

    # concatenate dfs for each condition
    dfs['control'] = pd.concat(dfs['control'])
    dfs['condition'] = pd.concat(dfs['condition'])

    # Reset index on the final df
    df = pd.concat([dfs['control'], dfs['condition']]).reset_index(drop=True)

    # add label column
    df['label'] = 0
    df.loc[df['condition'] == 'condition', 'label'] = 1
    
    # remove old 'condition' column
    df.drop('condition', axis=1, inplace=True)

    # Final concat
    return df

In [11]:
# set folder path
folderpath = '../data/depresjon/'
# extract all files
all_files = extract_folder(folderpath)
# print rows 21-24
#print(all_files.iloc[21:25])
print(all_files.head(-5))


                  timestamp       date  activity           id  label
0       2003-03-18 15:00:00 2003-03-18        60    control_1      0
1       2003-03-18 15:01:00 2003-03-18         0    control_1      0
2       2003-03-18 15:02:00 2003-03-18       264    control_1      0
3       2003-03-18 15:03:00 2003-03-18       662    control_1      0
4       2003-03-18 15:04:00 2003-03-18       293    control_1      0
...                     ...        ...       ...          ...    ...
1571696 2004-06-10 14:58:00 2004-06-10         0  condition_9      1
1571697 2004-06-10 14:59:00 2004-06-10         0  condition_9      1
1571698 2004-06-10 15:00:00 2004-06-10         0  condition_9      1
1571699 2004-06-10 15:01:00 2004-06-10         5  condition_9      1
1571700 2004-06-10 15:02:00 2004-06-10         0  condition_9      1

[1571701 rows x 5 columns]


### Select subset

In [21]:
control_subjects = ['control_1', 'control_2', 'control_3', 'control_4']
condition_subjects = ['condition_1', 'condition_2', 'condition_3', 'condition_4', 'condition_5']

# Filter for control subjects
control_df = all_files[all_files['id'].isin(control_subjects)] 

# Filter for condition subjects
condition_df = all_files[all_files['id'].isin(condition_subjects)]

# Concatenate 
df = pd.concat([control_df, condition_df])

# print the first 5 rows
print(df.head(5), '\n')

# print info
print(df.info(), '\n')

# print random 10 rows
print(df.sample(10), '\n')

# print number of rows by 'id'
print(df['id'].value_counts(), '\n')

# print number of rows by 'label'
print(df['label'].value_counts())

# print proportion of 'label' column
print(df['label'].value_counts(normalize=True))

            timestamp       date  activity         id  label
0 2003-03-18 15:00:00 2003-03-18        60  control_1      0
1 2003-03-18 15:01:00 2003-03-18         0  control_1      0
2 2003-03-18 15:02:00 2003-03-18       264  control_1      0
3 2003-03-18 15:03:00 2003-03-18       662  control_1      0
4 2003-03-18 15:04:00 2003-03-18       293  control_1      0 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 306813 entries, 0 to 1488480
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   timestamp  306813 non-null  datetime64[ns]
 1   date       306813 non-null  datetime64[ns]
 2   activity   306813 non-null  int64         
 3   id         306813 non-null  object        
 4   label      306813 non-null  int64         
dtypes: datetime64[ns](2), int64(2), object(1)
memory usage: 14.0+ MB
None 

                  timestamp       date  activity           id  label
1293564 2003-05-13 09:17:00 2003-05-13 

### Normalise subset (z-score)

Could use: 

* `sklearn.preprocessing.scale` - Standardises features by removing the mean and scaling to unit variance (similar to manual z-score normalisation)
* `sklearn.preprocessing.minmax_scale` - Transforms features to a given range (often 0-1 for minmax scaling)
* `sklearn.preprocessing.normalize` - L2 vector normalisation

* pandas -> df.normalize(axis=0)

In [23]:
# calculate mean and standard deviation
mu = df['activity'].mean()
sigma = df['activity'].std()

# normalise
df['activity_norm'] = (df['activity'] - mu)/sigma

# print summary statistics
print(df.describe())

            activity          label  activity_norm
count  306813.000000  306813.000000   3.068130e+05
mean      168.451369       0.413499  -2.501153e-17
std       356.899613       0.492462   1.000000e+00
min         0.000000       0.000000  -4.719853e-01
25%         0.000000       0.000000  -4.719853e-01
50%         3.000000       0.000000  -4.635796e-01
75%       160.000000       1.000000  -2.367996e-02
max      6776.000000       1.000000   1.851375e+01


### Missing values

In [25]:
# Check if dataframe has any NaN 
print(df.isnull().values.any())

# Count number of NaN per column
print(df.isnull().sum())

# See indices of NaN values 
print(df[df.isnull().any(axis=1)].index)

False
timestamp        0
date             0
activity         0
id               0
label            0
activity_norm    0
dtype: int64
Int64Index([], dtype='int64')


## Extract features (14)


TODO - can this be made more efficient
fix the above to take the first value from each hour for each respondent - too much data?

In [31]:
# Calculate features per time window
features = []
for idx, df_window in df.groupby(df.index): # grouped by time window
    features_dict = {
        'mean': df_window['activity_norm'].mean(),
        'std': df_window['activity_norm'].std(),
        'variance': df_window['activity_norm'].var(),
        'variance': df_window['activity_norm'].var(),
        'trimmed_mean': df_window['activity_norm'].quantile(0.05),
        'coef_var': df_window['activity_norm'].std() / df_window['activity_norm'].mean(),
        'inverse_coef_var': df_window['activity_norm'].mean() / df_window['activity_norm'].std(),
        'kurtosis': df_window['activity_norm'].kurtosis(),
        'skewness': df_window['activity_norm'].skew(),
        'quantile_1': df_window['activity_norm'].quantile(0.01),
        'quantile_5': df_window['activity_norm'].quantile(0.05),
        'quantile_25': df_window['activity_norm'].quantile(0.25),
        'quantile_75': df_window['activity_norm'].quantile(0.75),
        'quantile_95': df_window['activity_norm'].quantile(0.95),
        'quantile_99': df_window['activity_norm'].quantile(0.99)
    }
    
    features.append(features_dict) 

features_df = pd.DataFrame(features)

In [30]:
print(features_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306813 entries, 0 to 306812
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   mean      306813 non-null  float64
 1   std       0 non-null       float64
 2   variance  0 non-null       float64
dtypes: float64(3)
memory usage: 7.0 MB
None
