# Evaluate motor activity of depression patients (and healthy control group)

Depression is a severe illness which can lead to suicide. More than 264 million people worldwide suffer from depression. It is one of the main causes for disability and the second leading cause of death in the age group of 15-29-year-olds (Source: WHO; https://www.who.int/news-room/fact-sheets/detail/depression).

Fortunately, there are effective psychological and pharmacological treatments. **Measuring motor activity** could be one way to provide an diagnostic early warning system.

***

The **underlying data sets** provide the motor activity of 23 patients with depression and 32 controls (healthy). The severity of the depression is assessed by experts using the Montgomery-Asberg Depression Rating Scale (MADRS). MADRS levels range from 0 to 60. Values above 30 represent a severe depression, values below 10 indicate a healthy state.

Original paper see here: https://www.researchgate.net/publication/325021337_Depresjon_A_Motor_Activity_Database_of_Depression_Episodes_in_Unipolar_and_Bipolar_Patients

***


## <font color=blue>Table of Contents </font>
* [Explore Score File](#1)
* [Clean / Explore Condition Table](#2)
* [Explore Control Table](#3)
* [Activity Data - Exploration](#4)
* [Loop over files and extract info](#5)
* [Comparison Condition vs Control](#6)

In [None]:
# packages

# standard
import numpy as np
import pandas as pd
import time

# plots
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

In [None]:
# files
!ls ../input/the-depression-dataset/data

In [None]:
data_path = '../input/the-depression-dataset/data/'

<a id='1'></a>
# Explore Score File

In [None]:
# load score file
df = pd.read_csv(data_path + 'scores.csv')

In [None]:
# show file (it's quite small so it can be displayed at once)
df

In [None]:
# add difference of scores ("after activity recording" minus "before activity recording")
df['DeltaMADRS'] = df.madrs2 - df.madrs1

#### Control rows have empty columns except for number (id), days, gender and age. Therefore let's split between condition and control observations.

In [None]:
# split in condition and control table
df_condition = df[df.number.str.contains('condition')].copy()
df_control = df[df.number.str.contains('control')].copy()

<a id='2'></a>
# Clean / Explore Condition Table

In [None]:
df_condition

In [None]:
df_condition.shape

In [None]:
# define standard text for missing values
txt_missing = '_MISSING_'

In [None]:
# prep melanch column
df_condition.melanch = df_condition.melanch.fillna(txt_missing)
df_condition.melanch = df_condition.melanch.astype('category') # convert to categorical
df_condition.melanch = df_condition.melanch.cat.rename_categories({-1 : txt_missing,
                                                                   1.0 : '1',
                                                                   2.0 : '2'})

# age, gender => category
df_condition.age = df_condition.age.astype('category')
df_condition.gender = df_condition.gender.astype('category')

# further type conversions (float => int => category)
df_condition.afftype = df_condition.afftype.astype(int).astype('category')
df_condition.inpatient = df_condition.inpatient.astype(int).astype('category')
df_condition.marriage = df_condition.marriage.astype(int).astype('category')
df_condition.work = df_condition.work.astype(int).astype('category')

# imputation
df_condition.edu = df_condition.edu.astype('category')
df_condition.edu = df_condition.edu.cat.rename_categories({' ' : txt_missing})

In [None]:
# let's check the cleaned data set
df_condition

In [None]:
# define numerical and categorical features
features_num = ['days','madrs1','madrs2','DeltaMADRS']
features_cat = ['age', 'gender', 'afftype', 'melanch', 'inpatient', 'edu', 'marriage', 'work']

## Numerical Features

In [None]:
# basic stats
df_condition[features_num].describe()

#### Development of MADRS score (before activity recording / after activity recording):

In [None]:
# barplot of MADRS scores (before/after)
temp_plot_paras = plt.rcParams['figure.figsize']

plt.rcParams['figure.figsize'] = (14,4)
df_condition.plot(x='number', y=['madrs1','madrs2'], kind='bar')
plt.title('MADRS Development')
plt.grid()
plt.show()

plt.rcParams['figure.figsize'] = temp_plot_paras

In [None]:
# plot distributions of numerical features
for f in features_num:
    df_condition[f].plot(kind='hist')
    plt.title(f)
    plt.grid()
    plt.show()

### Correlations

In [None]:
# scatter plot for each pair incl. regression line
sns.pairplot(df_condition[features_num], kind='reg')
plt.show()

In [None]:
# correlation matrix
df_condition[features_num].corr(method='pearson')

## Categorical Features

In [None]:
# plot distributions of categorical features
for f in features_cat:
    df_condition[f].value_counts().sort_index().plot(kind='bar')
    plt.title(f)
    plt.grid()
    plt.show()

### Impact of categorical features on scores

In [None]:
# impact of feature on score madrs1 (at begin of activity measurement)
for f in features_cat:
    plt.figure(figsize=(10,4))
    sns.violinplot(data=df_condition, x=f, y='madrs1')
    plt.title('madrs1 vs ' + f)
    plt.grid()
    plt.show()

In [None]:
# impact of feature on score madrs2 (at end of activity measurement)
for f in features_cat:
    plt.figure(figsize=(10,4))
    sns.violinplot(data=df_condition, x=f, y='madrs2')
    plt.title('madrs2 vs ' + f)
    plt.grid()
    plt.show()

In [None]:
# impact of feature on score difference DeltaMADRS = madrs2 - madrs1
for f in features_cat:
    plt.figure(figsize=(10,4))
    sns.violinplot(data=df_condition, x=f, y='DeltaMADRS')
    plt.title('DeltaMADRS vs ' + f)
    plt.grid()
    plt.show()

<a id='3'></a>
# Explore Control Table

In [None]:
# distribution of days
df_control.days.plot(kind='hist')
plt.title('days [control]')
plt.grid()
plt.show()

In [None]:
# type conversion
df_control.age = df_control.age.astype('category')
df_control.gender = df_control.gender.astype('category')

In [None]:
# plot distributions of categorical features
df_control.gender.value_counts().sort_index().plot(kind='bar')
plt.title('gender [control]')
plt.grid()
plt.show()

df_control.age.value_counts().sort_index().plot(kind='bar')
plt.title('age [control]')
plt.grid()
plt.show()

<a id='4'></a>
# Activity Data - Exploration

In [None]:
# load a specific file
my_file = data_path + 'condition/condition_1.csv'
df_act = pd.read_csv(my_file)
df_act.head(10)

In [None]:
# dimensions
df_act.shape

In [None]:
# basic stats of activity
df_act.activity.describe(percentiles=[0.01,0.1,0.25,0.5,0.75,0.9,0.99])

In [None]:
# add logarithmic version of activity
df_act['log1_act'] = np.log10(1+df_act.activity)

# add non-zero indicator for activity
df_act['non_zero'] = (df_act.activity>0).astype(int)

### Distribution

In [None]:
# distribution of activity
plt.figure(figsize=(10,4))
df_act.activity.plot(kind='hist', bins=100)
plt.title('Activity - Histogram')
plt.grid()
plt.show()

In [None]:
# distribution of activity - log transformation
plt.figure(figsize=(10,4))
df_act.log1_act.plot(kind='hist', bins=100)
plt.title('log10(1+Activity) - Histogram')
plt.grid()
plt.show()

In [None]:
# distribution of activity - log transformation - non zeroes only
plt.figure(figsize=(10,4))
np.log10(df_act[df_act.non_zero==1].activity).plot(kind='hist', bins=100)
plt.title('log10(Activity|Activity>0) - Histogram')
plt.grid()
plt.show()

### Time Series

In [None]:
# plot full activity time series
my_alpha=0.25
fig, ax = plt.subplots(figsize=(18,6))
ax.scatter(df_act.timestamp, df_act.activity , alpha=my_alpha)
ax.xaxis.set_major_locator(plt.MaxNLocator(20)) # reduce number of x-axis labels
plt.title(my_file)
plt.xticks(rotation=90)
plt.grid()
ax.legend(loc='upper left')
plt.show()

In [None]:
# zoom into a specific day
df_act_zoom = df_act[df_act.date=='2003-05-12']
my_alpha=0.25
fig, ax = plt.subplots(figsize=(18,6))
ax.scatter(df_act_zoom.timestamp, df_act_zoom.activity, alpha=my_alpha)
ax.xaxis.set_major_locator(plt.MaxNLocator(20)) # reduce number of x-axis labels
plt.title(my_file)
plt.xticks(rotation=90)
plt.grid()
ax.legend(loc='upper left')
plt.show()

### Evaluate by Date

In [None]:
# group activity by date
plt.subplots(figsize=(18,6))
sns.boxplot(data=df_act, x='date', y='activity')
plt.xticks(rotation=90)
plt.title(my_file)
plt.grid()
plt.show()

In [None]:
# group by date
df_act_by_date = df_act.groupby(['date'], as_index=False).agg(
    n = pd.NamedAgg(column='activity', aggfunc='count'),
    n_non_zero = pd.NamedAgg(column='non_zero', aggfunc='sum'),
    mean_act = pd.NamedAgg(column='activity', aggfunc='mean'),
    q75_act = pd.NamedAgg(column='activity', aggfunc=lambda x : np.percentile(a=x, q=75)),
    q90_act = pd.NamedAgg(column='activity', aggfunc=lambda x : np.percentile(a=x, q=90)),
    q95_act = pd.NamedAgg(column='activity', aggfunc=lambda x : np.percentile(a=x, q=95)),
    q99_act = pd.NamedAgg(column='activity', aggfunc=lambda x : np.percentile(a=x, q=99)),
    max_act = pd.NamedAgg(column='activity', aggfunc='max'))

df_act_by_date

#### The first and the last day in this example are incomplete. For the sake of comparability we will remove those incomplete days!

In [None]:
# remove incomplete days from stats
df_act_by_date = df_act_by_date[df_act_by_date.n==1440] # 1440 = 24*60 minutes in a day
df_act_by_date

In [None]:
# plot mean activity by day
plt.figure(figsize=(14,4))
plt.scatter(df_act_by_date.date, df_act_by_date.mean_act)
plt.title('Mean Activity by Day')
plt.xticks(rotation=90)
plt.grid()
plt.show()

print('Mean of daily means:', np.round(df_act_by_date.mean_act.mean(),2))
print('Stdev of daily means:', np.round(df_act_by_date.mean_act.std(),2))

In [None]:
# plot 99th percentile of activity by day
plt.figure(figsize=(14,4))
plt.scatter(df_act_by_date.date, df_act_by_date.q99_act)
plt.title('99th Percentile of Activity by Day')
plt.xticks(rotation=90)
plt.grid()
plt.show()

print('Mean of daily 99th percentile:', np.round(df_act_by_date.q99_act.mean(),2))
print('Stdev of daily 99th percentile:', np.round(df_act_by_date.q99_act.std(),2))

<a id='5'></a>
# Loop over files and extract info

## Condition

In [None]:
# show all condition files
!ls ../input/the-depression-dataset/data/condition

### Let's look at another example before automatically evaluating all files:

In [None]:
# load and plot full activity time series
my_file = data_path + 'condition/condition_2.csv'
df_temp = pd.read_csv(my_file)

my_alpha=0.25
fig, ax = plt.subplots(figsize=(18,6))
ax.scatter(df_temp.timestamp, df_temp.activity , alpha=my_alpha)
ax.xaxis.set_major_locator(plt.MaxNLocator(20)) # reduce number of x-axis labels
plt.title(my_file)
plt.xticks(rotation=90)
plt.grid()
ax.legend(loc='upper left')
plt.show()

### We observe a longer period (several days) where no/almost no activity is recorded. This does not seem reasonable (maybe the sensor was offline/not working properly in that phase). We will in the following remove days showing such extremely low activity.

In [None]:
# define (daily mean) threshold below which we consider the data as not usable
daily_threshold = 10

In [None]:
# iterate over all files and extract statistics 
mean_list = []
std_list = []
q99_list = []
std_q99_list = []

for i in range(23):
    j = 1+i
    my_file = data_path + 'condition/condition_' + str(j) + '.csv'
    print('Extracting from:', my_file)
    df_temp = pd.read_csv(my_file)
    
    # group by date
    df_temp_by_date = df_temp.groupby(['date'], as_index=False).agg(
        n = pd.NamedAgg(column='activity', aggfunc='count'),
        mean_act = pd.NamedAgg(column='activity', aggfunc='mean'),
        q99_act = pd.NamedAgg(column='activity', aggfunc=lambda x : np.percentile(a=x, q=99)),
        max_act = pd.NamedAgg(column='activity', aggfunc='max')
    )

    # remove incomplete days (24*60 minutes = 1440)
    df_temp_by_date = df_temp_by_date[df_temp_by_date.n==1440]
    # remove days with unreasonable low average daily activity
    df_temp_by_date = df_temp_by_date[df_temp_by_date.mean_act > daily_threshold]
    
    print(df_temp_by_date)
    print()
    
    # extract statistics
    mean_temp = df_temp_by_date.mean_act.mean() # mean of mean daily activity
    std_temp = df_temp_by_date.mean_act.std() # stdev of mean daily activity
    mean_q99_temp = df_temp_by_date.q99_act.mean() # mean of 99th percentiles of daily activity
    std_q99_temp = df_temp_by_date.q99_act.std() # stdev of 99th percentiles of daily activity
    
    # add results to lists
    mean_list.append(mean_temp)
    std_list.append(std_temp)
    q99_list.append(mean_q99_temp)
    std_q99_list.append(std_q99_temp)

In [None]:
# store results in data frame
condition_stats = pd.DataFrame(zip(df_condition.number, mean_list, q99_list, std_list, std_q99_list), 
                               columns=['number','Mean_MeanAct','Mean_Q99Act','Std_MeanAct','Std_Q99Act'])
# add coefficient of variation (stdev / mean)
condition_stats['CV_MeanAct'] = condition_stats.Std_MeanAct / condition_stats.Mean_MeanAct
condition_stats['CV_Q99Act'] = condition_stats.Std_Q99Act / condition_stats.Mean_Q99Act
condition_stats

In [None]:
# look at correlation of different metrics
condition_stats.corr()

## Control

In [None]:
# show all control files
!ls ../input/the-depression-dataset/data/control

In [None]:
# iterate over all files and extract statistics 
mean_list_control = []
std_list_control = []
q99_list_control = []
std_q99_list_control = []

for i in range(32):
    j = 1+i
    my_file = data_path + 'control/control_' + str(j) + '.csv'
    print('Extracting from:', my_file)
    df_temp = pd.read_csv(my_file)
    
    # group by date
    df_temp_by_date = df_temp.groupby(['date'], as_index=False).agg(
        n = pd.NamedAgg(column='activity', aggfunc='count'),
        mean_act = pd.NamedAgg(column='activity', aggfunc='mean'),
        q99_act = pd.NamedAgg(column='activity', aggfunc=lambda x : np.percentile(a=x, q=99)),
        max_act = pd.NamedAgg(column='activity', aggfunc='max')
    )

    # remove incomplete days (24*60 minutes = 1440)
    df_temp_by_date = df_temp_by_date[df_temp_by_date.n==1440]
    # remove days with unreasonable low average daily activity
    df_temp_by_date = df_temp_by_date[df_temp_by_date.mean_act > daily_threshold]
    
    print(df_temp_by_date)
    print()
    
    # extract statistics
    mean_temp = df_temp_by_date.mean_act.mean() # mean of mean daily activity
    std_temp = df_temp_by_date.mean_act.std() # stdev of mean daily activity
    mean_q99_temp = df_temp_by_date.q99_act.mean() # mean of 99th percentiles of daily activity
    std_q99_temp = df_temp_by_date.q99_act.std() # stdev of 99th percentiles of daily activity

    # add results to lists
    mean_list_control.append(mean_temp)
    std_list_control.append(std_temp)
    q99_list_control.append(mean_q99_temp)
    std_q99_list_control.append(std_q99_temp)

In [None]:
# store results in data frame
control_stats =  pd.DataFrame(zip(df_control.number, mean_list_control, q99_list_control, std_list_control, std_q99_list_control), 
                               columns=['number','Mean_MeanAct','Mean_Q99Act','Std_MeanAct','Std_Q99Act'])
# add coefficient of variation (stdev / mean)
control_stats['CV_MeanAct'] = control_stats.Std_MeanAct / control_stats.Mean_MeanAct
control_stats['CV_Q99Act'] = control_stats.Std_Q99Act / control_stats.Mean_Q99Act
control_stats

In [None]:
# look at correlation of different metrics
control_stats.corr()

<a id='6'></a>
# Comparison Condition vs Control

In [None]:
# basic stats of condition group
condition_stats.describe()

In [None]:
# basic stats of control group
control_stats.describe()

### => Control group shows higher activity (Mean_MeanAct, Mean_Q99Act as well as Std_MeanAct and Std_Q99Act; CVs are however on similar level).
### Let's visualize:

In [None]:
# combine statistics into one common data frame
condition_stats['Group'] = 'Condition'
control_stats['Group'] = 'Control'
combined_stats = pd.concat([condition_stats, control_stats])

In [None]:
# compare means of daily means for the two groups
sns.boxplot(data=combined_stats, x='Group', y='Mean_MeanAct')
plt.title('Compare Means of Daily Means')
plt.grid()
plt.show()

# compare means of 99th percentiles
sns.boxplot(data=combined_stats, x='Group', y='Mean_Q99Act')
plt.title('Compare Means of Daily 99th Percentiles')
plt.grid()
plt.show()

# compare stdevs of daily means
sns.boxplot(data=combined_stats, x='Group', y='Std_MeanAct')
plt.title('Compare Stdevs of Daily Means')
plt.grid()
plt.show()

# compare CVs of daily means
sns.boxplot(data=combined_stats, x='Group', y='CV_MeanAct')
plt.title('Compare CVs of Daily Means')
plt.grid()
plt.show()

# compare stdevs of 99th percentiles
sns.boxplot(data=combined_stats, x='Group', y='Std_Q99Act')
plt.title('Compare Stdevs of Daily 99th Percentiles')
plt.grid()
plt.show()

# compare CVs of 99th percentiles
sns.boxplot(data=combined_stats, x='Group', y='CV_Q99Act')
plt.title('Compare CVs of Daily 99th Percentiles')
plt.grid()
plt.show()

### Look at individual observations:

In [None]:
# compare two groups using scatter plot
plt.figure(figsize=(8,6))
plt.scatter(condition_stats.Mean_MeanAct, condition_stats.CV_MeanAct, label='Condition')
plt.scatter(control_stats.Mean_MeanAct, control_stats.CV_MeanAct, label='Control')
plt.legend(loc='lower right')
plt.xlabel('Mean of Daily Means')
plt.ylabel('Stdev of Daily Means')
plt.title('Compare Groups using Mean and CV of Mean Daily Activity')
plt.grid()
plt.show()

In [None]:
# compare two groups using scatter plot - now use quantile based metrics
plt.figure(figsize=(8,6))
plt.scatter(condition_stats.Mean_Q99Act, condition_stats.CV_Q99Act, label='Condition')
plt.scatter(control_stats.Mean_Q99Act, control_stats.CV_Q99Act, label='Control')
plt.legend(loc='lower right')
plt.xlabel('Mean of Daily 99th Percentiles')
plt.ylabel('Stdev of Daily 99th Percentiles')
plt.title('Compare Groups using Mean and CV of 99th Percentiles of Daily Activity')
plt.grid()
plt.show()

In [None]:
# interactive plot using additional "quantile" dimension
fig = px.scatter_3d(combined_stats, x='Mean_MeanAct', y='Std_MeanAct', z='CV_Q99Act',
                    color='Group',
                    hover_data=['number'],
                    opacity=0.5)
fig.update_layout(title='Compare Groups using Mean/Stdev of Mean Daily Activity and CV of Daily 99th Perc.')
fig.show()

#### Does afftype (1: bipolar II, 2: unipolar depressive, 3: bipolar I) make a difference within the condition group?

In [None]:
# add stats to original data frame (condition group) to get access to all features
df_condition_x = pd.concat([df_condition, condition_stats.drop('number', axis=1)], axis=1)
df_condition_x = df_condition_x.drop('Group', axis=1)
df_condition_x.head()

In [None]:
# scatterplot, show afftype via color
sns.scatterplot(data=df_condition_x,
                x='Mean_MeanAct', y='CV_MeanAct',
                hue='afftype')
plt.grid()
plt.show()

In [None]:
# scatterplot, show afftype via color
sns.scatterplot(data=df_condition_x,
                x='Mean_Q99Act', y='CV_Q99Act',
                hue='afftype')
plt.grid()
plt.show()

#### Ok, at least nothing obvious...

## For plots of all time series see the additional notebook https://www.kaggle.com/docxian/depression-and-motor-activity-all-plots