# `DEPRESJON` - Exploring Gender Specific Machine Learning Models


## Depresjon

### Dataset
The [publicly available dataset “Depresjon”](https://datasets.simula.no/depresjon/) consists of motor activity recordings of 23 unipolar and bipolar depressed patients, stored as files labelled `condition`, and 32 healthy controls, stored as files labelled `control`. The data was collected using Actiwatch devices worn on the right wrist. The dataset also contains a file (scores.csv) which includes a limited amount of biodemographical details about the persons, as well as clinical details (MADRS scores) of the `condition` persons. (Garcia-Ceja et al., 2018)

![Actigraphy Day by ID and Condition](../_images/days-id-label.png)
### Actigraphy data
The Actiwatch, developed by Cambridge Neurotechnology Ltd, England (model AW4), measures activity levels by recording movements with a sampling frequency of 32Hz. Movements exceeding 0.05 g generate a corresponding voltage, stored as activity counts in the device's memory unit. These counts are indicative of movement intensity, with higher counts representing greater activity levels. The data is recorded at one-minute intervals, providing a detailed temporal profile of motor activity (Fogner et al., 2019)
### Depression rating
The Montgomery-Åsberg Depression Rating Scale (MADRS) is a depression scale designed to be particularly sensitive to treatment effects. The measure calculates the “mean change” and “correlation to total change” on 17 different facets of depression listed below. These measures are summed to give the overall depression value (Montgomery and Åsberg, 1979).
One key feature of the MADRS is its focus on specific symptoms rather than general mood assessment. This allows for a more nuanced understanding of the patient's depressive experience and facilitates targeted treatment approaches.

## Literature review

### DEPRESJON - Origins

The “Depresjon” dataset was introduced by Garcia et al. at the *ACM Multimedia Systems Conference*, Amsterdam in 2018 and has become the de facto dataset for predicting depression state from actigraphy data, so much so that it is often referred to as the ‘depression dataset.’ 

It had its origins in a 2010 study by Berle et al. on investigating motor activity patterns in schizophrenic and depressed patients.  This study focused on statistical analyses of differences in motor activity between three groups – controls, patients with schizophrenia and patients diagnosed with depression.  It looked at features such as interdaily stability, intradaily variability, relative amplitude, night activity and total activity and found distinct differences between the groups.

Garcia et al., which includes several members of Berle’s 2010 team, introduced and described the “Depresjon” dataset in their 2018 article DEPRESJON: A Motor Activity Database of Depression Episodes in Unipolar and Bipolar Patients.  Their main objective was to share a novel, open dataset which contains motor activity readings from patients diagnosed with depression and healthy controls.  This enables researchers to study the association between activity patterns and depression as well as develop machine learning (ML) approaches for predicting depression and severity of depression.  The article includes baseline evaluation using several ML classifiers as well as suggested evaluation metrics, all in the interest of transparency.

### Articles influencing this study

#### Day and Night

Rodriguez-Ruiz et al. [2020] were some of the first to analyse day and night data seperately. This was an interesting idea as it indirectly addresses our hypothesis that depressed people show less activity during the day and more activity at night. In their investigation they concluded that there was a significant difference in the results of day and night data and they even went on to suggest that depression states could be classified on night time actigraph data alone. 

They cropped the data so that each participant had the same number of activity readings as each other. Then two subsets of the data were created by splitting the data into day (between 8:00 and 20:59), night (between 21:00 and 7:59); the entire dataset was also retained. The activity data was standardised transforming the raw readings into values that reflected the variation fo the data from a central value. This process was not described fully and so we made an assumption that this central value was the mean of the entire data set.

It was clear when studying this paper and recreating the model that there were a number of omissions or poorly described processses which made the recreation of the model difficult. Most notable was the way the data was cropped so that all participants had the same amount of data was not described in detail. We attempted to replicate the values extracted in the paper but without any success.

Activity data was further processed by applying a Fourier transform where only the real values were retained. After this process is completed, 14 features were extracted in hourly increments from both the processed data dubbed the frequency-domain; these were supplemented by 10 non-processed features described as time-domain. These features were then used to train models with very high levels of accuracy.  The process described to create features may be introducing data leakage and could very well be the reason for the very high (overfit) accuracies (as high as 99.7%).

#### Sleep

Hu et al. [2024] continued the work into using sleep disturbance as an indication of depression.  Sleep data was extracted and used to classify each time period as either sleep or non-sleep time. The classification process used a voting mechanism, where at a point in time data to be classified had 5 mins before and 5 mins after extracted, giving 11 pieces of data. If 10 out of the 11 pieces of data held activity values less than 5 it was classified as a sleep state. The following nine features were extracted from this sleep data:

Sleep start time
Sleep end time
Total duration of sleep
Sleep efficiency
Length of waking during sleep
Number of wakings during sleep
Frequency of waking during sleep
Maximum activity during sleep
Minimum activity during sleep
Another notable feature of this work is the use of an improved missing data interpolation method (GAIN) which they used to impute missing data values. Data was also standardised using the min-max method, Z-cores method and a self-fitting transformation.

These features were then used to train an ensemble model with high levels of accuracy.



## Research Area
When conducting initial research into the dataset and associated literature, we could find no investigation into how gender might affect model construction and performance.

It has been well documented that depression is more prevalent in women [Piccinelli and Wilkinson, 2000]; in fact recent studies have shown that there is a significant difference between genders in symptoms, including a lack of energy and psychomotor retardation [Sabic et al., 2021]. It stands to reason that a predictive models based on activity data may differ by gender.

[insert image here] /workspaces/IGP-5/petter/_images/mf-average-activity.png 

A two-sample t-test suggests there is a statistical difference between activity means between male and female observations (T-statistic: -32.582, p=<0.01).

After gaining various evidence from our initial exploration, we decided to focus our work on these 3 questions:

* What activity features could be extracted from our data?
* Are different machine learning algorithms better suited for use on male, female and a gender unknown dataset?
* Is there a difference in the feature importance for male, female and unknown dataset?

## Process

<img src="../_images/IGP process (1).png" width="200"/>


## Preprocessing

1. Import data from csv files and store as dataframe
2. Extract 'full days' - where unique [participant, date] pair has 1440 rows (minutes in a day)
3. Reduce to 'number of days' specified on 'scores.csv' and import gender


## Feature Engineering
Features: 
* Mean
* Median
* Quartiles (Q1 and Q3)
* Kurtosis
* Standard Deviation
* Percentage of Zero Activity
* Inactivity during the day
* Activity during the day
* Inactivity during daylight hours
* Activity during darkness

For details and code see [01-data-prep.ipynb](./01-data-prep.ipynb)

## Baseline

### Approach
Each model was fit using cross-validation (n=5) with default settings - that is, no hyperparameter tuning. 

### Models
* Zero R Classifier
* Random Forest
* Linear SVC
* Decision Tree
* Logistic Regression
* KNN
* Naïve Bayes
* Neural Network
* XGBoost
* LightGBM
* AdaBoost
* QDA
* Gradient Boosting
* SVM rbf
* SVM linear
* Gaussian Process

### Metrics 
* Accuracy
* Precision
* Recall
* F1
* MCC
* Specificity
* ROC-AUC

### Results

#### Male dataset


| Model | Accuracy | F1 | MCC | Precision | Recall | Specificity | ROC-AUC |
|-------|----------|----|----|-----------|--------|-------------|---------|
| Neural network | 0.908 | 0.912 | 0.821 | 0.919 | 0.912 | 0.904 | 0.963 |
| SVC linear | 0.900 | 0.908 | 0.806 | 0.886 | 0.934 | 0.865 | 0.936 |
| SVM linear | 0.897 | 0.904 | 0.799 | 0.885 | 0.927 | 0.865 | 0.934 |
| QDA | 0.897 | 0.912 | 0.808 | 0.852 | 0.985 | 0.802 | 0.958 |
| Logistic Regression | 0.882 | 0.889 | 0.757 | 0.877 | 0.905 | 0.857 | 0.928 |
| SVM rbf | 0.882 | 0.894 | 0.774 | 0.844 | 0.956 | 0.802 | 0.949 |
| LightGBM | 0.878 | 0.888 | 0.759 | 0.869 | 0.912 | 0.841 | 0.937 |
| Gradient Boosting | 0.874 | 0.885 | 0.753 | 0.851 | 0.926 | 0.818 | 0.929 |
| XGBoost | 0.87 | 0.879 | 0.745 | 0.86 | 0.904 | 0.834 | 0.934 |
| AdaBoost | 0.859 | 0.869 | 0.729 | 0.853 | 0.899 | 0.816 | 0.921 |
| Random Forest | 0.844 | 0.859 | 0.703 | 0.82 | 0.912 | 0.77 | 0.912 |
| KNN | 0.844 | 0.857 | 0.691 | 0.822 | 0.897 | 0.786 | 0.897 |
| Decision Tree | 0.809 | 0.815 | 0.622 | 0.827 | 0.808 | 0.81 | 0.809 |
| Naïve Bayes | 0.756 | 0.774 | 0.518 | 0.759 | 0.802 | 0.708 | 0.826 |

#### Female dataset


| Model | Accuracy | F1 | MCC | Precision | Recall | Specificity | ROC-AUC |
|-------|----------|----|----|-----------|--------|-------------|---------|
| Neural network | 0.907 | 0.863 | 0.796 | 0.876 | 0.856 | 0.935 | 0.951 |
| SVC linear | 0.852 | 0.774 | 0.671 | 0.833 | 0.729 | 0.916 | 0.914 |
| SVM linear | 0.831 | 0.729 | 0.615 | 0.809 | 0.667 | 0.916 | 0.895 |
| QDA | 0.729 | 0.681 | 0.495 | 0.569 | 0.856 | 0.663 | 0.881 |
| Logistic Regression | 0.837 | 0.747 | 0.632 | 0.802 | 0.702 | 0.906 | 0.896 |
| SVM rbf | 0.862 | 0.772 | 0.689 | 0.876 | 0.702 | 0.944 | 0.934 |
| LightGBM | 0.846 | 0.775 | 0.659 | 0.778 | 0.775 | 0.883 | - |
| Gradient Boosting | 0.868 | 0.804 | 0.707 | 0.819 | 0.793 | 0.907 | 0.936 |
| XGBoost | 0.840 | 0.761 | 0.643 | 0.778 | 0.748 | 0.888 | 0.916 |
| AdaBoost | 0.828 | 0.719 | 0.610 | 0.821 | 0.648 | 0.920 | 0.898 |
| Random Forest | 0.846 | 0.767 | 0.658 | 0.806 | 0.738 | 0.902 | 0.920 |
| KNN | 0.855 | 0.782 | 0.675 | 0.799 | 0.766 | 0.902 | 0.902 |
| Decision Tree | 0.822 | 0.741 | 0.609 | 0.743 | 0.747 | 0.859 | 0.803 |
| Naïve Bayes | 0.745 | 0.649 | 0.454 | 0.614 | 0.693 | 0.771 | 0.809 |

#### Both dataset



| Model | Accuracy | F1 | MCC | Precision | Recall | Specificity | ROC-AUC |
|-------|----------|----|----|-----------|--------|-------------|---------|
| Neural network | 0.905 | 0.886 | 0.805 | 0.894 | 0.878 | 0.924 | 0.958 |
| SVC linear | 0.871 | 0.844 | 0.736 | 0.858 | 0.834 | 0.897 | 0.944 |
| SVM linear | 0.874 | 0.849 | 0.743 | 0.855 | 0.846 | 0.895 | 0.943 |
| QDA | 0.845 | 0.831 | 0.700 | 0.781 | 0.895 | 0.809 | 0.928 |
| Logistic Regression | 0.876 | 0.850 | 0.746 | 0.865 | 0.838 | 0.903 | 0.948 |
| SVM rbf | 0.866 | 0.830 | 0.728 | 0.881 | 0.793 | 0.918 | 0.948 |
| LightGBM | 0.867 | 0.837 | 0.729 | 0.866 | 0.814 | 0.906 | 0.947 |
| Gradient Boosting | 0.869 | 0.833 | 0.734 | 0.891 | 0.789 | 0.927 | 0.949 |
| XGBoost | 0.883 | 0.855 | 0.761 | 0.888 | 0.830 | 0.921 | 0.950 |
| AdaBoost | 0.854 | 0.814 | 0.705 | 0.875 | 0.773 | 0.912 | 0.938 |
| Random Forest | 0.862 | 0.828 | 0.791 | 0.867 | 0.797 | 0.909 | 0.935 |
| KNN | 0.833 | 0.791 | 0.658 | 0.835 | 0.757 | 0.889 | 0.894 |
| Decision Tree | 0.801 | 0.766 | 0.597 | 0.759 | 0.777 | 0.819 | 0.798 |
| Naïve Bayes | 0.740 | 0.699 | 0.477 | 0.683 | 0.725 | 0.751 | 0.822 |

In addition to the above metrics, we also captured model training time: 

![Barplot of Model Training Time on Male dataset](../_images/barplot-male-time.png)

Further plots and code can be found here: [02-baseline-models](./02-baseline-models.ipynb)

### Summary

* `Neural Network` classifier is the strongest performer across all metrics for each of the datasets.
    * It also takes the longest time to train, by some margin.
* `Naive Bayes` performed the worst for each of the datasets across most metrics.
* The **Male** dataset also had strong performance form `SVC (linear)`, `SVM (linear)` and `QDA`
* The **Female** dataset also had strong performance from `Gradient Boosting` and `SVC (rbf)` but the results were more scattered.
* The **Both** dataset also had strong performance from `Logistic Regression` and `XGBoost`



## Feature Importance Evaluation

We know that there is high multicollinearity in our dataset and that several of the features are very similar in terms of what they measure.  For example, 'InactiveDay' is very similar to 'InactiveLight' with the key difference being one is time-based and the other is based on actual sunset and sunrise in Norway.  We are interested in understanding which of our new feature pairs is most effective (['nactiveDay, ActiveNight], [InactiveLight, ActiveDark]) for predicting depression states.  As our goal is to produce a simple, efficient and interpretable model, we are looking at feature removal as opposed to clustering options like `Principal Component Analysis`, which is an effective dimensionality reduction technique, but may be less interpretable.

We used several approaches to investigate feature importance:

* **SHAP** (SHapley Additive exPlanations) - this uses 'cooperative game theory' to calculate a feature's contribution to the prediction (Lundberg and Lee, 2017) 
* **VIF** (Variance Inflation Factor) - results in a measure to quantify the severity of multicollinearity.  Values greater than 5-10 suggest significant multicollinearity and could be a concern, warranting investigation as possible redundant featuresor highly correlated. (Wikipedia, 2024)

Three methods were deployed to calculate the top_n features for each classifier - we chose top_n = 5:

* **filter** - uses SelectKBest functionality to evaluate performance of each model after selecting n features with ANOVA F-value as the scoring function; features are selected based on their individual relevance to the target.
* **wrapper** - uses Recursive Feature Elimination (RFE) functionality with cross-validation to evaluate performance of each model after selecting top_n features; features are selected based on their contribution to the model's performance; not available for some models, e.g. QDA.
* **embedded** - evaluates the model performance after training on the entire dataset without feature selection; models are trained with feature selection built into their learning process (where available)

Further information, examples and code can be found here: [03-feature-removal](./03-feature-removal.ipynb)

### Male Dataset

![VIF for Male Dataset](../_images/male-vif.png)
![Feature Importance Heatmap on Male Dataset](../_images/male-feat-imp-hm.png)
![Feature Importance Barplot for Male Dataset](../_images/male-feat-imp-bars.png)
![Feature Importance by Model and Method](../_images/male-feat-model-method.png)

The above plots (and more) are available [here](./03-feature-removal.ipynb) for Female and Both datasets.

### Summary

Multicollinearity is a significant part of this model and all other models using the Depresjon dataset.  The dataset is essentially one column of activity data at minute level and all features are engineered from this column.  It is essential to ensure that there is no data leakage by being very careful in the process.  Multicollinearity can result in unstable coefficients, reduced interpretability, redundancy and model reinforcement/bias by including complementary features.  This is the reason we looked at feature removal and model simplicity, even at the expense of some performance. 

Future work may involve exploring multicollinearity further and through different methods, including regularisation to minimise collinearity (Ridge, Lasso), ensemble models with different features for different models, or calculating new features from the existing ones - e.g. using combination (linear or non-linear), clustering, PCA or feature interaction techniques.

#### Male

Following iterative evaluation of models and subsets of features it was found that: 

* Best three feature model for Male Dataset is: `LightGBM` with Accuracy = 0.863 using `inactiveLight`, `activeDark`, `percent_zero`
* Best two feature model (`inactiveLight`, `activeDark`) is SVM (rbf) with Accuracy = 0.84 and MCC = 0.693 

#### Female

Following iterative evaluation of models and subsets of features it was found that: 

* Best three feature model for Female Dataset is: `Gradient Boosting` with Accuracy = 0.834 using `inactiveLight`, `activeDark`, `std`
* Best two feature model (`inactiveLight`, `activeDark`) is `Gradient Boosting` with Accuracy = 0.806 and MCC = 0.567 

#### Both

Following iterative evaluation of models and subsets of features it was found that: 

* Best three feature model for Both Dataset is: `Random Forest` with Accuracy = 0.820 using `inactiveLight`, `activeDark`, `median`
* Best two feature model (`inactiveLight`, `activeDark`) is `Gradient Boosting` with Accuracy = 0.791 and MCC = 0.574



## Hyperparameter Tuning

Hyperparameter tuning with `RandomisedSearchCV` was performed using the six final models: 

### Male Models

**LightGBM**

![Male Two Feature Model Parameters](../_images/male-2-feature-final.png)

**SVC (rbf)**

![Male Three Feature Model Parameters](../_images/male-3-feature-final.png)


| Model | Accuracy | MCC |
|-------|----------|-----|
| Baseline (Neural Network) | 0.908 | 0.821 |
| Three Feature Final (SVC-rbf) | 0.894 | 0.780 |
| Two Feature Final (LightGBM) | 0.871 | 0.752 |

### Female Models

**Gradient Boosting**

![Female Two Feature Model Parameters](../_images/female-2-feature-final.png)

**Gradient Boosting**

![Female Three Feature Model Parameters](../_images/female-3-feature-final.png)


| Model | Accuracy | MCC |
|-------|----------|-----|
| Baseline (Neural Network) | 0.908 | 0.796 |
| Three Feature Final (Gradient Boosting) | 0.858 | 0.683 |
| Two Feature Final (Gradient Boosting) | 0.822 | 0.608 |

### Both Models

**Gradient Boosting**

![Both Two Feature Model Parameters](../_images/both-2-feature-final.png)

**Random Forest**

![Both Three Feature Model Parameters](../_images/both-3-feature-final.png)


| Model | Accuracy | MCC |
|-------|----------|-----|
| Baseline (Neural Network) | 0.905 | 0.805 |
| Three Feature Final (Random Forest) | 0.829 | 0.647 |
| Two Feature Final (Gradient Boosting) | 0.805 | 0.599 |


Further information and code can be found here: [05-final-model](./05-final-model.ipynb)

## Validation on Final Models

The moment of truth comes when checking model performance against a completely unseen dataset - the validation dataset.  This data subset was taken at the very beginning and has not been looked at or handled in any way.  The features were generated independently using the same methods and standardisation/normalisation was done using the coefficients from the train datasets.  This is where we learn whether the model has been overfitted to the train sample or if it generalises (at least to unseen examples from the Depresjon dataset).


| Classifier | Accuracy | MCC | F1 |
|------------|----------|-----|----| 
| Male – 2 feature: LightGBM | 0.766 | 0.572 | 0.807 |
| Male – 3 feature: SVC-rbf | 0.851 | 0.717 | 0.868 |
| Female – 2 feature: Gradient Boosting | 0.741 | 0.409 | 0.595 |
| Female – 3 feature: Gradient Boosting | 0.862 | 0.689 | 0.778 |
| Both – 2 feature: Gradient Boosting | 0.731 | 0.449 | 0.682 |
| Both – 3 feature: Random Forest | 0.750 | 0.488 | 0.705 |

As can be seen from the table above, the results from the validation set are lower than those achieved on the cross-validated training set.  However, the accuracies and F1 values are still very respectable and performant.  We were particularly interested in `Matthews Correlation Coefficient` and note that these values have dropped from 0.7-0.8 to 0.4-0.7 ranges.  

MCC ranges from -1 to 1 (where 1 is perfect prediction).  In general, values greater than 0.5 are good and indicate that the model performs better than chance - therefore it makes meaningful predictions; less than 0.5 suggests less-than-random performance, which is obviously a concern.  

The female 2-feature model has an MCC of 0.409 which may mean that this model is not adequately performant.  Also of note is that the Both models (i.e. all genders) both have MCC values less than 0.5, while the Male models have MCC values of 0.57 and 0.72 for two and three features, respectively.  This would require further investigation but it adds more credence to the suspicion that it is worthwhile pursuing the gender-based classifier angle.  

## Discussion

### What went well

We are pleased with the outcomes from our project everyone learnt a new skill and we all learnt about ourselves and each other. WE have listed a number of key achievements in our project:

Created a vast number of different plots and charts using python libarires. Experimenting with differnt summaries and sorting methods to enhance the visual.
Extract a number of key features including our own creation (ActiveDark and InactiveLight) which performed better than any other feature.
Created a model that achieved impressive metrics whilst still maintianing strick adherance to our guilding principles. (Avoid data leakage, split data into sensibly sized subsets, investigate colinearity, reduce features to a minimum level) 
The project was managed well with members being able to say they contributed something.

### Limitations of the dataset

* Dataset Size: The Depresjon dataset is very small with only 55 participants in total.
All studies split each participants data up into various date+time denominations in order to increase the size of train and test sets.
Such a small number number of participants does not lend itself well to multi-class prediction (severity categories, type of depression) or regression analysis (severity on MADRS scale).
It also means that incorporating age, severity, education, marital status, etc. is not viable, unless measures are taken (e.g. data synthesis like SMOTE, augmentation, GAN, adding noise,etc).
* Generalisability:
The dataset only contains values from 23 hospitalised Norwegians with MDD (Major Depressive Disorder) and is almost certainly is not reflective of a global population.
Depression affects individuals in different ways and there are many 'functioning depressives' as well as undiagnosed depressives.  It is unikely that this limited and specific dataset will produce a model which will predict these depression states well.
It has to be stressed that the Norwegian 'activity' experience for both condition (MDD) and controls will be different to other locations.  Norwegians tend to be very 'active' relative to other nations and treatment of depression will vary significantly between nations.   
* Supplementary data:
The dataset contained some biodemographic data for all individuals, including gender, as well as some additional data for depressed patients only, however it was limited.
The dataset did not contain any additional lifestyle data such as fitness or general health of the participants. 
Without this information and given the small data size, it is very likely that any predictive models are specific to the activity patterns for these 55 persons.  We cannot baseline activity, health, lifestyle and account for these factors, meaning that models will most likely have limited generalisability.  For example, reductions in activity may have been due to fitness of the participant and not an indication of a depressive state.
* Complexity:
Depression is a multifacited and complex condition, using activity to diagnose or classify a depressive state is never going to be a critcally correct way to do this.
That said, investigations and models, as proof-of-concepts, have shown that activity can feature as part of mental health diagnosis

### Possible continuations for study

We believe that our initial investigations show that there is scope for further analysis into depression and activity patterns, in the form of correlations as well as predictive models.  Differences between genders can be attributed to natural variation, imbalanced datasets, etc. especially given the small dataset size, but there is also promising evidence to suggest that male and female depressives may display differing daily activity patterns.  Activity in darkness appeared to be a better predictor of depression for female participants in comparison to being inactive in light conditions for male depressives.  A future avenue to explore may be whether depression manifests more as poor sleep for women and lethargic days for men. 

In order for these and other studies to take place, we believe that there is a fundamental need for more data - both more participants (i.e. bigger dataset, more varied) and also more data (biodemographic, baseline health, geographic, etc.).  The Depresjon dataset was released a few years ago with noble intentions from the authors (Garcia et al.) but we believe it has run its course.  Virtually every machine learning technique has been applied and many feature engineering approaches have been undertaken - with increasingly good results, e.g. 99.7% accuracy for Rodriguez et al.  What this means, to us at least, is that the predicting depression on this dataset has been optimised - in fact, it has been overfit to the extent that generalisability has been sacrificed.  The ultimate purpose of these types of prediction should remain useful in the real-world.

Therefore, new studies should focus on collecting more data with a clear study design or alternatively look at combining datasets.  There are other open actigraphy datasets in existence from the US and China in particular, but they have key differences - e.g. different actigraphy collection methods, different depression diagnosis (not MADRS), etc.  That said, it would be interesting to train models on one dataset and validate on a different dataset, for example.  

In terms of progressing the gender-based models, it is also necessary to collect more data in order to take into account the different lived experiences of men and women, especially with depression.  For example, age and other biodemographic markers will be significant as women will uniquely experience certain depressions associated with childbirth (postpartum, postnatal) and menopause.

### Final thoughts

We created a model that can categorise a depressed states with a 90+% accuracy while at the same time achieving equally high figures in MCC, precision and F1 scores. The value of this model can be seen in two ways:

1. The model can be used in an advisory role promting people to seek medical help if a depressed state is classified off their activity data. It could also be used to help monitor the state of an already diagnosed depressive allowing for potentially a vital interventional conversation by a GP or family member.

2. We believe that we have made a proof-of-concept case for further study into the way activity plays a role in male and female depression states. Further investigation into this could extract much more valueable information about this debilitating illness.

We have enjoyed working together and will take the skills we have learnt in this project wit us into our data science careers.


## References
