# Temple University Dataset

The Temple University Dataset is a large, open-access, dataset of clincal EEG data. 

Web link: https://www.isip.piconepress.com/projects/tuh_eeg/html/overview.shtml

Paper Overview: http://journal.frontiersin.org/article/10.3389/fnins.2016.00196/full

Ultimately, the goal will be to use this dataset, which is on a scale never really seen before in EEG datasets, to try and characterize and understand brain activity in a large cohort, across 'normal' brain activity, and disease. 

## TASKS:
The first set of tasks relates to working on a pre-processing pipeline for the data. Regardless of the set of analyses we ultimately choose, the data need to be organized, cleaned and pre-processed in such a way that we have confidence that the analyses we run reflect brain activity, not artifacts. A key goal of setting up this pipeline is that it be automatic and robust - given the number of subjects, we need to have confidence we can run this pipeline without manual supervision, and trust the data that comes out of it. 

Note: since we're using MNE, we don't want to end up with some information inside the MNE objects, and some outside them. As we develop these processes, let's try to keep a tight integration with MNE. 

### First Goal:
    - Fully clean a subject of the rest state, for example, run the PSD over the remaining channels, and trust it reflects brain activity.

### Preprocessing Tasks
Pre-processing pipeline:
  - Sort out the channel names and data types. 
      - Target output: a list of just the standard channel names (example: FP1, Cz, T4)
          - Also: Keep track of channel types. (Another list of ['EEG', 'EEG', 'EMG'])
              - Then: make sure we can put this 'back' into the MNE objects - update the channel info accordingly. 
  - What are the event markers? How do we line them up with the data? Can we epoch into different 'states'?
      - Target output: extracted epoch of resting data. 
  - How can we (automatically?) reject bad channels
      - We want to figure out MNE's available tools to reject channels, and start applying them.
      - Target output: For example subjects, channels we don't trust (visually) should be able to be 'dropped' (marked as bad so that they are ignored by future processing) by running an automatic process. 
  - How can we (automatically?) reject bad time segments
      - We want to figure out MNE's available tools to reject time segments, and start applying them.
      - Target output: For example subjects, time segments we don't trust (visually) should be dropped (again, marked bad to be ignored) by running an automatic process. 
  - How can we (automatically?) deal with stereotyped artifacts - such as eye blinks, heartbeats, saccades
      - Here I mean artifacts that we can correct for, different from time segments we decide we need to avoid.
      - We want to figure out MNE's available tools to correct for artifacts, and start applying them.
      - Target output: This depends slightly more on what is available (in the data, and in MNE), but ideally, for heartbeat and eye blinks (perhaps others), we would like to be able to automatically label them and 'subtract' them out from the data in some way - such that we can keep, and still use, these segments of data. 
      
Final outcome: A cleaned segment of resting data, ready for analysis, for 1 or 2 subjects. Working through this process will be very manually supervised, but hopefully by the time it's finished, the whole process can be re-run automatically, without human input, and this process can also be applied to other subjects. 
  
### Database Work (2nd level of task)
We will need utility functions to work with the database, to list subjects, find subjects, get paths, etc. 
- We'll work on this mainly when we want to scale up the number of subjects. Don't start it yet, I have some example code from other projects that can form a basis of this. 


### Metadata Work (3rd level of task)
Ultimately, we probably want some tools to be able to automatically scrape the clinical notes, and extract key parameters, like age, sex, and diagnoses. 
- This is not an immediate priority, as it will become more relevant as we fine-tune the analyses. We might, for example, start with the clearly labelled 'epilepsy vs non-epilepsy' group that is available, before we try to do large-scale mining of the clinical notes. 

## Notes:
- This first level of work is very pre-process-y, but required to get started with this data. Once we have a couple subjects set up with something that works at least reasonably well, we'll move on to more analysis stuff, and really get into the data, so that we're not stuck forever on pre-processing. 
- The goal is to get a rough version of the full pipeline, allowing us to 'see' the whole project, and find any major hurdles. It also, in practice, lets us try out each part of the project first, without getting bogged down at any particular step. 
- Eventually, we will have to double back and make sure each step is really robust.
- Also, soon, we will work a bit on setting up our codebase as a proper module, for organization and scalability. 

In [6]:
# Imports
%matplotlib qt
import mne
import os

In [2]:
# Set up paths

# This base path will need updating
base_path = '/Users/thomasdonoghue/Documents/Research/1-Projects/z_Ideas/EEG_Database/'

# These should stay the same
#subj_dat_fname = 'tuh_eeg/v0.2/00000164/00000164_0.edf'
subj_dat_fname = 'tuh_eeg/v0.2/00000184/00000184_0.edf'

In [8]:
# Read in an example subject of data (subject chosen randomly)
full_path = os.path.join(base_path, subj_dat_fname)
eeg_dat = mne.io.read_raw_edf(full_path)

Extracting edf Parameters from /Users/thomasdonoghue/Documents/Research/1-Projects/z_Ideas/EEG_Database/tuh_eeg/v0.2/00000184/00000184_0.edf...
Setting channel info structure...
Creating Raw.info structure...
Ready.


In [9]:
# Check out the info object for this subject
eeg_dat.info

<Info | 17 non-empty fields
    bads : 'list | 0 items
    buffer_size_sec : 'float | 1.0
    ch_names : 'list | EEG FP1-REF, EEG FP2-REF, EEG F3-REF, EEG F4-REF, ...
    chs : 'list | 29 items (EEG: 28, STIM: 1)
    comps : 'list | 0 items
    custom_ref_applied : 'bool | False
    dev_head_t : 'mne.transforms.Transform | 3 items
    events : 'list | 0 items
    filename : 'str | /Users/tho.../00000184_0.edf
    highpass : 'float | 0.0 Hz
    hpi_meas : 'list | 0 items
    hpi_results : 'list | 0 items
    lowpass : 'float | 128.0 Hz
    meas_date : 'int | 1357041778
    nchan : 'int | 29
    projs : 'list | 0 items
    sfreq : 'float | 256.0 Hz
    acq_pars : 'NoneType
    acq_stim : 'NoneType
    ctf_head_t : 'NoneType
    description : 'NoneType
    dev_ctf_t : 'NoneType
    dig : 'NoneType
    experimenter : 'NoneType
    file_id : 'NoneType
    hpi_subsystem : 'NoneType
    kit_system_id : 'NoneType
    line_freq : 'NoneType
    meas_id : 'NoneType
    proj_id : 'NoneType
    pro

In [10]:
# Example of one way to start fixing up the channels
#  Priority again is to keep everything consistent with MNE, so first check for MNE functions that can do 
#   things like this, and that can update this information insode the data.info object. 

for ch in eeg_dat.ch_names:
    if ch[0:4] is 'EEG':
        # Keep track that this is an EEG chan
        pass
    elif ch[0:4] is 'EMG':
        # Keep track that this is an EMG chan
        pass
    