# Libraries:
`sequences.py` defines `Measurement` and `MeasurementSequence` classes.

#### Measurement
`Measurement` contains information regarding one patient visit: `study_date`, `oct_path`, `cur_va`. Optionally, clinical features (from OCT segmentation) may be added (`features`).

It is possible to add events information to this visit as well (`injections`, `injection_dates` and `lens_surgery`). These events might not happen at exactly the same time as the patient visit (which has the date of the OCT as study_date) but either happen before or after the visit (you can choose which measurement to add an event to).

In addition, some information of the next visit can be added (`delta_t`, `next_va`)

#### MeasurementSequence
`MeasurementSequence` contains the `patient_id`, `diagnosis` (DR or AMD atm), `laterality` and a list of `Measurement`s. 

This class contains methods that create a `MeasurementSequence` from a pandas group, and methods that can subset a sequence to desired length, or remove single measurements. There are also some convenience functions (`has_checkup`) that allow you to filter out speficic measurements.

I save the MeasurementSequences as dictionaries in pickle files, I found that to be a quite consistent and save way to store python objects. 

In [1]:
import sys
sys.path.append('../')
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm_notebook as tqdm
import sequences # <- this contains the custom code

# Generate `MeasurementSequence` from pandas tables

In [2]:
workspace_dir = '../../workspace' #'/storage/groups/ml01/workspace/hannah.spitzer/LODE'
# longitudinal data is a merged table from all oct measurements and the cleaned diagnosis table
longitudinal_data = pd.read_csv(os.path.join(workspace_dir, 'longitudinal_data.csv'), index_col=0)

# filter measurements for which could not calculate features (from table that Olle sent me)
octs_to_remove = pd.read_csv('../../data/non_segmented_octs.csv', index_col=0)
longitudinal_data = longitudinal_data[~longitudinal_data.oct_path.isin(octs_to_remove['0'])]

# events is a table containing injections and lens surgery events for each patient
events = pd.read_csv(os.path.join(workspace_dir, 'longitudinal_events.csv'), index_col=0)
events = events.sort_values('study_date')
events.loc[:,'visus?'] = False
events.loc[:,'oct?'] = False

In [5]:
display(longitudinal_data.head())
display(events.head())

Unnamed: 0,patient_id,laterality,study_date,oct_path,fundus_path,thickness_path,visual_acuity,logMAR,oct?,visus?,thickness?,fundus?,diagnosis_raw,diagnosis
0,34537,R,2014-12-16,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/projects/2018161...,0.2,0.69897,True,True,True,True,Irvine-Gass-Syndrom,
1,34537,R,2016-01-26,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/projects/2018161...,0.25,0.60206,True,True,True,True,Irvine-Gass-Syndrom,
2,34537,R,2017-12-13,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/projects/2018161...,0.25,0.60206,True,True,True,True,Irvine-Gass-Syndrom,
3,34537,R,2014-09-16,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/projects/2018161...,0.32,0.49485,True,True,True,True,Irvine-Gass-Syndrom,
4,34537,R,2013-11-05,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/raw/2018_LMUAuge...,/storage/groups/ml01/datasets/projects/2018161...,,,True,False,True,True,Irvine-Gass-Syndrom,


Unnamed: 0,patient_id,laterality,study_date,MED,injection?,iol?,visus?,oct?
3762,115790,L,2003-01-21,,,True,False,False
4281,77064,R,2003-07-12,,,True,False,False
4372,136223,R,2004-02-24,,,True,False,False
4221,49544,R,2005-07-21,,,True,False,False
3806,159760,L,2005-08-09,,,True,False,False


In [6]:
# get grouped patients (sorted by date)
# keep NAN octs and logMARs (can still build sequence from them)
# remove patients without cleaned diagnosis label (currently only AMD and DR have diagnosis label)
filtered = longitudinal_data.dropna(subset=['diagnosis'])  
all_patients = filtered.sort_values('study_date')
# drop all groups that do not have at least one OCT and one logMAR
grouped = all_patients.groupby(['patient_id', 'laterality'])
all_patients = grouped.filter(lambda x: x.oct_path.count()>0 and x.logMAR.count() > 0)

grouped_patients = all_patients.groupby(['patient_id', 'laterality'])
grouped_events = events.groupby(['patient_id', 'laterality'])

## Calculate measurement sequences from pandas tables
in `seq.add_events_from_pandas(group_events, how='next')` I choose to add the events to the next available measurement. That means that if there is e.g a lens surgery a week before the next OCT, this event is added to the measurement taken a week after. This made sense for me, because I would like to model events that happen even before we have anny OCT measurement. 
For the statistics maybe you need to do it differently (`how='previous'`)

In [7]:
# create sequences with events added to next mmt
seqs = []
i = 0
for name, group in tqdm(grouped_patients):
    # get events for this group
    group_events = None
    try:
        group_events = grouped_events.get_group(name)
    except KeyError as e:
        pass

    seq = sequences.MeasurementSequence.from_pandas(group)
    seq.add_events_from_pandas(group_events, how='next')  # IMPORTANT: ADD EVENTS TO NEXT MEASUREMENT
    seqs.append(seq)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=1315.0), HTML(value='')))




## Subset measuremnt sequences to valid sequences with 3 month / 12 month checkups
these sequences are structured like this:
`mmt1`, `mmt2`, `mmt3`, ..., `checkup1`, `checkup2` (optional)

`checkup1` is a measurement which has been taken at the correct time (e.g. 3mo) after the last measurement. 

I keep all the measurements before that, because I am learning an LSTM to predict 3/12 months in the future, which can use the previous measurements as well. For the statistics you probably only want to use the measurement before the checkup measurement and the checkup measurement.

In [8]:
# parameters for sequence generation
# should each measurement in the sequence have an OCT and a VA?
req_sequence_oct = True
req_sequence_va = True # could just require VA for initial mmt if need more measurements
# do the checkup measurement need to have an OCT and a VA? 
# For me not, but for statistics, maybe you need to set req_checkup_oct to True
req_checkup_oct = False
req_checkup_va = True

In [9]:
# create sequences with 3 month / 12 month checkup
sequences_checkup_3 = []
sequences_checkup_3_12 = []
for seq in tqdm(seqs):
    # get seq_ids - all mmts fullfilling criterion
    seq_ids = []
    for seq_id in range(len(seq)):
        if seq.measurements[seq_id].is_valid(req_oct=req_sequence_oct, req_va=req_sequence_va):
            seq_ids.append(seq_id)
    # iterate over all possible end_ids - mmt with has_checkup()
    for i,end_id in enumerate(seq_ids):
        checkup_3_id = seq.has_checkup(end_id, checkup_time=90, max_deviation=20,
                                       req_oct=req_checkup_oct, req_va=req_checkup_va)
        checkup_12_id = seq.has_checkup(end_id, checkup_time=360, max_deviation=30, 
                                    req_oct=req_checkup_oct, req_va=req_checkup_va)
        if checkup_3_id:
            #print(seq_ids[0:i+1]+[checkup_3_id])
            # is valid end_id for 3
            # get new subsetted sequence
            seq_sub = seq.subset(seq_ids[0:i+1]+[checkup_3_id])
            sequences_checkup_3.append(seq_sub)
            if checkup_12_id:
                # is valid end_id for 3-12
                # get new subsetted sequence
                seq_sub = seq.subset(seq_ids[0:i+1]+[checkup_3_id,checkup_12_id])
                sequences_checkup_3_12.append(seq_sub)
            

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=1315.0), HTML(value='')))




In [10]:
# save sequences to file to avoid recomputing
sequences.save_sequences_to_pickle(os.path.join(workspace_dir, 'sequences_3.pickle'), sequences_checkup_3)
sequences.save_sequences_to_pickle(os.path.join(workspace_dir, 'sequences_3-12.pickle'), sequences_checkup_3_12)

# Examples on how to calculate statistics

In [12]:
# load sequences
sequences_checkup_3_12 = sequences.load_sequences_from_pickle(os.path.join(workspace_dir, 'sequences_3-12.pickle'))


In [26]:
print('{} measurement sequences computed'.format(len(sequences_checkup_3_12)))
# MeasurementSequences have a print function
print(sequences_checkup_3_12[0])
# Sequences as well
print(sequences_checkup_3_12[0].measurements[0])
# OCT path is saved in measurements
print(sequences_checkup_3_12[0].measurements[0].oct_path)
# patient id etc is saved in measurementsequence object
print(sequences_checkup_3_12[0].patient_id)

6284 measurement sequences computed
MeasurementSequence 18,R (3): [
Measurement 2017-10-24: oct True , cur_va 0.40, delta_t 0076, next_va 0.20, 0 injections, lens_surgery False
Measurement 2018-01-08: oct True , cur_va 0.20, delta_t 0270, next_va 0.30, 0 injections, lens_surgery False
Measurement 2018-10-05: oct True , cur_va 0.30, delta_t None, next_va None, 0 injections, lens_surgery False
]
Measurement 2017-10-24: oct True , cur_va 0.40, delta_t 0076, next_va 0.20, 0 injections, lens_surgery False
/storage/groups/ml01/datasets/raw/2018_LMUAugenklinik_niklas.koehler/Studies/Optical Coherence Tomography Scanner/18/Right/20171024/1.3.6.1.4.1.33437.10.4.13118731.13153317085.20917.4.1.dcm
18


In [16]:
# some functions that I previously used to calculate statistics
def calculate_statistics(seqs, num_checkups=1):
    # statistics for sequences before checkup
    stats = {
        'len': np.array([len(seq)-num_checkups for seq in seqs]),
        'dt': np.array([sum([mmt.delta_t for mmt in seq.measurements[:-num_checkups-1]]) for seq in seqs]),
        'num_inj': np.array([sum([sum(mmt.injections) for mmt in seq.measurements[:-num_checkups]]) for seq in seqs]),
        'va_mean': np.array([np.mean([mmt.cur_va for mmt in seq.measurements[:-num_checkups]]) for seq in seqs]),
        'va_std': np.array([np.std([mmt.cur_va for mmt in seq.measurements[:-num_checkups]]) for seq in seqs]),
        'ls': np.array([np.any([mmt.lens_surgery for mmt in seq.measurements[:-num_checkups]]) for seq in seqs])
    }
    return stats

def calculate_checkup_statistics(seqs, num_checkups=1, checkup_names=['checkup3']):
    # statistics for time between last measurement + checkup
    mmt_id = -num_checkups-1
    stats = {}
    for i,name in zip(range(num_checkups), checkup_names):
        checkup_id = -(num_checkups-i)
        print(mmt_id, checkup_id, name)
        res = {
            'dt': np.array([(seq.measurements[checkup_id].study_date - seq.measurements[mmt_id].study_date).days for seq in seqs]),
            'num_inj': np.array([sum(seq.measurements[checkup_id].injections) for seq in seqs]),
            'diff_va': np.array([seq.measurements[checkup_id].cur_va - seq.measurements[mmt_id].cur_va for seq in seqs]),
            'ls': np.array([seq.measurements[checkup_id].lens_surgery for seq in seqs])
        }
        stats[name] = res
    return stats

stats = calculate_statistics(sequences_checkup_3_12, num_checkups=2)
checkup_stats = calculate_checkup_statistics(sequences_checkup_3_12, num_checkups=2, checkup_names=['checkup3', 'checkup12'])

-3 -2 checkup3
-3 -1 checkup12


In [22]:
# stats for all sequences before checkups start
display(pd.DataFrame(stats))
# stats for difference between last measurement and 12 month checkup
display(pd.DataFrame(checkup_stats['checkup12']))

Unnamed: 0,len,dt,num_inj,va_mean,va_std,ls
0,1,0,0,0.397940,0.000000,False
1,3,196,0,0.100552,0.002575,True
2,4,226,0,0.250156,0.259132,True
3,5,259,0,0.240257,0.232619,True
4,6,304,0,0.200214,0.230456,True
...,...,...,...,...,...,...
6279,2,29,0,0.397940,0.000000,True
6280,1,0,0,0.795880,0.000000,False
6281,2,29,0,0.747425,0.048455,True
6282,1,0,0,1.000000,0.000000,False


Unnamed: 0,dt,num_inj,diff_va,ls
0,346,0,-0.096910,False
1,387,0,0.000000,False
2,357,0,-0.596597,False
3,366,0,-0.200659,False
4,363,0,0.000000,False
...,...,...,...,...
6279,351,0,-0.197281,False
6280,352,0,-0.397940,False
6281,351,0,-0.096910,False
6282,332,0,0.000000,False


In [24]:
# statistic for different subgroups (patients that improve / get worse)
measurement_error = 0.15

def get_summary_stats(stats, mask=None, checkup_stats={}):
    if mask is None:
        mask = np.ones(len(stats['len'])).astype(bool)
    data = {
        'num ts': len(stats['len'][mask]),
        'len mean': np.mean(stats['len'][mask]), 'len std': np.std(stats['len'][mask]),
        'dt mean': np.mean(stats['dt'][mask]), 'dt std': np.std(stats['dt'][mask]),
        'num_inj mean': np.mean(stats['num_inj'][mask]), 'num_inj std': np.std(stats['num_inj'][mask]),
        'va_mean mean': np.mean(stats['va_mean'][mask]), 'va_std mean': np.mean(stats['va_std'][mask]),
        'ls': np.mean(stats['ls'][mask])
    }
    for name, checkup in checkup_stats.items():
        data[name+' dt mean'] = np.mean(checkup['dt'][mask]) 
        data[name+' dt std'] = np.std(checkup['dt'][mask])
        data[name+' num_inj mean'] = np.mean(checkup['num_inj'][mask])
        data[name+' num_inj std'] = np.std(checkup['num_inj'][mask])
        data[name+' diff_va mean'] = np.mean(np.abs(checkup['diff_va'][mask]))
        data[name+' diff_va std'] = np.std(np.abs(checkup['diff_va'][mask]))
        data[name+' ls'] = np.mean(checkup['ls'][mask])
        
    return data


df = {}
# all data
mask = np.ones(len(stats['len']))
df['all'] = pd.Series(get_summary_stats(stats, checkup_stats=checkup_stats))
# binned in no change, improvement, worsening
for name, checkup in checkup_stats.items():
    diff_va = checkup['diff_va']
    mask_nochange = np.where((diff_va >= -measurement_error) & (diff_va <= measurement_error))[0]
    mask_impr = np.where(diff_va < -measurement_error)[0]
    mask_worse = np.where(diff_va > measurement_error)[0]
    df['no change after {}'.format(name)] = pd.Series(get_summary_stats(stats, mask=mask_nochange, checkup_stats=checkup_stats))
    df['improvement after {}'.format(name)] = pd.Series(get_summary_stats(stats, mask=mask_impr, checkup_stats=checkup_stats))
    df['worse after {}'.format(name)] = pd.Series(get_summary_stats(stats, mask=mask_worse, checkup_stats=checkup_stats))


In [25]:
pd.DataFrame(df)

Unnamed: 0,all,no change after checkup3,improvement after checkup3,worse after checkup3,no change after checkup12,improvement after checkup12,worse after checkup12
num ts,6284.0,4691.0,791.0,802.0,4094.0,903.0,1287.0
len mean,12.494908,12.961629,10.328698,11.901496,12.890327,11.246955,12.112665
len std,8.819543,8.868218,8.265547,8.708011,8.723704,8.701518,9.102057
dt mean,654.650064,676.077169,542.451327,639.98005,679.019541,585.238095,625.831391
dt std,440.673968,435.566771,442.009925,451.541314,436.387557,457.423048,435.313934
num_inj mean,2.768619,2.875506,2.198483,2.705736,2.827308,2.272425,2.93007
num_inj std,3.676347,3.720154,3.399234,3.628055,3.672037,3.466176,3.802599
va_mean mean,0.508025,0.465418,0.676714,0.59086,0.458832,0.656774,0.560141
va_std mean,0.113385,0.104528,0.142562,0.136413,0.105482,0.139611,0.120121
ls,0.446531,0.44191,0.472819,0.447631,0.452858,0.442968,0.428904
