## puffMarker: Exploratory Data Analysis

- This notebook is dedicated to understanding puffMarker using self-report and random EMA
- For every self-report and random EMA with response 'Yes', check to see
    + What is the fraction where a puffMarker is  __NOT__ triggered in a (10, 15, 20, 25, 30) minute radius around the recall time
    + This is a rough estimate of the false negative rate
    + We see that for a 15-minute window the fraction is 0.88
- For every random EMA with response 'No', check to see 
    + What is the fraction where a puffMarker is triggered in a 10 (15, 20, 25, 30) minute in the window since last self-report or random EMA?
    + This is a rough estimate of the false positive rate
    + We see that for a 15-minute window the fraction is 0.017
- For every end-of-day EMA where the person responds with 0 hours of smoking, check to see 
    + What is the fraction where a puffMarker is triggered any time in the day?
    + This is a rough estimate of the false positive rate
    + We see that the fraction is 0.302
- Note: this does not account for data quality
- For deterministic rule, it is more important to compute P( event \given puffmarker)
    + We look at all pM events and ask "Is there any random or self-report in Delta window around it"?
    + This is a different conditional statement than above!
    + We still see that in a 15-minute window around pM, the fraction of times a SR or random EMA says 'Yes' is only 0.086.
    + Suggests we should not rely on pM in the deterministic rule

In [1]:
import pandas as pd
import numpy as np
import datetime as datetime
import os
os.getcwd()
dir = "../final-data"

In [2]:
## Dictionaries for Self-report and Random EMA
sr_accptresponse = ['Smoking Event(15 to 30 minutes)', '5 to 15 minutes', 'Smoking Event(less than 5 minutes ago)']
sr_dictionary = {'Smoking Event(less than 5 minutes ago)': 2.5, 
                 'Smoking Event(15 - 30 minutes)': 17.5, 
                 'Smoking Event(5 - 15 minutes)': 10
                } 
random_accptresponse = ['1 - 19 Minutes', '20 - 39 Minutes', '40 - 59 Minutes', 
                    '60 - 79 Minutes', '80 - 100 Minutes']
random_dictionary = {'1 - 19 Minutes': 10, 
                     '20 - 39 Minutes': 30, 
                     '40 - 59 Minutes':50,
                     '60 - 79 Minutes':70, 
                     '80 - 100 Minutes':90 }

In [3]:
# read data
selfreport = pd.read_csv(os.path.join(os.path.realpath(dir), 'self-report-smoking-final.csv'))
random_ema = pd.read_csv(os.path.join(os.path.realpath(dir), 'random-ema-final.csv'))
puffMarker = pd.read_csv(os.path.join(os.path.realpath(dir), 'puff-episode-final.csv'))

print(selfreport.columns)
print(random_ema.columns)
print(puffMarker.columns)

Index(['message', 'participant_id', 'timestamp', 'date', 'hour', 'minute',
       'day_of_week'],
      dtype='object')
Index(['status', 'smoke', 'when_smoke', 'eat', 'when_eat', 'drink',
       'when_drink', 'urge', 'cheerful', 'happy', 'angry', 'stress', 'sad',
       'see_or_smell', 'access', 'smoking_location', 'participant_id',
       'timestamp', 'date', 'hour', 'minute', 'day_of_week'],
      dtype='object')
Index(['timestamp', 'event', 'participant_id', 'date', 'hour', 'minute',
       'day_of_week'],
      dtype='object')


In [4]:
# Make a list of all self-report times between 8AM and 8PM
# Throw away observations for 'when_smoke' is nan or 
# 'More than 30 minutes' to ensure we can calculate a meaningful 
# quantity.
days_smoked = {}
for index, row in selfreport.iterrows():
    try:
        time = datetime.datetime.strptime(row['date'], '%m/%d/%y %H:%M')
    except:
        time = datetime.datetime.strptime(row['date'], '%Y-%m-%d %H:%M:%S')
    if row['message'] in sr_accptresponse:
        time = time - datetime.timedelta(minutes=sr_dictionary[row['message']])
    date = (time.year, time.month, time.day, time.hour, time.minute)
    if row['participant_id'] not in days_smoked:
        days_smoked[row['participant_id']] = set()
    if 8 <= date[3] < 20 and row['message'] in sr_accptresponse:        
        days_smoked[row['participant_id']].add(time)

# Add all Random EMA times between 8AM and 8PM
for index, row in random_ema.iterrows():
    try:
        time = datetime.datetime.strptime(row['date'], '%m/%d/%y %H:%M')
    except:
        time = datetime.datetime.strptime(row['date'], '%Y-%m-%d %H:%M:%S')
    if row['when_smoke'] in random_accptresponse:
        time = time - datetime.timedelta(minutes=random_dictionary[row['when_smoke']])
    date = (time.year, time.month, time.day, time.hour,time.minute)
    if row['participant_id'] not in days_smoked:
        days_smoked[row['participant_id']] = set()
    if 8 <= date[3] < 20 and row['when_smoke'] in random_accptresponse:        
        days_smoked[row['participant_id']].add(time)

In [5]:
# Make a list of all puffMarker times between 8AM and 8PM
pM_days_smoked = {}
for index, row in puffMarker.iterrows():
    try:
        time = datetime.datetime.strptime(row['date'], '%m/%d/%y %H:%M')
    except:
        time = datetime.datetime.strptime(row['date'], '%Y-%m-%d %H:%M:%S')
    date = (time.year, time.month, time.day, time.hour, time.minute)
    if row['participant_id'] not in pM_days_smoked:
        pM_days_smoked[row['participant_id']] = set()
    if 8 <= date[3] < 20:        
        pM_days_smoked[row['participant_id']].add(time)



In [6]:
'''
Adjust so if ID is not in one dictionary, then it is added!
'''
for id in days_smoked.keys():
    if id not in pM_days_smoked:
        pM_days_smoked[id] = set()

for id in pM_days_smoked.keys():
    if id not in days_smoked:
        days_smoked[id] = set()


In [7]:
'''
Compute fraction of self-report and random EMA 'Yes' that 
have a matching puffMarker in a window of length delta
around event time.
'''

def falsenegative(delta):
    matching_counts = []
    for id in set(days_smoked.keys()):
        ema_temp = days_smoked[id]
        pM_temp = pM_days_smoked[id]
        total_count_id = 0
        delta_count_id = 0
        for time in ema_temp:
            total_count_id+=1
            match = 0
            for pM_time in pM_temp:
                time_diff = abs((time - pM_time).total_seconds() / 60.0)
                if time_diff <= delta:
                    match = 1
            if match == 1:
                delta_count_id+=1
        if total_count_id > 0:
            matching_counts.append(np.array([total_count_id, delta_count_id], dtype='f'))

    matching_counts = np.asarray(matching_counts)

    # matching_counts = np.delete(matching_counts, (np.where(matching_counts[:,0] == 0)[0][0]), axis=0)

    fraction_per_delta = np.divide(matching_counts[:,1],matching_counts[:,0])

    aggregate_matching_counts = np.sum(matching_counts, axis=0)

    aggregate_frac_delta = aggregate_matching_counts[1]/aggregate_matching_counts[0]

    print('In window of length: %s' % delta)
    print('Aggregated data, Fraction agreement: %s' % (np.round(1-aggregate_frac_delta,3)))
    print('Mean of Fraction agreement across indidivuals: %s' % (np.round(np.mean(1-fraction_per_delta),3)))
    print('Standard deviation of Fraction agreement across indidivuals: %s' %  (np.round(np.std(1-fraction_per_delta),3)))

In [8]:
falsenegative(10)

falsenegative(15)

falsenegative(20)

falsenegative(25)

falsenegative(30)

In window of length: 10
Aggregated data, Fraction agreement: 0.897
Mean of Fraction agreement across indidivuals: 0.875
Standard deviation of Fraction agreement across indidivuals: 0.158
In window of length: 15
Aggregated data, Fraction agreement: 0.88
Mean of Fraction agreement across indidivuals: 0.857
Standard deviation of Fraction agreement across indidivuals: 0.172
In window of length: 20
Aggregated data, Fraction agreement: 0.871
Mean of Fraction agreement across indidivuals: 0.848
Standard deviation of Fraction agreement across indidivuals: 0.181
In window of length: 25
Aggregated data, Fraction agreement: 0.864
Mean of Fraction agreement across indidivuals: 0.842
Standard deviation of Fraction agreement across indidivuals: 0.186
In window of length: 30
Aggregated data, Fraction agreement: 0.855
Mean of Fraction agreement across indidivuals: 0.83
Standard deviation of Fraction agreement across indidivuals: 0.198


In [9]:
# All Random EMA times where the response was 'No'
no_smoked = {}
for index, row in random_ema.iterrows():
    try:
        time = datetime.datetime.strptime(row['date'], '%m/%d/%y %H:%M')
    except:
        time = datetime.datetime.strptime(row['date'], '%Y-%m-%d %H:%M:%S')
    if row['smoke'] == 'No':
        if row['participant_id'] not in no_smoked:
            no_smoked[row['participant_id']] = set()
        if 8 <= time.hour < 20:        
            no_smoked[row['participant_id']].add(time)

In [10]:
def falsepositive(delta):
    matching_counts = []
    for id in set(no_smoked.keys()):
        ema_temp = no_smoked[id]
        pM_temp = pM_days_smoked[id]
        total_count_id = 0
        delta_count_id = 0
        for time in ema_temp:
            total_count_id+=1
            match = 0
            for pM_time in pM_temp:
                time_diff = (time - pM_time).total_seconds() / 60.0
                if 0 <= time_diff <= delta:
                    match = 1
            if match == 1:
                delta_count_id+=1
        if total_count_id > 0:
            matching_counts.append(np.array([total_count_id, delta_count_id], dtype='f'))

    matching_counts = np.asarray(matching_counts)

    # matching_counts = np.delete(matching_counts, (np.where(matching_counts[:,0] == 0)[0][0]), axis=0)

    fraction_per_delta = np.divide(matching_counts[:,1],matching_counts[:,0])

    aggregate_matching_counts = np.sum(matching_counts, axis=0)

    aggregate_frac_delta = aggregate_matching_counts[1]/aggregate_matching_counts[0]

    print('In window of length: %s' % delta)
    print('Aggregated data, Fraction agreement: %s' % (np.round(aggregate_frac_delta,3)))
    print('Mean of Fraction agreement across indidivuals: %s' % (np.round(np.mean(fraction_per_delta),3)))
    print('Standard deviation of Fraction agreement across indidivuals: %s' %  (np.round(np.std(fraction_per_delta),3)))

In [11]:
falsepositive(10)

falsepositive(15)

falsepositive(20)

falsepositive(25)

falsepositive(30)

falsepositive(60)

In window of length: 10
Aggregated data, Fraction agreement: 0.009
Mean of Fraction agreement across indidivuals: 0.008
Standard deviation of Fraction agreement across indidivuals: 0.033
In window of length: 15
Aggregated data, Fraction agreement: 0.017
Mean of Fraction agreement across indidivuals: 0.016
Standard deviation of Fraction agreement across indidivuals: 0.048
In window of length: 20
Aggregated data, Fraction agreement: 0.026
Mean of Fraction agreement across indidivuals: 0.028
Standard deviation of Fraction agreement across indidivuals: 0.076
In window of length: 25
Aggregated data, Fraction agreement: 0.029
Mean of Fraction agreement across indidivuals: 0.031
Standard deviation of Fraction agreement across indidivuals: 0.078
In window of length: 30
Aggregated data, Fraction agreement: 0.033
Mean of Fraction agreement across indidivuals: 0.033
Standard deviation of Fraction agreement across indidivuals: 0.08
In window of length: 60
Aggregated data, Fraction agreement: 0.066

In [12]:
'''
Alternative for false positive rate. 
Take all EOD where no smoking on a day was reported.
How many of such days had a puffMarker go off?
'''

eod_ema = pd.read_csv(os.path.join(os.path.realpath(dir), 'eod-ema-final.csv'))
keys = ['8to9', '9to10', '10to11', '11to12','12to13','13to14','14to15','15to16','16to17','17to18','18to19','19to20']

# List of all dates with all 0s
eod_dates = []
for irow in range(0,eod_ema.shape[0]):
    row = eod_ema.iloc[irow]
    if row['status'] == "MISSED":
        continue
    if np.count_nonzero(row[keys]) == 0:
        try:
            time = datetime.datetime.strptime(row['date'], '%m/%d/%Y %H:%M')
        except:
            time = datetime.datetime.strptime(row['date'], '%Y-%m-%d %H:%M:%S')
        if time.hour  == 0 or time.hour == 1:
            date = np.array([row['participant_id'], time.year, time.month, time.day-1])
            date = np.append(date, np.array(row[keys]))
        else:
            date = np.array([row['participant_id'], time.year, time.month, time.day])
            date = np.append(date, np.array(row[keys]))
        eod_dates.append(date)
    
eod_dates = np.asarray(eod_dates)
eod_dates[0,:]

array([203, 2017, 8, 16, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0], dtype=object)

In [13]:
all_day = 0
bad_day = 0 
for row in range(0,np.shape(eod_dates)[0]):
    all_day += 1
    is_bad_day = 0
    temp = eod_dates[row,:]
    pM_temp = pM_days_smoked[temp[0]]
    for times in pM_temp:
        hour_check = (8.0 <= times.hour <= 20.0)
        year_check = (times.year == temp[1])
        month_check = (times.month == temp[2])
        day_check = (times.day == temp[3])
        if all((year_check, month_check, day_check, hour_check)):
            is_bad_day = 1
    if is_bad_day == 1:
        bad_day += 1

falsepositive_rate = bad_day/all_day
falsepositive_rate

0.30177514792899407

In [21]:
'''
Cycle through pM times.
See if random or self-report within delta window
'''

def prob_event_given_pM(delta):
    pM_days_smoked.keys()

    matching_counts = []
    for id in set(pM_days_smoked.keys()):
        ema_temp = days_smoked[id]
        pM_temp = pM_days_smoked[id]
        total_count_id = 0
        delta_count_id = 0
        for time in pM_temp:
            total_count_id+=1
            match = 0
            for ema_time in ema_temp:
                time_diff = (time - ema_time).total_seconds() / 60.0
                if 0 <= time_diff <= delta:
                    match = 1
            if match == 1:
                delta_count_id+=1
        if total_count_id > 0:
            matching_counts.append(np.array([total_count_id, delta_count_id], dtype='f'))

    matching_counts = np.asarray(matching_counts)

    fraction_per_delta = np.divide(matching_counts[:,1],matching_counts[:,0])

    aggregate_matching_counts = np.sum(matching_counts, axis=0)

    aggregate_frac_delta = aggregate_matching_counts[1]/aggregate_matching_counts[0]

    print('In window of length: %s' % delta)
    print('Aggregated data, Fraction agreement: %s' % (np.round(aggregate_frac_delta,3)))
    print('Mean of Fraction agreement across indidivuals: %s' % (np.round(np.mean(fraction_per_delta),3)))
    print('Standard deviation of Fraction agreement across indidivuals: %s' %  (np.round(np.std(fraction_per_delta),3)))

In [22]:
prob_event_given_pM(10)

prob_event_given_pM(15)

prob_event_given_pM(20)

prob_event_given_pM(25)

prob_event_given_pM(30)

prob_event_given_pM(60)

In window of length: 10
Aggregated data, Fraction agreement: 0.067
Mean of Fraction agreement across indidivuals: 0.115
Standard deviation of Fraction agreement across indidivuals: 0.235
In window of length: 15
Aggregated data, Fraction agreement: 0.086
Mean of Fraction agreement across indidivuals: 0.131
Standard deviation of Fraction agreement across indidivuals: 0.239
In window of length: 20
Aggregated data, Fraction agreement: 0.097
Mean of Fraction agreement across indidivuals: 0.142
Standard deviation of Fraction agreement across indidivuals: 0.245
In window of length: 25
Aggregated data, Fraction agreement: 0.103
Mean of Fraction agreement across indidivuals: 0.148
Standard deviation of Fraction agreement across indidivuals: 0.247
In window of length: 30
Aggregated data, Fraction agreement: 0.109
Mean of Fraction agreement across indidivuals: 0.152
Standard deviation of Fraction agreement across indidivuals: 0.252
In window of length: 60
Aggregated data, Fraction agreement: 0.19