# **Feature Engineering**
***


In [24]:
#Config
import pandas as pd
import os
 
DF_PATH = './100D Data/Cleaned Dataframes'
%cd /content/drive/My\ Drive/

/content/drive/My Drive


In [25]:
#Loading all the dataframes
df_code = pd.read_csv(os.path.join(DF_PATH, 'df_code'))
df_exercise = pd.read_csv(os.path.join(DF_PATH, 'df_exercise'))
df_extra = pd.read_csv(os.path.join(DF_PATH, 'df_extra'))
df_smartwatch = pd.read_csv(os.path.join(DF_PATH, 'df_smartwatch'))
df_date = pd.read_csv(os.path.join(DF_PATH, 'df_date'))
df_code.head()


Unnamed: 0,time_span,working_on,stack,position,context,productivity,date
0,5:15 AM - 6:15 AM,Aws,Dev-Ops,Standing,Learn,8.5,23rd August
1,8:00 AM - 10:15 AM,100D,Data Science,Sitting,,7.5,23rd August
2,11:00 AM - 12:00 PM,Leetcode,,Standing,,8.0,23rd August
3,12:30 PM - 1:30 PM,Book Store,Back End,Sitting,Refactoring,7.0,23rd August
4,3:00 PM - 5:00 PM,100D,Data Science,Sitting,Write,7.5,23rd August


## Feature engineering on 

###  **time_range** -
* Breaking down time range column into *starting time* and *duration*
* *Classifying time* into early morning, morning and evening



In [26]:
#UTILS - some functions to help later
from datetime import datetime

def get_start_N_end(time_range):
    time_range = str(time_range)
    if '-' in time_range: 
        time_start, time_end = time_range.split('-')
        if not time_end: 
            time_end = time_start
        return time_start.strip(), time_end.strip()
    return None

def duration(start_time, end_time):
    FMT = '%I:%M %p'
    if not start_time or not end_time: 
        return None
    timedelta = datetime.strptime(end_time, FMT) - datetime.strptime(start_time, FMT) 
    hours_passed = timedelta.seconds / 3600
    return hours_passed

Extracting *duration* from *time_span*

Ex. 6:30 AM - 8:00 AM is 1.5 hr

In [27]:
def get_duration(entry):
    entry = str(entry)
    try:
        start, end = get_start_N_end(entry)
        return duration(start, end)
    except:
        if 'min' in entry: 
            minutes = entry[ : entry.find('min')]
            minutes = int(minutes.strip())
            return minutes / 60

def get_starting_time(entry):
    entry = str(entry)
    try:
        start, _ = get_start_N_end(entry)
        return start
    except: 
        return float('NaN')

df_code['starting_time'] = df_code['time_span'].apply(get_starting_time)
df_code['duration(hr)'] = df_code['time_span'].apply(get_duration)
df_code.head()

Unnamed: 0,time_span,working_on,stack,position,context,productivity,date,starting_time,duration(hr)
0,5:15 AM - 6:15 AM,Aws,Dev-Ops,Standing,Learn,8.5,23rd August,5:15 AM,1.0
1,8:00 AM - 10:15 AM,100D,Data Science,Sitting,,7.5,23rd August,8:00 AM,2.25
2,11:00 AM - 12:00 PM,Leetcode,,Standing,,8.0,23rd August,11:00 AM,1.0
3,12:30 PM - 1:30 PM,Book Store,Back End,Sitting,Refactoring,7.0,23rd August,12:30 PM,1.0
4,3:00 PM - 5:00 PM,100D,Data Science,Sitting,Write,7.5,23rd August,3:00 PM,2.0


In [28]:
def get_part_of_day(entry):
    #Verify Assumption : I only measure my time in minutes in morning
    if 'min' in entry: 
        return 'early morning'
    start, end = get_start_N_end(entry)
    if start and end: 
        is_before_breakfast = duration(start, '10:00 AM') < 10
        is_before_lunch = duration(start, '2:00 PM') < 10
        if is_before_breakfast: 
            return 'early morning'
        if is_before_lunch: 
            return 'late morning'
        else: 
            return 'evening'

In [29]:
df_code['part_of_day'] = df_code.time_span.apply(get_part_of_day)
df_code.head()

Unnamed: 0,time_span,working_on,stack,position,context,productivity,date,starting_time,duration(hr),part_of_day
0,5:15 AM - 6:15 AM,Aws,Dev-Ops,Standing,Learn,8.5,23rd August,5:15 AM,1.0,early morning
1,8:00 AM - 10:15 AM,100D,Data Science,Sitting,,7.5,23rd August,8:00 AM,2.25,early morning
2,11:00 AM - 12:00 PM,Leetcode,,Standing,,8.0,23rd August,11:00 AM,1.0,late morning
3,12:30 PM - 1:30 PM,Book Store,Back End,Sitting,Refactoring,7.0,23rd August,12:30 PM,1.0,late morning
4,3:00 PM - 5:00 PM,100D,Data Science,Sitting,Write,7.5,23rd August,3:00 PM,2.0,evening


In [30]:
#Rearranging columns and dropping time_span
cols = df_code.columns.to_list()
cols = cols[7 : 9] + cols[1 : 5] + [cols[5], cols[-1], cols[6]]
df_code = df_code[cols]

df_code.head()

Unnamed: 0,starting_time,duration(hr),working_on,stack,position,context,productivity,part_of_day,date
0,5:15 AM,1.0,Aws,Dev-Ops,Standing,Learn,8.5,early morning,23rd August
1,8:00 AM,2.25,100D,Data Science,Sitting,,7.5,early morning,23rd August
2,11:00 AM,1.0,Leetcode,,Standing,,8.0,late morning,23rd August
3,12:30 PM,1.0,Book Store,Back End,Sitting,Refactoring,7.0,late morning,23rd August
4,3:00 PM,2.0,100D,Data Science,Sitting,Write,7.5,evening,23rd August


###Feature Engeneering on **date**
Every dataframe - df_code, df_exercise, df_smartwatch has a date associated with it

And every date has a *week number* and a *week day* associated with it. 

Ex. 26th August is week number __1__ and day number 3 of the 100 day experiment and it's also the day of the week is *Wednesday*

I know from experience that my productivity varies depending on the weekday. 

In the last notebook I created a DataFrame **df_date** mapping a date to it's day_number and is_halfday

In this notebook I will add another column that maps a date to it's day of the week

In [33]:
df_date.rename({'Unnamed: 0' : 'date'}, axis=1, inplace=True)
df_date.set_index('date', inplace=True)
df_date.head(6)

Unnamed: 0_level_0,day_number,is_halfday
date,Unnamed: 1_level_1,Unnamed: 2_level_1
23rd August,0,False
24th August,1,False
25th August,2,False
26th August,3,False
27th August,4,False
28th August,5,True


But... it's quarentine and Sundays and Mondays are no diffrent from each other for me

But I still take a break once in a while and I need a way to account for that

So I instead of weekday like Monday or Tuesday, I will use a diffrent metirc - days_after_break. 

Ex. If I take a break on 5th September, 7th September is 2 days_after_break, 8th September is 3 days_after_break and so on...

I already have a day number. I go to the original source of Time Sheets (remember that?) to get the week number and week day for each date.

In [43]:
#Creating empty columns
df_date['days_after_break'] = ''
df_date['week'] = ''

TIME_SHEETS_PATH = './100 Days'
week_folders = os.listdir(TIME_SHEETS_PATH)
week_folders.sort(key=lambda x : x.split(' ')[1])
cur_week = 0
for week_folder in week_folders: 
    week_path = os.path.join(TIME_SHEETS_PATH, week_folder)
    if os.path.isdir(week_path):
        cur_week += 1
        dates_in_week = [f.split('.')[0] for f in os.listdir(week_path)]
        def sort(x):
            if 'HD'in x: 
                x = x[:-5]
            #REMOVE THIS
            if 'daily' in x.lower() or 'week' in x.lower():
                return -1
            return df_date.loc[x]['day_number']
        dates_in_week.sort(key=sort)
        for days_after_break, date in enumerate(dates_in_week):
            if 'HD' in date: 
                date = date[:-5]
            df_date.at[date, 'days_after_break'] = days_after_break + 1
            df_date.at[date, 'week'] = cur_week

df_date.head(8)

Unnamed: 0_level_0,day_number,is_halfday,days_after_break,week
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
23rd August,0.0,False,1,1
24th August,1.0,False,2,1
25th August,2.0,False,3,1
26th August,3.0,False,4,1
27th August,4.0,False,5,1
28th August,5.0,True,6,1
29th August,6.0,False,1,2
30th August,7.0,True,2,2


## **df_smartwatch**
*df_smartwatch* contains all the health related metrics like amount of sleep, base heart rate

In [44]:
df_smartwatch.head()

Unnamed: 0,date,sleep,deep_sleep,steps_walked,km_walked,avg_heart_rate,stress,calories,week
0,23rd August,7:54 PM - 4:50 AM,48,6893,5.4,54,40,3133,1
1,24th August,5:55 PM - 12:23 AM,48,17430,14.62,55,38,3607,1
2,25th August,9:59 PM - 5:48 AM,77,7339,5.94,51,22,2670,1
3,26th August,7:53 PM - 12:53 AM,48,4195,3.55,51,33,2634,1
4,27th August,8:38 PM - 1:38 AM,10,6110,4.93,50,17,2697,1


Breaking down the sleep column into sleep_start, wakeup_time and sleep_duration. Same stuff as in df_code

In [50]:
def get_sleep_duration(entry):
    entry = str(entry)
    start, end = get_start_N_end(entry)
    sleep_duration =  float(duration(start, end))
    return round(sleep_duration, 2)
df_smartwatch['sleep_start'] = df_smartwatch.sleep.apply(lambda x : get_start_N_end(x)[0])
df_smartwatch['wakeup_time'] = df_smartwatch.sleep.apply(lambda x : get_start_N_end(x)[1])
df_smartwatch['sleep_duration(hr)'] = df_smartwatch.sleep.apply(get_sleep_duration)

#Rearranging columns
cols = df_smartwatch.columns.to_list()
cols = cols[:3] + cols[-3:] + cols[3:-3]
df_smartwatch = df_smartwatch[cols]
df_smartwatch.head(8)

Unnamed: 0,date,sleep,deep_sleep,steps_walked,km_walked,avg_heart_rate,stress,calories,week,sleep_start,wakeup_time,sleep_duration(hr)
0,23rd August,7:54 PM - 4:50 AM,48,6893,5.4,54,40,3133,1,7:54 PM,4:50 AM,8.93
1,24th August,5:55 PM - 12:23 AM,48,17430,14.62,55,38,3607,1,5:55 PM,12:23 AM,6.47
2,25th August,9:59 PM - 5:48 AM,77,7339,5.94,51,22,2670,1,9:59 PM,5:48 AM,7.82
3,26th August,7:53 PM - 12:53 AM,48,4195,3.55,51,33,2634,1,7:53 PM,12:53 AM,5.0
4,27th August,8:38 PM - 1:38 AM,10,6110,4.93,50,17,2697,1,8:38 PM,1:38 AM,5.0
5,28th August,8:55 PM - 6:36 AM,112,7532,6.08,53,29,2869,1,8:55 PM,6:36 AM,9.68
6,29th August,8:21 PM - 5:26 AM,109,8369,7.02,52,20,2800,1,8:21 PM,5:26 AM,9.08
7,30th August,10:14 PM - 4:37 AM,75,7567,8.0,53,31,3133,2,10:14 PM,4:37 AM,6.38


### **df_exercise**

In [46]:
df_exercise['time_span'] = df_exercise.time_span.apply(lambda x : int(get_duration(x) * 60))
df_exercise.rename(columns={'time_span' : 'duration(min)'}, inplace=True)
df_exercise.head(8)

Unnamed: 0,duration(min),activity,measurement,bpm,date
0,32,Cycling,9.21 km,121.0,23rd August
1,49,Walking,4.31 km,109.0,23rd August
2,30,Workout,2 x Workout,,23rd August
3,30,Cycling,9.21 km,121.0,24th August
4,52,Walking,4.30 km,109.0,24th August
5,15,Workout,1 set,101.0,24th August
6,53,Walking,4.23 km,99.0,25th August
7,17,Cycling,4.25 km,117.0,27th August


Adding another column - speed for cycling and walking, where speed = distance / time

In [47]:
def get_speed(x):
    if x['activity'] not in ['Cycling', 'Walking']:
        return float('NaN')
    kms = float(x['measurement'].split(' ')[0])
    time = float(x['duration(min)']) 
    return round(kms / time * 60, 2)

#Speed is in km / hr
df_exercise['speed'] = df_exercise.apply(get_speed, axis=1)
df_exercise.head()

Unnamed: 0,duration(min),activity,measurement,bpm,date,speed
0,32,Cycling,9.21 km,121.0,23rd August,17.27
1,49,Walking,4.31 km,109.0,23rd August,5.28
2,30,Workout,2 x Workout,,23rd August,
3,30,Cycling,9.21 km,121.0,24th August,18.42
4,52,Walking,4.30 km,109.0,24th August,4.96


Got the speed in km/hr

Saving all the dataframes

In [48]:
SAVE_PATH = './100D Data/Final Dataframes'
DF = [df_code, df_exercise, df_smartwatch, df_extra]
NAMES = ['df_code', 'df_exercise', 'df_smartwatch', 'df_extra']
for df, name in zip(DF, NAMES):
    path = os.path.join(SAVE_PATH, name)
    df.to_csv(path, index=False)
df_date.to_csv(os.path.join(SAVE_PATH, 'df_date'))