# Savor: Data

> ### Take advantage of your own big data.

#### A data project by [_Tobias Reaper_](https://www.linkedin.com/in/tobias-reaper/)

---
---

## Disclaimer

The data contained within this notebook are private and confidential, and may not be shared with anyone without express permission from me (Tobias Reaper). If you somehow find yourself in possession of the raw data used in this notebook without that permission, please be a good person and delete it from wherever it's been saved. Also, please get in touch with me so I can try to figure out how it slipped through the cracks. Thank you.

---
---

## This Notebook

This notebook is an initial exploration of the latest version of the Savor data model, which was first implemented 2020-12-03.

### Table of Contents

---
---

## TODO

* [ ] Get screen time data from Airtable as better measure of time spent updating the journal

---
---

## Introduction

> Take advantage of your own big data.

Savor is a project based on an idea that I first had in 2016. At the time I was working as a consultant for an enterprise resource planning (ERP) software company. I worked intimately with manufacturers to integrate our system into their business, with the goal of optimizing their manufacturing processes. I became fascinated by the idea of tracking things to such a degree, and began to imagine what it would be like to have a similar type of system that would optimize my life.

That's the core idea: building a system to organize my life as if it were a series of manufacturing processes.

That way of saying it may initially make it seem impersonal. I believe it's the opposite, in fact. The goal is to use software to understand myself better. That's where the tagline comes from — I'd like to take advantage of my own big data to make my life better in whatever ways I see fit at any given time.

Companies like Google and Facebook have been taking advantage of my big data for years, with the primary goal of making money. In the process of segmenting and profiling me, they likely know a lot about me. I'd like to have a similar data-driven profile of my life, for my own purposes. Namely, to learn more about myself and my life, to be able to optimize it.

I guess it's at this point that I can see people rolling their eyes and thinking this is just another productivity app. Words like "optimize" don't help things. However, I want to get across the fact that because I have total control over this system, that I get to choose exactly how it gets used and precisely what is optimized. While sometimes this optimization would come in the form of making myself more productive, it's equally likely that I'll want to optimize the length of time and quality of connection I have with family and friends.

Imagine that: a system that can help you find time to spend with family and friends, and find mindsets and/or conversation topics that will most likely increase the quality of those connections.

I think that sounds like the opposite of impersonal — getting intimate with oneself on levels and to a degree potentially not possible before.

...

---
---

## Setup

In [1]:
# === General imports === #
import os

import pandas as pd
import numpy as np
import janitor

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# === Configuration === #
%matplotlib inline
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

In [3]:
# === 2019-12-03 - 2019-08-17 === #
asset_path = "../assets/data_/2020-09-07"

# Individual files
project_log = os.path.join(asset_path, "project_log.csv")
engage_log = os.path.join(asset_path, "engage_log.csv")
moment_log = os.path.join(asset_path, "moment_log.csv")

# Load into dataframes
project_1 = pd.read_csv(project_log)
engage_1 = pd.read_csv(engage_log)
moment_1 = pd.read_csv(moment_log)

---

## Data Model

The latest data model I came up with for my journal is one that has 3 distinct layers (with more to come in proceeding iterations).

1. Project Log
2. Engagement Log
3. Moment Log

The naming was not completely arbitrary, though could definitely be improved — the names aren't perfectly self-explanatory.

Basically, the reason I broke it up has to do with something I noticed about how my time is spent. At the top level, I try to work on a single "project" for a certain period of time. This helps me stay focused on what I wanted to do/work on during that time. Another way to think about it is that this level defines overarching parts of the day, usually somewhere between 5 and 10, depending on what I'm doing.

Within each block of time (maybe that's a better name right there — "block") I can switch between specific activities, such as between coding, writing, and reading/research. That is the second level, with each individual activity assigned to that higher level block. This second level is where the vast majority of the action is; the level where I spend most of my time.

The third level is for very short activities I do that aren't necessarily related to the main activity. For example, I could be working on the notebook to explore this data but take short little breaks to get water, use the restroom, or clear my head. In the previous iteration of the data model I didn't account for these little activities — every activity was "created equal", so to speak, and in order to account for that time, I'd have to split up and duplicate the main activity record, interspersing the short breaks into them that way. Simply put, that caused too much overhead sometimes.

The goal with this project is to reduce the time and effort required to keep a real-time journal to the point where it doesn't interrupt what I'm actually doing.

### Project Log

The Project Log is the "top level" of my journal. These are blocks of time spent doing a set of related activities with (hopefully) singular goal in mind.

The specific activities, or "engagements" as I call them in this iteration, are linked to the parent "project" in a many-to-one manner. I.e. there are many engage records linked to a single project record.

In [4]:
print(project_1.shape)
project_1.head()

(1768, 10)


Unnamed: 0,id,time_in,notes,location,engage_log,time_out,duration,created,modified,who
0,1,2019-12-03 07:00,Very first record on the new journal!! - kinda...,24hr-Bel,"1-Exe-Pod,2-Exe-Pod,3-Exe-Pod,4-Wal-Pod,5-Dri-Pod",2019-12-03 08:20,1:20,2019-11-23 22:49,2019-12-03 13:52,
1,2,2019-12-03 08:20,Preparation for another solid day’s work,Casa-Lake,"6-Wal-Pod,7-Sta-Upd,8-Poo-Upd,9-Sho-Bra,10-Dre...",2019-12-03 09:02,0:42,2019-11-23 22:49,2019-12-03 13:52,
2,3,2019-12-03 09:02,Lambda morning warmup time,Casa-Lake,"13-Sta-Wor,14-Sta-Dat,15-Sta-Wor",2019-12-03 10:10,1:08,2019-11-23 22:49,2019-12-06 03:35,
3,4,2019-12-03 10:10,412 Lesson - Vectorization,Casa-Lake,"16-Sit-Lea,17-Foo-Lea,18-Uri-Thi,19-Sit-Upd,20...",2019-12-03 12:20,2:10,2019-12-03 10:09,2019-12-03 13:52,
4,5,2019-12-03 12:20,Lunchtime,Casa-Lake,"21-Eat-Pod,22-Dis-Pod,23-Uri-Pod,24-Sit-Pod,25...",2019-12-03 13:00,0:40,2019-12-03 11:21,2019-12-03 13:52,


### Engage Log

As mentioned above, the Engage Log is where I spend most of my time. This is where I track what I'm doing at any given moment.

I wanted to roughly categorize each engagement or activity (I use them interchangeably here) based on what I'm doing mentally and physically. Those two features, along with their respective note fields, are the most important part of the journal, as they are the most descriptive features of my experience.

---
---

## Wrangling

### To do

* [x] Convert datetimes
* [x] Convert `duration` to minutes
* [ ] Fill in nulls with empty string
* [ ] Break out `mental` and `physical` into columns

In [5]:
# === Clean up columns to only what's needed right now === #
print("Initial shape:", engage_1.shape)

engage_keep_cols = [
    "id",
    "time_in",
    "time_out",
    "duration",
    "mental",
    "physical",
    "mental_note",
    "physical_note",
    "task",
    "tags",
    "subloc",
    "project_location",
]

# Create new dataframe
engage_2 = engage_1[engage_keep_cols].copy()
print("After column pruning:", engage_2.shape)
engage_2.head(3)

Initial shape: (11254, 23)
After column pruning: (11254, 12)


Unnamed: 0,id,time_in,time_out,duration,mental,physical,mental_note,physical_note,task,tags,subloc,project_location
0,1-Exe-Pod,2019-12-03 07:00,2019-12-03 07:19,19:00,Podcast,Exercise,Full Stack Radio - Evan Yue \\ Vue 3.0 + new e...,Cardio - elliptical,,,Elliptical,24hr-Bel
1,2-Exe-Pod,2019-12-03 07:19,2019-12-03 07:37,18:00,Podcast,Exercise,Full Stack Radio with Evan Yue \\ Vue 3.0 - fi...,Cardio - stairs,,,Stairmaster,24hr-Bel
2,3-Exe-Pod,2019-12-03 07:37,2019-12-03 08:02,25:00,Podcast,Exercise,Django Chat \\ Caching - something to read up ...,Weights - hip abduction in / out (machine) - k...,,,Machines,24hr-Bel


Lots of nulls in here, which is to be expected.

In [6]:
# === Deal with dur/time_out nulls right away === #
engage_2 = engage_2.dropna(axis=0, subset=["duration", "time_out"])

In [7]:
# === Nulls === #
engage_2.isnull().sum()

id                      0
time_in                 0
time_out                0
duration                0
mental                 27
physical               11
mental_note          7809
physical_note        9636
task                10868
tags                 8890
subloc               1490
project_location        0
dtype: int64

In [8]:
# === Look at errant data types === #
engage_2.dtypes

id                  object
time_in             object
time_out            object
duration            object
mental              object
physical            object
mental_note         object
physical_note       object
task                object
tags                object
subloc              object
project_location    object
dtype: object

### Convert datetime columns

In [9]:
# === Fix the datetimes === #
date_cols = [
    "time_in",
    "time_out",
]

for col in date_cols:
    engage_2[col] = pd.to_datetime(engage_2[col])
    
engage_2.dtypes

id                          object
time_in             datetime64[ns]
time_out            datetime64[ns]
duration                    object
mental                      object
physical                    object
mental_note                 object
physical_note               object
task                        object
tags                        object
subloc                      object
project_location            object
dtype: object

### Convert duration to minutes

The `duration` feature was imported as a string, which makes sense given the format: `[hh:]mm:ss`. To convert this into minutes, I'll split on the colon and extract the hours and minutes, multiplying the hours by 60 and adding them to the minutes. I can leave out the seconds, as I did not capture the timestamps at that level of detail.

Unfortunately, if the hour is not present in the record, it simply doesn't include that segment in the output. Therefore, I had to write a custom function to both split and calculate the minutes.

In [10]:
# === Custom function to split and combine === #

def split_and_calculate_mins(cell):
    """Splits up `duration` based on colon, accounting for missing hours.
    Expects format: [hh:]mm:ss."""
    # Split up cell into component parts
    segments = cell.split(":")
    segments = [int(s) for s in segments]
    # Check length - if more than 2, means hour is present
    if len(segments) > 2:
        # Calculate mins from hours and sum
        return (segments[0] * 60) + segments[1]
    elif len(segments) == 2:  # Case with mins:secs
        # Simply return the minutes
        return segments[0]
    else:
        return 0

In [None]:
# === Use apply to fun func on every cell === #
engage_2["duration"] = engage_2["duration"].apply(split_and_calculate_mins)
engage_2.head()

In [14]:
engage_2.dtypes

id                          object
time_in             datetime64[ns]
time_out            datetime64[ns]
duration                     int64
mental                      object
physical                    object
mental_note                 object
physical_note               object
task                        object
tags                        object
subloc                      object
project_location            object
dtype: object

### Fill nulls with empty string

Although it would be ideal, I don't write a note in every single record. That would add a little too much time to it and make it not as easily done in real-time. Furthermore, not every record has an associated task or tag.

As for `mental` and `physical`, they should be filled in completely. But it seems that I missed some. The same goes for `subloc`, though that one is understandably larger, as I didn't really start recording the sub-location of where am at any given moment (e.g. what room I'm in) until a little ways into this version of the journal. Therefore, I'll have to deal with some nulls.

To deal with all of these nulls without losing information, I'm going to fill them all in with an empty string.

In [15]:
engage_2.isnull().sum()

id                      0
time_in                 0
time_out                0
duration                0
mental                 27
physical               11
mental_note          7809
physical_note        9636
task                10868
tags                 8890
subloc               1490
project_location        0
dtype: int64

In [16]:
# === Fill with empty string === #
engage_3 = engage_2.fillna(value="")
engage_3.head()

Unnamed: 0,id,time_in,time_out,duration,mental,physical,mental_note,physical_note,task,tags,subloc,project_location
0,1-Exe-Pod,2019-12-03 07:00:00,2019-12-03 07:19:00,19,Podcast,Exercise,Full Stack Radio - Evan Yue \\ Vue 3.0 + new e...,Cardio - elliptical,,,Elliptical,24hr-Bel
1,2-Exe-Pod,2019-12-03 07:19:00,2019-12-03 07:37:00,18,Podcast,Exercise,Full Stack Radio with Evan Yue \\ Vue 3.0 - fi...,Cardio - stairs,,,Stairmaster,24hr-Bel
2,3-Exe-Pod,2019-12-03 07:37:00,2019-12-03 08:02:00,25,Podcast,Exercise,Django Chat \\ Caching - something to read up ...,Weights - hip abduction in / out (machine) - k...,,,Machines,24hr-Bel
3,4-Wal-Pod,2019-12-03 08:02:00,2019-12-03 08:08:00,6,Podcast,Walk,Not so standard deviations \\ misc discussions...,Walked to locker room then to car,,,Outside,24hr-Bel
4,5-Dri-Pod,2019-12-03 08:08:00,2019-12-03 08:20:00,12,Podcast,Drive,SE Daily \\ TIBCO,,,,Trinity,24hr-Bel


---
---

## Exploration

There are so, so many interesting questions to ask and avenues to explore with this data, it was almost overwhelming at first. I'd been brainstorming casually over the years on the topic of how to tackle the exploratory analysis and visualization.

Here are a few ideas to get me started:

* How do I spend my time? And what patterns does this follow on a daily, weekly, monthly, yearly time horizon?
* Sentiment analysis over time
  * Does my mood oscillate according to any discernable pattern?
  * Does my mood correlate with spending time on particular activities?