# Savor Data

> Taking advantage of my own big data.

A data-driven project by [Tobias Reaper](https://github.com/tobias-fyi/)

## Part 1: Archives

Working with data from previous versions of the Savor data model.

* Load data from archive CSVs
* Transform, clean, and concatenate
* Insert into local Postgres database

---
---

## Introduction

Savor is a project based on an idea that I first had in 2016. At the time I was working as a consultant for an enterprise resource planning (ERP) software company. I worked intimately with manufacturers to integrate our system into their business, with the goal of optimizing their manufacturing processes. I became fascinated by the idea of tracking things to such a degree, and began to imagine what it would be like to have a similar type of system that would optimize my life.

I'm also very into journaling, and the two seemed like a great combination to me. Soon I came to the idea of having a real-time journal, where I can easily and quickly document my experiences, thoughts, interactions as they happen (or as close to as is realistic).

This was before I started my journey into development and data science, so I didn't have the knowledge or skill to build the app myself...yet! I found a web app called Airtable that was perfect for my needs at the time, providing an intuitive interface to a set of relational databases in the cloud.

---
---

## Setup

In [1]:
# === General imports === #
from os import environ
from pathlib import Path

import pandas as pd
import numpy as np
import janitor

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# === Set up environment variables === #
from dotenv import load_dotenv
from pathlib import Path

env_path = Path.cwd().parents[0] / ".env"
load_dotenv(dotenv_path=env_path)

# Get path to archive data
archive_path = Path(environ.get("ARCHIVE_PATH"))

In [3]:
# === Configuration === #
%load_ext autoreload
%autoreload
%matplotlib inline
pd.options.display.max_rows = 100

---

## Archived Data Wrangling and Concatenation

I have four separate archived datasets, mainly due to the fact that Airtable's free tier supposedly only goes up to something like 1-2,000 records per base.

Every time I started a new one, I changed the data model — sometimes a lot, sometimes a little. What that means for me now is I get to go back and figure out a way of combining them all into a single, consistent structure that will be compatible with my newest data model.

The four archives cover the following time periods (and number of records):

* `2018-01-28 - 2018-11-28 (1,209)`
* `2018-10-09 - 2019-02-08 (1,454)`
* `2019-02-08 - 2019-06-07 (2,866)`
* `2019-06-01 - 2019-12-03 (8,406)`

So in total, there are 13,935 records covering a period of ~22 months. However, I do believe there are a few holes in there where I did not track my time. I'll be exploring all of this while cleaning the data up enough to concatenate together into a single, consistent dataset.

The important fields that I want in the final dataframe are:

* time_in
* time_out
* duration (will be re-calculated based on timestamps)
* activity (also called "What")
* notes
* tags (I'll be converting "project" and certain activities into tags, as that is how those items are tracked in the current model)

Some (potentially more advanced) things to be done to this later on:

* Extract names and places from notes to fill those columns where possible in archived data

#### TODO

* [ ] Look for records where `time_in` > `time_out`
* [ ] Look for records where `duration` is very large
* [ ] Recalculate `duration` after concatenation

---

### 1. 2018-01-28 - 2018-11-28 (AKA the Beginning)

In [35]:
# === 2018-01-28 - 2018-11-28 === #
# This one starts at the very beginning!
asset_path = archive_path / "2018" / "2018_11_Activitybox.Archive.csv"

# Load into dataframe, use pyjanitor to clean column names
df1_18_11 = (pd.read_csv(asset_path)
             .clean_names()[["time_in", "time_out", "what", "project", "notes"]]
             .copy()
             # Drop rows where null: duration, time_in/out
             .dropna(axis=0, subset=["time_in", "time_out"])
             # Remaining nulls are in `notes` - fill with empty string
             .fillna(value="")
             # Convert timestamps to datetime
             .change_type("time_in", "datetime64[ns]")
             .change_type("time_out", "datetime64[ns]")
             .rename_columns({"project": "tags", "what": "activity"})
            )

print(df1_18_11.shape)
df1_18_11.head(3)

(1205, 5)


Unnamed: 0,time_in,time_out,activity,tags,notes
0,2018-01-28 18:00:00,2018-01-28 19:08:00,DJ Practice,General Music,Kaliope Mix approx half done
1,2018-01-28 14:00:00,2018-01-28 15:00:00,Practice Instrument(s),General Music,Acoustic Guitar
2,2018-02-03 10:00:00,2018-02-03 13:00:00,Arrangement,Seigyn,Worked on Dark City VIP vocals


In [36]:
# Look at nulls
df1_18_11.isnull().sum()

time_in     0
time_out    0
activity    0
tags        0
notes       0
dtype: int64

In [37]:
# Take a look at some of the null notes - look fine to me
df1_18_11[df1_18_11["notes"] == ""].head()

Unnamed: 0,time_in,time_out,activity,tags,notes
72,2018-02-26 10:40:00,2018-02-26 11:30:00,Reading,Personal,
100,2018-03-01 20:15:00,2018-03-01 21:33:00,Exercise,Personal,
118,2018-03-08 19:00:00,2018-03-08 20:30:00,Arrangement,Seigyn,
120,2018-03-09 03:12:00,2018-03-09 11:13:00,Sleep,Personal,
167,2018-03-18 02:45:00,2018-03-18 03:17:00,Brush/Floss,Personal,


In [38]:
df1_18_11.dtypes

time_in     datetime64[ns]
time_out    datetime64[ns]
activity            object
tags                object
notes               object
dtype: object

In [None]:
# TODO: Look for records where `time_in` > `time_out`
# I know there's at least one (that could be fixed relatively easily)

---

### 2. 2018-10-09 - 2019-02-08

In [None]:
# === 2018-10-09 - 2019-02-08 === #
asset_path = archive_path / "2019-02" / "2019-02-08_Journal_Complete.csv"

df1_19_02 = (pd.read_csv(asset_path, skiprows=1)
             .clean_names()[["time_in", "time_out", "what", "project", "notes"]]
             .copy()
             .dropna(axis=0, subset=["time_in"])
             # Convert timestamps to datetime
             .change_type("time_in", "datetime64[ns]")
             .change_type("time_out", "datetime64[ns]")
             .rename_columns({"project": "tags", "what": "activity"})
            )

print(df1_19_02.shape)
df1_19_02.head(8)

In [40]:
df1_19_02.isnull().sum()

time_in       0
time_out      8
activity      0
tags          0
notes       378
dtype: int64

In [None]:
# These nulls in `time_out` can be filled in using `time_in`
df1_19_02[df1_19_02["time_out"].isnull()]

---

### 3. 2019-02-08 - 2019-06-07

In [None]:
# === 2019-02-08 - 2019-06-07 === #
asset_path = archive_path / "2019-06" / "SassyJo-Journal-View.csv"

# Load journal into dataframe and take a look
df_19_06 = (pd.read_csv(asset_path)
            .clean_names()[["time_in", "time_out", "activity", "project_lookup", "notes"]]
            .copy()  # To prevent slice warning
            # Convert timestamps to datetime
            .change_type("time_in", "datetime64[ns]")
            .change_type("time_out", "datetime64[ns]")
            .rename_columns({"project_lookup": "tags"})
           )

print(df_19_06.shape)
df_19_06.head()

In [43]:
# Bummer about all those nulls
df_19_06.isnull().sum()

time_in        0
time_out       1
activity       3
tags           6
notes       1863
dtype: int64

In [None]:
# TODO: Fill null `notes` with empty string

---

### 4. 2019-06-01 - 2019-12-03

In [44]:
# === 2019-06-01 - 2019-12-03 === #
# Last archived dataset
asset_path = archive_path / "2019-12" / "19-12-03-journal.csv"

# Load journal and take a look
df1_19_12 = (pd.read_csv(asset_path)
            .clean_names()[["time_in", "time_out", "activity", "project_lookup", "notes"]]
            .copy()  # To prevent slice warning
            # Convert timestamps to datetime
            .change_type("time_in", "datetime64[ns]")
            .change_type("time_out", "datetime64[ns]")
            .rename_columns({"project_lookup": "tags"})
           )
print(df1_19_12.shape)
df1_19_12.head()

(8406, 5)


Unnamed: 0,time_in,time_out,activity,tags,notes
0,2019-06-01 03:34:00,2019-06-01 10:30:00,Sleep,Health—Physical,
1,2019-06-01 10:30:00,2019-06-01 10:39:00,Resting,Health—Physical,
2,2019-06-01 10:39:00,2019-06-01 10:44:00,Social_Media,Community,
3,2019-06-01 10:44:00,2019-06-01 10:54:00,Shower,Health—Physical,
4,2019-06-01 10:54:00,2019-06-01 10:58:00,Dress,In_Between,


In [45]:
df1_19_12.dtypes

time_in     datetime64[ns]
time_out    datetime64[ns]
activity            object
tags                object
notes               object
dtype: object

In [46]:
# LOTS of nulls to deal with here!
df1_19_12.isnull().sum()

time_in       23
time_out    6997
activity      26
tags          26
notes       5078
dtype: int64