# "Exploring the Basics of Data Indexing"

> "A gentle introduction to the concept of data indexing."

- toc: true
- badges: true
- author: Tobias Reaper
- comments: true
- categories: [blog]
- image: 

---

# Introduction

Anyone who has worked with data in even a limited capacity — which, if considered literally, is virtually most people nowadays — is likely familiar with the term _data indexing_.

Therefore, in its most generic form, what is a _data index_? For that matter, what is an _index_? (Really, the two can be interchangeable in many situations. However, for the purposes of illustration I'm going to say a data index is specific to relational data.)

There are a few things that come to mind right off the bat.

The first one is the index of a book — that section typically at the back of books (usually reference-type books such as textbooks) that list specific names and terms and where in the book they can be found. I think this is one of the best ways to illustrate what an index _does_: it allows information to be "looked up" more easily and/or quickly.

For example, an index would be quite useful when looking for a specific term, say "immediately invoked function expressions" in a JavaScript textbook. If that term was listed in the book's index, it would be easy to find every page where this term appears and simply flip to those pages directly. If no index existed, or that term wasn't listed, then one would have to do some searching. An intelligent algorithm for searching could reduce the time and page-flips needed to find the term, but the effort involved in this is, obviously, much greater than simply flipping directly to pages.

For those with some familiarity with data structures, this is very similar (if not identical in some cases) to looking something up in a hashtable versus an array. There's a reason hashtables were invented: they are _much_ more efficient for many purposes.

Indeed, the book example is used to explain the word "index" in many [needs citation] dictionaries.

As someone who has worked quite a lot with pandas, the next thing that comes to mind when I hear or read the phrase "data indexing" is the index column in a pandas dataframe. If you're familiar with this concept of an index column, that's also not a bad way to think about the more generic form of data index: it is a method of looking up data — you can access rows of the dataframe by "indexing" them.

---

## Savor Data Reindexing

- [ ] By Sublocation
- Personal Data Dashboard
  - [ ] How long does it take me to do ___ (shower, dress, etc) this month this year vs last year?

For these examples, I'll be retrieving Savor data from Airtable via [airtable-python-wrapper](https://github.com/gtalarico/airtable-python-wrapper), transforming it into a set of variations of a dataset that can plug into my Personal Data Dashboard.

In case you haven't read my first post about Savor, here's a very brief overview: it's a real-time journaling system I'm building with the aim of making it easy to gather and utilize rich data about my life as it happens. Thus far, I've been using Airtable as the interface, capturing the data in a relatively simple set of relational tables.

In [1]:
#collapse-hide
# === Imports and config === #
%load_ext autoreload
%autoreload

from os import environ
from pprint import pprint

from airtable import Airtable
import pandas as pd
import janitor

pd.options.display.max_rows = 100
pd.options.display.max_columns = 50

In [2]:
#collapse-hide
# === Connect to airtable === #

# Envirovars for authentication
from dotenv import load_dotenv
from pathlib import Path

load_dotenv(dotenv_path=".env")
base_key = environ.get("AIRTABLE_BASE_KEY")
api_key = environ.get("AIRTABLE_API_KEY")

In [3]:
# Connect to engage_log table
engage_log = Airtable(base_key, "engage_log", api_key=api_key)
print(engage_log)

<Airtable table:engage_log>


### ETL Setup

The data is returned as an array of dictionaries (objects in JavaScript), where each item in the array represents one row in the dataset. In this format, each of the data points are assigned directly to their column names within the object. Another common format for datasets returned as JSON objects via an API is an array of arrays, where the column names are held in some other object and can be matched to their respective row-data by way of numerical indexing. For this example, I'm going to be using the former.

Here is an example of what a couple rows would look like in this format.

In [4]:
# Get all engagement records, sorted by time_in'
# engage_fields = [
#     "time_in",
#     "mental",
#     "physical",
#     "tags",
#     "subloc",
#     "mental_note",
#     "physical_note",
#     "who",
#     "dose",
# ]
engages = engage_log.get_all(
#     fields=engage_fields,
    sort=["-time_in"],
    max_records=2
)
engages[:2]

[{'id': 'recNVDgKQ75fbhUjv',
  'fields': {'id_num': 20314,
   'project_log': ['reccjFvHkMmxGrsu3'],
   'subloc': ['recrQNJglSZ5mmZQl'],
   'dose': ['rec74Hi1KbAdEHspV', 'recXaoGTBbrRiVeXt'],
   'time_in': '2021-02-18T13:25:00.000Z',
   'mental': ['recm7RWIWmDQDCWSe'],
   'physical': ['recNcdJGnjhCWe6Eu'],
   'name': '20314-Cap-Thi',
   'modified': '2021-03-10T23:15:13.000Z',
   'created': '2021-02-18T03:48:00.000Z',
   'duration': {'specialValue': 'NaN'},
   'project_location': ['recgaBaPGoewkBgbE']},
  'createdTime': '2021-02-18T03:48:00.000Z'},
 {'id': 'recSubcryeSu1Iwhd',
  'fields': {'id_num': 20313,
   'project_log': ['reccjFvHkMmxGrsu3'],
   'subloc': ['rec92DKYGuA3gGzXd'],
   'time_in': '2021-02-18T13:21:00.000Z',
   'mental': ['recm7RWIWmDQDCWSe'],
   'physical': ['recQsIaiG012c5KoI'],
   'name': '20313-Tea-Thi',
   'modified': '2021-03-10T23:15:13.000Z',
   'created': '2021-02-18T03:48:00.000Z',
   'duration': {'specialValue': 'NaN'},
   'project_location': ['recgaBaPGoewkBgbE

In [5]:
# Generator
engage_log_records = airtable.get_iter(
    sort=["-time_in"],
    max_records=10
)
for page in engage_log_records:
    for record in page:
        pprint(record)

{'createdTime': '2021-02-18T03:48:00.000Z',
 'fields': {'created': '2021-02-18T03:48:00.000Z',
            'dose': ['rec74Hi1KbAdEHspV', 'recXaoGTBbrRiVeXt'],
            'duration': {'specialValue': 'NaN'},
            'id_num': 20314,
            'mental': ['recm7RWIWmDQDCWSe'],
            'modified': '2021-03-10T23:15:13.000Z',
            'name': '20314-Cap-Thi',
            'physical': ['recNcdJGnjhCWe6Eu'],
            'project_location': ['recgaBaPGoewkBgbE'],
            'project_log': ['reccjFvHkMmxGrsu3'],
            'subloc': ['recrQNJglSZ5mmZQl'],
            'time_in': '2021-02-18T13:25:00.000Z'},
 'id': 'recNVDgKQ75fbhUjv'}
{'createdTime': '2021-02-18T03:48:00.000Z',
 'fields': {'created': '2021-02-18T03:48:00.000Z',
            'duration': {'specialValue': 'NaN'},
            'id_num': 20313,
            'mental': ['recm7RWIWmDQDCWSe'],
            'modified': '2021-03-10T23:15:13.000Z',
            'name': '20313-Tea-Thi',
            'physical': ['recQsIaiG012c5KoI']

#### Primary Keys

As likely noticed in these records, any time a related table is referenced, the primary key is used — e.g. `rec92DKYGuA3gGzXd`. Part of the ETL process will therefore be matching up those keys to their respective records. This is not only for me to know what I'm working with; also to get human-readable names for these records to display on the dashboard. Technically I could still do all the reindexing without knowing what the records are, but I would by flying mostly blind and wouldn't be able to get much useful insight.

In most cases, I really only need to get the name of the record, though there are a couple of instances where additional information is useful. For example, the `dose` table, which holds data about nutritional and nootropic supplements, has information like what supplement it is and the amount.

Additionally, the way Airtable sets up the relations results in a nice feature that I can use to my advantage here: there is already an index for each relation. When a relation is set up between two tables, a column automatically gets created in both tables listing all references to their respective records in the other table. That's the `engage_log` column in each of the tables below: a list of `engage_log` records that reference the record.

This saves me a step in case I want to look up — or _index_ — `engage_log` records by a related record. The way this might surface in the dashboard is that I could be looking at a monthly summary of my mental and physical activities and decide I want to drill down into more details about a specific mental activity. I'd ideally be able to click on that activity and it would _reindex_ the data such that it would show me the other data summarized / aggregated according to that activity, such as descriptive statistics about how long is spent on it each time I do it, where it takes place, and maybe (someday) a smart summary and sentiment analysis of my notes.

Indeed, this example can be generalized to much of the logic behind a dashboard like this: a dashboard is a visual way to reindex data.

In [5]:
# Get relevant related records to match up
table_fields = {
    "mental": ["name", "engage_log"],
    "physical": ["name", "engage_log"],
    "dose": ["name", "engage_log", "supp", "amt", "unit"],
    "who": ["name", "engage_log"],
    "location": ["name", "engage_log", "location", "city", "state"],
    "subloc": ["name", "engage_log"],
    "tag": ["name", "engage_log"],
}

table_data = {
    "mental": {},
    "physical": {},
    "dose": {},
    "who": {},
    "subloc": {},
    "tag": {},
}

# Loop through tables to retrieve records and save in dictionary
for table in table_fields:
    airtable = Airtable(  # Connect to table
        base_key,
        table,
        api_key=api_key
    )
    records = airtable.get_all(  # Retrieve records
        fields=table_fields[table],
    )
    table_data[table] = records  # Save records to above dict

Now the `table_data` dictionary holds the key-name lookup I'll need later on. I won't go through the entire dataset and replace all of the PKs as that would be something of a process. My method will be to match them up only when I actually need the human-readable name.

In [7]:
pprint(table_data["mental"][:5])

[{'createdTime': '2019-11-24T06:04:43.000Z',
  'fields': {'name': 'Podcast'},
  'id': 'rec04WWDmwUYsOfVR'},
 {'createdTime': '2020-09-20T20:58:30.000Z',
  'fields': {'name': 'Arrange'},
  'id': 'rec1qtyxPApCENwj3'},
 {'createdTime': '2019-12-05T00:13:44.000Z',
  'fields': {'name': 'Troubleshoot'},
  'id': 'rec2ETpho10dvgd3e'},
 {'createdTime': '2019-12-06T21:04:43.000Z',
  'fields': {'name': 'Chat/Text'},
  'id': 'rec47FqiuoEPdjVwc'},
 {'createdTime': '2019-12-07T22:53:48.000Z',
  'fields': {'name': 'Shop'},
  'id': 'rec5saWQlgv6Bemi7'}]


###