# "Exploring the Basics of Data Indexing"

> "A gentle introduction to the concept of data indexing."

- toc: true
- badges: true
- author: Tobias Reaper
- comments: true
- categories: [blog]
- image: 

---

# Introduction

Anyone who has worked with data in even a limited capacity — which, if considered literally, is virtually most people nowadays — is likely familiar with the term _data indexing_. I explicitly won't be getting into SQL or _database_ indexing in this particular piece; that will be the topic of another.

Therefore, in its most generic form, what is a _data index_? For that matter, what is an _index_?

There are a few things that come to mind right off the bat.

The first one is the index of a book — that section typically at the back of books (usually reference-type books such as textbooks) that list specific names and terms and where in the book they can be found. I think this is one of the best ways to illustrate what an index _does_: it allows information to be "looked up" more easily and/or quickly.

For example, an index would be quite useful when looking for a specific term, say "immediately invoked function expressions" in a JavaScript textbook. If that term was listed in the book's index, it would be easy to find every page where this term appears and simply flip to those pages directly. If no index existed, or that term wasn't listed, then one would have to do some searching. An intelligent algorithm for searching could reduce the time and page-flips needed to find the term, but the effort involved in this is, obviously, much greater than simply flipping directly to pages.

For those with some familiarity with data structures, this is very similar (if not identical in some cases) to looking something up in a hashtable versus an array. There's a reason hashtables were invented: they are _much_ more efficient for many purposes.

As someone who has worked quite a lot with pandas, the next thing that comes to mind when I hear or read the phrase "data indexing" is the index column in a pandas dataframe. If you're familiar with this concept of an index column, that's also not a bad way to think about the more generic form of data index: it is a method of looking up data — you can access rows of the dataframe by "indexing" them.



## Data Reindexing Example: By Artist

In this example, I'll be using an imaginary API that returns music metadata to build a dataset as an array of objects / dictionaries, where each item (object / dict / hash table) in the array represents one row in the dataset. In this format, each of the data points are assigned directly to their column names within the object. The other common format for datasets returned as JSON objects via an API is an array of arrays, where the column names are held in some other object and can be matched to their respective row-data by way of numerical indexing. For this example, I'm going to be using the former.

Here is an example of what a couple rows would look like in this format (once the JSON object is converted into a Python dictionary, which is essentially the same thing).

In [None]:
rows = [
    {
        "id": "V6cb2pnH4vDW15pI",
        "track_title": "Fairy",
        "artists": ["Amonita"],
        "collaborators": "",
        "label": "All Day I Dream",
        "release_title": "Secret of Happiness",
        "release_date": "2019-04-19",
        "genre": "Deep House",
        "duration": 436000,
        "danceability": 0.78,
        "key": 7,
        "valence": 0.81,
        "playlist_count": 382,
        "streams": 442906
    },
    {
        "id": "J2amgp6tQli64sKU",
        "track_title": "Do Not Grill Inside",
        "artists": ["Gab Rhome"],
        "collaborators": "",
        "label": "All Day I Dream",
        "release_title": "Rêveries Éphémères",
        "release_date": "2019-05-03",
        "genre": "Deep House",
        "duration": 433000,
        "danceability": 0.71,
        "key": 3,
        "valence": 0.65,
        "playlist_count": 148,
        "streams": 510786
    },
    # And so on...
]

One of the key features of this

Say that we need to group the rows by artist in order to create a chart of each artist's releases over time, to visualize trends in popularity over time.

The code will loop through a list of artists and render a set of charts for each. Therefore, the data needs to be efficiently accessed by artist. As alluded to above, it wouldn't be very efficient to simply run through the entire dataset every single time to add the rows where the `artist` column matches each one in the list. A much more time-efficient way to go about this is to index the data by artist, in this example using a hash table as the data structure.

The final result will look something like this:

In [None]:
rows_by_artist = {
    "Amonita": [
        {
            "id": "V6cb2pnH4vDW15pI",
            "track_title": "Fairy",
            "artists": ["Amonita"],
            "collaborators": "",
            "label": "All Day I Dream",
            "release_title": "Secret of Happiness",
            "release_date": "2019-04-19",
            "genre": "Deep House",
            "duration": 436000,
            "danceability": 0.78,
            "key": 7,
            "valence": 0.81,
            "playlist_count": 382,
            "streams": 442906
        },
        {
            "id": "V6cb2pnH4vDW15pI",
            "track_title": "Azure",
            "artists": ["Amonita"],
            "collaborators": "",
            "label": "All Day I Dream",
            "release_title": "Secret of Happiness",
            "release_date": "2019-04-19",
            "genre": "Deep House",
            "duration": 436000,
            "danceability": 0.78,
            "key": 7,
            "valence": 0.81,
            "playlist_count": 93,
            "streams": 38337
        },
    ],
    "Gab Rhome": [
        {
            "id": "J2amgp6tQli64sKU",
            "track_title": "Do Not Grill Inside",
            "artists": ["Gab Rhome"],
            "collaborators": "",
            "label": "All Day I Dream",
            "release_title": "Rêveries Éphémères",
            "release_date": "2019-05-03",
            "genre": "Deep House",
            "duration": 433000,
            "danceability": 0.71,
            "key": 3,
            "valence": 0.65,
            "playlist_count": 148,
            "streams": 510786
        },
        {
            "id": "J2amgp6tQli64sKU",
            "track_title": "La Maison",
            "artists": ["Gab Rhome"],
            "collaborators": "",
            "label": "Sol Selectas",
            "release_title": "La Maison",
            "release_date": "2018-12-31",
            "genre": "Deep House",
            "duration": 430000,
            "danceability": 0.78,
            "key": 7,
            "valence": 0.68,
            "playlist_count": 520,
            "streams": 2478105
        },
    ]
}