# Setup
**Note:** Ensure you have completed the setup instructions from `README.md` first.

Let's get started by running the following cell:

In [None]:
from util import save_readable_titles
save_readable_titles()

You should now have a file called `titles.txt` in the repo root. Each set of two lines consists of a
chat title and the folder in `messages/inbox` that the chat data is contained in.

## Loading and Munging Data
Pick a chat you're interested in analyzing, and find its folder by searching `titles.txt` for your
chat's title. Replace the folder name below with your desired folder (which is really a subdirectory
of `messages/inbox`):

In [None]:
# NOTE: Please change the folder name in `constants.py`.
from constants import folder_name

Now follow the rest of this notebook (or **jump to the end** if you aren't interested in the data
munging process).

In [None]:
import os
import json


relative_prefix = "../messages/inbox"
message_filename = "message_1.json"
data_path = os.path.join(relative_prefix, folder_name, message_filename)

with open(data_path, "r") as data_file:
    data = json.loads(data_file.read())

Okay, we've loaded the data. Let's get some quick facts about the chat:

In [None]:
def resolve(unicode_bytes):
    """Facebook emoji data is often dumped as Unicode bytes."""
    return unicode_bytes.encode('charmap').decode('utf8')

participants = [p["name"] for p in data["participants"]]
messages = data["messages"]

print(f"Chat Title: {resolve(data['title'])}")
print(f"Chat Type: {data['thread_type']}")
print(f"Message Count: {len(messages)}")
print()

for p in participants:
    print(f"{p}: {len([m for m in messages if m['sender_name'] == p])}")

Now is a good time to check whether all participants have names, and if the chat type starts with
`Regular`. Otherwise, you may run into unexpected behavior.

So what does the array of `messages` look like? Facebook doesn't tell us explicitly, but I've done
the legwork for you:

## Message Schema Notes
Each element in `messages` looks something like this:

- `sender_name` (keys into `participants`)
- `timestamp_ms` (timestamp in millis)
- `reactions`? (array, has `reaction` and `actor` keying into to `participants`)
- `type` (one of these, value determines other top-level fields)
    - `Generic`
        - `content`? (text associated with message)
        - Top Level Media (zero or one of these, but maybe more?)
            - `photos` (array, has `uri` and `creation_timestamp` in seconds)
            - `videos` (array, has `uri` and `creation_timestamp` in seconds)
                - Nb. `photos` and `videos` also have a thumbnail I'm ignoring
            - `audio_files` (array, has `uri` and `creation_timestamp` in seconds)
            - `files` (array, has `uri` and `creation_timestamp` in seconds)
            - `gifs` (array, has `uri`)
            - `sticker` (has `uri`)
    - `Share`
        - `share`? (might not exist if unavailable)
            - `link` (a URI)
        - Nb. ignoring `content`, since it is oftentimes auto-generated
    - `Call`
        - `call_duration`
        - `missed`? (`call_duration` is `0` if `missed`)
    - `Subscribe`
        - `users` (array, has `name` keying into `participants`)
    - `Unsubscribe`
        - `users` (array, has `name` keying into `participants`)

This format is relatively straightforward, but we'd prefer to normalize our data and store it
somewhere we can query easily. In this case, we'll load the data into a SQLite DB with the
following schema:

## SQL DB Schema (approximate, see `schema.py` for exact details)
Table: `user` (a unary relation defining the domain of participants; note that chats with duplicate
names have no disambiguating info, so they are rejected by this analysis framework)
- `name`: Text [PKEY]

Table: `messages`
- `id`: Integer (auto-increment) [PKEY]
- `sender`: User [FKEY]
- `timestamp`: Timestamp
- `content`: Text (nullable)
    - Also indexed by an auxiliary table with full-text search capability.
- (`sender`, `timestamp`) (unique)

Table: `reactions`
- `user`: User [FKEY]
- `message`: Message [FKEY]
- `reaction`: String
- (`user`, `message`) [PKEY]

Table: `assets`
- `id`: Integer (auto-increment) [PKEY]
- `message`: Message [FKEY]
- `type`: String
    - `photo`
    - `video`
    - `audio`
    - `gif`
    - `sticker`
    - `link`
    - `other`
- `path`: String (URI or local path)
- `timestamp`: Timestamp (nullable)
    - Specific to the asset.
- (`message`, `path`) (unique)

Table: `events`
- `id`: Integer (auto-increment) [PKEY]
- `actor`: User [FKEY]
- `timestamp`: Timestamp
- `type`: String
    - `call`
    - `subscribe`
    - `unsubscribe`
- `target`: User [FKEY] (nullable)
- `duration`: Integer (nullable)
- (`actor`, `timestamp`, `target`) (unique)

To avoid having composite foreign keys, we use a combination of IDs and unique constraints.

I'll spare you the gory details of storing the data in SQL. Go ahead and run the next cell, which
will populate a file `data/your_folder_name.db` in the SQLite format. (Takes about 2 seconds for
every 1000 messages...)

In [None]:
from util import store_chat

# This works standalone (none of the stuff above necessary).
store_chat(folder_name)