In [1]:
%load_ext memory_magics
# imports
import os
from pathlib import Path

import numpy as np
import pandas as pd
import polars as pl
import random

In [2]:
# consts
_data_root = Path(os.sep.join(["/", "Users", "toby.devlin", "dev", "projects", "data-processing", "data"]))
crime_path = (_data_root / "Crime_Data_from_2020_to_Present.csv").resolve()
reddit_path = (_data_root / "reddit_account_data.csv").resolve()
print(crime_path)
print(reddit_path)

np.random.seed(0)
random.seed(0)

/Users/toby.devlin/dev/projects/data-processing/data/Crime_Data_from_2020_to_Present.csv
/Users/toby.devlin/dev/projects/data-processing/data/reddit_account_data.csv


# What we are doing
We have 3 files, as printed above.
These have Various schemas & we want to do some processing in various tools to test performance. The tests themselves make no sense but the data is large enough and  "real" enough to be somewhat reflective of a real world scenario without any need to adjust for random generators.

## Data Sourcing
- [Crime_Data_from_2020_to_Present.csv](Crime_Data_from_2020_to_Present.csv) -> https://catalog.data.gov/dataset/crime-data-from-2020-to-present
- [reddit_account_data.csv](reddit_account_data.csv) -> https://files.pushshift.io/reddit/

## The Test:
1. Read in `Crime_Data_from_2020_to_Present.csv` to a frame, converting all datetime to native datetime format: `Date Rptd`, `DATE OCC`
2. Filter this frame such that:
   1. Column `Vict Age` > 15
   2. Use only records in the quantiles 0.05 -> 0.95 for `LAT` and `LON`
3. Read in the `reddit_account_data.csv` to a frame, converting all datetime to native datetime format: `created_utc`, `updated_on`
4. Filter this frame such that:
   1. None of the account names end with the number 7
5. Create a column in the `account_data` frame, and assign each of the users a `group` value, uniformly, from the `crime_data.DR_NO` column
6. Join these two frames together on `account_data.group == crime_data.DR_NO`
7. Select only those who are NOT part of the following groups:
    1. `CC`
    2. `AA`
8. Group these by the first 2 letters of their name and calculate:
   1. The average time since reporting the crime
   2. The number of members in the group
   3. The user with the highest & lowest Karma scores
   4. The average time since creating an account

## What we are monitoring
- Time taken, using the timeit module
- Memory usage, using [rusmux's ipython-memory-magics](https://github.com/rusmux/ipython-memory-magics), which should just look at the cells memory usage.

# Before we start
The files themselves should be looked at first; their size, shape and location on disk. todo: also stream from s3

| file                                |                       size |    records | columns |
|-------------------------------------|---------------------------:|-----------:|:--------|
| Crime_Data_from_2020_to_Present.csv |   168,553,714 bytes (161M) |    659,640 | 28      |
| reddit_account_data.csv             | 3,298,086,717 bytes (3.1G) | 69,382,538 | 6       |

The both these frames are relatively small but we are hoping to show there is some extra on disk processing we can do to speed things up.