# Anonymize JupyterHub Logs

This notebooks extracts anonymized, publishable user session information from JupyterHub logs.

## Extract user session information from the log

We only care about server starts & stops, so we extract lines related to this from the JupyterHub log. We might pre-filter the log with something like `grep 'seconds to' jupyterhub.log > filtered-jupyterhub.log` to make processing faster - the Berkeley JupyterHub logfile for Spring 2018 semester was 7G without this pre-filtering!

## Anonymize user names

User names should not leak, but we want to establish usage patterns for individual users across time. We accomplish this by hashing each username with an ephemeral secret salt. This ensures user names stay same across each run of this notebook, but can't be co-related with other datasets that might be made in the future.

## Reduce data resolution

User activity timestamps are important for most analysis, but can also be used in attacks to de-anonymize users. To safeguard against this, we only provide timestamps with hourly resolution. This is good enough for most analysis at large scales.

## Eliminate periods of low activity 

If only a small number of users activities happen in any given hour, the risk of them being de-anonymized becomes higher. For example, if you know student A was active on the hub at Friday Jan 21 2018 9PM via other channels (maybe they tweeted about it!), then from this dataset you can find their hashed user id & hence track their activity across time! We try to make this harder by eliminating data for the hours where less than `k` user activities happened. We set `k` to 5 by default.

## Ongoing process

This is a best effort in anonymizing usage data, and could use improvements! If you think of any, please let me know!

## Further reading

The wikipedia article for [k-anonymity](https://en.wikipedia.org/wiki/K-anonymity) is pretty good. The [original paper](https://dataprivacylab.org/dataprivacy/projects/kanonymity/paper3.pdf) is also fairly readable!

In [1]:
import hashlib
import hmac
import json
import dateutil
import secrets

# Generate a HMAC key for salting the username
# This is only kept in memory, so we can not reverse this after this process dies
HMAC_KEY = secrets.token_bytes(32)

def parse_activity_line(line):
    """
    Parses a user server start/stop line from JupyterHub logs
    
    Returns a tuple of (timestamp, anonymized_username, action).
    
    timestamp is rounded out to the nearest hour for anonymization purposes.
    """
    lineparts = line.split()
    try:
        # Round all timestamp info to the hour to make it more anonymous
        ts = dateutil.parser.parse('{} {}'.format(lineparts[1], lineparts[2])).replace(minute=0, second=0, microsecond=0)
        user = lineparts[6].strip()
        userhash = hmac.new(HMAC_KEY, user.encode(), hashlib.sha512).hexdigest()

        action = lineparts[-1].strip()
    except IndexError:
        # Poor person's debugger!
        print(lineparts)
        raise
    return (ts, userhash, action)

In [2]:
def generate_session_data(infile_path, outfile_path, min_entries_per_hour=5):
    """
    Generate user session data from JupyterHub logs in infile_path
    
    min_entries_per_hour is the minimum number of activity entries that must
    be present in each hour for the hour to be included in the output.
    """
    with open(infile_path) as infile, open(outfile_path, 'w') as outfile:
        current_hour_entries = []
        last_hour = None
        for l in infile:
            if 'seconds to' in l:
                timestamp, user, action = parse_activity_line(l)
                if last_hour is None:
                    last_hour = timestamp
                if timestamp == last_hour:
                    current_hour_entries.append(json.dumps({'timestamp': timestamp.isoformat(), 'user': user, 'action': action}))
                else:
                    if len(current_hour_entries) >= min_entries_per_hour:
                        outfile.write('\n'.join(current_hour_entries) + '\n')
                    else:
                        print(f'Skipped entry for {timestamp}: had less than {min_entries_per_hour} actions')
                    last_hour = timestamp
                    current_hour_entries = []

In [3]:
# Generate usage data for Summer 2018
generate_session_data(
    #'../data/private/user-starts-stops-june-12-2018.log', 
    '../data/processed/user-starts-stops-june-12-2018.jsonl',
    5
)

FileNotFoundError: [Errno 2] No such file or directory: '../data/private/user-starts-stops-june-12-2018.log'

01-anonimize-hub-logs.ipynb
