In [None]:
import datatable as dt
from datatable import f
import pandas as pd

import numpy as np
from sklearn.metrics import roc_auc_score,roc_curve, auc, log_loss
from sklearn.linear_model import LogisticRegression

import gc

Maybe it's a bit late for most people, but I just came up with this trick and it's already signinficantly improving my life.

If this is helpful, feel free to either pull in the output directly, or tweak the code and rerun to generate what you need.

The training dataset is rather unwieldy because it's so large. What makes it even more unwieldy is that Pandas tends to take quite a few liberties with how it uses memory when processing data, seemingly making lots of copies of stuff in RAM during various intermediate states of doing what you ask. This is problematic because the training dataset is around 5 GB, and we only get 16GB of memory. So the training dataset would only possibly fit into memory at most 3 times. And apparently Pandas tries for more than 3, since my notebooks constantly run out of RAM **and restart, losing any intermediate state** when I try to perform some calculation on the whole training dataset. Most of the things I'm trying to do ought to be O(N) time-wise and O(N) or constant space-wise, but again, Pandas seems to have other ideas.

Anyway, the solution is to **split the dataset into mostly-independent chunks**. And because most of the calculation and feature engineering I do is on a per-user basis (and I imagine this is true for most  features you might want to engineer for this dataset), it makes sense to split it up such that **each chunk contains all the data  for some subset of users**.

This way, I can perform my calculations independently on each chunk, Pandas will create its temporary overhead on just a small fraction of the data, and I can then manually aggregate the per-chunk results however I need to. *(in a way, I'm manually replicating a lot of how Dask does things, except Dask can't split on anything but an index... and can't have multi-index... and makes some assumptions about groupby)*

# Load the training data

In [None]:
%%time
# load data via datatable: 
# much faster to ask datatable to load it, and then to translate to Pandas, than trying to get Pandas to load the .csv
train_dt = dt.fread("../input/riiid-test-answer-prediction/train.csv")
train_df = train_dt.to_pandas()

# Calculate per-user running total of rows

I sort the user ids in ascending order, so that I can select a range of users with `<` and `>=`

I then get the cumulative number of training data rows for all users so far.

In [None]:
%%time
# get cumulative counts of users
user_counts = train_df.groupby('user_id').count()
user_counts.sort_index(inplace=True)
user_counts['total_rows'] = user_counts['timestamp'].cumsum()
user_counts[['timestamp','total_rows']]

# Find "break points"

I semi-arbitrarily decided I want roughly 500000 rows per chunk - this makes for about 200 chunks.

For each "break point" (multiple of 500000), I find the user whose `total_rows` is closest to that break point. 
All users between this user and the previous "break point" user go into the current chunk.

(and don't forget the last chunk - all users after the last break point)

In [None]:
%%time

rows_per_df = 500000
df_split = []
prev_break_user = 0
for row_break in range(rows_per_df, 101230332, rows_per_df):
    print(row_break)
    break_i = user_counts['total_rows'].searchsorted(row_break)
    break_user = user_counts.iloc[break_i].name
    %time part = train_df.loc[(train_df['user_id']<=break_user) & (train_df['user_id']>prev_break_user)]
    prev_break_user = break_user
    df_split.append(part)
    gc.collect()
# last bit
print('>',row_break)
part = train_df.loc[train_df['user_id']>break_user]
df_split.append(part)


In [None]:
del train_df
gc.collect()

In [None]:
%%time
for i, df in enumerate(df_split):
    df.to_csv(f'train_{i}.csv')

# Example use case

This example is based loosely on the idea of Learning Factor Analysis: that students get better at a skill as they practice it, and that this learning curve follows a specific power curve. See [this notebook](https://www.kaggle.com/yanamal/learning-factor-analysis-are-tags-skills) for details if you're interested.

For simplicity, in this example I'll just pretend that "answering all kinds of TOEIC questions" is a learnable skill. This is not particularly true in practice, but this is just an illustration of how I might deal with the split-out dataframes.

First, add a column to each dataframe to represent per-user "encounters" with the "skill" of answering questions.

In [None]:
%%time

def add_peruser_encounters(df):
    df['encounters'] = df.groupby('user_id').cumcount()+1
    

[add_peruser_encounters(df) for df in df_split]
print()

Now I want to do logistic rergression to predict the `answered_correctly` column, using the `encounters` column as input.

The simpler way to do this would be to pull out the relevant columns from each chunk and then stick them together them using `np.concatenate`, e.g. `all_encounters = np.concatenate([df['encounters'] for df in df_split])`. 

But in this case, the logistic regression actually runs out of memory! Also it's actually needlessly slow, since what we really have is lots of copies of the same few combinations of `answered_correctly` and `encounters` (there are only so many combinations possible!)

So instead, I count the number of occurrences of each combination; and then use those counts as weights for the regression.

In [None]:
%%time

# sum up all the weights

def count_encounter_and_label_combinations(df):
    qs = df[df['answered_correctly']!=-1]  # skip lectures (I want to classify question answers)
    return qs.groupby(['encounters', 'answered_correctly'])['timestamp'].count()

weights = [count_encounter_and_label_combinations(df) for df in df_split]

# add up all the weights (I could also probably use the fold function, but I'm trying to keep it readable)
sum_weights = weights[0]
for i in range(1, len(weights)):
    sum_weights.add(weights[i], fill_value=0)
sum_weights

In [None]:
# delete intermediate variable, call garbage collection
del weights
gc.collect()

Now we can extract the relevant fields from this multi-index series (it's a bit finicky but doable) and the actual regression takes no time at all:

In [None]:
X = np.array(sum_weights.index.get_level_values('encounters')).reshape(-1,1)
y = sum_weights.index.get_level_values('answered_correctly')

In [None]:
%%time
l = LogisticRegression().fit(X,y,sample_weight=sum_weights)


Some sample results of the regresssion:

In [None]:
probabilities = l.predict_proba(np.array(sum_weights.index.get_level_values('encounters')).reshape(-1,1))
probabilities

In [None]:
# regression score (more is better)
l.score(X, y, sample_weight=sum_weights)

In [None]:
# another scoring metric (less is better)
log_loss(y, probabilities[:,0], sample_weight=sum_weights)

In [None]:
fpr, tpr, thresholds = roc_curve(y, probabilities[:,0])
roc_auc = auc(fpr, tpr)
roc_auc

Well, I mean... it's better than random?..