### Insights from other Notebooks and Discussions:

Important links (References):

Notebooks:
1. Introduction to the API: https://www.kaggle.com/sohier/competition-api-detailed-introduction
2. EDA + LGBM Starter: https://www.kaggle.com/isaienkov/riiid-answer-correctness-prediction-eda-modeling
3. CV benchmark with LGBM: https://www.kaggle.com/sishihara/riiid-lgbm-5cv-benchmark
4. Loading Large datasets: https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets

Discussion:
1. Incremental/online learning relevance: https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/190354
2. Loading dataset faster with cuDF (requires GPU): https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/190288
3. Private dataset questions (unanswered as of now): https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/190200
4. Target Leakage (Past / Future must be kept seperate): https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/189437
5. Fancy Paper ideas (Advanced): https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/189398

Misc:
- Load the data as pickle, feather, tfrec (or) use DataTable, cuDF, Dask, Bquery
- The hidden test set contains *new users* but not new questions.
- The test data follows chronologically after the train data. The test iterations give interactions of users chronologically.
- Additional insights follow as I explore them.


### Load necessary modules and data:

In [None]:
# Loading modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import riiideducation
import gc
import tqdm

%matplotlib inline

# Loading the API
env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
# check the files we have been given
!ls ../input/riiid-test-answer-prediction/

Right now we are going to subsample the dataset instead of reading it all to the memory, let's try to understand what data we have been provided with before training on all the data given.

We need to change the dtypes of certain columns and read a subset of the entire data to be able to fit it all to the memory:

In [None]:
dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}

data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes, nrows=int(1e6))

ques = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
lectures = pd.read_csv("../input/riiid-test-answer-prediction/lectures.csv")
ex_sub = pd.read_csv('../input/riiid-test-answer-prediction/example_sample_submission.csv')
ex_test = pd.read_csv('../input/riiid-test-answer-prediction/example_test.csv')

data.shape, ques.shape, lectures.shape, ex_sub.shape, ex_test.shape

In [None]:
# returns total bytes consumed
temp = data.memory_usage(deep=True).sum()

# bytes to MB
print (f"{temp / (2**20):.2f}", 'MB')

### EDA:

In [None]:
data.sample(5)

In [None]:
# row ID is basically redundant
(data.row_id == data.index).all()

#### From data description and discussions:

`Time stamp`: The time in milliseconds between this user interaction (click n submit) and the first event COMPLETION from that user.

`Prior_question_elapsed_time`: Is null for a user's first question bundle or lecture. It is the average time in a took to solve each question in the previous bundle. (NAN if curr ques belongs to first bundle or if curr ques is a lecture)

`task_container_id`: Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

`prior_question_had_explanation`: Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user 
sees were part of an onboarding diagnostic test where they did not get any feedback.

#### Let's check the data for any one random user to help us understand these features better:

In [None]:
temp = data[data.user_id == np.random.choice(data.user_id.unique())].sort_values("timestamp")
print (temp.shape)
temp.head(10)

Timestamp:
- It's safe to say that timestamp = 0 is the first user interaction (COMPLETION)
- TimeStamp will give us the order in which the student has interacted.

Prior_ques_elapsed:
- Time taken on avg to complete ques on prev bundle
- Specfies how *quickly* the prev bundle questions were answered
- DOES NOT specify the order of attempts.

From [here](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/189351):

"In rough terms you might want to look at the timestamp column to check whether a user worked on 10 questions a day or 10 questions per month and the `prior_question_elapsed_time` column to see if they need 1 second to respond or three minutes."

From [here](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/189465):

"`Task_container_id` captures the order that a user *first sees* tasks in, `timestamp` captures the order in which a user actually *completes* tasks. A user can start one task, start a second, and then finish the second before finishing the first -- this results in the later timestamp for the first task."

#### Let's merge the questions and lectures together with our sample subset for better understanding of the whole:

From [this](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/189298) discussion, the `part` and `tags` codes are same for questions and lectures.csv. Though there is an apparant confusion since the tags don't seem to overlap between both the csv (as of now).

In [None]:
# combining questions and lectures together
ql = pd.concat([ques, lectures.rename({"lecture_id": "question_id"}, axis=1)], axis=0).reset_index(drop=True)

# overlap the tags from both columns
ql.tags = ql.tags.fillna(ql.tag)

# custom type of for questions
ql.type_of = ql.type_of.fillna("question")

# for distinguishing between lectures and questions
ql["content_type_id"] = ql["type_of"] != 'question'

# bundle id and correct ans & tags is filled with -1
# tag is missing for 1 row -> 10033
ql = ql.fillna(-1)

# drop the unneeded tag feature
ql = ql.drop("tag", 1)

# rename the column for easy merge
ql = ql.rename({"question_id": "content_id"}, axis=1)

# convert all the tags to list from string
ql.tags = ql.tags.apply(lambda x: [int(x)] if type(x) != str else list(map(int, x.split())))

# note that missing values are filled with -1 though
print ("Columns with missing values:", ql.isna().sum().index[ql.isna().sum() != 0].tolist())

# how does it look?
ql.head()

What does the `part` column mean?

It refers to the relevant section of the [TOEIC](https://www.iibc-global.org/english/toeic/test/lr/about/format.html) test.

1. Listening (Statements will not be printed):
    - Part 1: Photographs: Four short statements regarding a photograph will be spoken only one time. The statements will not be printed. Of these four statements, select the one that best describes the photograph and mark your answer on the answer sheet.
    
    - Part 2: Question-Response: Three responses to one question or statement will be spoken only one time. They will not be printed. Select the best response for the question, and mark your answer on the answer sheet.
    
    - Part 3: Conversations: Conversations between two or three people will be spoken only one time. Listen to each conversation and read the questions printed in the test book (the questions will also be spoken). Some questions may require responses related to information found in diagrams, etc. printed on the test book as well as what you heard in the conversations. There are three questions for each conversation.
    
    - Part 4: Talks: Short talks such as announcements or narrations will be spoken only one time. Listen to each talk and read the questions printed in the test book (the questions will also be spoken), select the best response for the question, and mark your answer on the answer sheet. Some questions may require responses related to information found in diagrams, etc. printed on the test book as well as what you heard in the talks. There are three questions for each talk.

2. Reading:
    - Part 5: Incomplete Sentences: Select the best answer of the four choices to complete the sentence, and mark your answer on the answer sheet. 

    - Part 6: Text Completion: Select the best answer of the four choices (words, phrases, or a sentence) to complete the text, and mark your answer on the answer sheet. There are four questions for each text. 

    - Part 7: Passages: A range of different texts will be printed in the test book. Read the questions, select the best answer of the four choices, and mark your answer on the answer sheet. Some questions may require you to select the best place to insert a sentence within a text. There are multiple questions for each text. 

In [None]:
# some might find reading section easier and listening tougher or vice versa
ql['reading_section'] = ql.part.isin([5, 6, 7]) # (5, 6, 7) => reading section (refer above)

In [None]:
# tags match for lectures and questions? (overlap)
(set(ql[ql.content_type_id].tags.explode().unique()) 
 == 
 set(ql[~ql.content_type_id].tags.explode().unique()))

In [None]:
# Return tags that exist only in one of the CSVs
np.setxor1d(
    ql[ql.content_type_id].tags.explode().unique(),
    ql[~ql.content_type_id].tags.explode().unique()
)

In [None]:
# `part` overlaps?
np.intersect1d(lectures.part.unique(), ques.part.unique()).shape

In [None]:
# content id overlap in lectures and questions
np.intersect1d(lectures.lecture_id.unique(), ques.question_id.unique()).shape

In [None]:
# it would make much more sense if we merged lectures and questions together
# we merge it on content_type_id since we need to differnetiate between
# questions and lectures
temp = temp.merge(ql, on=['content_id', 'content_type_id'])
print(temp.shape)
temp.head()

Now that we have a overall understanding of what the features mean let's switch back to performing EDA on the entire dataset:

In [None]:
# content ID denotes whether the interaction 
# is that of lecture or question
data.content_type_id.unique()

In [None]:
print (len(data))
data.nunique()

- Much less users compared to the number of questions (which is obvious)
- Timestamps are a lot many more compared to prior_question_elapsed_time which is the case since it the average time taken to complete the previous bundle by a user, in other words it is the same for all questions for a particular user for a particular question bundle (see snippet below)

In [None]:
# prior_elapsed_time per bundle is always one (0 for nans)
(data.groupby(['user_id', 'task_container_id'])
 ['prior_question_elapsed_time']
 .nunique().values <= 1).all()

In [None]:
# No overlap between the content id for varying task containers?
data.groupby("task_container_id")['content_id'].nunique().sum() == data['content_id'].nunique()

In [None]:
# is the timestamp increasing monotically for all users?
data.groupby("user_id")['timestamp'].is_monotonic_increasing.all()

Describe function to see the overall stats:

In [None]:
data.describe()

#### Some visualizations to help us better:

In [None]:
ax = data.timestamp.plot(kind='hist', figsize=(10, 5), title='TimeStamp Distribution (log Scale)')
ax.set(yscale="log", xlabel='Timestamp in milliseconds');

There's an exponentially decreasing trend. However we need per user stats to reliably tell anything about app popularity conclusively.

In [None]:
ax = (data.groupby("user_id")['timestamp'].max()
      .plot(kind='hist', figsize=(10, 5),
            title='User Retention (Since they first start using the app)'))
ax.set(yscale='log', xlabel='Milliseconds (max - min) timestamp');

The number of interactions with the app since their first interaction indeed decreases *exponentially* over time.

Let's now visualize the avg time taken to solve questions (we can use prior_question_elapsed_time since it does the same thing we want):

In [None]:
ax = ((data.groupby("user_id")['prior_question_elapsed_time'].mean() / 1000)
      .plot(kind='hist', figsize=(10, 5), xticks=range(0, 101, 10)))
ax.set(title='Avg time taken to solve questions (User Avg)', xlabel='Time Elapsed in Seconds');

Any insights and corelations between timestamp, prior_question_elapsed_time and correct answer?

In [None]:
temp = data.sample(frac=0.2).merge(ql, on=['content_id', 'content_type_id'], how='left')

(temp.plot(kind='scatter', x='timestamp', y='prior_question_elapsed_time', 
           c='answered_correctly', cmap='jet', figsize=(10, 5), alpha=.8));

There doesn't seem to be any. This somewhat makes sense since each user is unique and our predictions better be such that it is unique for each user. Each user may have a preference for `reading_section`, some for `writing_section` and so on. Let' try to visualise for the users individually:

In [None]:
f, ax = plt.subplots(nrows=3, ncols=2, figsize=(20, 20))
ax = ax.ravel()
for i in range(6):
    temp = data[data.user_id == np.random.choice(data.user_id.unique())].sort_values("timestamp")
    temp = temp.merge(ql, on=['content_id', 'content_type_id'])
    sc = temp.plot(kind='scatter', x='timestamp', y='prior_question_elapsed_time', 
              ax=ax[i], s=temp.part*15, c='answered_correctly', cmap='jet', alpha=0.6,
              title=f"ID: {temp.user_id.values[0]}\
\nScore avg: {temp.answered_correctly.mean():.2f} | Count: {len(temp)}")

- There are some users who use the app spoardically in periodic intervals and there are users who use the app on a regular basis
- Maybe we could group users together using an embedding or a simple kmeans clustering (for a later period)

##### We will do much more EDA at a later time.

#### We now try to understand `ex_sub` & `ex_test` csv files:

These files are provided as sample for how the files produced by `env.itertest` would be. At one call, it would only give us a small batch. We need to make predictions with our models on this and submit with a `env.predict` before we can call the next batch. *This is done so as to mimic real life scenarios where the future data is not available for model training.*

From data description:

- `prior_group_responses` (string) provides all of the user_answer entries for previous group in a string representation of a list in the **first row** of the group. **All other rows in each group are null**. If you are using Python, you will likely want to call eval on the non-null rows. Some rows may be null, or empty lists.

- `prior_group_answers_correct` (string) provides all the answered_correctly field for previous group, with the *same format and caveats as prior_group_responses*. Some rows may be null, or empty lists.

A more thorough understanding can be obtained from this post [here](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/190430) by Alex:

     Once you submitted your predictions for a group, you get the next test_df off the iterator and that immediately tells you whether you were right or not. You can use this information to improve your model before continuing with going through the test set, or you can just ignore it.
    As you can't submit predictions for the same group twice, you can't cheat with it. It's just meant to be used for improving your prediction algorithm as you get more information, as is typical for realtime applications.

In [None]:
# example input from the API looks like:
ex_test.head()

- Only the first row of each group would contain the answers and scores. The rest of the rows are all null.
- We are *NOT* provided with the `user_answer` during the predictions we are to make. We are only provided that information at the next batch along with whether the user's predictions were indeed correct. If this had not been the case, we could simply compare with the `questions.csv` and be able to perfectly predict if the users were correct ;)
- During the prediction time we only have features such as the timestamp, question meta data and info regarding prior groups response.

In [None]:
# example prediction to the API must look like:
ex_sub.head()

- For submission, we only pass in row_id, predictions. Group_num although present here is not required for submission. (check 'Making Our Predictions' part)

Some more insights about the time series API testframes:
- All shapes for each batch aren't identical, each batch may have differing no of samples
- From this discussion [here](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/190698): Every test_iter will only have one group number -> The first row is a list in str format (could be an empty list too).

### Data Cleaning / Pipeline

Also from [this](https://www.kaggle.com/dwit392/riiid-challenge-time-since-last-action-for-test) notebook it is said that time elapsed between the last interaction and current one is a good predictor of the answer correctness which would make sense since that time could be used by a student to prepare before taking up the next test. However with data scattered around and us loading only a tiny fraction of the actual dataset, creating this feature would prove really difficult (for a later time).

Let's now write a function that when given testframe_1 and testframe_2 returns testframe_1 with `user_answer` and `answered_correctly` columns merged to it:

In [None]:
def post_process(fn0, fn1):
    '''
    fn0 is test dataframe at time t
    fn1 is test dataframe at time t + 1
    
    If however no fn1 is provided, we need to simply assign all the 
    user_answer and answered_correctly as nans. For this purpose, 
    any dataframe with first row as nan will suffice.
    '''
    
    fn_processed = pd.concat([
        
        # since group_num is set as index, we would lose it otherwise
        fn0.iloc[:, :-2].reset_index(), 
        
        # using eval to obtain the list values (including empty lists)
        fn1.iloc[0, -2:].apply(lambda x: pd.Series(eval(x), dtype=np.int)).T
        
    ], axis=1)
    
    return fn_processed

#### Let's now load the data from the API for the purpose of understanding it, we will disable it when we wish to make a submission:

It will also help us check if our `post_process` function works good.

In [None]:
# Donot set as False if you wish to submit
# since iter_test can only be called once
SUBMIT = True

In [None]:
if not SUBMIT:
    batches = []
    prev_batch = next(iter_test)
    while True:
        try:
            env.predict(prev_batch[1]) 
            batch = next(iter_test)
            batches.append(post_process(prev_batch[0], batch[0]))
            prev_batch = batch
        except StopIteration:
            batches.append(post_process(prev_batch[0]))
            print (f"All sample test batches exhausted after {len(batches)} iterations")
            break

    print ("Batch sizes for each test sample:", list(map(lambda x: x.shape[0], batches)))

Post_processed frame could then be used to help our model to learn about a students behaviour better in real time during submission. And since we are able to learn from the test data as well, it's best to create an *online / incremental model* for this competition.


#### Evaluation metric:

Let's understand the evaluation metric - `roc_auc_score`. It ensures that random predictions always yeild a score of 0.5. If however our score is less that 0.5 it means that we have made some mistake in our predictions (wrongly labeled the data, model does beter than random guess). A score of 1 (or 0) means that model is absolutely perfect and makes correct predictions 100% of the time. 

Let's verify this with some random guesses:

In [None]:
# Predicting constant values for the entire train dataset.
from sklearn.metrics import roc_auc_score
temp = data.loc[~data.content_type_id, "answered_correctly"]
for value in [0, 1, 0.5, temp.mean()]:
    print ("At {:.2f} the score is: {:.3f}".format(value, roc_auc_score(temp, np.full_like(temp, 0))))

A highly robust metric indeed. So we have to be a bit more smart in making predictions to beat this. What about random predictions *per question*?

In [None]:
roc_auc_score(temp, np.random.rand(len(temp)))

Does marginally (very marginally) better than previous predictions. Our next naive idea is to use per user mean accuracy as predictions but befor we do that we need to *something else*.

Let's now write a function to split the data to train/val as reliably as possible to mimic the test case scenario. Further it should also function as a CV generator, given some train ids:

In [None]:
def generate_train_val(Train, train_users=None, tp=.70, vp=None, put=None): 
    '''
    What we already know:
    - Test dataset has new users but no new questions. 
    - Test follows chornologically after train
    
    * Parameters *  
    train_ids -> The ID's completely used for training (If None, generate it using tp)
    tp        -> train users percentage (completely used for training)
    vp        -> Validation usage percentage completely used for val 
                 (used to make val users | partial users)
                 If None, randomly chosen
                 
    put       -> timestamp threshold for partial users above which 
                 timetamp the row becomes validation dataset
                 If None value chosen is same as `vp`
    
    * Output *
    Returns a new dataframe with `train` column added for val/train split
    
    * Note *
    1. Partial users are those users whose data is used for training and validation
    2. Perentage of train data is always greater than tp
    3. Exact percentage -> tp + 
    '''   
    
    # we will tinkering with it, best to copy it 
    # beware of passing in large sized DataFrames
    train = Train.copy()
    
    total_users = train.user_id.unique()
    
    if train_users is None:
        train_users = np.random.choice(total_users, int(len(total_users) * tp), replace=False)
        
    if vp is None: # validation percent 
        vp = np.random.choice(np.linspace(.20, .80, 13))
    
    if put is None: # threshold for timestamp cutoff
        put = vp
            
    # partial users percentage
    pp = 1 - vp
    
    remaining_users = np.setdiff1d(total_users, train_users)
    val_users = np.random.choice(remaining_users, int(len(remaining_users) * vp), replace=False)
    partial_users = np.setdiff1d(remaining_users, val_users)
    
    user_val_cutoff = (train[train.user_id.isin(partial_users)]
                       .groupby("user_id")['timestamp']
                       .apply(lambda x: np.quantile(x, put).astype(int))
                       .rename("val_cutoff"))
    
    train = train.merge(user_val_cutoff, on='user_id', how='left')
    train['val_cutoff'].fillna(0, inplace=True)
    
    train['train'] = (
        train.user_id.isin(train_users) | 
        (train.user_id.isin(partial_users) & (train['timestamp'] < train['val_cutoff']))
    )
    
    train = train.drop(["val_cutoff"], 1) 
    
    return train

How does the above function work? Lets see the split via a pie chart:

In [None]:
temp = generate_train_val(data.copy())
print (data.shape, "->", temp.shape)
(temp.train.astype(int).value_counts()
 .plot(kind='pie', autopct=lambda x: f"{int(x)}%", 
       title='Train/Val Split',
       colors=['r', 'g'], explode=[.1, .15],
       labels=["Train", "Validation"]));

### Data Modelling:

#### Submission Idea 1: 

We create a dataframe containing all the users we see along with their average accuracy. We use this average accuracy as our predictions. 

In [None]:
folds = 5
total_users = data.user_id.unique()
np.random.shuffle(total_users)
val_size = len(total_users) // folds
scores = {}

for i in range(folds):
    train_users = np.setdiff1d(total_users, total_users[(i)*val_size:(i+1)*val_size])
    temp = generate_train_val(data, train_users)
    temp = temp[~temp.content_type_id]
    
    user_df = (temp[temp.train].drop("train", 1)
           .groupby("user_id")['answered_correctly']
           .agg(Mean='mean', Count='count', Sum='sum'))

    pred = (temp.loc[(~temp.train)]
     .merge(user_df, on='user_id', how='left'))

    actual = pred['answered_correctly']
    
    scores['Off'] = scores.get("Off", []) + [roc_auc_score(actual, pred['Mean'].fillna(0.5))]

    pred = (
        (pred.groupby("user_id")['answered_correctly'].cumsum() + pred['Sum'].fillna(0)) / 
        (pred.groupby('user_id')['answered_correctly'].cumcount() + pred['Count'].fillna(0) + 1) 
    )
    
    scores['On'] = scores.get("On", []) + [roc_auc_score(actual, pred)]

print ("Scores after {} folds (User Mean Acc):\n{}\nWithout Online Learning: {:.2f} @ {:.2f} std\
\nWith Online Learning: {:7.2f} @ {:.2f} std"
       .format(folds, "="*40, np.mean(scores['Off']), np.std(scores['Off']), 
               np.mean(scores['On']), np.std(scores['On'])))

The above idea's score on public LB:
1. Offline Naive Learning: `0.622` (ran for 30 min)
2. Online Naive Learning: `0.634` (ran for 1 hr)

#### Submission Idea 2: 

We create a dataframe containing all the questions we see along with how accurate students usually answer them. We use this average accuracy as our predictions. 

`Note`: Even though it is said that the test set doesn't contain any new questions, it is still better to perform online learning since it may happen that eventually a tough question will get tackled head on and owing to increased students practice, cease to be a difficult question anymore.

In [None]:
folds = 5
total_users = data.user_id.unique()
np.random.shuffle(total_users)
val_size = len(total_users) // folds
scores = {}

for i in range(folds):
    train_users = np.setdiff1d(total_users, total_users[(i)*val_size:(i+1)*val_size])
    temp = generate_train_val(data, train_users)
    temp = temp[~temp.content_type_id]
    
    content_df = (temp[temp.train].drop("train", 1)
           .groupby("content_id")['answered_correctly']
           .agg(Mean='mean', Count='count', Sum='sum'))

    pred = (temp.loc[(~temp.train)]
     .merge(content_df, on='content_id', how='left'))

    actual = pred['answered_correctly']
    
    scores['Off'] = scores.get("Off", []) + [roc_auc_score(actual, pred['Mean'].fillna(0.5))]

    pred = (
        (pred.groupby("content_id")['answered_correctly'].cumsum() + pred['Sum'].fillna(0)) / 
        (pred.groupby('content_id')['answered_correctly'].cumcount() + pred['Count'].fillna(0) + 1) 
    )
    
    scores['On'] = scores.get("On", []) + [roc_auc_score(actual, pred)]

print ("Scores after {} folds (Content Mean Acc):\n{}\nWithout Online Learning: {:.2f} @ {:.2f} std\
\nWith Online Learning: {:7.2f} @ {:.2f} std"
       .format(folds, "="*40, np.mean(scores['Off']), np.std(scores['Off']),
               np.mean(scores['On']), np.std(scores['On'])))

The above idea's scores on Public LB:
1. Offline Naive Learning: `0.705` (ran for 20 min)
2. Online Naive Learning: `0.705` (ran for 30 hr)

### Making our predictions:

1. For implementing Idea 1 simply change: `col = 'user_id'` in the snippet below.
2. For implementing Idea 2 simply change: `col = 'content_id'` in the snippet below.

We use Idea 2 for now.

In [None]:
col = 'content_id'

temp = (pd.read_feather(
    "../input/riiid-train-data-multiple-formats/riiid_train.feather", 
    columns=[col, 'content_type_id', 'answered_correctly']))

# filter out only the questions
temp = temp[temp.content_type_id == 0]
temp = temp.drop("content_type_id", axis=1)

temp = (temp.groupby(col)['answered_correctly'].agg(Mean='mean', Count='count', Sum='sum'))

print (f"Memory Usage: {temp.memory_usage(deep=True).sum() / (2**20):.2f}", 'MB')
print ( "Length      :", len(temp))
temp.head()

In [None]:
prev_batch = next(iter_test)
called, predicted = 1, 0

while True:
    try:
        
        ## =========================================================== ##
        ## =================== Make the predictions ================== ##
        ## =========================================================== ##
        
        q_mask = prev_batch[0]['content_type_id'] == 0
        total = prev_batch[0].loc[q_mask, col].unique()
        missing = np.setdiff1d(total, temp.index)
        present = np.setdiff1d(total, missing)
        
        pred = prev_batch[0].loc[q_mask, ['row_id', col]].merge(
            temp.loc[present, 'Mean'],            
            on=col, how='left')
        
        pred['answered_correctly'] = pred['Mean'].fillna(0.5)
        pred = pred[['row_id', 'answered_correctly']]
        
        ## =========================================================== ##
        ## ================= API Calls (Donot Change) ================ ##
        ## =========================================================== ##
        
        env.predict(pred) 
        predicted += 1
        
        batch = next(iter_test)
        called += 1
        
        # Comment the line below for STATIC learning
        processed_batch = post_process(prev_batch[0], batch[0])
        
        prev_batch = batch
        
        ## =========================================================== ##
        ## ============= Update user_df (Online Learning) ============ ##
        ##  Comment Code block below if you wish for Offline learning  ##
        ## =========================================================== ##
        
        # We retain only those rows that are relevant
        processed_batch = processed_batch[q_mask.reset_index(drop=True)]
        
        pb_grp = (processed_batch.groupby(col)['prior_group_answers_correct']
                  .agg(Mean='mean', Count='count', Sum='sum'))

        temp = temp.append(pb_grp.loc[missing])
        temp.loc[present] = temp.loc[present] + pb_grp.loc[present]
        temp.loc[present, 'Mean'] = temp.loc[present, 'Sum'] / temp.loc[present, 'Count']
        
        ## =========================================================== ##
        
    except StopIteration:
        print (f"All sample test batches exhausted after {predicted} iterations")
        break
        
assert predicted == called

Offline learning scores lesser compared to online learning in our CVs. We observe the same trend in the LB as well (although much less compared to the difference observed in our CV). 

The function to generate CV and the prediction Pipeline may be buggy (although executes without any error). If you did find any bugs kindly post them in the comments below. Thank you for reading :)