# Introduction

This notebook aims at presenting my final solution for the Riid competition.

As I was limited in time, I decided to stay simple and go with single LGBM + a lot of feature engineering.

Credit to Steve Majou for its notebook [Simple Elo Rating](https://www.kaggle.com/stevemju/riiid-simple-elo-rating) that gave me a very nice boost

# Plan

## 1. Model and feature importance

## 2. Feature exploration

## 3. Prediction pipeline

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pickle
import riiideducation
from tqdm import tqdm 
import glob
import pandas as pd
import numpy as np

import plotly.graph_objects as go

# Model and feature importance

I wanted to get over the competition with a single model. 

The model is a simple lightgbm classifier trained on 22 millions rows and tested on 3 millions rows.
My final model is using 38 features, but there is really few gain through the last features. If I had to make a trade off I could have probably stayed with 30 features only without loosing much accuracy

In [None]:
metrics = [
    'questions_part',
    'questions_difficulty',
    'questions_count',
    'questions_good_answer_reaction_time',
    'questions_bad_answer_reaction_time',
    'questions_batch_size',
    'questions_variance',
    'questions_difficulty_given_prev_explanation',
    'questions_difficulty_given_prev_no_explanation',
    'questions_conditionnal_proba_cluster',
    'lectures_count',
    'student_general_elo_theta',
    'student_general_elo_proba',
    'student_part_elo_proba',
    'student_tags_elo_theta',
    'student_tags_elo_proba',
    'student_part_score',
    'timestamp_student_general_last_batch_average',
    'timestamp_student_general_reactivity_rolling_19',
    'timestamp_student_general_part_reactivity_average',
    'timestamp_student_general_reactivity_good_answer_average',
    'timestamp_last_delta_between_questions',
    'timestamp_delta_between_questions_rolling_10',
    'timestamp_delta_between_questions_rolling_5',
    'timestamp_last_time_between_answers',
    'student_general_good_answers_average',
    'student_general_good_answers_rolling_2',
    'student_general_good_answers_rolling_6',
    'student_part_good_answers_rolling_6',
    'questions_seen_with_explanation',
    'questions_conditionnal_count',
    'questions_conditionnal_proba',
    'lectures_conditionnal_proba',
]

In [None]:
with open('../input/lgb-training/model.pkl', 'rb') as f:
    lgb = pickle.load(f)

## Feature importance, lgb

Below is a plot of the feature importance as per lgb classification

In [None]:
lgb_importance = pd.Series(lgb.feature_importances_, index = metrics).sort_values()

fig = go.Figure(
    go.Bar(
        y = lgb_importance.index,
        x = lgb_importance.values, 
        orientation='h'
    )
)
fig.update_layout(
    title = 'LGB feature importance',
    template = 'presentation', 
    height =1000,
    margin=dict(
        l=400,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
)
fig.update_yaxes(tickfont = dict(size=13))
fig.show()

## Average contribution by thematic

In [None]:
grouper = [elmt.split('_')[0] for elmt in lgb_importance.index]
category_importance = lgb_importance.groupby(grouper).mean().sort_values(ascending = False)

fig = go.Figure(
    go.Bar(
        x = category_importance.index,
        y = category_importance.values, 
    )
)
fig.update_layout(
    title = 'LGB feature importance, grouped by category',
    template = 'presentation', 
    height =400,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
)
fig.update_yaxes(tickfont = dict(size=13))
fig.show()

## Average contribution by "Tag, Part, General"

In [None]:
sub_cat = np.array([1 if 'general' in elmt else 0 for elmt in metrics ]) + \
np.array([2 if 'part' in elmt else 0 for elmt in metrics]) + \
np.array([3 if 'tag' in elmt else 0 for elmt in metrics])

category_importance_2 = lgb_importance.groupby(sub_cat).mean().loc[[1,2,3]]
category_importance_2.index = ['general','part','tags']
category_importance_2 = category_importance_2.sort_values(ascending = False)

fig = go.Figure(
    go.Bar(
        x = category_importance_2.index,
        y = category_importance_2.values, 
    )
)
fig.update_layout(
    title = 'LGB feature importance, grouped by sub category',
    template = 'presentation', 
)
fig.update_yaxes(tickfont = dict(size=13))
fig.show()

## 2 - Feature Creation

In this part I will detail a bit more the way I calculated some of the features.
I am using a sample of 10 million data here to make it light, but during my own inference I used the full dataset

In [None]:
train = pd.read_csv('../input/riiid-test-answer-prediction/train.csv', nrows = 1000000, index_col = 0)
questions = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
lectures = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')

### General features about questions

In this part, nothing fancy, I used the generic pandas functions to compute some general statistics about questions (difficulty, avg_reaction_time, etc...)

I then saved each statistic in each own dictionnary, that I could call when needed.

What could have been done better: including all the statistic in a general dictionnary rather than having one dictionnary for each

#### Question difficulty

Question to difficulty is only the average answer good answers of users, reversed so that difficult questions have a high score and easy questions a low one. 
I made this inversion because I am later using an asymetrical score function that attribute points to users answer hard questions and failing easy quesytions.

Possible improvement: a threshold system for questions that are too close to 0 or to 1. Also remove score for questions with not enough statistical significance (too few people answering those questions)

In [None]:
questions_to_difficulty = (1-train[train.content_type_id == False].groupby('content_id')['answered_correctly'].mean())

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize = (12,5))
plt.title('Question difficulty')
sns.distplot(questions_to_difficulty.values)
plt.show()

To compute the average reaction time of user when they answer correctly or not, we first have to shift the prior_question_reaction time as it refere to the previous row

#### Questions reaction times

In [None]:
subtrain = train[train.content_type_id ==0]
subtrain['shift_elapse_time'] = subtrain['prior_question_elapsed_time'].shift(-1)
good_answer_reaction_time = subtrain[subtrain.answered_correctly==True].dropna().groupby('content_id')['shift_elapse_time'].mean()
bad_answer_reaction_time = subtrain[subtrain.answered_correctly==False].dropna().groupby('content_id')['shift_elapse_time'].mean()

In [None]:
plt.figure(figsize = (12,5))
plt.title('Question difficulty')
sns.distplot(good_answer_reaction_time.values, label = 'good answer reaction time')
sns.distplot(bad_answer_reaction_time.values, label = 'bad answer reaction time')
plt.legend()
plt.show()

### Conditionnal probabilities

This part is very interesting as it includes some of the features that gave me the nicest boost.
The objectif was to infere Bayesian probability of user answering correctly given their previous answers.

I applyied the same approach for the lectures

#### Selection of questions to use to compute the Bayesian probabilities

Using all questions would have been way to heavy both in computationnal time and in RAM. I decided to limit this part to the top 500 questions the most recurrent. An optimal value would have probably be around 1000 questions but I didn't have really time to try it.

In [None]:
(unique, counts) = np.unique(train.values[:,1], return_counts=True)
qids = pd.Series(counts, index = unique).sort_values(ascending=False)[:500]
qids.head()

In [None]:
#We just keep the questions id
qids = qids.index

#### Challenge

Now the main challenge here is to deal all the different possibilities.

We want to build a dictionnary of the shape {A1: {B1: P_A1B1_1}, {B2: P_A1B2_1}, ...}, where P_A1B2_1 is the bayesian probability P(A=1|B=1) for each question An in the questions table and for each question B in the top 500 questions most recurrent

Also, I wanted to include a notion of memory: when a user is seeing too many questions, he shall start to forget the impact of the previous ones. 

Finally we want the feature creation not to be time consuming nor ram consuming, so we have to use as few loop as possible and try to optimize the RAM usage. 

#### Solution

I came up with a solution that requiere only a single loop:
- Every new user, I generate a new user_meta dictionnary to store the history of a given user
- I also create 4 global dictionnary to calculate general occurences of even A (question seen) and the conditionnal probability count. Note: I don't do the division in this part, this is done later based on those inputs
- I then loop one time through the training set. Every time a user answer a question, if the question belong to the frequen questions, I add the answer to the "memory" of this user. For every row, I check if the user already saw the question, and in that case, I add the values to my conditionnal dictionnaries.
- After the update of the dictionnary, I do an update of the memory of the user based on the current question. 
- Finally, I increment the time of each questions by 1, once a question has reach the maximum memory, it is reset to 0

Note:
- You can see I am iterating through the array rather than the pandas dataframe: it reduce a lot time processing
- We transform the answered values to -1 or 1, it is then easier to make the filter for the previous questions seen

The code makes about 7000 iterations per second, so it takes about 5 hours to run on the full dataset

In [None]:
from tqdm import tqdm

memory = 100
train_arr = train[['user_id', 'content_id', 'content_type_id', 'answered_correctly']].values
# The different metadata dictionnaries used in this part
user_meta = {}
user_meta_count = {}
meta_proba_pos = {}
meta_count_pos = {}
meta_proba_neg = {}
meta_count_neg = {}

#I iterate through
for i in tqdm(range(len(train))):
    
    user_id, content_id, content_type_id, answered_correctly = train_arr[i]
    
    #If a new user is detected, we reset the user cache to save RAM
    if user_id not in user_meta.keys():
        user_meta = {}
        user_meta_count = {}
        user_meta[user_id] = {qid:0 for qid in qids}
        user_meta_count[user_id] = {}
        
    #Then: update the questions dictionnaries based on the user cache
    if content_type_id ==0:
        A = content_id
        X = np.array(list(user_meta[user_id].values()))
        #If a new question is detected, we create a slot in our dictionnaries
        if A not in meta_proba_pos.keys():
            meta_proba_pos[A] = {i:0 for i in range(len(qids))}
            meta_count_pos[A] = {i:0 for i in range(len(qids))}
            meta_proba_neg[A] = {i:0 for i in range(len(qids))}
            meta_count_neg[A] = {i:0 for i in range(len(qids))}
         
        #We check all previous answer of a user and update the respective dictionnary P(A=1|B=1) and P(A=0|B=0)
        for B in np.where(X!=0)[0]:
            #P(A=1|B=1)
            if (X[B]==1):
                if answered_correctly == 1:
                    meta_proba_pos[A][B] += 1
                meta_count_pos[A][B] += 1
            #P(A=1|B=0)
            if (X[B]==-1):
                if answered_correctly == 1:
                    meta_proba_neg[A][B] += 1
                meta_count_neg[A][B] += 1             
    
    #update
    if A in qids:
        
        for k,v in user_meta_count[user_id].copy().items():
            #Reduce the memory clock, when it reach 0, reset the value to 0 for the given question
            user_meta_count[user_id][k] = v-1
            if v == 0:
                del user_meta_count[user_id][k]
                user_meta[user_id][k] = 0
            
        user_meta[user_id][content_id] = 2*answered_correctly-1
        #Add the question to the memory of a user if not seen yet
        if not A in user_meta_count[user_id].keys():
            user_meta_count[user_id][A] = memory


Let's load the full dictionnaries !

In [None]:
import numpy as np
import pandas as pd

import pickle

with open('../input/riid-raw-samples/bad_answer_reaction_time', 'rb') as f:
    bad_answer_reaction_time = pickle.load(f)
    
with open('../input/riid-raw-samples/good_answer_reaction_time', 'rb') as f:
    good_answer_reaction_time = pickle.load(f)
    
with open('../input/riid-raw-samples/question_user_count', 'rb') as f:
    question_user_count = pickle.load(f)

with open('../input/riid-raw-samples/questions_to_difficulty', 'rb') as f:
    questions_to_difficulty = pickle.load(f)

with open('../input/riid-raw-samples/questions_to_difficulty', 'rb') as f:
    questions_to_difficulty = pickle.load(f)
    
with open('../input/riid-raw-samples/questions_to_difficulty_variance', 'rb') as f:
    questions_to_difficulty_variance = pickle.load(f)
    
with open('../input/riid-raw-samples/questions_to_difficulty_with_explanation', 'rb') as f:
    questions_to_difficulty_with_explanation = pickle.load(f)
    
with open('../input/riid-raw-samples/questions_to_difficulty_without_explanation', 'rb') as f:
    questions_to_difficulty_without_explanation = pickle.load(f)
    
with open('../input/riid-probabilities-2/meta_count.pkl', 'rb') as f:
    meta_count = pickle.load(f)
    
with open('../input/riid-probabilities-2/meta_count_neg.pkl', 'rb') as f:
    meta_count_neg = pickle.load(f)
    
with open('../input/riid-probabilities-2/meta_proba.pkl', 'rb') as f:
    meta_proba = pickle.load(f)
    
with open('../input/riid-probabilities-2/meta_proba_neg.pkl', 'rb') as f:
    meta_proba_neg = pickle.load(f)
    
with open('../input/riid-probabilities/question_position.pkl', 'rb') as f:
    question_position = pickle.load(f)   
    
with open('../input/riid-lectures/lecture_position.pkl', 'rb') as f:
    lecture_position = pickle.load(f)
    
with open('../input/riid-lectures/meta_count_lect.pkl', 'rb') as f:
    meta_count_lect = pickle.load(f)
    
with open('../input/riid-lectures/meta_proba_lect.pkl', 'rb') as f:
    meta_proba_lect = pickle.load(f)
    
clust = pd.read_csv('../input/riid-cluster/test_cluster_4',index_col=0)['0']

with open('../input/riid-cache-questions/cache_questions.pck', 'rb') as f:
    question_cache = pickle.load(f)

Given the global counts and the global counts for the probabilities, we can now calculate the bayesian probability.
I am also setting a threshold of occurence as well as a treshold of significancy. Those values are hyperparameters of the problem

In [None]:
proba_pos = {}
proba_neg = {}
THRESH_COUNT = 50
THRESH_PROBA = 0.01
for A,v in tqdm(meta_proba.items()):
    proba_pos[A] = {}
    proba_neg[A] = {}
    for B in v.keys():
        if meta_count[A][B]>THRESH_COUNT:
            proba = meta_proba[A][B]/meta_count[A][B]
            improve = proba - (1-questions_to_difficulty[A])
            if improve > THRESH_PROBA:
                proba_pos[A][question_position[B]] = improve
        if meta_count_neg[A][B]>THRESH_COUNT:
            proba = meta_proba_neg[A][B]/meta_count_neg[A][B]
            decrease = proba - (1-questions_to_difficulty[A])
            if decrease < -THRESH_PROBA:
                proba_neg[A][question_position[B]] = decrease
                
                
#Same job for the lectures !
proba_lec = {}
thresh = 20
for k,v in tqdm(meta_proba_lect.items()):
    proba_lec[k] = {}
    for kk,vv in v.items():
        if meta_count_lect[k][kk]>thresh:
            proba = meta_proba_lect[k][kk]/meta_count_lect[k][kk]
            improve = proba - (1-questions_to_difficulty[k])
            if improve > 0.10:
                proba_lec[k][lecture_position[kk]] = improve

In [None]:
example = pd.Series(proba_pos[7684]).sort_values(ascending = False)
example.index = ['q'+str(elmt) for elmt in example.index]

fig = go.Figure(
    go.Bar(
        x = example.index,
        y = example.values, 
    )
)
fig.update_layout(
    title = 'Example of bayesian effet - Gain over difficulty metric',
    template = 'presentation', 
)
fig.update_yaxes(tickfont = dict(size=13))
fig.show()

### Clustering

Those Bayesian values can be also used to find similarity between questions.

In [None]:
all_prob = {}
for k,v in proba_pos.items():
    all_prob[k] = {}
    for kk,vv in v.items():
        all_prob[k][(kk,1)] = vv
for k,v in proba_neg.items():
    if k not in all_prob.keys():
        all_prob[k] = {}
    for kk,vv in v.items():
        all_prob[k][(kk,0)] = vv

all_q = set()
for k,v in all_prob.items():
    for kk in v.keys():
        all_q.add(kk[0])
        
mprob1 = {}
for k,v in tqdm(all_prob.items()):
    mprob1[k] = {}
    kk = [e[0] for e in v.keys()]
    for elmt in all_q:
        mprob1[k][elmt]=0
        if elmt in kk:
            if (elmt,1) in v.keys():
                mprob1[k][elmt]=v[(elmt,1)]
            else:
                mprob1[k][elmt]=v[(elmt,0)]

In [None]:
mprob = pd.DataFrame(mprob1).T
mprob.head()

We use a simple Kmean to find group of questions related

In [None]:
import umap
from sklearn.cluster import KMeans

kmean = KMeans(n_clusters=20)
c = kmean.fit_predict(mprob)

We use UMAP to project the data and check the clustering made.

The cluster obtained is used in the final features

In [None]:
proj = umap.UMAP(n_components=2)
x = proj.fit_transform(mprob)

plt.figure(figsize = (20,10))
plt.title('Projection of questions similarity by Bayesian Method', size = 20)
plt.scatter(x[:,0],x[:,1], c=c, alpha = 0.1, cmap = 'tab20b')
plt.show()

Same type of cluster could have been done based on lectures probability, but I didn't have time to try...

### Other features

The features presented above could be computed offline. The other features needs a continuous update to avoid obvious leakage.
More details are given in the next part

# 3. Production pipeline

## History and memory limitation

All ids in the test set are users already present in the train set.
In order to be efficient, our algorithm needs to keep tracks of the parameter of each student.

Unfortunately, due to the high amount of students, my dictionnary was too big to be loaded at one.
In order to solve that issue, I decided to create a cache for each id. During inference, I was checking for new ids and adding them to my general cache if there were here already

It was taking about 5 hours to construct an up to date dictionnary for all users.

In [None]:
# This is the location of all the cache, I create a list of ids to keep track of what I alreadu have
ids = glob.glob('../input/riid-cache-6/content/drive/My Drive/riid/*')
ids = [int(elmt.split('/')[-1]) for elmt in ids]

In [None]:
len(ids)

## Some more dictionnaries !

We create references of parts and tags that will be later user to associate content_id to subcategories to create relationnal features

In [None]:
questions_to_tag = questions[['question_id','tags']].set_index('question_id')[['tags']]['tags'].fillna('-1').apply(lambda x: [np.int64(elmt) for elmt in x.split(' ')]).to_dict()
questions_to_parts = questions[['question_id','part']].set_index('question_id')['part'].to_dict()
lecture_to_type = lectures[['lecture_id','type_of']].set_index('lecture_id')['type_of'].to_dict()
lecture_to_part = lectures[['lecture_id','part']].set_index('lecture_id')['part'].to_dict()
lecture_to_tag = lectures[['lecture_id','tag']].set_index('lecture_id')['tag'].to_dict()

tags = []
for k,v in questions_to_tag.items():
    tags+= v
tags = {elmt:np.int32(0) for elmt in set(tags)}

parts = np.unique(list(questions_to_parts.values()))
question_list = list(questions_to_difficulty.keys())

### Some useful functions

One of the big challenge for me was to try to maintain a clean code.
Have some utility functions to make simple operations was very usefull has it allowed me to avoid a lot of unecessary lines later

In [None]:
def qscore(answer, difficulty):
    
    '''This function calculate an asymetric score that is high when a user answer a 
    difficult question and penalise users that answers wrong in simple questions'''
    if answer>0:
        return (2*answer - 1)*difficulty
    else:
        return (2*answer - 1)*(1-difficulty)

def list_feature_average(n, vals):
    '''a simple rolling average over last n elements'''
    if len(vals)>n:
        return np.mean(vals[-n:])
    else:
        return -1
    
def make_feature_average(count, vals):
    '''a simple average over all elements given a parameter and its count'''
    if count>0:
        return vals/count
    else:
        return -1
    
def calculate_time_deltas(cache, timestamp, prior_question_elapsed_time,content_type_id, previous_part, user_id, batch_size):
    
    if timestamp != cache[user_id]['ts-1']:
        if prior_question_elapsed_time != -1:
            cache[user_id]['average_reactivities'].append(prior_question_elapsed_time)
            cache[user_id]['last_ts_list'].append(timestamp)
            if len(cache[user_id]['average_reactivities'])>20:
                cache[user_id]['average_reactivities'].pop(0)
            if len(cache[user_id]['last_ts_list'])>20:
                cache[user_id]['last_ts_list'].pop(0)

            if previous_part in cache[user_id]['part_average_reactivity_tot'].keys():
                cache[user_id]['part_average_reactivity_tot'][previous_part]+=prior_question_elapsed_time

            else:
                if previous_part != -1:
                    cache[user_id]['part_average_reactivity_tot'][previous_part] = prior_question_elapsed_time

            if len(cache[user_id]['last_ts_list'])>3:
                cache[user_id]['time_between_questions'].append(cache[user_id]['last_ts_list'][-2]-cache[user_id]['last_ts_list'][-3]-cache[user_id]['average_reactivities'][-1]*cache[user_id]['batch_size_list'])
                if len(cache[user_id]['time_between_questions'])>20:
                    cache[user_id]['time_between_questions'].pop(0)


        if (cache[user_id]['last_delta_between_questions'] != -1) & (prior_question_elapsed_time != -1):
            if cache[user_id]['average_delta_between_questions'] > 3600:
                cache[user_id]['hour_count'] +=1

            if cache[user_id]['average_delta_between_questions'] > 10000:
                cache[user_id]['day_count'] +=1

            if cache[user_id]['average_delta_between_questions'] < 3600:
                cache[user_id]['average_delta_between_questions'] += cache[user_id]['last_delta_between_questions'] - prior_question_elapsed_time 

                
        if cache[user_id]['ts-1'] != -1:
            if content_type_id == 0:
                cache[user_id]['last_delta_between_questions'] = (timestamp - cache[user_id]['ts-1'])/batch_size
            else:
                cache[user_id]['average_time_on_lecture'] += (timestamp - cache[user_id]['ts-1'])/batch_size
            
    return cache

### Elo functions

Credit to Steve Majou for its notebook [Simple Elo Rating](https://www.kaggle.com/stevemju/riiid-simple-elo-rating). This function works very well to quantify the level of a student.

I later applyied the method both at the global level, part level and tag level

In [None]:
def get_new_theta(is_good_answer, beta, left_asymptote, theta, nb_previous_answers):
    return theta + learning_rate_theta(nb_previous_answers) * (
        is_good_answer - probability_of_good_answer(theta, beta, left_asymptote)
    )

def get_new_beta(is_good_answer, beta, left_asymptote, theta, nb_previous_answers):
    return beta - learning_rate_beta(nb_previous_answers) * (
        is_good_answer - probability_of_good_answer(theta, beta, left_asymptote)
    )

def learning_rate_theta(nb_answers):
    return max(0.3 / (1 + 0.01 * nb_answers), 0.04)

def learning_rate_beta(nb_answers):
    return 1 / (1 + 0.05 * nb_answers)

def probability_of_good_answer(theta, beta, left_asymptote):
    return left_asymptote + (1 - left_asymptote) * sigmoid(theta - beta)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

### Cache template

This one is a very important one:

Everytime a new user is seen, I need to create a new blank page from where I record all the necessary caracteristics "on the fly". 
In order to be memory efficient, I took some decision such as not storing the full history but rather using troncated lists when needed. 

For global indicator, I iteratively calculate sums, and produce simply the averages when needed.

For the final inference, my create_cache function also check first if an id has not already in an history in my database. If this is the case, I load this history directly in the current cache and keep it in RAM (so I don't need to reload it later if needed)

In [None]:
def create_cache(cache, user_id, ts, ids):
        
    if user_id in ids:
        with open(f'../input/riid-cache-6/content/drive/My Drive/riid/{user_id}', 'rb') as f:
            id_cache = pickle.load(f)[user_id]
            cache[user_id] = id_cache
    else:
        #generate the cach for a new id
        cache[user_id] = {}

        #general
        cache[user_id]['previous_content_type_id'] = -1
        cache[user_id]['previous_part'] = -1
        cache[user_id]['previous_question'] = -1
        cache[user_id]['questions_count'] = 0
        cache[user_id]['part_lectures_count'] = {i:0 for i in range(1,8)}
        cache[user_id]['lectures_part'] = {i:0 for i in range(1,8)}
        cache[user_id]['part_count'] = {}
        cache[user_id]['question_with_explanation'] = set()
        cache[user_id]['question_seen'] = set()
        cache[user_id]['first_bundle'] = -1
        cache[user_id]['batch_size_list'] = -1

        #Timestamps analysis
        cache[user_id]['ts-1'] = -1
        cache[user_id]['last_ts_part'] = {i:0 for i in range(1,8)}
        cache[user_id]['last_delta_between_questions'] = -1
        cache[user_id]['day_count'] = 0
        cache[user_id]['hour_count'] = 0
        cache[user_id]['last_ts_list'] = []
        cache[user_id]['time_between_questions'] = []
        cache[user_id]['average_reactivity_tot'] = 0
        cache[user_id]['average_reaction_good_answer_tot'] = 0
        cache[user_id]['average_reaction_bad_answer_tot'] = 0
        cache[user_id]['average_delta_between_good_answer'] = 0
        cache[user_id]['part_average_delta_between_good_answer'] = {}
        cache[user_id]['average_reactivities'] = []
        cache[user_id]['average_delta_between_questions'] = 0
        cache[user_id]['average_time_on_lecture'] = -1
        cache[user_id]['part_average_reactivity_tot'] = {}

        #QUESTION RELATED
        cache[user_id]['last_questions'] = []
        cache[user_id]['last_answers_correctly'] = []
        cache[user_id]['last_answers_correctly_part'] = {i:[] for i in range(1,8)}
        cache[user_id]['last_lectures'] = []
        cache[user_id]['lectures_count'] = 0


        #GOOD ANSWERS AND SCORE
        cache[user_id]['general_good_answer'] = 0
        cache[user_id]['part_good_answer'] = {}
        cache[user_id]['part_score'] = {i:0 for i in range(1,8)}

        #ELO
        cache[user_id]['student_parameters'] = {"theta": 0, "nb_answers": 0}

        #ELO PART
        cache[user_id]['part_student_parameters'] = {i:{"theta": 0, "nb_answers": 0} for i in range(1,8)}

        #ELO TAGS
        cache[user_id]['tag_student_parameters'] = {i:{"theta": 0, "nb_answers": 0} for i in tags.keys()}

        #TAG RELATED
        cache[user_id]['tag_count'] = {}
        cache[user_id]['tag_good_answers'] = {}

    return cache

### Update the cache

Updating the cache his handled by the function below using the labeled data once available.
The only purpose of the function is to update the full history for a given user

In [None]:
def update_cache(cache, arr, batch_size, question_cache):
    
    left_asymptote = 1/4
    #unpack array
    row_id, timestamp, user_id, content_id, content_type_id, task_container_id, prior_question_elapsed_time, prior_question_had_explanation, _, _, answered_correctly,user_answer  = arr
    
    if not prior_question_elapsed_time:
        prior_question_elapsed_time=-1
    if not prior_question_had_explanation:
        prior_question_had_explanation =-1
        
    timestamp = int(float(timestamp)/1000)
    user_id = np.uint32(user_id)
    content_id = np.uint16(content_id)
    content_type_id = np.int8(content_type_id)
    task_container_id = np.uint32(task_container_id)
    prior_question_elapsed_time = int(float(prior_question_elapsed_time)/1000)
    prior_question_had_explanation = np.bool(prior_question_had_explanation)
    user_answer = np.int8(user_answer)
    answered_correctly = np.int8(answered_correctly)
    previous_content_type_id = cache[user_id]['previous_content_type_id']
    
    part = -1
    #General
    if content_type_id == 0:

        if content_id not in question_cache.keys():
            question_cache[content_id] = {"beta": 0, "nb_answers": 0}

        part = questions_to_parts[content_id]
        tags = questions_to_tag[content_id]

        difficulty = questions_to_difficulty[content_id]
        count = question_user_count[content_id]
        bad_answer_reaction = bad_answer_reaction_time[content_id]
        good_answer_reaction = good_answer_reaction_time[content_id]
        beta = question_cache[content_id]["beta"]
        score = qscore(answered_correctly, difficulty)
        theta = cache[user_id]['student_parameters']["theta"]
        theta_part = cache[user_id]['part_student_parameters'][part]["theta"]

        for tag in tags:
            theta_tag = cache[user_id]['tag_student_parameters'][tag]["theta"]
            cache[user_id]['tag_student_parameters'][tag]["theta"] = get_new_theta(
                answered_correctly, beta, left_asymptote, theta_tag, cache[user_id]['tag_student_parameters'][tag]["nb_answers"],
            )
            cache[user_id]['tag_student_parameters'][tag]["nb_answers"] += 1
        

        cache[user_id]['student_parameters']["theta"] = get_new_theta(
            answered_correctly, beta, left_asymptote, theta, cache[user_id]['student_parameters']["nb_answers"],
        )

        cache[user_id]['part_student_parameters'][part]["theta"] = get_new_theta(
            answered_correctly, beta, left_asymptote, theta_part, cache[user_id]['part_student_parameters'][part]["nb_answers"],
        )

        cache[user_id]['student_parameters']["nb_answers"] += 1
        cache[user_id]['part_student_parameters'][part]["nb_answers"] += 1
        
        cache[user_id]['questions_count'] +=1
        if part in cache[user_id]['part_count'].keys():
            cache[user_id]['part_count'][part] += 1
        else:
            cache[user_id]['part_count'][part] = 1
            
        cache[user_id]['last_answers_correctly'].append(answered_correctly)
        cache[user_id]['last_answers_correctly_part'][part].append(answered_correctly)
        if len(cache[user_id]['last_answers_correctly'])>20:
            cache[user_id]['last_answers_correctly'].pop(0)
        if len(cache[user_id]['last_answers_correctly_part'][part])>20:
            cache[user_id]['last_answers_correctly_part'][part].pop(0)  

        #GENERAL GOOD ANSWERS
        if answered_correctly == 1:
            cache[user_id]['average_reaction_good_answer_tot'] += prior_question_elapsed_time
            if part in cache[user_id]['part_good_answer'].keys():
                cache[user_id]['part_good_answer'][part]+=1
            else:
                cache[user_id]['part_good_answer'][part]=1
                
            cache[user_id]['general_good_answer'] += 1
            score = qscore(answered_correctly, difficulty)
    
        if answered_correctly == 0:
            cache[user_id]['average_reaction_bad_answer_tot'] += prior_question_elapsed_time
            
        cache[user_id]['part_score'][part]+=score

        #TAGS
        for tag in tags:
            if tag in cache[user_id]['tag_count'].keys():
                cache[user_id]['tag_count'][tag] +=1
                cache[user_id]['tag_good_answers'][tag] += answered_correctly
            else:
                cache[user_id]['tag_count'][tag] = 1
                cache[user_id]['tag_good_answers'][tag] = answered_correctly
    
    if (cache[user_id]['ts-1'] != -1) & (timestamp != cache[user_id]['ts-1']):
        cache[user_id]['last_delta_between_questions'] = (timestamp - cache[user_id]['ts-1'])
        cache[user_id]['average_delta_between_questions'] += (timestamp - cache[user_id]['ts-1'])
                
    cache = calculate_time_deltas(cache, timestamp, prior_question_elapsed_time,content_type_id, cache[user_id]['previous_part'], user_id, batch_size)
    
    cache[user_id]['ts-1'] = timestamp
    cache[user_id]['last_ts_part'][part] = timestamp
    cache[user_id]['previous_content_type_id'] = content_type_id
    if part != -1:
        cache[user_id]['previous_part'] = part
        
    if content_type_id == 0:
        cache[user_id]['previous_question'] = content_id
        
    if content_type_id == 0:
        cache[user_id]['last_questions'].append((content_id,answered_correctly))
        if len(cache[user_id]['last_questions'])>100:
            cache[user_id]['last_questions'].pop(0)
            
    if content_type_id == 1:
        part = lecture_to_part[content_id]
        cache[user_id]['lectures_count']+=1
        cache[user_id]['lectures_part'][part]+=1
        cache[user_id]['last_lectures'].append(content_id)
        if len(cache[user_id]['last_lectures'])>100:
            cache[user_id]['last_lectures'].pop(0)
            
    if (prior_question_had_explanation != -1) & (cache[user_id]['previous_question'] != -1):
        cache[user_id]['question_with_explanation'].add(content_id)
    
    cache[user_id]['question_seen'].add(content_id)
    if cache[user_id]['first_bundle'] == -1:
        cache[user_id]['first_bundle'] = content_id    

    cache[user_id]['batch_size_list'] = batch_size
    
    return cache

### Feature update

The feature update create the rows use in the model to make the prediction. It is based on the current information of the questions (content_id, features generated offline) as well as on the content from the user history

note: to build the training set and maximase RAM usage, I prefered to use a simple csv iterator rather than charging the whole dataset. 
note: if the current row is a lecture, I return an empty list that will be handled in the main loop

In [None]:
def create_feature(cache, arr, batch_size, question_cache):
    
    features = []
    
    #unpack array
    row_id, timestamp, user_id, content_id, content_type_id, task_container_id, prior_question_elapsed_time, prior_question_had_explanation, _, _ = arr
    
    if not prior_question_elapsed_time:
        prior_question_elapsed_time=-1
    if not prior_question_had_explanation:
        prior_question_had_explanation =-1
        
       
    ##CAST##
    timestamp = float(timestamp)/1000
    prior_question_elapsed_time = float(prior_question_elapsed_time)/1000

    if content_type_id == 0:
        
        part = questions_to_parts[content_id]
        tags = questions_to_tag[content_id]
        difficulty = questions_to_difficulty[content_id]
        count = question_user_count[content_id]
        bad_answer_reaction = bad_answer_reaction_time[content_id]
        good_answer_reaction = good_answer_reaction_time[content_id]
        var = questions_to_difficulty_variance[content_id]

        if content_id in questions_to_difficulty_with_explanation.keys():
            difficulty1bis = questions_to_difficulty_with_explanation[content_id]
        else:
            difficulty1bis = -1
        if content_id in questions_to_difficulty_without_explanation.keys():  
            difficulty2bis = questions_to_difficulty_without_explanation[content_id]
        else:
            difficulty2bis = -1

        kk = 0

        features.append(part)
        features.append(difficulty)
        features.append(count)
        features.append(good_answer_reaction)
        features.append(bad_answer_reaction)
        features.append(batch_size)
        features.append(var)
        features.append(difficulty1bis)
        features.append(difficulty2bis)
        features.append(clust[content_id])

        #lecture
        features.append(cache[user_id]['lectures_count'])
            
        #elo
        a=0
        if user_id in cache.keys():
            features.append(cache[user_id]['student_parameters']['theta'])
            a+=1
        else:
            features.append(-1) # 12
        if content_id in question_cache.keys():

            a+=1
        else:
            pass

        if a ==2:
            features.append(probability_of_good_answer(cache[user_id]['student_parameters']['theta'], question_cache[content_id]['beta'], 1/4))
        else:
            features.append(-1)

        #PART ELO
        if part in cache[user_id]['part_student_parameters'].keys():
            features.append(probability_of_good_answer(cache[user_id]['part_student_parameters'][part]['theta'], question_cache[content_id]['beta'], 1/4))
        else:
            features.append(-1)

        c = 0
        p = 0 
        s = len(tags)
        for tag in tags:
            c+=cache[user_id]['tag_student_parameters'][tag]['theta']
            p+=probability_of_good_answer(cache[user_id]['tag_student_parameters'][tag]['theta'], question_cache[content_id]['beta'], 1/4)
        features.append(c/s)
        features.append(p/s)

        features.append(cache[user_id]['part_score'][part])
        #Timestamp features

        features.append((timestamp - cache[user_id]['ts-1'])/batch_size)
        features.append(list_feature_average(19, cache[user_id]['average_reactivities']))

        if part in cache[user_id]['part_average_reactivity_tot'].keys():
            features.append(make_feature_average(cache[user_id]['part_count'][part], cache[user_id]['part_average_reactivity_tot'][part]))
        else:
            features.append(-1)
    
        features.append(make_feature_average(cache[user_id]['general_good_answer'], cache[user_id]['average_reaction_good_answer_tot']))
        features.append(cache[user_id]['last_delta_between_questions'])

        features.append(list_feature_average(10, (cache[user_id]['time_between_questions'])))
        features.append(list_feature_average(5, (cache[user_id]['time_between_questions'])))
        if len(cache[user_id]['time_between_questions'])>1:
            features.append(cache[user_id]['time_between_questions'][-1])
        else:
            features.append(-1)

        #Score features
        features.append(make_feature_average(cache[user_id]['questions_count'], cache[user_id]['general_good_answer']))
        features.append(list_feature_average(2, cache[user_id]['last_answers_correctly']))
         
        features.append(list_feature_average(6, cache[user_id]['last_answers_correctly']))


        features.append(list_feature_average(6, cache[user_id]['last_answers_correctly_part'][part]))

        #Explanation features
        if content_id in cache[user_id]['question_with_explanation']:
            features.append(1)
        else:
            features.append(0)
            
        c = 0

        ##Conditionnal Proba part
        c1 = 0
        c0 = 0
        p=0
        if content_id in all_prob.keys():
            to_check = set(cache[user_id]['last_questions']).intersection(set(all_prob[content_id].keys()))
            for content, answer in to_check:
                p+=all_prob[content_id][(content,answer)]
                if answer == 1:
                    c1+=1
                else:
                    c0+=1

        features.append(c1)

        if c1+c0:
            features.append(p/(c1+c0))
        else:
            features.append(0)
        
        ##Conditionnal Proba part
        c = 0
        p=0
        if content_id in proba_lec.keys():
            to_check = set(cache[user_id]['last_lectures']).intersection(set(proba_lec[content_id].keys()))
            for content in to_check:
                p+=proba_lec[content_id][content]
                c+=1
        if c:
            features.append(p/c)
        else:
            features.append(0)

    return features

### Update cache in batch

As the data are given by batch, I need another loop to update the cache row by row

In [None]:
def update_cache_batch(cache, df_arr, question_cache):
    
    batch_array = []
    task_init = -1
    user_id0 = -1 
    
    for arr in df_arr:
        
        user_id = arr[2]
        timestamp = arr[1]
        task_container_id = arr[5]
        if user_id not in cache.keys():
            cache = create_cache(cache, user_id, timestamp, ids)
            
        if (task_container_id == task_init) & (user_id == user_id0):
            batch_array.append(arr) 
        else:
            #When we meet a new task_container_id, we calculate metrics, and update cache.
            batch_size = len(batch_array)
            for arr_ in batch_array:
                cache = update_cache(cache, arr_, batch_size, question_cache)
                
            #Finally, create new batch_array and update task_init
            task_init = task_container_id
            if user_id != user_id0:
                user_id0 = user_id
                
            batch_array = []
            batch_array.append(arr)

    #Eventually, compute the metrics for the last elements in batch_array
    batch_size = len(batch_array)
    
    #Update cache
    for arr_ in batch_array:
        cache = update_cache(cache, arr_, batch_size, question_cache)
        
    return cache

### Calculate features

This is the main part of the loop

It detects new users, create the cache for a new user if needed, create the features and return an array to be directly plugged to the model

In [None]:
def calculate_features(cache, df_arr, question_cache):

    X = []

    batch_array = []
    task_init = -1
    user_id0 = -1 

    for arr in df_arr:

        #First we check if the id has been seen before. If not, we create a new cache
        user_id = arr[2]
        timestamp = arr[1]
        task_container_id = arr[5]
        if user_id not in cache.keys():
            cache = create_cache(cache, user_id, timestamp, ids)

        #This is the tricky part: we check if the question comes in batch. If this is the case, we store the array in a 
        #temporary list the time to retrieve all the questions from the array in order to compute batch_size.
        if (task_container_id == task_init) & (user_id == user_id0):
            batch_array.append(arr)

        else:
            #When we meet a new task_container_id, we calculate metrics, and update cache.
            batch_size = len(batch_array)

            #Update metrics
            for arr_ in batch_array:
                feats = create_feature(cache, arr_, batch_size, question_cache)
                if len(feats)>0:
                    X.append(feats)

            #Finally, create new batch_array and update task_init
            task_init = task_container_id
            if user_id != user_id0:
                user_id0 = user_id
                
            batch_array = []
            batch_array.append(arr)

    #Eventually, compute the metrics for the last elements in batch_array
    batch_size = len(batch_array)

    #Update metrics
    for arr_ in batch_array:
        feats = create_feature(cache, arr_, batch_size, question_cache)
        if len(feats)>0:
            X.append(feats)

    X = np.array(X)
        
    return X

## Final inference

In [None]:
env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
p_test_df = pd.DataFrame()
cache = {}

for idx, (test_df, _) in enumerate(iter_test):
    
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].astype(float)
    test_df = test_df.fillna(-1)


    submit_df = test_df.loc[test_df['content_type_id'] == False, ['row_id']].copy()
    if not p_test_df.empty:

        answered_correctly = [int(elmt) for elmt in test_df.iloc[0]['prior_group_answers_correct'][1:-1].replace(' ','').split(',')]
        prior_group_responses = [int(elmt) for elmt in test_df.iloc[0]['prior_group_responses'][1:-1].replace(' ','').split(',')]
        p_test_df['answered_correctly'] = answered_correctly
        p_test_df['user_response'] = answered_correctly
        test_arr = p_test_df.values
        update_cache_batch(cache, test_arr, question_cache)



    X = calculate_features(cache, test_df.values, question_cache)
    if len(X):
        preds = 1-lgb.predict_proba(X)[:,0]
    else:
        preds = None

    submit_df['answered_correctly'] = preds
    env.predict(submit_df)    
    p_test_df = test_df