# Riiid classification using Naive Bayes

The initial purpose of this notebook was to make a very simple baseline for my own usage to better understand the data and the submission process. As I got a fairly decent score (given the simplicity of the model) I decided to share the notebook in case it can be helpful to someone else. This model could be improved by adding more features.

In [None]:
import pandas as pd
import numpy as np
import riiideducation
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
import datetime
import sys

If you want to submit to the competition, the flag <code>TEST_PREDICTION</code> must be set to <code>False</code>, otherwise it will use a smaller train dataset

In [None]:
TEST_PREDICTION = True

## Read the data
<p>I read the data in feather format. Depending on the flag <code>TEST_PREDICTION</code> I will either read the full train dataset in feather format from <a href="https://www.kaggle.com/aralai/riiid-feather-dataset">this</a> notebook or I will read two datasets for train/test from <a href="https://www.kaggle.com/aralai/riiid-creating-a-test-dataset">this</a> notebook. You can check the notebook if you want more details on how the test set was created</p>

In [None]:
%%time
if not TEST_PREDICTION:
    train = pd.read_feather('../input/riiid-feather-dataset/train.feather')
else:
    train = pd.read_feather('../input/riiid-creating-a-test-dataset/train_df.feather')
    test = pd.read_feather('../input/riiid-creating-a-test-dataset/test_df.feather')

## Simple Naive Bayes classifier
<p>When we create the classifier we must provide a dictionary of pandas DataFrames. Each dataframe must contain an index which is the id of the feature and three columns with the total number of occurences of the feature in the train dataset, the total positive cases and the total negative cases. <p>
We define a threshold as well (by default 10). If any instance of the feature in the prediction dataset has less elements than the threshold in the training dataset, it will be ignored.<p>
<code>predict</code> receives a dictionary (the keys must match with the dictionary provided on the creation step) of lists. All list must be the same size. Each value of the list is an instance that we want to predict.
<p>We can incrementally update the model by using the function <code>update_model</code>.

In [None]:
class NaiveBayes:
    def __init__(self, features, threshold=10):
        assert type(features)==dict, 'parameter features is not a dictionary!'
        for f in features.keys():
            assert type(features[f])==pd.core.frame.DataFrame, 'Wrong datatype for {0}. Each entry of the dictionary must contain a pandas DataFrame'.format(f)
            assert list(features[f].columns)==['total', 'positive', 'negative'], 'wrong columns in {0} DataFrame'.format(f)
        self.THRESHOLD = threshold
        self.features = features
        self.prior_probability = {}
        self.one_feature = list(features.keys())[0]
        self.total_answers = features[self.one_feature]['total'].sum()
        self.positive_answers = features[self.one_feature]['positive'].sum()
        self.negative_answers = features[self.one_feature]['negative'].sum()
        self.prior_probability['negative'] = self.negative_answers/self.total_answers
        self.prior_probability['positive'] = self.positive_answers/self.total_answers
        
    def update_model(self, data):
        assert data.keys()==self.features.keys(), "Keys doesn't match!"
        for f in self.features.keys():
            self.features[f].add(data[f], fill_value=0).astype('uint64')
        self.total_answers += data[self.one_feature]['total'].sum()
        self.positive_answers += data[self.one_feature]['positive'].sum()
        self.negative_answers += data[self.one_feature]['negative'].sum()
        self.prior_probability['negative'] = self.negative_answers/self.total_answers
        self.prior_probability['positive'] = self.positive_answers/self.total_answers
        
        
    def predict(self, data):
        assert data.keys()==self.features.keys(), "Keys doesn't match!"
        data_len = len(data[list(data.keys())[0]])
        # pos and neg are the priors for positive and negative classes
        pos = np.array([self.prior_probability['positive'] for _ in range(data_len)])
        neg = np.array([self.prior_probability['negative'] for _ in range(data_len)])
        # multiply the prior probability by the likelihood of each feature
        for d in data.keys():
            feature = pd.DataFrame({'id':data[d]})
            counts=pd.merge(feature,self.features[d],left_on='id',right_index=True,how='left').fillna(0).astype('uint64').values
            # counts.shape == (sample_len,4)
            # counts[:,0]==id ; counts[:,1]==total ; counts[:,2]==positive ; counts[:,3]==negative
            # e.g.: counts == array([[115,46,32,14],[124,10,7,3],[115,46,32,14]],dtype=uint64)
            updatable = np.where(counts[:,1]>self.THRESHOLD)[0]
            # e.g.: updatable == array([True,False,True])
            pos[updatable] *= counts[updatable,2]/counts[updatable,1]
            neg[updatable] *= counts[updatable,3]/counts[updatable,1]
        return pos/(pos+neg)

## Prepare simple features<p>
Given a column name, this function creates a dataframe that we can use for the class NaiveBayes.

In [None]:
def prepare_features(dataset, col_name):
    df = dataset.loc[dataset.content_type_id==0,[col_name,'answered_correctly']].groupby(col_name).agg(['count','sum'])
    df.columns=['total', 'positive']
    df = df.astype('uint64')
    df['negative'] = df['total']-df['positive']
    return df

### Questions<p>
We group by <code>content_id</code> to get the number of times that a question has been asked and how many times it has been answered correctly or incorrectly.

In [None]:
question_df = prepare_features(train,'content_id')
question_df.head()

In [None]:
plt.hist((question_df['positive']/question_df['total']).values, bins =30)
plt.show()

### Users<p>
Grouping by <code>user_id</code>, we will get information about how good or bad is the student.

In [None]:
user_df = prepare_features(train,'user_id')
user_df.head()

In [None]:
plt.hist((user_df['positive']/user_df['total']).values, bins =30)
plt.show()

This is just a simple baseline, but more features could be added in a similar way. Naive Bayes is making the "naive" assumption that all the features are independent. If we want it to work well, we should add feature that are not correlated.

## Predict on a test dataset
<p>In this case we don't want to submit the solution but just evaluate it on a test dataset (as explained <a href="https://www.kaggle.com/aralai/riiid-creating-a-test-dataset">here</a>)</p>

In [None]:
class TestGenerator:
    def __init__(self, df, grp_size=[50,10]):
        self.df = df
        self.answered_correctly = self.df.answered_correctly[self.df.content_type_id==0].values
        self.predictions = np.zeros(len(self.answered_correctly))
        self.grp_size = grp_size
        self.start_idx=0
        self.last_prediction_idx = 0
        self.prediction_called = True
        self.test_cols = [c for c in df.columns if c not in ['answered_correctly','user_answer']]
        self.current_batch = {'prior_group_answers_correct':[], 'prior_group_responses':[]}

    def iter_test(self):
        while self.start_idx<len(self.df):
            assert self.prediction_called, "You must call `predict()` successfully before you can continue with `iter_test()`"
            self.prediction_called = False
            self.end_idx = int(self.start_idx + max(1,np.random.normal(self.grp_size[0],self.grp_size[1])))
            test_df = self.df.iloc[self.start_idx:self.end_idx]
            answered_correctly_previous_batch = list(test_df['answered_correctly'])
            user_answer_previous_batch = list(test_df['user_answer'])
            test_df = test_df[self.test_cols]
            test_df['prior_group_answers_correct'] = None
            test_df['prior_group_responses'] = None
            test_df.loc[test_df.index[0],'prior_group_answers_correct'] = str(self.current_batch['prior_group_answers_correct'])
            test_df.loc[test_df.index[0],'prior_group_responses'] = str(self.current_batch['prior_group_responses'])
            self.current_batch['prior_group_answers_correct'] = answered_correctly_previous_batch
            self.current_batch['prior_group_responses'] = user_answer_previous_batch
            yield test_df

    def predict(self, prediction_df):
        assert not self.prediction_called, "You must get the next test sample from `iter_test()` first."
        self.predictions[self.last_prediction_idx:self.last_prediction_idx+len(prediction_df)] = prediction_df.answered_correctly
        self.last_prediction_idx += len(prediction_df)
        self.start_idx = self.end_idx
        self.prediction_called = True
        if self.end_idx>=len(self.df):
            print("Final AUC score: {0}".format(roc_auc_score(self.answered_correctly,self.predictions)))

In [None]:
class Chrono:
    def __init__(self):
        self.chrono = {}
        self.acc_time = {}
    
    def start(self, name):
        self.chrono[name] = datetime.datetime.now().timestamp()
    
    def stop(self, name):
        timestop = datetime.datetime.now().timestamp() - self.chrono[name]
        if name in self.acc_time.keys():
            self.acc_time[name] += timestop
        else:
            self.acc_time[name] = timestop

In [None]:
if TEST_PREDICTION:
    tg = TestGenerator(test, grp_size=[1000,100])
    nb = NaiveBayes({'question': question_df, 'user':user_df})
    chrono = Chrono()

In [None]:
if TEST_PREDICTION:
    chrono.start('total')
    progress=0
    np.random.seed(59)
    for test_batch in tg.iter_test():
        progress += len(test_batch)
        sys.stdout.write("{0}\r".format(progress))

        # HERE we can perform the incremental learning. That means train the model including data from the previous batch.
        # We must be careful because retrain the model at every step can slow down a lot the submission.
        # In this example I won't use incremental learning
        chrono.start('update_model')
        #nb.update_model({'question':question_incr_df, 'user':user_incr_df})
        chrono.stop('update_model')

        test_questions = test_batch.loc[test_batch.content_type_id==0,'content_id']
        test_users = test_batch.loc[test_batch.content_type_id==0,'user_id']
        test_rowids = test_batch.loc[test_batch.content_type_id==0,'row_id']
        chrono.start('predict')
        answered_correctly = nb.predict({'question':test_questions, 'user':test_users})
        chrono.stop('predict')

        tg.predict(pd.DataFrame({'row_id':test_rowids, 'answered_correctly': answered_correctly}))
    chrono.stop('total')

The AUC with this test dataset is 0.724. The AUC of the submission is 0.742

In [None]:
chrono.acc_time

The time statistics registered in <code>chrono</code> can be useful to detect bottlenecks in the submission.

## Submission<p>
Here I create the real submission for the competition.

In [None]:
env = riiideducation.make_env()
iter_test = env.iter_test()

For each batch in <code>test_df</code>, we will predict the probability of answering correctly (<code>nb.predict</code>) and then we will send the resulting data back to the environment. This last part is done in <code>env.predict</code>. Notice that we must not create any <code>submission.csv</code> file, this is done automatically by <code>env.predict</code>.

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    test_questions = test_df['content_id']
    test_users = test_df['user_id']
    answered_correctly = nb.predict({'question':test_questions, 'user':test_users})
    test_df['answered_correctly'] = answered_correctly
    env.predict(test_df.loc[test_df['content_type_id']==0,['row_id','answered_correctly']])