# Opening Notes 

This competition was really fun for me and I learned so much as the competition progressed. Here are a few of the things I have learnt along the way.

- Manipulating/transforming/loading large datasets with compute limits 
- Advanced data manipulation and analysis using pandas (Pandas is now my top tag on Stack Overflow!)
- Learnt how to use an Automized hyperparameter tuner called Optuna
- Feature engineering with tabular time series data
- and so much more!

### Data Sources
I used 4 different data sources to load into this one notebook, and they are as follows.

1. riid-test-answer-prediction

This was the actual competition data which contained csv file with 100 million rows of users and their interactions with quesitons. There was also metadata referring to each question_id and lecture_id, and example_test data, and a sample submission. I will go into more detail regarding the features provided throughout the notebook.


2. riiid-splitting-train-and-test-data

I created a seperate notebook to split the notebook into two seperate datasets because I was unable to stay within CPU and RAM notebook limits. I also wanted to save the dataset to pickle as otherwise it took forever to load in. This enabled me to use a portion of the data to get specific user features and statistics on each question type. I go into into more detail in that notebook about the split and encourage you to check it out. 

3. riiid-content-answers-df-preperation

This is essentially a csv file of question statistics that I put together, and I looped through this and updated the question accuracy as I preprocessed the training dataset. I could not do this in this notebook as training the LGB model would just force the Notebook to go over the CPU limits.

4. riid-train-df

Pickled 100 million rows so I could load in all the data. I had to do this as some of the dictionaries necessary to create features were to large to keep when training the model. I found a good solution to this was to load in all the data at the end of the notebook after deleting it all to get those dictionaries back before making predictions on the test dataset.

### Helpful Notebooks

Tito's Looping Strategy - [link](https://www.kaggle.com/its7171/lgbm-with-loop-feature-engineering)

Vopani's Notebook on Large Datasets -  [link](https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets)

Alex Bader's Notebook on Quesion Tags - [link](https://www.kaggle.com/spacelx/2020-r3id-clustering-question-tags)

Mark wijkhuizen's Feature Eng - [link](https://www.kaggle.com/markwijkhuizen/riiid-training-and-prediction-using-a-state)

Takamotoki's Data Manipulation Techniques [link](https://www.kaggle.com/takamotoki/lgbm-iii-part3-adding-lecture-features)





### Imports and Loading Data

In [None]:
#basic
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import gc

#Model imports
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier

#optuna
import optuna
from optuna.samplers import TPESampler
from sklearn.metrics import roc_auc_score

# You can only call make_env() once, so don't lose it!
import riiideducation

import os

Specifying the datatypes of the dataset that I was loading in really helped keep the size of the files down. I did so in the following kernel on the question metadata, and did so on the training dataframe in the riiid-splitting-train-and-test-data before loading them into this notebook.

In [None]:
%%time

used_data_types_dict = {
    'question_id': 'int16',
    'bundle_id': 'int16',
    'correct_answer': 'int8',
    'part': 'int8',
    'tags': 'str',
}

questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv',
                       usecols = used_data_types_dict.keys(), dtype=used_data_types_dict)

lectures_df = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')
#ex = pd.read_csv('../input/riiid-test-answer-prediction/example_test.csv')

In [None]:
%%time
features_df = pd.read_pickle('../input/riiid-splitting-train-and-test-data/features_q_only.pkl.zip')
train_df = pd.read_pickle('../input/riiid-splitting-train-and-test-data/train_q_only.pkl.zip')

### Feature Engineering

I did quite a bit of feature engineering in a seperate notebook to stay withing the CPU limits but nonetheless still go most of it done below.

In the next kernel I created a feature which basically told the model wether this specific user had seen this specific quesiton before. It turned out to be a great feature and provided a good boost in my score when I added it in.

In [None]:
def add_seen_before_to_train_df(features_df, train_df):
    train_questions_only_df = features_df[features_df['answered_correctly']!=-1]

    # fill dictionary with default values
    state = dict()
    for user_id in train_questions_only_df['user_id'].unique():
        state[user_id] = {}
    total = len(state.keys())

    # add user content attempts
    user_content = train_questions_only_df.groupby('user_id')['content_id'].apply(np.array).apply(np.sort).apply(np.unique)
    user_attempts = train_questions_only_df.groupby(['user_id', 'content_id'])['content_id'].count().astype(np.uint8).groupby('user_id').apply(np.array).values

    for user_id, content, attempt in tqdm(zip(state.keys(), user_content, user_attempts),total=total):
        state[user_id] = dict(zip(content, attempt))

    del user_content, user_attempts, train_questions_only_df

    big_list=[]

    for pair in tqdm(train_df[['user_id','content_id']].values):
        if pair[0] in state:
            if pair[1] in state[pair[0]]:
                big_list.append(state[pair[0]][pair[1]])
                state[pair[0]][pair[1]]+=1
            else:
                big_list.append(0)
                state[pair[0]][pair[1]]=1
        else:
            big_list.append(0)
            state[pair[0]]={pair[1]:1}

    del state

    train_df['seen_before']=big_list
    train_df.seen_before = train_df.seen_before.clip(upper=5)

    del big_list
    gc.collect()
    
    return(train_df)

train_df = add_seen_before_to_train_df(features_df, train_df)

The next kernel is a dataframe that keeps track of what type of lectures each user had seen. This was another great dataframe, and provided numerous features for the model to train on. 

In [None]:
#messing with lectures before we remove them from df
lects_df = features_df[features_df['answered_correctly']==-1]

lect_seen_df = pd.DataFrame(data=lects_df.user_id.value_counts())
lect_seen_df.columns=['lectures_seen']
lect_seen_df.lectures_seen = lect_seen_df.lectures_seen.astype(float)

#some new lect feature stuff
lectures_df['type_of'] = lectures_df['type_of'].replace('solving question', 'solving_question')

lectures_df = pd.get_dummies(lectures_df, columns=['part', 'type_of'])

part_lectures_columns = [column for column in lectures_df.columns if column.startswith('part')]
types_of_lectures_columns = [column for column in lectures_df.columns if column.startswith('type_of_')]

lectures_df = lectures_df.set_index('lecture_id')

# merge lecture features to train dataset
train_lectures = features_df[features_df.answered_correctly == -1].merge(lectures_df, left_on='content_id', right_on='lecture_id', how='left')

# collect per user stats
user_lecture_stats_part = train_lectures.groupby('user_id')[part_lectures_columns + types_of_lectures_columns].sum()

user_lecture_stats_part = user_lecture_stats_part.merge(lect_seen_df, left_index=True, right_index=True)
user_lecture_stats_part.index.name = 'user_id'
user_lecture_stats_part['lectures_seen'] = user_lecture_stats_part['lectures_seen'].astype(int)

del lects_df, lect_seen_df, train_lectures

user_lecture_stats_part

Once I had got all the information needed from the lecture interactions, I removed all of these rows from the data and focused on creating features from the questions. Before doing this I made sure merge the question tag and the question ID with the dataframe in order to merge with the training dataset down the line. 

In [None]:
#removes rows that are lectures and adds tags and part to each interaction
train_questions_only_df = features_df[features_df['answered_correctly']!=-1]

train_questions_only_df = pd.merge(train_questions_only_df, questions[['part','tags']], 
                                   left_on='content_id', right_index=True, how = 'left')
del features_df

A dataframe with the number of questions answered, the number of questions correct, and the mean_accuracy of each user.

In [None]:
#getting the mean accuracy, question count of each user and other math stuff
grouped_by_user_df = train_questions_only_df.groupby('user_id')

user_answers_df = grouped_by_user_df.agg({'answered_correctly': ['mean', 'count', 'sum']}).copy()
user_answers_df.columns = [
    'user_mean_accuracy', 
    'user_questions_answered',
    'user_questions_correct',
]
user_answers_df.user_questions_correct = user_answers_df.user_questions_correct.astype('int32')

user_answers_df

One of the features I created in a different notebook was called lag_time. Each interaction had a timestamp, and I was able to create a feature which was basically the amount of time since they had last answered a question. 

When creating this feature I found that some questions had multiple parts, and the timestamps were identical. If I did not create this dictionary I would have all but one of the connected questions with a lag_time of 0.

Creating this dataframe enabled me to loop through the train and test data and input the proper lag time value.

In [None]:
user_lagtime_max_dict = grouped_by_user_df.agg({'lag_time': ['max']}).copy()
user_lagtime_max_dict.columns = [
    'user_lag_time_max',
]

user_lagtime_max_dict = user_lagtime_max_dict.to_dict('index')

#adds in the last lagtime as sometimes mutliple q's have the same timestamp
for pair in grouped_by_user_df.tail(1)[['user_id','lag_time']].values:
    user_lagtime_max_dict[pair[0]]['last_lagtime'] = pair[1]

gc.collect()

A dataframe with the number of questions answered, number of questions correct, mean_accuracy, and skew of each question tag.

In [None]:
#grouping by content_id
grouped_by_tags_df = train_questions_only_df.groupby('tags')

tags_answers_df = grouped_by_tags_df.agg({'answered_correctly': ['mean', 'count', 'std', 'skew']}).copy()

tags_answers_df.columns = [
    'tags_mean_accuracy', 
    'tags_question_asked', 
    'tags_std_accuracy', 
    'tags_skew_accuracy'
]

tags_answers_df

A dataframe with the number of questions answered and mean_accuracy grouped by question part. 

I ended up leaving out the idea of a loop updating this stat as it did little to the accuracy, and the part_questions_correct was an underwhelming feature.

In [None]:
grouped_by_part_df = train_questions_only_df.groupby('part')

part_answers_df = grouped_by_part_df.agg({'answered_correctly': ['mean', 'count']}).copy()
part_answers_df.columns = [
    'part_mean_accuracy', 
    'part_questions_answered', 
]

part_answers_df

I then deleted the groupby objects to free up some space before importing the preprocessed statistics of each content by its id. I will note that I was very strict on making sure to get these statistics from the exact same questions as I did in this notebook.

This is important as I would otherwise have had some data leakage!

In [None]:
del grouped_by_user_df
del grouped_by_tags_df
del grouped_by_part_df

In [None]:
content_answers_df = pd.read_pickle('../input/riiid-content-answers-df-preprocessing/content_answers_df.pkl.zip')

content_answers_df

### Features not created in this notebook

**Community**

This feature was actually really interesting, and I learnt alot from Alex Bader who was the brains behind this feature. It basically groups the tags into five groups. Some of the tags were "connected" by a specific tag number, others were by themselves, and others were in pairs. He did this using a framework called networkx which provided great visualizations of these connections. I encourage you to check out his analysis linked at the top of this notebook.

**num_in_bundle**

This feature came about when I realized that we dont see some of the questions in the entire training dataset. I realized that three consecutive questions werent seen, and it occured to me that some questions are connected and always answered together. I created a feature that says how many parts to the question there is.

**avg_q_time**

This feature was brought in from the content_answers_df and was basically the total average amount of time spent on each question. I was able to get this variable by shifting the prior_question_elapsed_time variable and taking the average of all the times. I ran into some difficulty here as there were a number of numpy inf and pandas NaN values which could not be dropped easily.

**lag_time**

I probably could have added created this feature in this notebook, but it did take a while to loop through and add this feature. I mentioned a little bit about this variable above, but it is essentially the time between the current interaction and the previous interaction of the specific user. This was one of the last features I came across and bumped up my score by a decent margin.


NOTE: It is also a great help to list the column names of all the features, and the column that you will be predicting as it makes the pipeline more readable and easy to debug.



In [None]:
features = [
    'user_mean_accuracy', 
    'user_questions_answered',
    'user_questions_correct',
    'q_mean_accuracy', 
    'q_question_asked',
    'q_question_correct',
    'community',
    'num_in_bundle',
    'tags_mean_accuracy', 
    'tags_question_asked', 
    'tags_std_accuracy', 
    'tags_skew_accuracy',
    'part_mean_accuracy', 
    'part_questions_answered', 
    'prior_question_elapsed_time', 
    #'prior_question_had_explanation',
    'part_1',
    'part_2',
    'part_3',
    'part_4',
    'part_5',
    'part_6',
    'part_7',
    'type_of_concept',
    'type_of_intention',
    'type_of_solving_question',
    #'type_of_starter',
    'lectures_seen',
    'seen_before',
    'avg_q_time',
    'lag_time',
]

target = 'answered_correctly'

Deleting the portion of the dataset which we got our statistics from to free up some memory.

In [None]:
del train_questions_only_df

### Preprocessing functions

The following functions are crucial to this notebook. They allow the datapipeline to go through and provide updated statistics on: 

* lectures that users attend
* user statistics
* content statistics

First of all I converted the dataframes to dictionarues, and converted the columns to numpy arrays and looped through those as this was much more memory efficient and was a much quicker solution. I added the values I needed to a list and then added these lists as columns at the end. 

The final function I have written here is taking the max user lagtime so that if by chance a user is midway through a question set, we can input the correct lagtime to that interaction.


In [None]:
def add_and_update_user_lects(user_lecture_stats_part, lectures_df, train_df):
    
    lect_dict = lectures_df.to_dict('index')
    lect_stats_part_dict = user_lecture_stats_part.to_dict('index')
    
    part_1_list = []
    part_2_list = []
    part_3_list = []
    part_4_list = []
    part_5_list = []
    part_6_list = []
    part_7_list = []
    type_of_concept_list = []
    type_of_intention_list = []
    type_of_solving_question_list = []
    type_of_starter_list = []
    lectures_seen_list = []
    
    for pair in tqdm(train_df[['content_id','user_id','answered_correctly']].values):
        if pair[1] in lect_stats_part_dict:
            if pair[2]!=-1:
                part_1_list.append(lect_stats_part_dict[pair[1]]['part_1'])
                part_2_list.append(lect_stats_part_dict[pair[1]]['part_2'])
                part_3_list.append(lect_stats_part_dict[pair[1]]['part_3'])
                part_4_list.append(lect_stats_part_dict[pair[1]]['part_4'])
                part_5_list.append(lect_stats_part_dict[pair[1]]['part_5'])
                part_6_list.append(lect_stats_part_dict[pair[1]]['part_6'])
                part_7_list.append(lect_stats_part_dict[pair[1]]['part_7'])
                type_of_concept_list.append(lect_stats_part_dict[pair[1]]['type_of_concept'])
                type_of_intention_list.append(lect_stats_part_dict[pair[1]]['type_of_intention'])
                type_of_solving_question_list.append(lect_stats_part_dict[pair[1]]['type_of_solving_question'])
                type_of_starter_list.append(lect_stats_part_dict[pair[1]]['type_of_starter'])
                lectures_seen_list.append(lect_stats_part_dict[pair[1]]['lectures_seen'])
                
            else:
                part_1_list.append(lect_stats_part_dict[pair[1]]['part_1'])
                part_2_list.append(lect_stats_part_dict[pair[1]]['part_2'])
                part_3_list.append(lect_stats_part_dict[pair[1]]['part_3'])
                part_4_list.append(lect_stats_part_dict[pair[1]]['part_4'])
                part_5_list.append(lect_stats_part_dict[pair[1]]['part_5'])
                part_6_list.append(lect_stats_part_dict[pair[1]]['part_6'])
                part_7_list.append(lect_stats_part_dict[pair[1]]['part_7'])
                type_of_concept_list.append(lect_stats_part_dict[pair[1]]['type_of_concept'])
                type_of_intention_list.append(lect_stats_part_dict[pair[1]]['type_of_intention'])
                type_of_solving_question_list.append(lect_stats_part_dict[pair[1]]['type_of_solving_question'])
                type_of_starter_list.append(lect_stats_part_dict[pair[1]]['type_of_starter'])
                lectures_seen_list.append(lect_stats_part_dict[pair[1]]['lectures_seen'])
                
                lect_stats_part_dict[pair[1]]['part_1'] += lect_dict[pair[0]]['part_1']
                lect_stats_part_dict[pair[1]]['part_2'] += lect_dict[pair[0]]['part_2']
                lect_stats_part_dict[pair[1]]['part_3'] += lect_dict[pair[0]]['part_3']
                lect_stats_part_dict[pair[1]]['part_4'] += lect_dict[pair[0]]['part_4']
                lect_stats_part_dict[pair[1]]['part_5'] += lect_dict[pair[0]]['part_5']
                lect_stats_part_dict[pair[1]]['part_6'] += lect_dict[pair[0]]['part_6']
                lect_stats_part_dict[pair[1]]['part_7'] += lect_dict[pair[0]]['part_7']
                lect_stats_part_dict[pair[1]]['type_of_concept'] += lect_dict[pair[0]]['type_of_concept']
                lect_stats_part_dict[pair[1]]['type_of_intention'] += lect_dict[pair[0]]['type_of_intention']
                lect_stats_part_dict[pair[1]]['type_of_solving_question'] += lect_dict[pair[0]]['type_of_solving_question']
                lect_stats_part_dict[pair[1]]['type_of_starter'] += lect_dict[pair[0]]['type_of_starter']
                lect_stats_part_dict[pair[1]]['lectures_seen'] += 1
        else:
            part_1_list.append(0)
            part_2_list.append(0)
            part_3_list.append(0)
            part_4_list.append(0)
            part_5_list.append(0)
            part_6_list.append(0)
            part_7_list.append(0)
            type_of_concept_list.append(0)
            type_of_intention_list.append(0)
            type_of_solving_question_list.append(0)
            type_of_starter_list.append(0)
            lectures_seen_list.append(0)
            
            
            if pair[2]==-1:
                lect_stats_part_dict[pair[1]]={}
                lect_stats_part_dict[pair[1]]['part_1'] = lect_dict[pair[0]]['part_1']
                lect_stats_part_dict[pair[1]]['part_2'] = lect_dict[pair[0]]['part_2']
                lect_stats_part_dict[pair[1]]['part_3'] = lect_dict[pair[0]]['part_3']
                lect_stats_part_dict[pair[1]]['part_4'] = lect_dict[pair[0]]['part_4']
                lect_stats_part_dict[pair[1]]['part_5'] = lect_dict[pair[0]]['part_5']
                lect_stats_part_dict[pair[1]]['part_6'] = lect_dict[pair[0]]['part_6']
                lect_stats_part_dict[pair[1]]['part_7'] = lect_dict[pair[0]]['part_7']
                lect_stats_part_dict[pair[1]]['type_of_concept'] = lect_dict[pair[0]]['type_of_concept']
                lect_stats_part_dict[pair[1]]['type_of_intention'] = lect_dict[pair[0]]['type_of_intention']
                lect_stats_part_dict[pair[1]]['type_of_solving_question'] = lect_dict[pair[0]]['type_of_solving_question']
                lect_stats_part_dict[pair[1]]['type_of_starter'] = lect_dict[pair[0]]['type_of_starter']
                lect_stats_part_dict[pair[1]]['lectures_seen'] = 1
                
    train_df['part_1'] =  part_1_list
    train_df['part_2'] = part_2_list
    train_df['part_3'] = part_3_list
    train_df['part_4'] = part_4_list
    train_df['part_5'] = part_5_list
    train_df['part_6'] = part_6_list
    train_df['part_7'] = part_7_list
    train_df['type_of_concept'] = type_of_concept_list
    train_df['type_of_intention'] = type_of_intention_list
    train_df['type_of_solving_question'] = type_of_solving_question_list
    train_df['type_of_starter'] = type_of_starter_list
    train_df['lectures_seen'] = lectures_seen_list
    
    lect_stats_part = pd.DataFrame.from_dict(lect_stats_part_dict, orient='index')
    
    return(lect_stats_part, train_df)

In [None]:
def add_and_update_user_stats(user_answers_df, train_df):
    
    my_dict=user_answers_df.to_dict('index')
    user_acc_list=[]
    user_answered_list=[]
    user_correct_list=[]
    
    for pair in tqdm(train_df[['user_id','answered_correctly']].values):
        if pair[0] in my_dict:
            user_acc_list.append(my_dict[pair[0]]['user_mean_accuracy'])
            user_answered_list.append(my_dict[pair[0]]['user_questions_answered'])
            user_correct_list.append(my_dict[pair[0]]['user_questions_correct'])
            my_dict[pair[0]]['user_questions_answered']+=1
            my_dict[pair[0]]['user_questions_correct']+=pair[1]
            my_dict[pair[0]]['user_mean_accuracy'] = my_dict[pair[0]]['user_questions_correct']/my_dict[pair[0]]['user_questions_answered']
            
        else:
            my_dict[pair[0]]={'user_mean_accuracy':0.645,'user_questions_answered': 1, 'user_questions_correct': pair[1]}
            user_acc_list.append(0.645)
            user_answered_list.append(0)
            user_correct_list.append(0)
    
    train_df['user_mean_accuracy']=user_acc_list
    train_df['user_questions_answered']=user_answered_list
    train_df['user_questions_correct']=user_correct_list
    
    user_answers_df = pd.DataFrame.from_dict(my_dict, orient='index')
            
    return(user_answers_df, train_df)

In [None]:
def add_and_update_content_stats(content_answers_df, train_df):
    
    my_dict=content_answers_df.to_dict('index')
    
    q_mean_accuracy_list=[]
    q_question_asked=[]
    q_question_correct=[]
    community_list=[]
    num_in_bundle_list=[]
    avg_q_time_list=[]
    
    for pair in tqdm(train_df[['content_id','answered_correctly']].values):
        q_mean_accuracy_list.append(my_dict[pair[0]]['q_mean_accuracy'])
        q_question_asked.append(my_dict[pair[0]]['q_question_asked'])
        q_question_correct.append(my_dict[pair[0]]['q_question_correct'])
        community_list.append(my_dict[pair[0]]['community'])
        num_in_bundle_list.append(my_dict[pair[0]]['num_in_bundle'])
        avg_q_time_list.append(my_dict[pair[0]]['avg_q_time'])
        
        my_dict[pair[0]]['q_question_asked']+=1
        my_dict[pair[0]]['q_question_correct']+=pair[1]
        my_dict[pair[0]]['q_mean_accuracy'] = my_dict[pair[0]]['q_question_correct']/my_dict[pair[0]]['q_question_asked']
        
    train_df['q_mean_accuracy'] = q_mean_accuracy_list
    train_df['q_question_asked'] = q_question_asked
    train_df['q_question_correct'] = q_question_correct
    train_df['community'] = community_list
    train_df['num_in_bundle'] = num_in_bundle_list
    train_df['avg_q_time'] = avg_q_time_list
    
    content_answers_df = pd.DataFrame.from_dict(my_dict, orient='index')
    
    return(content_answers_df, train_df)

In [None]:
def update_user_max_timestamp(user_lagtime_max_dict, train_df):
    
    for pair in train_df[['user_id','timestamp','lag_time']].values:
        if pair[0] in user_lagtime_max_dict:
            user_lagtime_max_dict[pair[0]]['user_lag_time_max'] = pair[1]
            user_lagtime_max_dict[pair[0]]['last_lagtime'] = pair[2]
        else:
            user_lagtime_max_dict[pair[0]] = {}
            user_lagtime_max_dict[pair[0]]['user_lag_time_max'] = pair[1]
            user_lagtime_max_dict[pair[0]]['last_lagtime'] = pair[2]
   
    return(user_lagtime_max_dict)

### Preprocessing Training DataSet

We start off by adding the updating the user lecture stats, and then removing all of the lecture rows from the data as we cant make predictions on these rows. 

Then come the next three functions specified above:

* update_user_max_timestamp
* add_and_update_user_stats
* add_and_update_content_stats

I then merged the questions metadata on the content_id so that I could add specific part and tag statistics. 

Then came filling all null values mostly with zeros, with the excpetion of time features which I filled with their respective means. 

Re-ordered the columns so that the predicted column is found at the end of the dataframe.

Removed numpy inf values and filled them with zeros. 


In [None]:
%%time
lect_stats_part, train_df = add_and_update_user_lects(user_lecture_stats_part, lectures_df, train_df)

train_df = train_df[train_df[target] != -1]

user_lagtime_max_dict = update_user_max_timestamp(user_lagtime_max_dict, train_df)

user_answers_df, train_df = add_and_update_user_stats(user_answers_df, train_df)
content_answers_df, train_df = add_and_update_content_stats(content_answers_df, train_df)

train_df = pd.merge(train_df, questions[['part','tags']], left_on='content_id', right_index=True, how = 'left')
train_df = train_df.merge(part_answers_df, how='left', left_on='part', right_index=True)
train_df = train_df.merge(tags_answers_df, how='left', left_on='tags', right_index=True)

train_df['part_1'].fillna(0, inplace = True)
train_df['part_2'].fillna(0, inplace = True)
train_df['part_3'].fillna(0, inplace = True)
train_df['part_4'].fillna(0, inplace = True)
train_df['part_5'].fillna(0, inplace = True)
train_df['part_6'].fillna(0, inplace = True)
train_df['part_7'].fillna(0, inplace = True)
train_df['type_of_concept'].fillna(0, inplace = True)
train_df['type_of_intention'].fillna(0, inplace = True)
train_df['type_of_solving_question'].fillna(0, inplace = True)
train_df['type_of_starter'].fillna(0, inplace = True)
train_df['lectures_seen'].fillna(0, inplace = True)
train_df['prior_question_elapsed_time'].fillna(25423, inplace = True)
train_df['avg_q_time'].fillna(25423, inplace = True)
train_df['lag_time'].fillna(26161758, inplace = True)

train_df[types_of_lectures_columns + part_lectures_columns] = train_df[types_of_lectures_columns + part_lectures_columns].astype(int)
train_df['lectures_seen'] = train_df['lectures_seen'].astype(int)

train_df['prior_question_had_explanation'] = train_df['prior_question_had_explanation'].fillna(value=False).astype(bool)
train_df = train_df.fillna(value=0.5)

train_df = train_df[features + [target]]
train_df = train_df.replace([np.inf, -np.inf], np.nan)
train_df['prior_question_elapsed_time'].fillna(25423, inplace = True)
train_df['avg_q_time'].fillna(25423, inplace = True)
train_df['lag_time'].fillna(26161758, inplace = True)

train_df = train_df.fillna(0)

The train_test_split was only used when training the hyperparameters, and selecting features as it allowed me to evaluate the predictiveness of the model. 

In [None]:
#train_df, test_df = train_test_split(train_df, random_state=314, test_size=0.2)

The next few cells are another aspect of this competition where I gained some very useful skills. Before this compettion I was used to manually altering hyperparameters through loops and gridsearch methods. Optuna is really helpful as it provides a intuitive method to optimize hyperparameters with ease. 

In [None]:
sampler = TPESampler(seed=314)

def create_model(trial):
    num_leaves = trial.suggest_int("num_leaves", 20, 40)
    n_estimators = trial.suggest_int("n_estimators", 50, 400)
    max_depth = trial.suggest_int('max_depth', 3, 8)
    min_child_samples = trial.suggest_int('min_child_samples', 100, 1200)
    learning_rate = trial.suggest_uniform('learning_rate', 0.0001, 0.30)
    min_data_in_leaf = trial.suggest_int('min_data_in_leaf', 5, 90)
    bagging_fraction = trial.suggest_uniform('bagging_fraction', 0.50, 1.0)
    feature_fraction = trial.suggest_uniform('feature_fraction', 0.50, 1.0)
    
    model = LGBMClassifier(
        num_leaves=num_leaves,
        n_estimators=n_estimators, 
        max_depth=max_depth, 
        min_child_samples=min_child_samples, 
        min_data_in_leaf=min_data_in_leaf,
        learning_rate=learning_rate,
        feature_fraction=feature_fraction,
        random_state=314
    )
    return model

def objective(trial):
    model = create_model(trial)
    model.fit(train_df[features], train_df[target])
    score = roc_auc_score(test_df[target].values, model.predict_proba(test_df[features])[:,1])
    return score

Uncommenting the following cell gets optuna up and searching for even better parameters. The one thing I could not figure out, but would be helpful is to be able to set the starting parameters that it uses. I could only figure out how to randomize the parameter starting point.

In [None]:
#study = optuna.create_study(direction="maximize", sampler=sampler)
#study.optimize(objective, n_trials=50)
#params = study.best_params
#params['random_state'] = 314

If a train_test_split was performed here you can uncomment the print line to see the roc_auc_score of the model, and evaluate how effective it is.

In [None]:
params = {'num_leaves': 30,
          'n_estimators': 300,
          'max_depth': 5,
          'min_child_samples': 371,
          'learning_rate': 0.28285171125399805,
          'min_data_in_leaf': 23,
          'bagging_fraction': 0.8057106694835638,
          'feature_fraction': 0.5688885590495344,
         }

model = LGBMClassifier(**params)
model.fit(train_df[features], train_df[target])
#print('LGB score: ', roc_auc_score(test_df[target].values, model.predict_proba(test_df[features])[:,1]))

This is a pretty relaxed way of checking how important the features are in the model. The feature importance is calculated by the amount that each attribute split point improves the performance measure of the model. This seems to be a good indicator of feature importance for LGB models.

In [None]:
print(model.feature_importances_)
print(train_df.columns[:-1])

#we can use train_df.columns[:-1] only because the target column is at the end of the dataframe

pd.DataFrame({'col_name': model.feature_importances_},
                index=train_df.columns[:-1]).sort_values(by='col_name', ascending=False)

As I mentioned at the start of the notebook, I could not keep dictionary's that I used in preprocessing in memory while I trained the LGB model. This wasnt too much of a problem as I easily loaded the entire dataset back into memory and got the necessary dictionaries back in memory for use on the test data.

In [None]:
del train_df

all_data = pd.read_pickle('../input/riiid-train-df/train_df.pkl.gzip')
all_data = all_data[all_data[target] != -1]

In [None]:
# fill dictionary with all the questions that users have seen before
def get_me_all_seen_befores(all_data):
    state = dict()
    for user_id in all_data['user_id'].unique():
        state[user_id] = {}
    total = len(state.keys())

    # add user content attempts
    user_content = all_data.groupby('user_id')['content_id'].apply(np.array).apply(np.sort).apply(np.unique)
    user_attempts = all_data.groupby(['user_id', 'content_id'])['content_id'].count().astype(np.uint8).groupby('user_id').apply(np.array).values

    for user_id, content, attempt in tqdm(zip(state.keys(), user_content, user_attempts),total=total):
        state[user_id] = dict(zip(content, attempt))

    del user_content, user_attempts, all_data

    return(state)

state = get_me_all_seen_befores(all_data)

### Final Preds on Test Data

Traditionally it is frowned upon to add loops into the test_data pipeline, but I agree with Tito's notebook in that in this instance updating features does actually increase model performance. I did test this out, and with looping my score was much better.


The following loops are essentially performing the same methods as I did in the training dataset preprocessing functions. There are a couple more functions that I added here which updated features which were brought in from another notebook for the training data. 

Functions:

* update_content_stats
* update_user_stats
* update_lect_stats
* add_and_update_seen_before
* add_and_update_lag_time

In [None]:
def update_content_stats(content_answers_df, previous_test_df):
    for row in previous_test_df[['content_id','answered_correctly','content_type_id']].values:
        if row[2] == 0:
            content_answers_df.at[row[0],'q_question_correct'] += row[1]
            content_answers_df.at[row[0],'q_question_asked'] += 1
    content_answers_df['q_mean_accuracy']= content_answers_df['q_question_correct']/content_answers_df['q_question_asked']

In [None]:
def update_user_stats(user_answers_df, previous_test_df):
    for row in previous_test_df[['user_id','answered_correctly','content_type_id']].values:
        if row[2] == 0:
            try:
                user_answers_df.at[row[0],'user_questions_correct'] += row[1]
                user_answers_df.at[row[0],'user_questions_answered'] += 1
            except:
                user_answers_df.at[row[0]]=(0,1,row[1])
    user_answers_df['user_mean_accuracy']= user_answers_df['user_questions_correct']/user_answers_df['user_questions_answered']

In [None]:
def update_lect_stats(user_lecture_stats_part, lectures_df, previous_test_df):
    for row in previous_test_df[['user_id','content_type_id', 'content_id']].values:
        if row[1] == 1:
            y = lectures_df.query('lecture_id == {}'.format(row[2])).drop('tag', 1).reset_index(drop=True)
            y = y.loc[:, (y != 0).any(axis=0)]
            if row[0] in user_lecture_stats_part.index:
                for i in y.columns:
                    user_lecture_stats_part.at[row[0], i] +=1
                user_lecture_stats_part.at[row[0], 'lectures_seen'] +=1
            else:
                user_lecture_stats_part.loc[row[0]] = 0
                for i in y.columns:
                    user_lecture_stats_part.at[row[0], i] +=1
                user_lecture_stats_part.at[row[0], 'lectures_seen'] +=1

In [None]:
def add_and_update_seen_before(train_df, state):
    big_list=[]
    for pair in train_df[['user_id','content_id','content_type_id']].values:
        if pair[2] == 0:
            if pair[0] in state:
                if pair[1] in state[pair[0]]:
                    big_list.append(state[pair[0]][pair[1]])
                    state[pair[0]][pair[1]]+=1
                else:
                    big_list.append(0)
                    state[pair[0]][pair[1]]=1
            else:
                big_list.append(0)
                state[pair[0]]={pair[1]:1}
        else:
            big_list.append(0)

    train_df['seen_before']= big_list
    train_df.seen_before = train_df.seen_before.clip(upper=5)
    return(train_df, state)

In [None]:
def add_and_update_lag_time(test_df, user_lagtime_max_dict):
    
    lag_time_list = []
    
    for pair in test_df[['user_id','timestamp']].values:
        if pair[0] in user_lagtime_max_dict:
            if pair[1] != user_lagtime_max_dict[pair[0]]['user_lag_time_max']:
                lag_time_list.append(pair[1] - (user_lagtime_max_dict[pair[0]]['user_lag_time_max']))
                user_lagtime_max_dict[pair[0]]['last_lagtime'] = pair[1] - (user_lagtime_max_dict[pair[0]]['user_lag_time_max'])
                user_lagtime_max_dict[pair[0]]['user_lag_time_max'] = pair[1]
            else:
                lag_time_list.append(user_lagtime_max_dict[pair[0]]['last_lagtime'])
        else:
            lag_time_list.append(0)
            user_lagtime_max_dict[pair[0]] = {}
            user_lagtime_max_dict[pair[0]]['user_lag_time_max'] = pair[1]
            user_lagtime_max_dict[pair[0]]['last_lagtime'] = 0
            
    test_df['lag_time'] = lag_time_list
            
    return(test_df, user_lagtime_max_dict)

### Calling environment to Make Preds

In [None]:
env = riiideducation.make_env()
iter_test = env.iter_test()

previous_test_df = None

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    if previous_test_df is not None:
        previous_test_df[target] = eval(test_df["prior_group_answers_correct"].iloc[0])
        update_content_stats(content_answers_df, previous_test_df)
        update_user_stats(user_answers_df, previous_test_df)
        update_lect_stats(user_lecture_stats_part, lectures_df, previous_test_df)
    
    train_df, state = add_and_update_seen_before(test_df, state)
    test_df, user_lagtime_max_dict = add_and_update_lag_time(test_df, user_lagtime_max_dict)
    
    test_df = pd.merge(test_df, questions[['part','tags']], left_on='content_id', right_index=True, how = 'left')
    
    test_df = test_df.merge(user_answers_df, how='left', left_on='user_id',right_index=True)
    test_df = test_df.merge(content_answers_df, how='left', left_on='content_id', right_index=True)
    test_df = test_df.merge(part_answers_df, how='left', left_on='part', right_index=True)
    test_df = test_df.merge(tags_answers_df, how='left', left_on='tags', right_index=True)
    test_df = test_df.merge(user_lecture_stats_part,  how='left', left_on='user_id', right_index=True)
    
    test_df['part_1'].fillna(0, inplace = True)
    test_df['part_2'].fillna(0, inplace = True)
    test_df['part_3'].fillna(0, inplace = True)
    test_df['part_4'].fillna(0, inplace = True)
    test_df['part_5'].fillna(0, inplace = True)
    test_df['part_6'].fillna(0, inplace = True)
    test_df['part_7'].fillna(0, inplace = True)
    test_df['type_of_concept'].fillna(0, inplace = True)
    test_df['type_of_intention'].fillna(0, inplace = True)
    test_df['type_of_solving_question'].fillna(0, inplace = True)
    test_df['type_of_starter'].fillna(0, inplace = True)
    test_df['lectures_seen'].fillna(0, inplace = True)
    test_df['prior_question_elapsed_time'].fillna(25423, inplace = True)
    test_df['avg_q_time'].fillna(0, inplace = True)
    test_df['lag_time'].fillna(26161758, inplace = True)

    test_df['lectures_seen'] = test_df['lectures_seen'].astype(int)
    
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].fillna(value=False).astype(bool)
    test_df.fillna(value = 0.6, inplace = True)

    test_df['answered_correctly'] = model.predict_proba(test_df[features])[:,1]
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])
    previous_test_df = test_df.copy()

# Closing Notes

### Things I would do differently

With more time I would have liked to have implemented a model that trained on the entire dataset. I experiemented with this a little bit near the end and found thtat by dropping the features with low importance and increasing the size of the training dataset I can see improvenements in my model score.

I would also have liked to experiment more with deep learning models as this was such a large dataset and these kind of models perform very well on large datasets. I did attempt to create a deep learning model, but it was not as accurate as the LGB I trained.. Still unsure if this was due to hyperparamter tuning, selected layers, or data preprocessing.

### Things I would like to improve on

The big takeway from this competition is that I really need to become more proficcient in using deep learning models on tabular datasets. I have spent a good portion of time on this already, and am learning something new everyday.

Furthermore, I have been exploring automated layer optimization frameworks in order to give myy deep learning models the best structure. Other than fine hyperparameter tuning, it seems seems that there is no framework out there that will optimize layers. Maybe once I understand a bit more about the topic, maybe I could create a package myself?