In this notebook, I will generate four objects to represent each user's performance on different tags, and all-user's performance on different tags.

In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm

Loading train.csv

In [2]:
train_dtypes_dict = {
    "row_id": "int64",
    #"timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "int8",
    #"task_container_id": "int16",
    #"user_answer": "int8",
    "answered_correctly": "int8",
    #"prior_question_elapsed_time": "float32", 
    #"prior_question_had_explanation": "boolean"
}

train_data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv",
                         nrows=10**5,
                         usecols = train_dtypes_dict.keys(),
                         dtype=train_dtypes_dict,
                         #index_col = 0,
                        )
train_data = train_data[train_data.content_type_id == 0]

Loading questions.csv

In [3]:
question_dtype = {
    'question_id':'int16',
    'tags':'object'
}
questions_data = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv',
                             usecols = question_dtype.keys(), 
                             dtype = question_dtype)

There is a question without a tag. Fill a tag if you want.

In [4]:
print(questions_data.loc[questions_data.tags.isnull()])
questions_data.tags.fillna('92',inplace=True)

       question_id tags
10033        10033  NaN


How many questions? How many tags?

In [5]:
tags_set = set()
print(len(questions_data))
for i in range(len(questions_data)):
    tags_set = tags_set.union(questions_data.tags[i].split())
print(tags_set)
print(len(tags_set))

13523
{'181', '97', '147', '95', '32', '152', '106', '45', '75', '139', '115', '46', '178', '168', '34', '21', '146', '136', '164', '183', '19', '130', '113', '175', '107', '173', '91', '89', '48', '35', '37', '137', '156', '157', '98', '41', '88', '108', '14', '82', '110', '162', '10', '141', '23', '102', '120', '119', '96', '24', '145', '70', '62', '153', '51', '170', '131', '83', '73', '72', '78', '185', '165', '86', '38', '1', '42', '79', '154', '22', '99', '158', '182', '15', '149', '133', '12', '9', '167', '93', '81', '140', '49', '135', '143', '174', '127', '103', '40', '138', '59', '150', '118', '84', '128', '53', '13', '126', '111', '166', '117', '124', '169', '29', '142', '104', '64', '112', '90', '50', '30', '122', '0', '100', '8', '58', '105', '20', '176', '63', '25', '76', '87', '61', '36', '68', '114', '54', '172', '85', '47', '18', '3', '159', '125', '17', '28', '7', '94', '148', '60', '2', '65', '144', '186', '161', '67', '26', '11', '56', '71', '52', '171', '44', '31',

There are 188 tags. Creating a 188-dimension vector to represent a question.

In [6]:
def gen_vec(row):
    row['vec'] = np.zeros(188)
    index_list = row.tags.split()
    for index_ in index_list:
        row.vec[int(index_)] = 1.0
    return row

questions_data = questions_data.apply(gen_vec, axis='columns')
questions_data.head()

Unnamed: 0,question_id,tags,vec
0,0,51 131 162 38,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,1,131 36 81,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,2,131 101 162 92,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,3,131 149 162 29,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,4,131 5 162 38,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ..."


We can get one user's performance on different tags. And all-user's performance on different tags.
So I create four objects below:
1. user_ele_dict is a dictionary. Its items are users with their sums of correctly answered counts on tags.
2. user_num_dict is a dictionary. Its items are users with their sums of answered counts on tags.
3. ques_ele_vec is a vector. It is sums of all-user's correctly answered counts on tags.
4. ques_num_vec is a vector. It is sums of all-user's answered counts on tags.

In [7]:
def cal_vec(train_row,ele_dict,num_dict,q_data=questions_data):
    num_dict[train_row.user_id] += q_data.vec[train_row.content_id]
    ele_dict[train_row.user_id] += q_data.vec[train_row.content_id] * train_row.answered_correctly
    
user_ele_dict = dict()
user_num_dict = dict()
ques_ele_vec = np.zeros(188)
ques_num_vec = np.zeros(188)

for index, row in tqdm(train_data.iterrows()):
    ques_ele_vec += questions_data.vec[row.content_id] * row.answered_correctly
    ques_num_vec += questions_data.vec[row.content_id]
    if row.user_id in user_ele_dict.keys():
        cal_vec(row,user_ele_dict,user_num_dict)
    else:
        user_ele_dict[row.user_id] = np.zeros(188)
        user_num_dict[row.user_id] = np.zeros(188)
        cal_vec(row,user_ele_dict,user_num_dict)

98182it [00:26, 3758.98it/s]


The process over the whole train.csv will take hours. So I upload my result in "pretrained-for-riiid" folder in case you want to use.

In [8]:
import pickle
with open('../input/pretrained-for-riiid/user_ele_dict.pkl', 'rb') as f:
    user_ele_dict = pickle.load(f)

with open('../input/pretrained-for-riiid/user_num_dict.pkl', 'rb') as f:
    user_num_dict = pickle.load(f)

with open('../input/pretrained-for-riiid/ques_ele_vec.pkl', 'rb') as f:
    ques_ele_vec = pickle.load(f)

with open('../input/pretrained-for-riiid/ques_num_vec.pkl', 'rb') as f:
    ques_num_vec = pickle.load(f)  

Check them out.

In [9]:
print(user_ele_dict[115])#user_id=115
print(user_num_dict[115])
print(ques_ele_vec)
print(ques_num_vec)

[ 0.  0.  0.  0.  0.  1.  0.  0.  0.  2.  6.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  8.  0.  0.  0.  0.  0.  0.
  1.  0.  3.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  2.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  9.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0. 12.  4.  2.  0.  0.  0.  0.  1.  0.  2.  1.  1.  2.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  2. 20.  0.  0.  0.  0.  2.  0.  0.  0.  1.  0.  0.  1.
  0.  0.  0.  0.  0.  2.  0.  1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.
  8.  0.  1.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  2.]
[ 0.  0.  0.  0.  0.  4.  0.  0.  0.  3.  8.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. 12.  0.  0.  0.  0.  0.  0.
  1.  0.  4.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  2.  0.  0.
  0.  0.  0.  0. 