In [1]:
import pandas as pd


*lectures.csv*: metadata for the lectures watched by users as they progress in their education.  
*lecture_id*: foreign key for the train/test content_id column, when the content type is lecture (1).  
*part*: top level category code for the lecture.  
*tag*: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.  
*type_of*: brief description of the core purpose of the lecture.  

In [2]:
df_l = pd.read_csv('lectures.csv')
df_l.head()

Unnamed: 0,lecture_id,tag,part,type_of
0,89,159,5,concept
1,100,70,1,concept
2,185,45,6,concept
3,192,79,5,solving question
4,317,156,5,solving question


**questions.csv**: metadata for the questions posed to users.  
**question_id**: foreign key for the train/test content_id column, when the content type is question (0).  
**bundle_id**: code for which questions are served together.  
**correct_answer**: the answer to the question. Can be compared with the train user_answer column to check if the user was right.  
**part**: the relevant section of the TOEIC test.  
**tags**: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

In [3]:
df_q = pd.read_csv('questions.csv')
df_q.head()

Unnamed: 0,question_id,bundle_id,correct_answer,part,tags
0,0,0,0,1,51 131 162 38
1,1,1,1,1,131 36 81
2,2,2,0,1,131 101 162 92
3,3,3,0,1,131 149 162 29
4,4,4,3,1,131 5 162 38


**row_id**: (int64) ID code for the row.  
**timestamp**: (int64) the time in milliseconds between this user interaction and the first event completion from that user.  
**user_id**: (int32) ID code for the user.  
**content_id**: (int16) ID code for the user interaction  
**content_type_id**: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.  
**task_container_id**: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.  
**user_answer**: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.
**answered_correctly**: (int8) if the user responded correctly. Read -1 as null, for lectures.  
**prior_question_elapsed_time**: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.  
**prior_question_had_explanation**: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.  

In [4]:
df_t = pd.read_csv('train.csv')
df_t.head()

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0,115,5692,0,1,3,1,,
1,1,56943,115,5716,0,2,2,1,37000.0,False
2,2,118363,115,128,0,0,0,1,55000.0,False
3,3,131167,115,7860,0,3,0,1,19000.0,False
4,4,137965,115,7922,0,4,1,1,11000.0,False


In [5]:
df_t.user_answer.value_counts()

 0    28186489
 1    26990007
 3    26084784
 2    18010020
-1     1959032
Name: user_answer, dtype: int64

In [6]:
df_t.answered_correctly.value_counts()

 1    65244627
 0    34026673
-1     1959032
Name: answered_correctly, dtype: int64

In [8]:
df_q.correct_answer.value_counts()

0    3716
3    3544
1    3478
2    2785
Name: correct_answer, dtype: int64

In [9]:
df_t.answered_correctly.describe()

count    1.012303e+08
mean     6.251644e-01
std      5.225307e-01
min     -1.000000e+00
25%      0.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.000000e+00
Name: answered_correctly, dtype: float64