# **RIID Answer Correctness Prediction - EDA Summarized data Per User/Tag/Question/Tag Part/ Bundle**

#### Ability to predict if a student will answer the question correctly depends primarily on the overall/topic_level ability of the student and also on how 'Simple/Difficult' an individual question, part,  topic/tag. 

 #### In other words, The external influence for the performance of a student could depend on :
* 'Simplicity Level' of a Tag . 
* 'Simplicity Level' of the Parts. 
* 'Simplicity Level' of the Bundle.
* 'Simplicity Level' of the Question. 

#### The  internal indicators of the student could be :
* Past performance of the student.
* Familiarity/Expertise of the user wrt to the Tag. 
* Has the  user been reading the explanation / lectures. 
* Avg time spent by the user to answer a question. 

#### In this Notebook we will try to create and view these summarized data and calculate the correlation to the user performance,. 


*Note: This is memory sensitive notebook. Running all the cells together may result in "out of memory" or notebook restart. Run each cell at a time and after a gc operation wait for the memory to stabilize (arounf 5.7GB).*

Acknowledgements:
    
    https://www.kaggle.com/erikbruin/riiid-comprehensive-eda-baseline
    https://www.kaggle.com/ilialar/simple-eda-and-baseline
        

Version History

* V1 - Initial Version
* V2 - Converted 'Difficulty Level' to 'Simplicity Level'. 
* V3 - Added Visualizations, Made changes to the weighted levels 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from scipy.stats import pearsonr, spearmanr
import gc
import pyarrow.parquet as pq
import pyarrow as pa

In [None]:
%%time
##
# Read the data from pickle format. Thanks to BIZEN(https://www.kaggle.com/hiro5299834)
##
train_df = pd.read_pickle("/kaggle/input/riid-pickle-file/train.pkl")
print("Train size:", train_df.shape)

In [None]:
##
# Check for missing/incorrect  data and impute
##
summary = train_df.describe().transpose()
summary 

In [None]:
del summary
gc.collect ()

In [None]:
##
# There are missing values in prior_question_elapsed_time. For now, replace with median. Also there seems to be outliers in the prior_question_elapsed_time.
# This could be incorrect values. Replace this also by median. 
##

#train_df['prior_question_elapsed_time'] = train_df['prior_question_elapsed_time'].fillna(train_df['prior_question_elapsed_time'].median)
val_85_pc = train_df['prior_question_elapsed_time'].quantile(0.85)
median_val = train_df['prior_question_elapsed_time'].median
# Todo: Out of memory if this line is executed- Need to be looked at later. 
#train_df.loc[train_df['prior_question_elapsed_time'] > val_85_pc,'prior_question_elapsed_time'] = median_val

In [None]:
del val_85_pc
del median_val
gc.collect ()

## User Level Data Summary

In [None]:
##
# Number of unique users in the train data 
##
len ( train_df['user_id'].unique() )

In [None]:
##
# Number of unique questions/lectures in the train data 
##
len ( train_df['content_id'].unique() )

In [None]:
##
# Mean/Min/Max/Median Number of questions per user. 
##

temp_df = train_df[train_df['content_type_id'] == 0].groupby("user_id")['content_id'].count()
print( " Mean:", temp_df.mean() )
print( " Median:",temp_df.median() )
print( " Min:", temp_df.min() )
print( " Max:", temp_df.max() )

Seems like the same question can be posted to the user mutliple times 

In [None]:
##
# An example of the same question being posted mutliple times 
##
user_id = temp_df.sort_values(ascending=False).index[0]
user_level_sorted_data = train_df[train_df['user_id'] == user_id].sort_values(by="content_id")
user_level_sorted_data

content id '2' for the user '801103753' is being posted multiple times but with different container ids !!!

In [None]:
del user_level_sorted_data
gc.collect()

In [None]:
##
# Calculate the performance of the user based on the correctness of the answers and the time taken . The value of the elasped time is for the previous 
# question in the bundle. But since we are taking the median, we do not have to recalibrate this one for user summary. 
##
user_grouped_df = train_df[train_df['content_type_id']==0].groupby("user_id")
user_info = user_grouped_df.agg({"answered_correctly":["sum", "count"],"prior_question_elapsed_time": 'median' })
user_info.columns=['total_correct_ans', 'total_q_attempted', 'prior_question_elapsed_time']
user_info['performance'] = user_info['total_correct_ans'] / user_info['total_q_attempted']


In [None]:
del user_grouped_df
gc.collect ()

In [None]:
##
# Add the number of lectures attended by the user
##
user_lectures = train_df[train_df['content_type_id']==1].groupby("user_id")['content_id'].count()
user_info = user_info.join(user_lectures)
user_info = user_info.rename (columns = { "content_id": "lecture_count" } )
user_info['lecture_count'] = user_info['lecture_count'].fillna(0).astype(int)
user_info

In [None]:
##
# 'Performance' param gives undue prefernce for children who have attempted very few questions. 
# Hence, calculate weighted performance and find the best students 
# TODO: relook at the calculation of the weighted_performance
##
correct_answers_max = user_info['total_correct_ans'].max()
user_info['weighted_performance'] = user_info['performance']*0.5 +  (user_info['total_correct_ans']/correct_answers_max)*0.5
user_info_sorted = user_info.sort_values(by='weighted_performance', ascending=False)
user_info_sorted

In [None]:
##
# Remove outlier data 
##
print ( "Users who have attempted less than 10 q:", user_info[user_info['total_q_attempted'] < 10 ].shape[0])
print( "Users who do have 0 questions answered correct:", user_info[user_info['performance'] == 0.0 ].shape[0])
user_info = user_info[user_info['total_q_attempted'] > 10 ]
user_info = user_info[user_info['performance'] != 0.0 ]
user_info

In [None]:
##
# Is there a correlation between performace and the avg time spent on questions ?
##
ax = sns.scatterplot ( x=user_info['weighted_performance'], y=user_info['prior_question_elapsed_time']  )
from scipy.stats import pearsonr, spearmanr
print ( pearsonr(user_info['weighted_performance'], user_info['prior_question_elapsed_time']) )
print (  spearmanr(user_info['weighted_performance'], user_info['prior_question_elapsed_time']) )

In [None]:
##
# Is there a correlation between performace and the total questions attempted ?
##
ax = sns.scatterplot ( x=user_info['performance'], y=user_info['total_q_attempted']  )
print ( pearsonr(user_info['performance'], user_info['total_q_attempted']) )
print ( spearmanr(user_info['performance'], user_info['total_q_attempted'])) 

In [None]:
##
# Is there a correlation between the lecture count and the weighted_performance 
##

ax = sns.scatterplot ( x=user_info['weighted_performance'], y=user_info['lecture_count'] )
print ( pearsonr(user_info['weighted_performance'], user_info['lecture_count']) )
print ( spearmanr(user_info['weighted_performance'], user_info['lecture_count'])) 

*There seems to be a good correlation between the number of lectures attended and the performance*

In [None]:
##
# Plot the top 10 users based on weighted_performance , Performance  
##

plt.figure(figsize = (20, 15))
plt.subplot(1, 2, 1)
plt.xticks(rotation=45)
ax=sns.barplot(x =user_info_sorted[0:10].index, y=user_info_sorted[0:10]['weighted_performance'], order=user_info_sorted[0:10].index )

plt.subplot(1, 2, 2)
plt.xticks(rotation=45)
ax=sns.barplot(x =user_info_sorted[0:10].index, y=user_info_sorted[0:10]['performance'], order=user_info_sorted[0:10].index )




In [None]:


table = pa.Table.from_pandas(user_info, preserve_index=True)
pq.write_table(table, '/kaggle/working/user_info.parquet')

In [None]:
del user_info
del table
del temp_df
gc.collect()

## Question Level Data Summary

In [None]:
##
# Calculate the difficulty level. Note we are not considering the elapsed_time. Since the time represents the time for the previous row. 
# Hence this cant be consumed as is. 
##
question_grouped_df = train_df[train_df['content_type_id']==0].groupby("content_id")
question_info = question_grouped_df.agg({"answered_correctly":["sum", "count"]})
question_info.columns=['total_users_correct', 'total_users_attempted']
question_info['question_simplicity_level'] =  question_info['total_users_correct'] / question_info['total_users_attempted']

question_info

In [None]:
# Some questions have no users. Hence calculate the weighted_difficulty_level. Multiply by a percentage of ppl attempted it. 
# TODO: relook at the calculation of the weighted_question_difficulty_level
max_q_user_correct = question_info['total_users_correct'].max()
question_info['weighted_question_simplicity_level'] = question_info['question_simplicity_level'] *0.5 + (question_info['total_users_correct']/max_q_user_correct)*0.5
question_info_sorted = question_info.sort_values(by='weighted_question_simplicity_level', ascending=False )
question_info_sorted

In [None]:
##
# Plot the most difficult questions. i.e the questions that have been coorect for many users 
##
plt.figure(figsize=(30,20))
plt.subplot(1,2, 1)
plt.xticks(rotation=45)
ax=sns.barplot(x =question_info_sorted[0:10].index, y=question_info_sorted[0:10]['weighted_question_simplicity_level'], order=question_info_sorted[0:10].index )

plt.subplot(1,2, 2)
plt.xticks(rotation=45)
ax=sns.barplot(x =question_info_sorted[0:10].index, y=question_info_sorted[0:10]['question_simplicity_level'], order=question_info_sorted[0:10].index )



In [None]:
del question_grouped_df
del question_info_sorted


In [None]:
gc.collect()

In [None]:
##
# Create columns for each tag and set if the question has the tag.
##

questions_tag_info = pd.read_csv("/kaggle/input/riiid-test-answer-prediction/questions.csv")
questions_tag_info['tags_new'] = questions_tag_info['tags'].apply( lambda x: str(x).split(" "))

for i in range( 0, 189) :
    questions_tag_info[str(i)] = 0 

for idx, row in questions_tag_info.iterrows():
    tags = row.tags_new
    for sub_tag in tags:
        if(sub_tag == 'nan'):
            continue
        questions_tag_info.loc[idx, str(sub_tag)]=1
questions_tag_info

In [None]:
question_info = pd.merge(question_info, questions_tag_info, how="outer", left_on= "content_id", right_on = "question_id")
question_info

In [None]:
import pyarrow.parquet as pq
import pyarrow as pa

table = pa.Table.from_pandas(question_info, preserve_index=True)
pq.write_table(table, '/kaggle/working/question_info.parquet')


In [None]:
del questions_tag_info 
del table

In [None]:
gc.collect()

## Tag Summary Data

Tags are " one or more detailed tag codes for the question. The meaning of the tags are not  provided, but these codes are sufficient for clustering the questions together.". Each tag represents a topic or a subject .

In [None]:
##
# Calculate the number of questions per tag
##
tag_q_info = pd.DataFrame()
for i in  range(0, 188):
    tag_q_info = tag_q_info.append(pd.DataFrame( {"count": question_info[question_info[str(i)]==1]['question_id'].count() }, index = [i] ))

In [None]:
##
# Calculate the number of questions answered correctly per tag 
# TODO: This is a time consuming operation - how can we simplify this ?
##


from tqdm import tqdm
tag_info = pd.DataFrame()


for i in tqdm(range( 0, 187)) :
    q_list_series = question_info[question_info[str(i)] == 1]['question_id']
    q_list_series.astype(np.int32)
    temp_df = train_df.join(q_list_series, how='inner', on="content_id")['answered_correctly']
    tag_info = tag_info.append ( pd.DataFrame({"answered_correctly_per_tag": temp_df [ temp_df == 1 ].shape[0],
                            "answered_incorrectly_per_tag": temp_df [ temp_df == 0 ].shape[0],
                            "lectures_per_tag": temp_df [ temp_df == -1 ].shape[0] },
                           index = [i]) ) 
tag_info['q_count_per_tag'] = tag_q_info['count']
tag_info['total_answers'] = tag_info['answered_correctly_per_tag'] + tag_info['answered_incorrectly_per_tag']
tag_info['tag_simplicity_level'] = tag_info['answered_correctly_per_tag'] / tag_info['total_answers']
tag_info

In [None]:
max_tag_ans_corr = tag_info['answered_correctly_per_tag'].max()
tag_info['weighted_tag_simplicity_level'] = tag_info['tag_simplicity_level'] *0.5 + (tag_info['answered_correctly_per_tag']/max_tag_ans_corr)*0.5
tag_info_sorted = tag_info.sort_values(by=['weighted_tag_simplicity_level'], ascending = False )
tag_info_sorted

In [None]:
##
# Plot the top 10 simple tags
##
plt.figure(figsize=(30,20))

plt.subplot(1,2,1)
plt.xticks(rotation=45)
ax=sns.barplot(x =tag_info_sorted[0:10].index, y=tag_info_sorted[0:10]['weighted_tag_simplicity_level'], order=tag_info_sorted[0:10].index )

plt.subplot(1,2,2)
plt.xticks(rotation=45)
ax=sns.barplot(x =tag_info_sorted[0:10].index, y=tag_info_sorted[0:10]['tag_simplicity_level'], order=tag_info_sorted[0:10].index )

In [None]:
##
# Is there a correlation between the lecture count and the performance of a tag
##
ax = sns.scatterplot ( x=tag_info_sorted['weighted_tag_simplicity_level'], y=tag_info_sorted['lectures_per_tag'] )
print ( pearsonr(tag_info_sorted['weighted_tag_simplicity_level'], tag_info_sorted['lectures_per_tag']) )
print ( spearmanr(tag_info_sorted['weighted_tag_simplicity_level'], tag_info_sorted['lectures_per_tag'])) 

In [None]:
##
# Is there a correlation between the q_count_per_tag and the performance of a tag
##
ax = sns.scatterplot ( x=tag_info_sorted['tag_simplicity_level'], y=tag_info_sorted['q_count_per_tag'] )
print ( pearsonr(tag_info_sorted['tag_simplicity_level'], tag_info_sorted['q_count_per_tag']) )
print ( spearmanr(tag_info_sorted['tag_simplicity_level'], tag_info_sorted['q_count_per_tag'])) 

In [None]:
import pyarrow.parquet as pq
import pyarrow as pa

table = pa.Table.from_pandas(tag_info, preserve_index=True)
pq.write_table(table, '/kaggle/working/tag_info.parquet')

In [None]:
del tag_q_info
del table
del temp_df
del q_list_series
del tag_info_sorted


In [None]:
gc.collect()

## Bundle Summary Data 

#### Bundle represents a group of questions that are presented together. Lets check if they have consequtive timestamps. Lets check if they have similar tag id. 

In [None]:
##
# Read question and check the number of bundles. 
##
questions_df = pd.read_csv("/kaggle/input/riiid-test-answer-prediction/questions.csv")
questions_df.set_index('question_id')
train_q_df = train_df.join(questions_df, on="content_id")
train_q_df['bundle_id'].unique()

In [None]:
##
# Calculate the number of bundles 
##
train_q_df['bundle_id'].max()

In [None]:
##
# Do the questions in same bundle have same tags ?
##
questions_df.sort_values(['tags', 'bundle_id'])

In [None]:
##
# Check how the bundles are presented to the user - Is it in consequtive timestamps ? Do they share the same container id ? 
##
temp_df = train_q_df[0:100000].groupby(['user_id','task_container_id']).agg({'bundle_id': ["min", 'max']} )
temp_df=temp_df.fillna(0)
temp_df_sorted = temp_df.sort_values('user_id')
temp_df_sorted

In [None]:
train_df[train_df['user_id'] == 115].sort_values(by="timestamp")

Question from the same bundle have similar tags. They have a similar timestamp and same 'task container id' per user. 
Since there are too many bundles and not many questions per bundle - this may not be very useful. So, not summarizing this data as of now.


In [None]:
del questions_df
del train_q_df
del temp_df_sorted
del temp_df




In [None]:
gc.collect()

## Part Summary Data 

#### Part represents the type of question e.g is it reading , listening. etc (https://www.iibc-global.org/english/toeic/test/lr/about/format.html). This is significant since some parts may be easier than the others. 

In [None]:
##
# Calculate the number of parts and the number questions for each 'part' 
##
questions_df = pd.read_csv("/kaggle/input/riiid-test-answer-prediction/questions.csv")
q_part_grouped = questions_df.groupby("part").count().sort_values(by="part")
q_part_grouped

In [None]:
##
# Calculate the number of questions answered correctly per part.
# Also calculate the number of questions per tag
# 
##
from tqdm import tqdm


part_q_info = pd.DataFrame()
for i in  range(1, 8):
    part_q_info = part_q_info.append(pd.DataFrame( {"count": questions_df[questions_df['part']==i]['question_id'].count() }, index = [i] ))
    
part_info = pd.DataFrame()


for i in tqdm(range( 1, 8)) :
    q_list_series = questions_df[questions_df['part'] == i]['question_id']
    q_list_series.astype(np.int32)
    temp_df = train_df.join(q_list_series, how='inner', on="content_id")['answered_correctly']
    part_info = part_info.append ( pd.DataFrame({"answered_correctly_per_part": temp_df [ temp_df == 1 ].shape[0],
                            "answered_incorrectly_per_part": temp_df [ temp_df == 0 ].shape[0],
                            "lectures_per_part": temp_df [ temp_df == -1 ].shape[0] },
                           index = [i]) ) 
part_info['q_count_per_part'] = part_q_info['count']
part_info['total_answers_per_part'] = part_info['answered_correctly_per_part'] + part_info['answered_incorrectly_per_part']
part_info['part_simplicity_level'] = part_info['answered_correctly_per_part'] / part_info['total_answers_per_part']
part_info

In [None]:
ax=sns.barplot(x=part_info.index, y= part_info['q_count_per_part'])

In [None]:
ax=sns.barplot(x=part_info.index, y= part_info['part_simplicity_level'])

Questions in Part 5 is more questions and have lesser percentage of people answering it correctly.

In [None]:
import pyarrow.parquet as pq
import pyarrow as pa

table = pa.Table.from_pandas(part_info, preserve_index=True)
pq.write_table(table, '/kaggle/working/part_info.parquet')

In [None]:
del part_info
del part_q_info
del table
del q_part_grouped
del questions_df
gc.collect () 

## Other comparisons

In [None]:
##
# Read the summary data 
##
table = pq.read_table("/kaggle/working/question_info.parquet")
question_info = table.to_pandas()
question_info_sorted = question_info.sort_values(by="weighted_question_simplicity_level")

In [None]:
table = pq.read_table("/kaggle/working/user_info.parquet")
user_info = table.to_pandas()


In [None]:
##
# Check if the users who have  attempted the most easier/simpler questions are better off than the ones who have 
## 

q_list= question_info_sorted[0:200]['question_id']
user_list = train_df[train_df['content_id'].isin(q_list)][['user_id', 'content_id']]
user_list = user_list.groupby('user_id').count()

# Get the list of users who have attempted less than 10 , 25, 50 , 75 easier questions 
user_list_10 = user_list[user_list['content_id']<10].index
user_list_25 = user_list[(user_list['content_id']>10) & (user_list['content_id']<25) == True ].index
user_list_50 = user_list[(user_list['content_id']>25) & (user_list['content_id']<50) == True ].index
user_list_75 = user_list[(user_list['content_id']>50) & (user_list['content_id']<75) == True ].index
user_list_100 = user_list[(user_list['content_id']>75) & (user_list['content_id']<100) == True ].index

# Get the mean performance of such users 
perf = np.zeros(5)
perf[0] = user_info[user_info.index.isin(user_list_10)]['performance'].mean()
perf[1] = user_info[user_info.index.isin(user_list_25)]['performance'].mean()
perf[2] = user_info[user_info.index.isin(user_list_50)]['performance'].mean()
perf[3] = user_info[user_info.index.isin(user_list_75)]['performance'].mean()
perf[4] = user_info[user_info.index.isin(user_list_100)]['performance'].mean()

x_axis_val = [10, 25, 50, 75, 100]
ax=sns.barplot (x =x_axis_val  , y = perf)

The graph shows that the more the user attempts the easier questions, the higher will be the number of correct answers 