# Import required Libraries

In [None]:
%matplotlib inline

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import plotly.express as px

# Load data

In [None]:
train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv', low_memory=False, nrows=10**5, index_col=0,
                       dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8',
                              'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             }
                      )

# Initial exploration

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df['content_type_id'].value_counts().plot(kind='bar', title='Questions vs Lectures')

In [None]:
train_df['content_type_id'].value_counts()/len(train_df)

We will try to analyze the questions that were answered, by filtering the records for questions alone

In [None]:
train_df = train_df[train_df['content_type_id'] == 0]

Let's try to group some `user_id` and generate statistics

In [None]:
train_df.groupby(['user_id', 'answered_correctly'])\
        .agg({'prior_question_elapsed_time':np.mean}).head(2000)

In [None]:
train_df.groupby(['user_id', 'answered_correctly'])\
        .agg({'prior_question_elapsed_time':np.mean}).head(2000)\
        .groupby('user_id').agg(
    {'prior_question_elapsed_time':lambda x: x.values[0]-x.values[1] if len(x) == 2 else x.values[0]})

Above I have tried to generate a user level difference of time taken between correctly answered questions and incorrectly answered questions, if the students had taken less for answering correct if compared to the incorrect ones, the difference must be **negative** and vice-versa for the other scenario.

_(Assuming 0 to be False and 1 to be True)_

This maybe helpful to get patterns from time taken for a question, We can see that there are some big differences in time elapsed for some users 

Let's say that each student maybe weak in some topics and stronger in some topics, identifying this can be helpful, to know if a student can answer a question from those topics correctly.

## Looking at the questions

In [None]:
qdf = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
qdf.head()

In [None]:
qdf.info()

Let's check if the `bundle_id` and `question_id` columns are not same

In [None]:
(qdf['bundle_id'] == qdf['question_id']).value_counts()

No they are not, guess this was a bad assumption, since the `bundle_id` may have questions which are similar to one another or of the same topic of the lecture viewed.

Let's see the different types of tags for the questions

In [None]:
from collections import Counter

In [None]:
tags = []
for tag_str in qdf['tags'].values.tolist():
    if not isinstance(tag_str, float):
        tags.extend(tag_str.split(' '))

counts = dict(Counter(tags))
counts_df = pd.DataFrame.from_dict(counts, orient='index', columns=['Count']).reset_index()
counts_df = counts_df.rename({'index':'tag_id'}, axis='columns')

In [None]:
counts_df.head()

In [None]:
fig = px.bar(counts_df, x='tag_id', y='Count', title='Tag counts')
fig.show()

We should probably cluster this, but I don't know how

# Looking at the `answered_correctly` as a time series problem for each user

The competition description clearly states that we need to trace the knowledge of that particular student overtime to understand if he can answer the incoming question correctly.

Let's choose some well represented `user_id`s and try to look at the cumulative number of correctly answered questions for each of them overtime

In [None]:
uids = train_df['user_id'].value_counts().index.tolist()[:10]

In [None]:
random_user = train_df.loc[train_df['user_id'].isin(uids), ['user_id', 'timestamp', 'answered_correctly']]

In [None]:
random_user['timestamp'] = random_user['timestamp']/1000

In [None]:
random_user.reset_index(drop=True, inplace=True)

In [None]:
random_user['corr_cs'] = random_user.groupby('user_id').agg({'answered_correctly':np.cumsum})

In [None]:
fig = px.line(random_user, x="timestamp", y="corr_cs", color='user_id')
fig.show()

Some users have a long gap, it probably must be due to them watching lectures in the middle before moving on to the next bundle of questions