## Import necessary libraries

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import os

In [None]:
import warnings
warnings.filterwarnings("ignore")

### Check data available

We have 4 datasets at our disposal

In [None]:
os.listdir('../input/riiid-test-answer-prediction')

In [None]:
lectures_csv = pd.read_csv("../input/riiid-test-answer-prediction/lectures.csv")
example_test_csv = pd.read_csv("../input/riiid-test-answer-prediction/example_test.csv")
train_csv = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", low_memory=False, nrows=1000000)
questions_csv = pd.read_csv("../input/riiid-test-answer-prediction/questions.csv")

Let's explore each of the datasets!

## 1.1 ```'train.csv'```

Let's take another look at our parameters:

- ```row_id```: (int64) ID code for the row.

- ```timestamp```: (int64) the time between this user interaction and the first event from that user.

- ```user_id```: (int32) ID code for the user.

- ```content_id```: (int16) ID code for the user interaction

- ```content_type_id```: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

- ```task_container_id```: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id. Monotonically increasing for each user.

- ```user_answer```: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.

- ```answered_correctly```: (int8) if the user responded correctly. Read -1 as null, for lectures.

- ```prior_question_elapsed_time```: (float32) How long it took a user to answer their previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Note that the time is the total time a user took to solve all the questions in the previous bundle.

- ```prior_question_had_explanation```: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

In [None]:
train_csv.head()

In [None]:
train_csv.nunique()

We can conclude that ```answered_correctly```, ```prior_question_had_explanation``` are a nominative features; ```user_answer``` is a rank variable; ```timestamp```, ```prior_question_elapsed_time```  are quantitative.

In [None]:
train_csv.info()

In [None]:
train_csv.describe()[['timestamp', 'user_answer', 'answered_correctly', 'prior_question_elapsed_time']]

Let's check the Nan values

In [None]:
train_csv.isnull().sum()

The easiest way is to delete rows that contain nan values, but in this case we may lose important information. Alternatively, it can be replaced by the average value of the group, where the group is calculated taking into user's id and the content's id.

In [None]:
# train_csv["prior_question_elapsed_time"] = train_csv.groupby(["user_id", "content_id"]).transform(lambda x: x.fillna(x.mean()))
# train_csv["prior_question_had_explanation"] = train_csv.groupby(["user_id", "content_id"]).transform(lambda x: x.fillna(x.mean()))


# train_csv.dropna(inplace=True)

In [None]:
train_csv['timestamp'].hist(bins = 50)

You can see that many users have a period of "stagnation" now.

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.countplot(train_csv.groupby('user_id')['user_answer'].count().value_counts(), palette="hls")
plt.title("Count of answers per user", fontsize=12)
plt.xticks(rotation=90, fontsize=13)
plt.ylabel('Number of answers')
plt.xlabel('Count of users')

We can single out one user who answered much more times than other students. Almost all users answered up to 30 times

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.countplot(train_csv.user_answer)
plt.title("Distribution of Mean's answer per user", fontsize=12)
plt.xticks(rotation=90, fontsize=13)
plt.ylabel('Frequency')
plt.xlabel('Average answer')

We see that users in principle equally likely to answer questions using answers 0,1,3. There are some -1 values


In [None]:
plt.figure(figsize=(15, 7))
ax = sns.distplot(train_csv.groupby('user_id')['answered_correctly'].mean())
plt.title("Distribution of correct's answer per user", fontsize=12)
plt.xticks(rotation=90, fontsize=13)
plt.ylabel('Frequency')
plt.xlabel('Average correct answer')

In [None]:
train_csv.groupby('user_id')['answered_correctly'].mean().median()

It can be argued that users are more likely to respond **correctly**. Let's implement another plot to estimate this opinion

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.countplot(train_csv.answered_correctly)
plt.title("Distribution of correct answer", fontsize=12)
plt.xticks(rotation=90, fontsize=13)
plt.ylabel('Frequency')
plt.xlabel('Answer')

Answer 1 is almost 2 times more often correct than 0

In [None]:
s = train_csv.groupby('content_id')['user_answer'].count().sort_values(ascending=False)

In [None]:
s[:20]

In [None]:
zz = train_csv.groupby('content_id')['user_answer'].count().sort_values(ascending=False)
plt.figure(figsize=(15, 7))
ax = sns.lineplot(y=zz, x=range(0, len(zz)))
plt.title("Count of answers per content_id", fontsize=12)
plt.locator_params(nbins=12)
plt.ylabel('Number of answers')
plt.xlabel('Number of content_id')

Approximately 2000 contents have more than 200 questions

We can find **most popular content**

In [None]:
zz[:15]

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.distplot(train_csv.groupby('user_id')['prior_question_elapsed_time'].mean())
plt.title("Distribution of Mean's prior_question_elapsed_time per user", fontsize=12)
plt.xticks(rotation=90, fontsize=13)
plt.ylabel('Frequency')
plt.xlabel('Average prior_question_elapsed_time')

In avarage, each user needs 20000 to answer their previous question bundle, ignoring any lectures in between.

Note: Note that the time is the total time a user took to solve all the questions in the previous bundle.


## 1.2 ```'questions.csv'```

Let's take another look at our parameters:

- ```question_id```: foreign key for the train/test content_id column, when the content type is question (0).

- ```bundle_id```: code for which questions are served together.

- ```correct_answer```: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

- ```part```: top level category code for the question.

- ```tags```: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

In [None]:
questions_csv.head()

In [None]:
questions_csv.nunique()

In [None]:
questions_csv.isnull().sum()

In [None]:
def split_tags(x):
    try: return [int(i) for i in str(x).split()]
    except: return [0]

In [None]:
questions_csv.tags = questions_csv.tags.apply(lambda x: split_tags(x))

In [None]:
unique, counts = np.unique(questions_csv.tags.sum(), return_counts=True)

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.barplot(x=unique, y=counts)
plt.title("Count of tag", fontsize=12)
plt.tick_params(axis='x',which='both', bottom=False, top=False, labelbottom=False)
plt.ylabel('Count')
plt.xlabel('Tag')

In [None]:
idx = np.argsort(counts)[::-1]
print(f"most frequently tags are: {unique[idx[:5]]}")

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.countplot(questions_csv['correct_answer'], palette="hls")
plt.title("Count of correct answer per each choice", fontsize=12)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel('Count')
plt.xlabel('Correct answer')

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.countplot(questions_csv.groupby('bundle_id').count()['question_id'], palette="hls")
plt.title("Count of questions per bundle_id", fontsize=12)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel('Number of bundle')
plt.xlabel('Number of question')

Most bundle have only 1 question

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.countplot(questions_csv['part'], palette="hls")
plt.title("Distribution of Path", fontsize=12)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel('Count')
plt.xlabel('Path')

Type "5" is more frequent

In [None]:
questions_csv.groupby(['part', 'correct_answer']).count()['question_id']

In each part approximately the same distribution of the answer variant

## 1.3 ```'lectures.csv'```

Let's take another look at our parameters:

- ```lecture_id```: foreign key for the train/test content_id column, when the content type is lecture (1).

- ```part```: top level category code for the lecture.

- ```tag```: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

- ```type_of```: brief description of the core purpose of the lecture

In [None]:
lectures_csv.head()

In [None]:
lectures_csv.nunique()

In [None]:
lectures_csv.isnull().sum()

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.countplot(lectures_csv['part'], palette="hls")
plt.title("Distribution of Path", fontsize=12)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel('Count')
plt.xlabel('Path')

In [None]:
plt.figure(figsize=(15, 7))
ax = sns.countplot(lectures_csv['type_of'], palette="hls")
plt.title("Distribution of Path", fontsize=12)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel('Count')
plt.xlabel('Path')

We can compare what is the significant difference in the number of different types of lectures