# Riiid! Answer Correctness Prediction

## DATA

1. **Train.csv** 

`row_id`: (int64) ID code for the row.

`timestamp`: (int64) the time in milliseconds between this user interaction and the first event completion from that user.

`user_id`: (int32) ID code for the user

`content_id`: (int16) ID code for the user interaction

`content_type_id`: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

`task_container_id`: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

`user_answer`: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.

`answered_correctly`: (int8) if the user responded correctly. Read -1 as null, for lectures.

`prior_question_elapsed_time`: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.

`prior_question_had_explanation`: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

# Importing Libraries

In [None]:

import numpy as np
import pandas as pd
import dask.dataframe as dd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%pylab inline

In [None]:
#Load data
train_data=pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv',low_memory=False, nrows = 1000000)
test_data=pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv',low_memory=False)
question=pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv',low_memory=False)
lecture=pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv',low_memory=False)


In [None]:
#Explore train data
print("No of rows in the training set:", len(train_data))
print("--------------------------")
print("No of attributes:",len(train_data.columns))
print("--------------------------")
print(train_data.dtypes)
print("--------------------------")
print(train_data.head())
print("--------------------------")

In [None]:
#Finding missing values in each column.
print('Part of missing values for every column')
print(train_data.isnull().sum() / len(train_data))

In [None]:
train_data.head()

In [None]:
#The number of unique users in the training set.
unique_users=train_data.user_id.nunique()
print("We have",unique_users,"number of unique users in the training set")

**1. Trying to find the most active users on the platform**

In [None]:
user = train_data.groupby(train_data.user_id).user_id.count()
user_10 = user.nlargest(10)
user_10 = user_10.reset_index(name = 'counts')
user_10.user_id = user_10.user_id.astype('str')
row = user_10.user_id.tolist()
col = user_10.counts.tolist()
fig = plt.figure(figsize=(10,5))
ax = fig.add_axes([0,0,1,1])
ax = sns.barplot(x="user_id", y="counts", data=user_10)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set_title('Top 10 users')
plt.show()

**2. Trying to plot the most frequently encountered contents on the platform between questions and lectures**

In [None]:
content_type = train_data.groupby(train_data.content_type_id).content_type_id.count()
content_type = content_type.reset_index(name = 'count')
fig = px.pie(content_type, values='count', names='content_type_id', title='Content Type')
fig.show()


`Only 2% of Users watching the Lecture videos. 98% Users exposed to the questions.`

**3. Finding the most recurring content**

In [None]:

cids = train_data.content_id.value_counts()[:30]
fig = plt.figure(figsize=(12,6))
cids.plot.bar()
plt.title("Thirty most used content id's")
plt.xticks(rotation=90)
plt.show()

**4. Trying to find the percentsge of users choosing different options while answering a question**

In [None]:
user_answer = train_data.groupby(train_data.user_answer).user_answer.count()
user_answer = user_answer.reset_index(name = 'count')
fig = px.pie(user_answer, values='count', names='user_answer', title='User Answer Distribution')
fig.show()

**5. Trying to see how many users correctly answered and how many didn't**

In [None]:
answered_correctly = train_data.groupby(train_data.answered_correctly).answered_correctly.count()
answered_correctly = answered_correctly.reset_index(name = 'count')
fig = px.pie(answered_correctly, values='count', names='answered_correctly', title='Answered correctly Distribution')
fig.show()

## Questions.csv

 **questions.csv**: metadata for the questions posed to users.

`question_id`: foreign key for the train/test content_id column, when the content type is question (0).

`bundle_id`: code for which questions are served together.

`correct_answer`: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

`part`: the relevant section of the TOEIC test.

`tags`: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

In [None]:
#Explore the questions.csv file
print("No of rows in questions.csv:",len(question))
print("--------------------------")
print("No of attributes:",len(question.columns))
print("--------------------------")
print(question.head())
print("--------------------------")
print(question.dtypes)

In [None]:
#Finding missing values in each column.
print('Part of missing values for every column')
print(question.isnull().sum() / len(question))

**Distribution for correct answers**

In [None]:
question = question
fig = plt.figure(figsize=(8,6))
cr = question.groupby("correct_answer")['question_id'].count().reset_index(name = 'counts')
cad = sns.barplot(x="correct_answer", y="counts", data=cr).set_title("Correct Answers Distribution")

**Most recurring tags**

In [None]:
check = question['tags'].str.split(' ').explode('tags').reset_index()
check = check['tags'].value_counts().reset_index()

check.columns = ['tag', 'count']
check['tag'] = check['tag'].astype(str) + '-'
check = check.sort_values(['count'])

fig = px.bar(
    check.tail(10), 
    x='count', 
    y='tag', 
    title='Top 10 most useful tags'
)
fig.show()

## lectures.csv

 **lectures.csv**: metadata for the lectures watched by users as they progress in their education.

`lecture_id`: foreign key for the train/test content_id column, when the content type is lecture (1).

`part`: top level category code for the lecture.

`tag`: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

`type_of`: brief description of the core purpose of the lecture

**Exploring how many parts are there in this data set**

In [None]:
#Explore Lecture data
print("No of rows in the lectures dataset:", len(lecture))
print("--------------------------")
print("No of attributes:",len(lecture.columns))
print("--------------------------")
print(lecture.dtypes)
print("--------------------------")
print(lecture.head())
print("--------------------------")

In [None]:
lecture = lecture
part_count = lecture.groupby("part")['lecture_id'].count().reset_index(name = 'counts')
chart = px.pie(part_count, values='counts', names='part', title='Part Type ')
chart.show()

**Different types of lectures**

In [None]:
type_count = lecture.groupby("type_of")['lecture_id'].count().reset_index(name = 'counts')
chart2 = px.pie(type_count, values='counts', names='type_of', title='Lecture Type')
chart2.show()

In [None]:
#The number of unique tags of lectures.
unique_tags=lecture.tag.nunique()
print("We have",unique_tags,"number of unique tags.")

Trying to find which is the most import lecture

In [None]:
#Exploring example_test csv file
print("No of rows in test file:",len(test_data))
print("--------------------------")
print("No of attributes:",len(test_data.columns))
print("--------------------------")
print(test_data.dtypes)

In [None]:
#Exploring the submission csv file 
submission=dd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_sample_submission.csv',low_memory=False).compute()
print("Number of rows in submission file:",len(submission))
print("--------------------------")
print("No of attributes:",len(submission.columns))
print("--------------------------")
print(submission.head())
print("--------------------------")
print(submission.dtypes)

# **Preprocessing Part**

In [None]:
#Dropping unwanted attribute.
del train_data['row_id']

In [None]:
train_data.head()

In [None]:
#Convert string to integer.

from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder() 
train_data['prior_question_had_explanation']=le.fit_transform(train_data['prior_question_had_explanation'])
lecture['type_of']=le.fit_transform(lecture['type_of'])

In [None]:
train_data.head()

In [None]:

lecture.head()

In [None]:
# Selecting only numerical values
train_corr =train_data._get_numeric_data()

In [None]:
#Co-relation plot
corr=train_corr.corr()
plt.subplots(figsize=(10,8))
sns.heatmap(corr,annot=True)

In [None]:
# Selecting only numerical values
lectures_corr =lecture._get_numeric_data()

In [None]:
#Co-relation plot
corr=lectures_corr.corr()
plt.subplots(figsize=(10,8))
sns.heatmap(corr,annot=True)

In [None]:
import kmeans1d

In [None]:
x = lecture['tag']
K = 7

clusters, centroids = kmeans1d.cluster(x, K)

print(clusters)   
print(centroids)  

In [None]:
plt.scatter(lecture['tag'],clusters)