<img src="https://mma.prnewswire.com/media/1200045/Riiid_Labs.jpg?p=publish&w=950" width="800" height="400">

## <center>Riiid! Answer Correctness Prediction</center>
### <center>üß†Track knowledge states of 1M+ students in the wildüß†</center>

# Table of contents <a id='0.1'></a>

* [Introduction](#1)
* [Import Packages](#2)
* [Utility](#3)
* [Data Overview](#4)
    * [Train Data](#4.1)
    * [Questions Data](#4.2)
    * [Lectures Data](#4.3)
    * [Test Data](#4.4)
* [Individual Features](#5)
    * [Continous Feature Distribution](#5.1)
        * [Train Data Feature Distribution](#5.1.1)
        * [Lectures Data Feature Distribution](#5.1.2)
        * [Test Data Feature Distribution](#5.1.3)
    * [Categorical Feature Distribution](#5.2)
        * [Train Data](#5.2.1)
        * [Questions Data](#5.2.2)
        * [Lectures Data](#5.2.3)
* [Multiple Features](#6)
    * [Train Data Features](#6.1)
    * [Questions MetaData Features](#6.2)
    * [Lectures MetaData Features](#6.3)
    * [Feature Correlation](#6.4)
* [Reference](#7)

# 1. <a id='1'>Introductionüìî</a>
[Table of contents](#0.1)

Welcome to this new competition hosted by [Riiid! Labs](https://www.riiid.co/en/main), leader in AI based education. Here are some of the [products](https://www.riiid.co/en/product) provided by [Riiid! Labs](https://www.riiid.co/en/main).

In this competition, your challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions. You will pair your machine learning skills using Riiid‚Äôs EdNet data.

## About Competition Data

The data is in tabular format. We have data regarding student's historic performance, the performance of other students on the same question, metadata about the question itself. 

**This is a time-series code competition, you will receive test set data and make predictions with Kaggle's time-series API. Please be sure to review the Time-series API Details section closely**.

We are provided with following **csv** files - 

* train.csv - Training features.
* questions.csv - Metadata for the questions posed to users.
* lectures.csv - Metadata for the lectures watched by users as they progress in their education.

Please check this starter kernels here to get more information.
* [Competition API Detailed Introduction](https://www.kaggle.com/sohier/competition-api-detailed-introduction)
* [Quick Sample Submission](https://www.kaggle.com/sohier/quick-sample-submission/)

## What we are prediciting?

 You will predict whether students are able to answer their next questions correctly.
 
## Evaluation Metric: Area Under ROC Curve

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

<img src="https://i.ytimg.com/vi/J9l8J1MeCbY/hqdefault.jpg" width="400" height="400" align='left'>

# 2. <a id='2'>Import Packagesüìö</a>
[Table of contents](#0.1)

In [None]:
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl > /dev/null

In [None]:
# import packages
import os, gc
import warnings
import numpy as np
import pandas as pd
import datatable as dt

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

# riiideducation module
import riiideducation

%matplotlib inline
warnings.filterwarnings('ignore')

# directory
print('Competition Data/Files')
os.listdir('../input/riiid-test-answer-prediction')

# 3. <a id='3'>Utility</a>
[Table of contents](#0.1)


In [None]:
def countplot(column, plot_type='multiple', gridstyle='whitegrid', gs=None,
              palette='Accent', xlab=None, ylab=None, title=None, fontsize=12):
    
    '''
    Make countplots
    -----------------
    
    Arguments:
    column -- column with categorical values
    plot_type -- multiple grid ('multiple/single')
    gridstyle -- seaborn gridstyle
    gs -- gridspec (if using subplots)
    palette -- color palette
    xlab -- x-axis label
    ylab -- y-axis label
    title -- plot title
    fontsize -- fontsize
    
    Returns:
    sns.countplot()
    '''
    if plot_type=='multiple':
        with sns.axes_style(gridstyle):
            ax = f.add_subplot(gs)
            aa = sns.countplot(column, palette=palette)
            for p in ax.patches:
                height = p.get_height()
                aa.text(p.get_x()+p.get_width()/2.,
                        height,
                        '{:1.2f}%'.format(height/len(column)*100),
                        ha="center", fontsize=fontsize)
            plt.xlabel(xlab,fontsize=fontsize)
            plt.ylabel(ylab,fontsize=fontsize)
            plt.title(title)
            
    elif plot_type=='single':
        with sns.axes_style("whitegrid"):
            aa = sns.countplot(column, palette=palette)
            for p in aa.patches:
                height = p.get_height()
                aa.text(p.get_x()+p.get_width()/2.,
                        height + 3,
                        '{:1.2f}%'.format(height/len(column)*100),
                        ha="center", fontsize=fontsize)
            plt.xlabel(xlab,fontsize=fontsize)
            plt.ylabel(ylab,fontsize=fontsize)
            plt.title(title)

# 4. <a id='4'>Data Overviewüîç</a>
[Table of contents](#0.1)

In this section we will develop some intuition about the [competition data](https://www.kaggle.com/c/riiid-test-answer-prediction/data). The train.csv is huge around 5.45 GB we will use python **datatable** package to load this huge tabular data in our notebook. The **datatable** is adopted from the [R data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) package for faster readability of tabular data. Thanks to [Rohan Rao](https://www.kaggle.com/rohanrao) for this [notebook](https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid/notebook).

In [None]:
# root directory
ROOT = '../input/riiid-test-answer-prediction/'

# files
train = dt.fread("../input/riiid-test-answer-prediction/train.csv").to_pandas()

train = train.astype({
    'row_id': 'int32',
    'timestamp': 'int64',
    'user_id': 'int64',
    'content_id': 'int16',
    'content_type_id': 'int8',
    'task_container_id': 'int16',
    'user_answer': 'int8',
    'answered_correctly': 'int8',
    'prior_question_elapsed_time': 'float32',
    'prior_question_had_explanation': 'boolean'
})

questions = pd.read_csv(f'{ROOT}questions.csv')
lectures = pd.read_csv(f'{ROOT}lectures.csv')
example_test = pd.read_csv(f'{ROOT}example_test.csv')
example_sample_submission = pd.read_csv(f'{ROOT}example_sample_submission.csv')

## 4.1 <a id='4.1'>Train Data</a>
[Table of contents](#0.1)

In [None]:
train.head()

In [None]:
print(f'We have {train.shape[0]} rows and {train.shape[1]} features in train.csv.')

**üìå Points to note :**

* row_id - ID code for the row.

* timestamp - the time between this user interaction and the first event from that user.

* user_id - ID code for the user.

* content_id - ID code for the user interaction

* content_type_id - 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

* task_container_id - Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id. Monotonically increasing for each user.

* user_answer - the user's answer to the question, if any. Read -1 as null, for lectures.

* answered_correctly - if the user responded correctly. Read -1 as null, for lectures.

* prior_question_elapsed_time - How long it took a user to answer their previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Note that the time is the total time a user took to solve all the questions in the previous bundle.

* prior_question_had_explanation - Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

Let's get somemore info about the training data.

In [None]:
train.info()

### Missing Values

In [None]:
print(f'Missing values in train.csv in each columns:\n{train.isnull().sum()}')

In [None]:
print(f'We have total of {train.isnull().values.sum()} missing values in train data.')

**üìå Points to note :**
* We have **2744044** in total. **2351538** in column **prior_question_elapsed_time** and **392506** in **prior_question_had_explanation**.

### Unique Values

In [None]:
print('Unique Values in each column of train.csv')
print('##########################################')
for col in train:
    print(f'{col}: {train[col].nunique()}')

**üìå Points to note :**

* We have **3,93,656 unique users**.
* We have 10,000 unique batches of questions. 
* We have 4 categorical features **content_type_id, user_answer, answered_correctly, prior_question_had_explanation**.

## 4.2 <a id='4.2'>Questions Data (metadata)</a>
[Table of contents](#0.1)

In [None]:
questions.head()

In [None]:
print(f'We have {questions.shape[0]} rows and {questions.shape[1]} features in questions.csv.')

**üìå Points to note :**

* question_id - foreign key for the train/test content_id column, when the content type is question (0).

* bundle_id - code for which questions are served together.

* correct_answer - the answer to the question. Can be compared with the train user_answer column to check if the user was right.

* part - top level category code for the question.

* tags - one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

### Missing Values in questions.csv

In [None]:
print(f'Missing values in questions.csv in each columns:\n{questions.isnull().sum()}')

In [None]:
print(f'We have total of {questions.isnull().values.sum()} missing values in train data.')

### Unique Values

In [None]:
print('Unique Values in each column of questions.csv')
print('##########################################')
for col in questions:
    print(f'{col}: {questions[col].nunique()}')


## 4.3 <a id='4.3'>Lectures Data (metadata)</a>
[Table of contents](#0.1)

In [None]:
lectures.head()

In [None]:
print(f'We have {lectures.shape[0]} rows and {lectures.shape[1]} features in lectures.csv.')

**üìå Points to note :**

* lecture_id - foreign key for the train/test content_id column, when the content type is lecture (1).

* part - top level category code for the lecture.

* tag - one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

* type_of - brief description of the core purpose of the lecture

### Missing Values in lectures.csv

In [None]:
print(f'Missing values in lectures.csv in each columns:\n{lectures.isnull().sum()}')

In [None]:
print(f'We have total of {lectures.isnull().values.sum()} missing values in lectures data.')

### Unique Values

In [None]:
print('Unique Values in each column of lectures.csv')
print('##########################################')
for col in lectures:
    print(f'{col}: {lectures[col].nunique()}')


## 4.4 <a id='4.4'>Test Data</a>
[Table of contents](#0.1)

In this competition we have to predict which questions each student can answer correctly. You will loop through a series of batches of questions. Once you make that prediction, you can move on to the next batch.

We need to use **riiideducation** python module to work with our test data. For more detailed explanation please visit [here](https://www.kaggle.com/sohier/competition-api-detailed-introduction).

In [None]:
# You can only call make_env() once, so don't lose it!
env = riiideducation.make_env()

# You can only iterate through a result from `env.iter_test()` once
# so be careful not to lose it once you start iterating.
iter_test = env.iter_test()

In [None]:
count = 0
for (test_df, sample_prediction_df) in iter_test:
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])
    count += len(test_df)

In [None]:
print(f'We have {count} observations in total and {test_df.shape[1]} features in test.csv.')

In [None]:
test_df.head()

**üìå Points to note :**

We can see test dataframe is same as train.csv except we have two new columns. 

* prior_group_responses - provides all of the user_answer entries for previous group in a string representation of a list in the first row of the group. All other rows in each group are null. If you are using Python, you will likely want to call eval on the non-null rows. Some rows may be null, or empty lists.

* prior_group_answers_correct - provides all the answered_correctly field for previous group, with the same format and caveats as prior_group_responses. Some rows may be null, or empty lists.

In [None]:
test_df.info()

# 5. <a id='5'>Individual Featuresüìä</a>
[Table of contents](#0.1)

Now we will visualize our data using information available to us in each of the csv file. 

## 5.1 <a id='5.1'>Continous Feature Distribution</a>
### 5.1.1 <a id='5.1.1'>Train Data Feature Distribution</a>

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    train['timestamp'].hist(bins = 50,color='orange')
    plt.title("Timestamp Distribution")

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    train['user_id'].hist(bins = 50,color='red')
    plt.title("User Id Distribution")

**üìå Points to note :**

* **timestamp** represents user interaction upto first event completion. We see the graph is rightly skew. 
* **User Id** is unique id assigned to each user.

In [None]:
mean = train['content_id'].mean()
median = train['content_id'].median()
mode = train['content_id'].mode()[0]

mean_2 = train['task_container_id'].mean()
median_2 = train['task_container_id'].median()
mode_2 = train['task_container_id'].mode()[0]

print(f'Content Id (Mean): {mean}')
print(f'Content Id (Median): {median}')
print(f'Content Id (Mode): {mode}\n')
print('######################################\n')
print(f'Task Container Id (Mean): {mean_2}')
print(f'Task Container Id (Median): {median_2}')
print(f'Task Container Id (Mode): {mode_2}')

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    sns.distplot(train['content_id'], color='green')
    ax.axvline(int(mean), color='r', linestyle='--')
    ax.axvline(int(median), color='y', linestyle='-')
    ax.axvline(mode, color='b', linestyle='-')
    plt.legend({'Mean':mean,'Median':median,'Mode':mode})
    plt.title("Content Id Distribution")
    
with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    sns.distplot(train['task_container_id'], color='yellow')
    ax.axvline(int(mean_2), color='r', linestyle='--')
    ax.axvline(int(median_2), color='g', linestyle='-')
    ax.axvline(mode_2, color='b', linestyle='-')
    plt.legend({'Mean':mean_2,'Median':median_2,'Mode':mode_2})
    plt.title("Task Container Id Distribution")

In [None]:
mean_3= train['prior_question_elapsed_time'].mean()
median_3 = train['prior_question_elapsed_time'].median()
mode_3 = train['prior_question_elapsed_time'].mode()[0]

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("whitegrid"):
    sns.distplot(train['prior_question_elapsed_time'], color='olive')
    plt.axvline(int(mean_3), color='c', linestyle='--')
    plt.axvline(int(median_3), color='m', linestyle='-')
    plt.axvline(mode_3, color='k', linestyle='-')
    plt.legend({'Mean':mean_3,'Median':median_3,'Mode':mode_3})
    plt.title("Prior Question Elapsed Time Distribution")
    
print(f'Prior Question Elapsed Time (Mean): {mean_3}')
print(f'Prior Question Elapsed Time (Median): {median_3}')
print(f'Prior Question Elapsed Time (Mode): {mode_3}')

### 5.1.2 <a id='5.1.2'>Lectures Data Feature Distribution</a>
[Table of contents](#0.1)

In [None]:
mean_4 = lectures['tag'].mean()
median_4 = lectures['tag'].median()
mode_4 = lectures['tag'].mode()[0]

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("whitegrid"):
    sns.distplot(lectures['tag'], color='coral', bins=20)
    plt.axvline(int(mean_4), color='r', linestyle='--')
    plt.axvline(int(median_4), color='g', linestyle='-')
    plt.axvline(mode_4, color='b', linestyle='-')
    plt.legend({'Mean':mean_4,'Median':median_4,'Mode':mode_4})
    plt.title("Lecture Tag Distribution")
    
print(f'Tag (Mean): {mean_4}')
print(f'Tag (Median): {median_4}')
print(f'Tag (Mode): {mode_4}')

### 5.1.3 <a id='5.1.3'>Test Data Feature Distribution</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    test_df['timestamp'].hist(bins = 50,color='maroon')
    plt.title("Timestamp Distribution in Test Data")

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    test_df['user_id'].hist(bins = 50,color='gold')
    plt.title("User Id Distribution in Test Data")

In [None]:
mean_5 = test_df['content_id'].mean()
median_5 = test_df['content_id'].median()
mode_5 = test_df['content_id'].mode()[0]

mean_6 = test_df['task_container_id'].mean()
median_6 = test_df['task_container_id'].median()
mode_6 = test_df['task_container_id'].mode()[0]

print(f'Content Id Test(Mean): {mean_5}')
print(f'Content Id Test(Median): {median_5}')
print(f'Content Id Test(Mode): {mode_5}\n')
print('######################################\n')
print(f'Task Container Id Test(Mean): {mean_6}')
print(f'Task Container Id Test(Median): {median_6}')
print(f'Task Container Id Test(Mode): {mode_6}')

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    sns.distplot(test_df['content_id'], color='cyan')
    ax.axvline(int(mean_5), color='r', linestyle='--')
    ax.axvline(int(median_5), color='y', linestyle='-')
    ax.axvline(mode_5, color='b', linestyle='-')
    plt.legend({'Mean':mean_5,'Median':median_5,'Mode':mode_5})
    plt.title("Content Id Distribution in Test Data")

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    sns.distplot(test_df['task_container_id'], color='purple')
    ax.axvline(int(mean_6), color='r', linestyle='--')
    ax.axvline(int(median_6), color='y', linestyle='-')
    ax.axvline(mode_6, color='b', linestyle='-')
    plt.legend({'Mean':mean,'Median':median,'Mode':mode})
    plt.title("Task Container Id Distribution in Test Data")

In [None]:
mean_7 = test_df['prior_question_elapsed_time'].mean()
median_7 = test_df['prior_question_elapsed_time'].median()
mode_7 = test_df['prior_question_elapsed_time'].mode()[0]

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("whitegrid"):
    sns.distplot(test_df['prior_question_elapsed_time'], color='darkgoldenrod')
    plt.axvline(int(mean_7), color='r', linestyle='--')
    plt.axvline(int(median_7), color='g', linestyle='-')
    plt.axvline(mode_7, color='b', linestyle='-')
    plt.legend({'Mean':mean_7,'Median':median_7,'Mode':mode_7})
    plt.title("Prior Question Elapsed Time Distribution")
    
print(f'Content Id Test(Mean): {mean_7}')
print(f'Content Id Test(Median): {median_7}')
print(f'Content Id Test(Mode): {mode_7}\n')

## 5.2 <a id='5.2'>Categorical Feature Distribution</a>
[Table of contents](#0.1)

## 5.2.1 <a id='5.2.1'>Train Data</a>

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 3)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    ay = sns.countplot(y = train['user_id'], order=train.user_id.value_counts().index[:10], palette="ocean_r")
    plt.title("Top 10 Active Users")

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    aa = sns.countplot(y = train['content_id'], order=train.content_id.value_counts().index[:10], palette="terrain")
    plt.title("Top 10 Popular Contents Ids")
    
with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 2])
    aa = sns.countplot(y = train['task_container_id'], order=train.task_container_id.value_counts().index[:10], palette="OrRd_r")
    plt.title("Top 10 Tasks")

**üìå Points to note :**
* We see **Top 10 most active users in first plot**. User with Id **801103753** has most number of interactions around **17,917**.
* **Content** with Id **6116** is most popular.  
* **Task Id** is unique Id for batched of questions/lectures. **Task Id 14** is at the top followed by **15 and 4**. 

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    ay = sns.countplot(train['user_answer'], palette="Set3")
    for p in ax.patches:
        height = p.get_height()
        ay.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/101230332*100),
                ha="center", fontsize=12)
    plt.xlabel('user answer',fontsize=12)
    plt.ylabel('count',fontsize=12)
    plt.title("User's Answer To Questions")

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    aa = sns.countplot(train['answered_correctly'], palette="pastel")
    for p in ax.patches:
        height = p.get_height()
        aa.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/101230332*100),
                ha="center", fontsize=14)
    plt.xlabel('answered correctly',fontsize=12)
    plt.ylabel('count',fontsize=12)
    plt.title("Correct Answers")

**üìå Points to note :**
* User's answers to MCQ type questions. We can see users **have 4 options to choose from**. **-1** means lecture videos.  
* **answered_correctly** is our traget label. It is binary target variable **0 means False and 1 means True** ignoring -1 since it is label for lecture videos. It sepecify if the answer chose by users in graph one are correct or not. We can see **most the users tend to answer correctly**. 

In [None]:
f = plt.figure(figsize=(16, 10))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    ay = sns.countplot(train['prior_question_had_explanation'].dropna(), palette="Pastel1")
    for p in ax.patches:
        height = p.get_height()
        ay.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/101230332*100),
                ha="center", fontsize=12)
    plt.xlabel('Prior question had explanation',fontsize=12)
    plt.ylabel('count',fontsize=12)
    plt.title("Users Saw Explanation", fontsize=14)
    
with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    aa = sns.countplot(train['content_type_id'], palette="twilight_r")
    for p in ax.patches:
        height = p.get_height()
        aa.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/101230332*100),
                ha="center", fontsize=12)
    plt.xlabel('Content Type Id',fontsize=12)
    plt.ylabel('count',fontsize=12)
    plt.title("Posed Question/Watching Lecture", fontsize=14)

**üìå Points to note :**

* It seems that users saw explanation and correct response after answering the previous question bundle. We have boolean value **True** and **False** if the users saw explanation or not respectively. 

* Most of the events in second graph represents **questions posed to the users around 98%**. Very small percentage **(~2%) of the events are associated with users watching a lecture**.

## 5.2.2 <a id='5.2.2'>Questions Data</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    ay = sns.countplot(questions['correct_answer'], palette="hls")
    for p in ax.patches:
        height = p.get_height()
        ay.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/13523*100),
                ha="center", fontsize=14)
    plt.title("User's Answer To Questions")

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    aa = sns.countplot(questions['part'], palette="deep")
    for p in ax.patches:
        height = p.get_height()
        aa.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/13523*100),
                ha="center", fontsize=12)
    plt.title("TOIEC English-language Assessment Section Number")

For more information visit [here](https://www.iibc-global.org/english/toeic/test/lr/about/format.html) regarding the part column in questions.csv.

**üìå Points to note :**
* We see user's answer to MCQ questions. The first graph is same as the graph just above it from the **train.csv's user_answer column** almost same distribution except we **don't have -1 label for lectures**.  
* The secound bar graph has information related to the sections in TOIEC English-language Assessment. It has 7 parts as given [here](https://www.iibc-global.org/english/toeic/test/lr/about/format.html). Most of the questions appear from **part 5 (Incomplete Sentences)**. In TOIEC English-language Assessment we have 2 sections **Listening and Reading Section** where former section has **4 part** and later section has **3 parts**.

In [None]:
questions['tag'] = questions['tags'].str.split(' ')
questions['tag_length'] = questions['tag'].str.len()
tag_len = questions['tag_length'].dropna()
tag_len = tag_len.astype({'tag_length': 'int8'})

top_tags = questions.tag.explode('tags').reset_index()

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    aa = sns.countplot(tag_len, palette="coolwarm")
    for p in aa.patches:
        height = p.get_height()
        aa.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/13522*100),
                ha="center", fontsize=12)
    plt.xlabel('number of tags',fontsize=14)
    plt.ylabel('count',fontsize=14)
    plt.title("Number Of Tags Per Questions", fontsize=14)
    
with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    sns.countplot(y = top_tags['tag'], order = top_tags.tag.value_counts().index[:10], palette="ocean_r")
    plt.xlabel('count',fontsize=14)
    plt.ylabel('tag',fontsize=14)
    plt.title("Top 10 Tags",fontsize=14)

**üìå Points to note :**

* Tags assign to each question. We can see most the questions have only **one tag (48%) followed by questions with three tags (29%)**. 
* **92 is the most used tags for questions**.


## 5.2.3 <a id='5.2.3'>Lectures Data</a>
[Table of contents](#0.1)

This tabular data contain metadata for the lectures watched by students. 

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    ay = sns.countplot(lectures['part'], palette='BuPu_r')
    for p in ax.patches:
        height = p.get_height()
        ay.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/418*100),
                ha="center", fontsize=14)
    plt.title("Category code for lecture")

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    aa = sns.countplot(lectures['type_of'], palette="gist_stern_r")
    for p in ax.patches:
        height = p.get_height()
        aa.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/418*100),
                ha="center", fontsize=12)
    plt.title("Lecture Description")

**üìå Points to note :**

* We can see **7 category codes** for lectures. **34%** lectures having **code 5**. Only **5%** lectures have **code 3**.
* Most of the lectures seems to be describing **theoritical concepts(53%)** followed by **44%** lectures on **solving questions**. We can see significantly less percentage of lectures for **Intention and Starter** categories **2% and 1% respectively**. 

# 6. <a id='6'>Multiple Featuresüìà</a>
[Table of contents](#0.1)

## 6.1 <a id='6.1'>Train Data Features</a>

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    ay = sns.countplot(train['user_answer'], hue = train['prior_question_had_explanation'], palette="vlag")
    for p in ax.patches:
        height = p.get_height()
        ay.text(p.get_x()+p.get_width()/2.,
                height + 2,
                '{:1.2f}%'.format(height/101230332*100),
                ha="center", fontsize=12)
    plt.xlabel('User Answer',fontsize=14)
    plt.ylabel('count',fontsize=14)
    plt.title("User's Answer With And Without Explanation", fontsize=16)
        
with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    aa = sns.countplot(train['answered_correctly'], hue= train['prior_question_had_explanation'], palette="deep")
    for p in ax.patches:
        height = p.get_height()
        aa.text(p.get_x()+p.get_width()/2.,
                height + 2,
                '{:1.2f}%'.format(height/101230332*100),
                ha="center", fontsize=12)
    plt.legend(loc='center upper')
    plt.xlabel('Answered Correctly',fontsize=14)
    plt.ylabel('count',fontsize=14)
    plt.title("User's Saw explanation and Correct Answers", fontsize=15)

**üìå Points to note :**
* We can see that user most of the time user's saw explanation after giving answers. Also, **answers** ranges from **-1 to 3** where -1 indicate a lecture video. 
* User's tend to answer correctly often and see explanation after answering the previous question bundle. We can also see **class imbalance**.

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("whitegrid"):
    sns.countplot(train['user_answer'], hue = train['answered_correctly'], palette="husl")
    plt.title("User's Answer vs Answered Correctly")

**üìå Points to note :**

* User's response to given questions (MCQ) and if the answer is correct or not. 

## 6.2 <a id='6.2'>Question MetaData Features</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("whitegrid"):
    sns.countplot(questions['correct_answer'], hue = questions['part'], palette="Spectral")
    plt.title("User's Answer vs Answered Correctly")

**üìå Points to note :**

* We can see correct responses by users for each of the 7 parts. There seems to be only three reponses for part 2.

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("white"):
    sns.catplot(x="part", y="tag_length", kind="box",
                col="correct_answer", aspect=.7, data=questions)

**üìå Points to note :**

* We observe for most of the users response **there are 3-4 tags associated per question**.
* **Part 7** has more number of tags for all available correct answers. 
* For all correct answers **3 tag** are present in **Part 1, 3,and 4**.
* For **correct answers (choices 0 and 4)** we see there are **questions with 4, 5 and 6 tags associated**.  

In [None]:
with sns.axes_style("white"):
    sns.pairplot(questions, hue="correct_answer", palette="gnuplot_r", diag_kind="kde",
                 height=3, corner=True, plot_kws=dict(linewidth=1, alpha=1))

## 6.3 <a id='6.3'>Lectures MetaData Features</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("whitegrid"):
    sns.countplot(lectures['part'], hue = lectures['type_of'], palette="Spectral")
    plt.title("Categories in Parts")

**üìå Points to note :**

* **Concepts and Solving questions** categories are present in evey part. 
* **Intention and Starter** categories are almost missing in each part. There is only **one occurence of Intention category** in **part 2** and **starter category is present only in part 5 and 6**.

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("white"):
    sns.catplot(x="part", y="tag", kind="box",
                col="type_of", aspect=.7, data=lectures)

**üìå Points to note :**

* In the above plot we have **added some more information to visualize more relationships i.e tag variable**. **Concepts and Solving questions** categories are present in evey part as we can see in the bar plots above this graph. 

* There are **151 unique tags for lectures**. Most of which are in two categories i.e **Concepts and Solving questions**.

* **Parts 5 and 6** seems to have more tags mostly in three categories **Concepts, Solving questions and Starter**. 

* There is one **outlier in Part 7 for *Solving Question* category**. 

In [None]:
with sns.axes_style("white"):
    sns.pairplot(lectures, hue="type_of", palette="copper", height=3,
                 corner=True, plot_kws=dict(linewidth=1, alpha=0.6))

In [None]:
gc.collect()

## 6.4 <a id='6.4'>Feature Correlation</a>
[Table of contents](#0.1)

Let's see some correlation using heatmap.

In [None]:
f = plt.figure(figsize=(16, 10))

mask = np.triu(np.ones_like(train.corr(), dtype=bool))

with sns.axes_style("white"):
    sns.heatmap(train.corr(), mask=mask, square=True, cmap = 'YlGnBu', annot=True);
    plt.title("Train Data Feature Correlation", fontsize=14)

In [None]:
f = plt.figure(figsize=(16, 10))

mask = np.triu(np.ones_like(questions.corr(), dtype=bool))

with sns.axes_style("white"):
    sns.heatmap(questions.corr(), mask=mask, square=True, cmap = 'YlOrBr', annot=True);
    plt.title("Questions Data Feature Correlation", fontsize=14)

In [None]:
f = plt.figure(figsize=(16, 10))

mask = np.triu(np.ones_like(lectures.corr(), dtype=bool))

with sns.axes_style("white"):
    sns.heatmap(lectures.corr(), mask=mask, square=True, cmap = 'icefire', annot=True);
    plt.title("Lectures Data Feature Correlation", fontsize=14)

In [None]:
gc.collect()

# 7. <a id='7'>Reference</a>
[Table of contents](#0.1)
* https://www.kaggle.com/sohier/competition-api-detailed-introduction
* [Unique Values](https://stackoverflow.com/questions/27241253/print-the-unique-values-in-every-column-in-a-pandas-dataframe)
* [Fast Tabular Data Read](https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid/notebook)