# Have A Look
* [TOEIC](https://www.iibc-global.org/english/toeic/test/lr.html) is a very popular English language test in Japan.
* The Reading and Listening exams are divided into 7 parts.
* The structure of the TOEIC test can be seen [here](https://www.iibc-global.org/english/toeic/test/lr/about/format.html).
* [Riiid](https://sunryse.co/app/startups/r_6WjOnZ6mV7JPbpY) is the first Korean startup to offer a TOEIC learning app that uses AI to optimize learning for each individual.
* [riiid](https://www.riiid.co/en/product)link shows [TOEIC learning app・santatoeic](https://santatoeic.jp/intro)
* The TOEIC learning app, [santatoeic](https://santatoeic.jp/intro), allows you to try tests and lectures with free registration.

# Installation

In [None]:
# Regular Libraries
import riiideducation
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.express as px
from matplotlib.ticker import FuncFormatter
import os
%matplotlib inline
import matplotlib.image as mpimg
from IPython.display import display_html
from PIL import Image
import gc
from scipy.stats import pearsonr
import tqdm
import copy
import re

# You can only call make_env() once, so don't lose it!
env = riiideducation.make_env()

In [None]:
# Color Palette
custom_colors = ['#7400ff', '#a788e4', '#d216d2', '#ffb500', '#36c9dd']

In [None]:
# Rapids Imports
import cupy # CuPy is an open-source array library accelerated with NVIDIA CUDA.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Checking data statistics of training data

In [None]:
train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv', low_memory=False, nrows=10**5, 
                       dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8',
                              'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             }
                      )

In [None]:
# Setup the paths to multiple data format
TRAIN_FEATHER_PATH = '../input/riiid-train-data-multiple-formats/riiid_train.feather'
TRAIN_H5_PATH = '../input/riiid-train-data-multiple-formats/riiid_train.h5'
TRAIN_JAY_PATH = '../input/riiid-train-data-multiple-formats/riiid_train.jay'
TRAIN_PARQUET_PATH = '../input/riiid-train-data-multiple-formats/riiid_train.parquet'
TRAIN_PKL_PATH = '../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip'

In [None]:
#used for changing color of text in print statement
from colorama import Fore, Back, Style
y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
sr_ = Style.RESET_ALL

In [None]:
# Display some of the training data
train_df.head(5).style.applymap(lambda x: 'background-color:lightsteelblue')

In [None]:
# Let's come up with a unique number.
train_df.nunique()

In [None]:
# Check the data type
train_df.dtypes

In [None]:
#Monitor memory usage
train_df.memory_usage(deep=True)

In [None]:
#Type conversion for efficient use of memory
from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_categorical_dtype

def reduce_mem_usage(df, use_float16=False):
    """
    Iterate through all the columns of a dataframe and modify the data type to reduce memory usage.        
    """
    
    start_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage of dataframe is {:.2f} MB".format(start_mem))
    
    for col in df.columns:
        if is_datetime(df[col]) or is_categorical_dtype(df[col]):
            continue
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if use_float16 and c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype("category")

    end_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage after optimization is: {:.2f} MB".format(end_mem))
    print("Decreased by {:.1f}%".format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [None]:
train_df = reduce_mem_usage(train_df)

In [None]:
# Check for missing values in the training data
train_df.isnull().sum()

* There are missing values in two columns.

In [None]:
# Display of training data
train_df.info()

* Concerning 'answered_correctly', Let's leave out the line including -1 because [Data Description](https://www.kaggle.com/c/riiid-test-answer-prediction/data) shows that ’Read -1 as null, for lectures’

In [None]:
len(train_df)

In [None]:
# Display Summary Statistics
train_df.describe().style.applymap(lambda x: 'background-color:lightgreen')

In [None]:
# Display information by user ID 
print(pd.pivot_table(train_df, index='user_id', values=['timestamp', 'prior_question_elapsed_time', 'answered_correctly'], aggfunc='mean'))

In [None]:
# Display information by user ID
train_df_pivot = pd.pivot_table(train_df, index='user_id', columns='answered_correctly')
train_df_pivot.head(5).style.applymap(lambda x: 'background-color:lightgreen')

In [None]:
# Read -1 as null, for lectures.
(train_df['answered_correctly']==-1).mean()
# We should exclude information about lectures.
train_df_questions = train_df[train_df['answered_correctly']!=-1]
train_df_questions['answered_correctly'].mean()

In [None]:
#Display the average percentage of correct answers per user
train_df_questions.groupby('user_id')['answered_correctly'].mean()

In [None]:
# Display the unique number of elements in a specific column
train_df['user_answer'].value_counts()

* The questions are multiple choice, and the answers are expected to take a value between 0 and 3.
* According to [the data description](https://www.kaggle.com/c/riiid-test-answer-prediction/data), -1 is for lectures and is excluded

# Checking data statistics of test and sample data

In [None]:
# You can only iterate through a result from `env.iter_test()` once
# so be careful not to lose it once you start iterating.
iter_test = env.iter_test()

Let's get the data for the first test batch and check it out.

In [None]:
sample_prediction_df = pd.read_csv('../input/riiid-test-answer-prediction/example_sample_submission.csv')

In [None]:
sample_prediction_df.head(5)

Note that we'll get an error if we try to continue on to the next batch without making our predictions for the current batch.

In [None]:
print('Number of rows in traing set: ', train_df.shape[0])
print('Number of columns in traing set: ', train_df.shape[1])

In [None]:
next(iter_test)

In [None]:
env.predict(sample_prediction_df)

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

In [None]:
test_df.groupby('user_id').agg(['min', 'max', 'mean']).head(5).style.applymap(lambda x: 'background-color:lightgreen')

# Checking data statistics of question's data

* [questions.csv](https://www.kaggle.com/c/riiid-test-answer-prediction/data) is that Metadata for the questions posed to users.

In [None]:
questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')

In [None]:
# Confirmation of the format of question's data
print('Number of rows in question data set: ', questions.shape[0])
print('Number of columns in question data set: ', questions.shape[1])

In [None]:
# Display some of the question's data
questions.head(5).style.applymap(lambda x: 'background-color:lightsteelblue')

In [None]:
len(questions)

In [None]:
# Let's come up with a unique number.
questions.nunique()

In [None]:
# Check statistics in the question's data
questions.head(5).style.applymap(lambda x: 'background-color:lightgreen')

In [None]:
# Check for missing values in the question's data
questions.isnull().sum()

* There is missing values in one column.

In [None]:
# Display of question's data
questions.info()

In [None]:
questions.groupby('question_id').agg(['min', 'max', 'mean']).head(5).style.applymap(lambda x: 'background-color:lightgreen')

In [None]:
#Let's take a look at the data for bundle_id=7796
#I think we can assume that major problem=bundle_id, minor problem=question_id.
#We can see that all the parts match and the tags are almost identical.
questions[questions["bundle_id"] == 7796]

In [None]:
correct = train_df[train_df.answered_correctly != -1].answered_correctly.value_counts(ascending=True)

fig = plt.figure(figsize=(12,4))
correct.plot.barh()
for i, v in zip(correct.index, correct.values):
    plt.text(v, i, '{:,}'.format(v), color='white', fontweight='bold', fontsize=14, ha='right', va='center')
plt.title("Questions answered correctly")
plt.xticks(rotation=0)
plt.show()

# Checking data statistics of lecture's data

* [lectures.csv](https://www.kaggle.com/c/riiid-test-answer-prediction/data) is that Metadata for the lectures watched by users as they progress in their education.

In [None]:
lectures = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv')

In [None]:
len(lectures)

In [None]:
# Check the data type
lectures.dtypes

In [None]:
# Let's come up with a unique number.
lectures.nunique()

In [None]:
# Confirmation of the format of lecture's data
print('Number of rows in lecture data set: ', lectures.shape[0])
print('Number of columns in lecture data set: ', lectures.shape[1])

In [None]:
# Check statistics in the lecture's data
lectures.head(5).style.applymap(lambda x: 'background-color:lightgreen')

In [None]:
# Check for missing values in the lecture's data
lectures.isnull().sum()

* There are no columns with missing values.

In [None]:
# Display of the lecture's data
lectures.info()

In [None]:
lectures["type_of"].drop_duplicates()

# Data Visualization

In [None]:
WIDTH = 800

In [None]:
cids = train_df.content_id.value_counts()[:30]

fig = plt.figure(figsize=(12,6))
ax = cids.plot.bar()
plt.title("Thirty most used content id's")
plt.xticks(rotation=90)
ax.get_yaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ','))) #add thousands separator
plt.show()

In [None]:

ds = train_df['content_type_id'].value_counts().reset_index()

ds.columns = [
    'content_type_id', 
    'percent'
]

ds['percent'] /= len(train_df)

fig = px.pie(
    ds, 
    names='content_type_id', 
    values='percent', 
    title='Lecures & questions', 
    width=WIDTH,
    height=500 
)

fig.show()

In [None]:
ds = train_df['user_id'].value_counts().reset_index()
ds.columns = ['user_id', 'count']
ds = ds.sort_values('user_id')

fig = px.line(
    ds, 
    x='user_id', 
    y='count', 
    title='User action distribution', 
    height=600, 
    width=800
)

fig.show()

In [None]:
# Find the unique number 
n = train_df['prior_question_elapsed_time'].nunique()
print(n)
# import math module
import math
# import data visualization
import seaborn as sns
import matplotlib.pyplot as plt
# First, I'll use Sturgess's formula to find the appropriate number of classes in the histogram 
k = 1 + math.log2(n)
# Display a histogram of the ElapsedTime of the training data
sns.distplot(train_df['prior_question_elapsed_time'], kde=True, rug=False, bins=int(k)) 
# Graph Title
plt.title('ElapsedTime')
# Show Histogram
plt.show() 

In [None]:
ds = train_df['prior_question_elapsed_time'].value_counts().reset_index()
ds.columns = ['prior_question_elapsed_time', 'mean']
ds = ds.sort_values('prior_question_elapsed_time')

fig = px.line(
    ds, 
    x='prior_question_elapsed_time', 
    y='mean', 
    title='Distribution of Prior_question_elapsed_time', 
    height=600, 
    width=900
)

fig.show()

In [None]:
# Find the unique number 
n = train_df['timestamp'].nunique()
# First, I'll use Sturgess's formula to find the appropriate number of classes in the histogram 
k = 1 + math.log2(n)
# Graph Title
plt.title('Timestamp')
# Show Histogram
train_df['timestamp'].hist(bins=int(k))

* Timestamp is the time in milliseconds between this user interaction and the first event completion from that user.

In [None]:
sns.countplot(y="part", data=questions)

* The structure of the TOEIC test can be seen [here](https://www.iibc-global.org/english/toeic/test/lr/about/format.html).
* There are a lot of Part 5's.Part 5 is a grammar and grammar problem.

In [None]:
# Display the distribution of each part of the question
sns.countplot(y="part", hue=None, data=questions)

* The structure of the TOEIC test can be seen [here](https://www.iibc-global.org/english/toeic/test/lr/about/format.html).
* There are a lot of Part 5's.Part 5 is a grammar and grammar problem.

In [None]:
# Display the distribution of lecture types.
sns.countplot(x="part", hue="type_of",data=lectures)

In [None]:
sns.countplot(y="user_answer", hue=None, data=train_df)

* The questions are multiple choice, and the answers are expected to take a value between 0 and 3.
* According to [the data description](https://www.kaggle.com/c/riiid-test-answer-prediction/data), -1 is for lectures and is excluded

In [None]:
# Distribution of correct answers by userID
grouped_by_user_df = train_df_questions.groupby('user_id')
user_answers_df = grouped_by_user_df.agg({'answered_correctly': ['mean', 'count'] })

user_answers_df[('answered_correctly','mean')].hist(bins = int(k))

* We'd like to compare the groups with the highest percentage of correct answers and the groups with the lowest percentage of correct answers about prior_question_elapsed_time.

In [None]:
# Draw a pie chart about gender.
plt.pie(lectures["type_of"].value_counts(),labels=["concept","solving question","intention","starter"],autopct="%.1f%%")
plt.title("Type of Lectures")
plt.show()

In [None]:
# coding: utf-8
from tqdm import tqdm
import time

# Set the total value 
bar = tqdm(total = 1000)
# Add description
bar.set_description('Progress rate')
for i in range(100):
    # Set the progress
    bar.update(25)
    time.sleep(1)

# Acknowledgements
* [Competition API Detailed Introduction](https://www.kaggle.com/sohier/competition-api-detailed-introduction)
* [Riiid: Comprehensive EDA + Baseline](https://www.kaggle.com/erikbruin/riiid-comprehensive-eda-baseline)
* [Riiid! Answer Correctness Prediction EDA. Modeling](https://www.kaggle.com/isaienkov/riiid-answer-correctness-prediction-eda-modeling)
* [Riiid - EDA&Baseline](https://www.kaggle.com/yutohisamatsu/riiid-eda-baseline)
* [Answer Correctness - Modeling with RapidsAI](https://www.kaggle.com/andradaolteanu/answer-correctness-modeling-with-rapidsai)
* [Riiid: EDA of full datase](https://www.kaggle.com/erikbruin/riiid-eda-of-full-dataset#Loading-the-data)
* [日本語EDA for biginner](https://www.kaggle.com/chumajin/eda-for-biginner)
* [【日本語】[Japanese] Riiid コンペに取り組む前の準備・随時更新](https://www.kaggle.com/takiyu/japanese-riiid)
* [Answer Correctness - RAPIDS, XGB, LGBM](https://www.kaggle.com/andradaolteanu/answer-correctness-rapids-xgb-lgbm)
