* In this competiton, we predict student's performance.
* Students answer the questions or take lectures about TOEIC test.
* In this notebook, you can learn about TOEIC test.
* If you like, feel free to upvote.

# About TOEIC test

* TOEIC is one of the most popular test of English.
* Listening part (100 questions): Part1~Part4
* Reading Part (100 questions): Part5~Part7

# Test Content
* Following explanations are quoted from [IIBC official website](https://www.iibc-global.org/english/toeic/test/lr/about/format.html)

# Part1: Photographs


> Four short statements regarding a photograph will be spoken only one time. The statements will not be printed. Of these four statements, select the one that best describes the photograph and mark your answer on the answer sheet.

# Part2: Question-response

> Three responses to one question or statement will be spoken only one time. They will not be printed. Select the best response for the question, and mark your answer on the answer sheet.

# Part3: Conversations

> Conversations between two or three people will be spoken only one time. They will not be printed. Listen to each conversation and read the questions printed in the test book (the questions will also be spoken), select the best response for the question, and mark your answer on the answer sheet. Some questions may require responses related to information found in diagrams,etc. printed on the test book as well as what you heard in the conversations. There are three questions for each conversation.

# Part4: Talks

> Short talks such as announcements or narrations will be spoken only one time. They will not be printed. Listen to each talk and read the questions printed in the test book (the questions will also be spoken), select the best response for the question, and mark your answer on the answer sheet. Some questions may require responses related to information found in diagrams, etc. printed on the test book as well as what you heard in the talks. There are three questions for each talk.

# Part5: Incomplete Sentences

> Select the best answer of the four choices to complete the sentence, and mark your answer on the answer sheet.

# Part6: Text Completion

> Select the best answer of the four choices (words, phrases, or a sentence) to complete the text, and mark your answer on the answer sheet. There are four questions for each text.

# Part7: Single,Multiple Passeges

> A range of different texts will be printed in the test book. Read the questions, select the best answer of the four choices, and mark your answer on the answer sheet. Some questions may require you to select the best place to insert a sentence within a text. There are multiple questions for each text.

# Merge train data and part of TOEIC

* As you can see from above,each part of TOEIC has own characteristics and the skills required to answer correctly are also diffrent by part.
* That's why I am wondering if part of TOEIC can be used as a feature.
* In this notebook, we merge train data with part of TOEIC.
* By doing this,we can easily grasp the relationship between content_id and part.
* This way may be useless on the entire train data because the size is very large.
* I am a beginner and strugging with understanding data on this comp.
* If I make a mistake, please let me know.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np
import gc
import plotly.graph_objects as go
import riiideducation

In [None]:
%%time
cols_to_load = ['row_id', 'user_id', 'answered_correctly', 'content_id', 'prior_question_had_explanation', 'prior_question_elapsed_time','content_type_id']
train = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")[cols_to_load]
train['prior_question_had_explanation'] = train['prior_question_had_explanation'].astype('boolean')

In [None]:
question = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
lecture = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')

# Extracting only a part of entire train

In [None]:
train_small = train.head(10000000)

In [None]:
train_small.head()

In [None]:
question

In [None]:
lecture

# Making dictionaries contain question/lecture id and part

In [None]:
keys = question["question_id"]
values = question["part"]
toeic_questions_parts = {key:value for key, value in zip(keys, values)}

In [None]:
keys = lecture["lecture_id"]
values = lecture["part"]
toeic_lectures_parts = {key:value for key, value in zip(keys, values)}

In [None]:
print(list(toeic_questions_parts.items())[0:10])
print("The length is " + str(len(toeic_questions_parts)))

In [None]:
print(list(toeic_lectures_parts.items())[0:10])
print("The length is " + str(len(toeic_lectures_parts)))

# Associating content_id with part of TOEIC

* Then, we clarify relationships between part of TOEIC and each content_id.

In [None]:
%%time
#firstly, associating content_id with question.csv
#"content_type_id" == True → lecture
#"content_type_id" == False → question
parts = []
for i,k in zip(train_small["content_id"],train_small["content_type_id"]):
    #if k != True and i in toeic_questions_parts.keys():
    if k != True: #question
        parts.append(toeic_questions_parts[i])
    elif k == True: #lecture
        parts.append(0) 

In [None]:
train_small["part"] = parts #making a new feature "part"
train_small.head()

In [None]:
train_small["part"].value_counts()

* According to the above, only about 2%(part=0) elements are about lectures.

In [None]:
%%time
#Next, associate content_id with lecture.csv
for index,(i,k) in enumerate(zip(train_small["content_id"],train_small["part"])):
    if k == 0 and i in toeic_lectures_parts.keys():
        train_small["part"][index] = toeic_lectures_parts[i]

In [None]:
train_small

In [None]:
train_small["part"].value_counts()

In [None]:
parts = ["Part1","Part2","Part3","Part4","Part5","Part6","Part7"]
question_number = [755467,1899847,859404,808760,4098861,1070802,506859]
fig = go.Figure([go.Bar(x=parts, y=question_number)])
fig.update_layout(title_text='The Number of Questions')
fig.show()

* There are much more questions/lectures about Part5 or Part2!

# Difficulties by TOEIC part

In [None]:
correct = [0,0,0,0,0,0,0]
incorrect = [0,0,0,0,0,0,0]
for i,j in zip(train_small["answered_correctly"],train_small["part"]):
    if j == 1 and i == 1:
        correct[0] += 1
    elif j == 1 and i == 0:
        incorrect[0] += 1
    elif j == 2 and i == 1:
        correct[1] += 1
    elif j == 2 and i == 0:
        incorrect[1] += 1
    elif j == 3 and i == 1:
        correct[2] += 1
    elif j == 3 and i == 0:
        incorrect[2] += 1
    elif j == 4 and i == 1:
        correct[3] += 1
    elif j == 4 and i == 0:
        incorrect[3] += 1
    elif j == 5 and i == 1:
        correct[4] += 1
    elif j == 5 and i == 0:
        incorrect[4] += 1
    elif j == 6 and i == 1:
        correct[5] += 1
    elif j == 6 and i == 0:
        incorrect[5] += 1
    elif j == 7 and i == 1:
        correct[6] += 1
    elif j == 7 and i == 0:
        incorrect[6] += 1
part_ = 1
for i, j in zip(correct,incorrect):
    print("Part" + str(part_) + ": " + " Correct Answer " + str(i) + " Uncorrect Answer " + str(j))
    part_ += 1

In [None]:
fig = go.Figure([go.Bar(x=parts, y=correct)])
fig.update_layout(title_text='The Number of Correct Answer')
fig.show()

In [None]:
fig = go.Figure([go.Bar(x=parts, y=incorrect)])
fig.update_layout(title_text='The Number of Incorrect Answer')
fig.show()

In [None]:
correct_rates = []
for i,j in zip(correct,incorrect):
    correct_rates.append(100*i/(i+j))
for i,j in enumerate(correct_rates):
    print("Correct Answer Rate of Part" + str(i+1) + " is " + str(j) + "%")

In [None]:
fig = go.Figure([go.Bar(x=parts, y=correct_rates)])
fig.update_layout(title_text='Correct Answer Rate (%)')
fig.show()

* According to the above, correct answer rates of 7 parts are in the range of 60% to 74%.