# The first 30 questions are special

I found that almost all users solved the same content in the first 30 questions. Since the questions are given in order from part1 to part7, I presume that it is a test to judge the user's ability. The average correct answer rate for the first 30 questions is different from that for the 31st and subsequent questions.

## Table of Contents

1. Data preparation
2. Average correct answer rate for each attempt
3. Content of the first 30 questions
4. Histogram of correct answer rate

# 1. Data preparation

In [None]:
import numpy as np
import pandas as pd

In [None]:
dtypes_train = {
    'row_id': 'int64',
    'timestamp': 'int64',
    'user_id': 'int32',
    'content_id': 'int16',
    'content_type_id': 'int8',
    'task_container_id': 'int16',
    'user_answer': 'int8',
    'answered_correctly':'int8',
    'prior_question_elapsed_time': 'float32',
    'prior_question_had_explanation': 'boolean'
}

train = pd.read_csv('../input/riiid-test-answer-prediction/train.csv', dtype=dtypes_train)
train = train.loc[train.content_type_id == 0, ['row_id', 'user_id', 'content_id', 'answered_correctly']]

In [None]:
dtypes_questions = {
    "question_id": "int16",
    "bundle_id": "int16",
    "part": "int8",
    "correct_answer": "int8",
    "tags": "object",
}

questions = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv', dtype=dtypes_questions)

part = questions[['question_id','part']]
part.set_index('question_id', inplace=True)

In [None]:
train = train.merge(part, how='left', left_on='content_id', right_index=True)

In [None]:
train['attempt'] = train.groupby('user_id')['answered_correctly'].agg(['cumcount'])
train.attempt += 1

In [None]:
train.head()

# 2. Average correct answer rate for each attempt

Let's look at the number of users and the correct answer rate for each attempt.

In [None]:
groupby_attempt = train.groupby('attempt')['answered_correctly'].agg(['count', 'mean'])
groupby_attempt

In [None]:
groupby_attempt.describe()

From the graph of the correct answer rate, we can see that there is something in the first 30 questions.

In [None]:
groupby_attempt['mean'].plot()

In [None]:
groupby_attempt['mean'][:60].plot()

# 3. Content of the first 30 questions

Almost all users solved the same content in the first 30 questions. The sequence of the first 30 questions's part is '1 1 1 2 3 3 3 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 7 7 7 7'. Some people skip the proficiency test and solve the contents they want to solve.

In [None]:
for i in range(1,31):
    print('attempt_' + str(i))
    print(train[train['attempt']==i].content_id.value_counts().head(3))

In [None]:
pd.set_option('display.max_columns', 30)

pivot_part = train[train['attempt']<=30].pivot(index='user_id', columns='attempt', values=['part'])
pivot_part.head(30)

In [None]:
questions[
        # 1-3  part1
            (questions['question_id']==7900)
          | (questions['question_id']==7876)
          | (questions['question_id']==175)
    
        # 4  part2
          | (questions['question_id']==1278)
          
        # 5-7 part3
          | (questions['question_id']==2063)
          | (questions['question_id']==2064)
          | (questions['question_id']==2065)
          
        # 8-16 part4
          | (questions['question_id']==3363)
          | (questions['question_id']==3364)
          | (questions['question_id']==3365)
          
          | (questions['question_id']==2946)
          | (questions['question_id']==2947)
          | (questions['question_id']==2948)     
          
          | (questions['question_id']==2593)
          | (questions['question_id']==2594)
          | (questions['question_id']==2595)      
          
        #17-22 part5
          | (questions['question_id']==4492)
          | (questions['question_id']==4120)
          | (questions['question_id']==4696)   
          | (questions['question_id']==6116)   
          | (questions['question_id']==6173)   
          | (questions['question_id']==6370)   
    
        #23-26 part6
          | (questions['question_id']==6877)
          | (questions['question_id']==6878)
          | (questions['question_id']==6879)  
          | (questions['question_id']==6880)   
        
        #27-30 part7
          | (questions['question_id']==7216)
          | (questions['question_id']==7217)
          | (questions['question_id']==7218)
          | (questions['question_id']==7219)
    
         ]

# 4. Histogram of correct answer rate

In [None]:
first30 = train[train['attempt']<=30]
first30_summary = pd.DataFrame(first30.groupby('user_id')['answered_correctly'].count())
first30_summary.columns = ['first30_part_all_count']
first30_summary['first30_part_all_sum'] = pd.DataFrame(first30.groupby('user_id')['answered_correctly'].sum())
first30_summary['first30_part_all_mean'] = pd.DataFrame(first30.groupby('user_id')['answered_correctly'].mean())
first30_summary['first30_part1234_count'] = pd.DataFrame(first30[(first30['part']==1) | (first30['part']==2) | (first30['part']==3) | (first30['part']==4)].groupby('user_id')['answered_correctly'].count())
first30_summary['first30_part1234_sum'] = pd.DataFrame(first30[(first30['part']==1) | (first30['part']==2) | (first30['part']==3) | (first30['part']==4)].groupby('user_id')['answered_correctly'].sum())
first30_summary['first30_part1234_mean'] = pd.DataFrame(first30[(first30['part']==1) | (first30['part']==2) | (first30['part']==3) | (first30['part']==4)].groupby('user_id')['answered_correctly'].mean())
first30_summary['first30_part567_count'] = pd.DataFrame(first30[(first30['part']==5) | (first30['part']==6) | (first30['part']==7)].groupby('user_id')['answered_correctly'].count())
first30_summary['first30_part567_sum'] = pd.DataFrame(first30[(first30['part']==5) | (first30['part']==6) | (first30['part']==7)].groupby('user_id')['answered_correctly'].sum())
first30_summary['first30_part567_mean'] = pd.DataFrame(first30[(first30['part']==5) | (first30['part']==6) | (first30['part']==7)].groupby('user_id')['answered_correctly'].mean())

In [None]:
first30_summary

In [None]:
import matplotlib.pyplot as plt

for column in first30_summary.columns:
    fig = plt.figure(figsize=(5, 5))
    fig.suptitle(column)
    first30_summary[column].hist()
    plt.show()