### Indroduction
I have noticed that users are given same questions multiple times during the course of there learning. From my I analysis, I think that keeping the prior history of questions given to the users can be a powerfull feature during modelling. The following is the analysis i have carried out to reach this claim

1. [Repeated Questions Detected](#2)
2. [Statistical Analysis on Repeated Questions](#3)
    1. [Probability Distribution of Fist Attempt Vs Repeated Attempts](#4)
    2. [Hypothesis Testing](#5)
        1. [Normality Test of Populations: First Attempt answer correctness](#6)
        2. [Normality Test of Populations: Repeated Attempt answer correctness](#7)
        3. [t-Test for comparing population mean](#8)
3. [Conclusions](#9)
    


In [None]:
import numpy as np
import dask.dataframe as dd
import pandas as pd
from time import time
from contextlib import contextmanager

@contextmanager
def timer(name):
    t0 = time()
    yield
    print(f'[{name}] done in {time() - t0:.2f} s')

In [None]:
with timer("Data Loading Time"):
    train = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv',
                        usecols = [0, 1, 2, 3, 4, 5, 7],
                   dtype={'row_id': 'int32',
                          'timestamp': 'int64',
                          'user_id': 'int64',
                          'content_id': 'int16',
                          'content_type_id': 'int8',
                          'task_container_id': 'int16',
                          'user_answer': 'int8',
                          'answered_correctly':'int8',
                          'prior_question_elapsed_time': 'float32',
                          'prior_question_had_explanation': 'boolean'}
                   )
    questions = pd.read_csv("/kaggle/input/riiid-test-answer-prediction/questions.csv",
                           dtype = {'question_id': 'int16', 
                                    'bundle_id': 'int16', 
                                    'correct_answer': 'int8',
                                    'part': 'int8',
                                   })
    # lectures = pd.read_csv("/kaggle/input/riiid-test-answer-prediction/lectures.csv",)

In [None]:
train = train[train.content_type_id == 0]
#questions['content_type_id'] = 0
#questions['content_type_id'] = questions['content_type_id'].astype('int8')

<a id='2'></a>
## Repeated Questions Detected
I have noticed that same question(or bundle) is given to a user many time throughout the course of learning. The following analysis proves my point.

In [None]:
train_user = train.loc[train.user_id == 801103753]

assert train_user.content_type_id.nunique() == 1

train_user = pd.merge(left=train_user, 
                       right=questions[['question_id', 'bundle_id']], 
                       left_on=["content_id",],
                       right_on=["question_id",],
                       how="left",
                      validate="m:1")


train_user = train_user[['user_id', 'question_id', 'bundle_id', 'timestamp', 'task_container_id', 'answered_correctly',]]

In [None]:
samples_1 = train_user[train_user.question_id == 853].reset_index(drop=True)
samples_1.loc[:, 'attempt_number'] = range(1, samples_1.shape[0]+1)
samples_1.loc[:, 'attempt_type'] = 'repeated_attempt'
samples_1.loc[0, 'attempt_type'] = 'first_attempt'
samples_1.style.background_gradient()

In [None]:
samples_2 = train_user[train_user.question_id == 3348].reset_index(drop=True)
samples_2.loc[:, 'attempt_number'] = range(1, samples_2.shape[0]+1)
samples_2.loc[:, 'attempt_type'] = 'repeated_attempt'
samples_2.loc[0, 'attempt_type'] = 'first_attempt'
samples_2.style.background_gradient()

In [None]:
samples_3 = train_user[train_user.question_id == 1754].reset_index(drop=True)
samples_3.loc[:, 'attempt_number'] = range(1, samples_3.shape[0]+1)
samples_3.loc[:, 'attempt_type'] = 'repeated_attempt'
samples_3.loc[0, 'attempt_type'] = 'first_attempt'
samples_3.style.background_gradient()

<a id="3"></a>
### Statistical Analysis on Repeated Questions
After the above analysis my hypothesis is that the probability of user answering a question correctly given that the question has already seen by the user before is higher than the probability of user answering the question correctly in the first shot.<br>
I am trying to prove the above hypothesis through plotting the probability density of two populations(First attempt Vs Repeated Attempt) and also by doing a hypothesis on the two population means.

In [None]:
random_user_ids = train.user_id.sample(n = 50000).unique()

train = train[train.user_id.isin(random_user_ids)]

print(f"number of users considered = {train.user_id.nunique()}")

In [None]:
master_data = train.merge(questions[['question_id', 'bundle_id',]], left_on=['content_id',],
           right_on = ['question_id', ], how='left', validate="m:1", copy=False)

del train

master_data['question_count'] = master_data.groupby(['user_id', 'question_id']).row_id.transform('count')
master_data['question_count'] = master_data['question_count'].astype('int8')
master_data = master_data[master_data.timestamp != 0] # first few rows are part of onboarding process
master_data = master_data[master_data.question_count >=2].reset_index(drop=True)

In [None]:
print(f"number of row_ids in the analzying={master_data.shape[0]}")

print(f"maxismum number of times a question repeated in the sub set = {master_data.question_count.max()}")

# master_data.sort_values(['user_id', 'question_id', 'timestamp'], ascending=True, inplace=True)#
print("master data head")
master_data.head(10).style.background_gradient()

In [None]:
non_repeated_questions = master_data.drop_duplicates(['user_id', 'question_id'], keep='first')

# non_repeated_questions.head(10).style.background_gradient()

assert non_repeated_questions.groupby(['user_id', 'question_id']).row_id.count().max() == 1

repeated_questions = master_data[~master_data.row_id.isin(non_repeated_questions.row_id)]

assert repeated_questions.shape[0] + non_repeated_questions.shape[0] == master_data.shape[0]

#repeated_questions.head(10).style.background_gradient()

assert repeated_questions.groupby(['user_id', 'question_id']).row_id.count().max() >= 1

avg_correctness_non_repeat = non_repeated_questions.groupby(['user_id',], as_index=False).agg({'row_id': 'count',
                                                                                    'timestamp':[min],
                                                                                   'answered_correctly': 'mean'})
avg_correctness_non_repeat.columns = ['_'.join(col).strip() for col in avg_correctness_non_repeat.columns.values]


avg_correctness_repeat = repeated_questions.groupby(['user_id',], as_index=False).agg({'row_id': 'count', 
                                                                          'timestamp':[min, max],
                                                                         'answered_correctly': 'mean'})
avg_correctness_repeat.columns = ['_'.join(col).strip() for col in avg_correctness_repeat.columns.values]


In [None]:
print("Non-Repeating questions data saample")
avg_correctness_non_repeat.head()

In [None]:
print(f"number of samples of first attempted questions in population={avg_correctness_non_repeat.shape[0]}")

In [None]:
print("Repeating questions data saample")
avg_correctness_repeat.head()

In [None]:
print(f"number of samples of repeated attempt questions population={avg_correctness_repeat.shape[0]}")

<a id='4'></a>
### Probability Distribution of Fist Attempt Vs Repeated Attempts

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [15,10]
plt.rcParams['font.size'] = 14
sns.kdeplot(avg_correctness_repeat.answered_correctly_mean, label="avg_correctness_first_attempt", clip=[0,1])
plt.axvline(avg_correctness_repeat.answered_correctly_mean.mean(), color='blue')
sns.kdeplot(avg_correctness_non_repeat.answered_correctly_mean, label="avg_correctness_repeated_attempt", clip=[0,1])
plt.axvline(avg_correctness_non_repeat.answered_correctly_mean.mean(), color='orange')

# add text 
plt.text(avg_correctness_non_repeat.answered_correctly_mean.mean()-.3, 3,
         f"first attempt mean={round(avg_correctness_non_repeat.answered_correctly_mean.mean(), 2)}")

plt.text(avg_correctness_repeat.answered_correctly_mean.mean()-.25, 2.5,
         f"repeated attempt mean={round(avg_correctness_repeat.answered_correctly_mean.mean(), 2)}")

plt.title("Probability Density of First attempt Vs repeated attempt answer correctness")
plt.xlabel("average answer correctness per user")
plt.ylabel("pdf")

plt.legend()
plt.show()

<a id='5'></a>
## Hypothesis Testing

<a id='6'></a>
### Normality Test of Populations: First Attempt answer correctness
Null Hypothesis H0: Sample comes from a Normal Distribution

In [None]:
from scipy import stats
np.random.seed(12345678)
k2, p = stats.normaltest(avg_correctness_non_repeat.answered_correctly_mean)
alpha = 1e-3
print("p = {:g}".format(p))

if p < alpha:  # null hypothesis: x comes from a normal distribution
    print("The null hypothesis can be rejected")
else:
    print("The null hypothesis cannot be rejected")

<a id='7'></a>
### Normality Test of Populations: Repeated Attempt answer correctness
Null Hypothesis H0: Sample comes from a Normal Distribution

In [None]:
k2, p = stats.normaltest(avg_correctness_repeat.answered_correctly_mean)
alpha = 1e-3
print("p = {:g}".format(p))

if p < alpha:  # null hypothesis: x comes from a normal distribution
    print("The null hypothesis can be rejected")
else:
    print("The null hypothesis cannot be rejected")

In [None]:
k2, p = stats.normaltest(avg_correctness_repeat.answered_correctly_mean)
alpha = 1e-3
print("p = {:g}".format(p))

if p < alpha:  # null hypothesis: x comes from a normal distribution
    print("The null hypothesis can be rejected")
else:
    print("The null hypothesis cannot be rejected")

### Note
Both The populations are not drawn from a Normal distribution. <br>
But since we are doing the test on the sample mean distribution, according to CLT the sample mean still be normally distributed if the sample size is large. This cross-validated post explain this well [here](https://stats.stackexchange.com/questions/9573/t-test-for-non-normal-when-n50)

<a id='8'></a>
### t-test for comparing population mean<br>
* H0: There are no significant difference in the mean answer correctness of two populations<br>
* H1: There is a difference<br> 


In [None]:
tstat, p = stats.ttest_ind(avg_correctness_repeat.answered_correctly_mean, 
                avg_correctness_non_repeat.answered_correctly_mean, equal_var = False)
alpha = 1e-2 # 99% confidence 

if p < alpha:  # null hypothesis: x comes from a normal distribution
    print("The null hypothesis can be rejected")
else:
    print("The null hypothesis cannot be rejected")

<a id='9'></a>
### Conclusion
* The mean correctness of the users who have attempted a question for the first time is quite lower than the mean correctness when the user sees the question repeatedly
* This finding can be used as a post processing techniques during modelling.
* By keeping the log of already seen question/or some engineered featured may improve the quality of the model.


### Please consider upvoting this kernel if you find ths informative