# Riiid! Answer Correctness Prediction

Tailoring education to a student's ability level is one of the many valuable things an AI tutor can do. Your challenge in this competition is a version of that overall task; you will predict whether students are able to answer their next questions correctly. You'll be provided with the same sorts of information a complete education app would have: that student's historic performance, the performance of other students on the same question, metadata about the question itself, and more.


## Files
### train.csv

row_id: (int64) ID code for the row.

timestamp: (int64) the time between this user interaction and the first event completion from that user.

user_id: (int32) ID code for the user.

content_id: (int16) ID code for the user interaction

content_type_id: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

task_container_id: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

user_answer: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.

answered_correctly: (int8) if the user responded correctly. Read -1 as null, for lectures.

prior_question_elapsed_time: (float32) The average time it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.

prior_question_had_explanation: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

### questions.csv: metadata for the questions posed to users.

question_id: foreign key for the train/test content_id column, when the content type is question (0).

bundle_id: code for which questions are served together.

correct_answer: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

part: the relevant section of the TOEIC test.

tags: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

### lectures.csv: metadata for the lectures watched by users as they progress in their education.

lecture_id: foreign key for the train/test content_id column, when the content type is lecture (1).

part: top level category code for the lecture.

tag: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

type_of: brief description of the core purpose of the lecture

example_test_rows.csv Three sample groups of the test set data as it will be delivered by the time-series API. The format is largely the same as train.csv. There are two different columns that mirror what information the AI tutor actually has available at any given time, but with the user interactions grouped together for the sake of API performance rather than strictly showing information for a single user at a time. Some questions will appear in the hidden test set that have NOT been presented in the train set, emulating the challenge of quickly adapting to modeling newly introduced questions. Their metadata is still in question.csv as usual.

prior_group_responses (string) provides all of the user_answer entries for previous group in a string representation of a list in the first row of the group. All other rows in each group are null. If you are using Python, you will likely want to call eval on the non-null rows. Some rows may be null, or empty lists.

prior_group_answers_correct (string) provides all the answered_correctly field for previous group, with the same format and caveats as prior_group_responses. Some rows may be null, or empty lists.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt

In [None]:
data_types_dict = {
    'row_id': 'int64',
    'timestamp': 'int64',
    'user_id': 'int32',
    'content_id': 'int16',
    'content_type_id': 'int8',
#     'task_container_id': 'int16',
#     'user_answer': 'int8',
    'answered_correctly': 'int8',
    'prior_question_elapsed_time': 'float16',
    'prior_question_had_explanation': 'boolean'
}

sample_size = (10**6)

In [None]:
train = pd.read_csv('../input/riiid-test-answer-prediction/train.csv', nrows=sample_size, 
                    index_col='row_id', usecols = data_types_dict.keys(),
                    dtype=data_types_dict)
train.head()

In [None]:
lecture = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')
lecture.head()

In [None]:
question = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
question.head()

In [None]:
test = pd.read_csv('../input/riiid-test-answer-prediction/example_test.csv', index_col='row_id')
test.head()

In [None]:
submission = pd.read_csv('../input/riiid-test-answer-prediction/example_sample_submission.csv', index_col='row_id')
submission.head()

In [None]:
print(f'train shape: {train.shape}')
print(f'lecture shape: {lecture.shape}')
print(f'question shape: {question.shape}')
print(f'test shape: {test.shape}')
print(f'submission shape: {submission.shape}')

## Feature Engineering

### **Expanding on Simple LGBM** 

ref: https://www.kaggle.com/thebigd8ta/lgbm-ii

In [None]:
train = train[train.answered_correctly != -1]

In [None]:
user_group = train.groupby('user_id')
user_answers = user_group.agg({'answered_correctly': ['mean', 'count'],'timestamp': ['mean']})
user_answers.columns = ['mean_user_accuracy', 'questions_answered', 'mean_user_time']
user_answers.head()

In [None]:
content_group = train.groupby('content_id')
content_answer = content_group.agg({'answered_correctly': ['mean', 'count'],'timestamp': ['mean']})
content_answer.columns = ['mean_content_accuracy', 'question_asked', 'mean_content_time']
content_answer.head()

In [None]:
questions = question.merge(content_answer, left_on = 'question_id', right_on = 'content_id', how = 'left')
bundle_dict = questions['bundle_id'].value_counts().to_dict()
questions['right_answers'] = questions['mean_content_accuracy'] * questions['question_asked']
questions['bundle_size'] =questions['bundle_id'].apply(lambda x: bundle_dict[x])
questions.tail()

In [None]:
grouped_by_bundle = questions.groupby('bundle_id')
bundle_answers = grouped_by_bundle.agg({'right_answers': 'sum', 'question_asked': 'sum'})
bundle_answers.columns = ['bundle_rignt_answers', 'bundle_questions_asked']
bundle_answers['bundle_accuracy'] = bundle_answers['bundle_rignt_answers'] / bundle_answers['bundle_questions_asked']
bundle_answers.head()

In [None]:
grouped_by_part = questions.groupby('part')
part_answers = grouped_by_part.agg({'right_answers': 'sum', 'question_asked': 'sum'})
part_answers.columns = ['part_rignt_answers', 'part_questions_asked']
part_answers['part_accuracy'] = part_answers['part_rignt_answers'] / part_answers['part_questions_asked']
part_answers.head()

In [None]:
new_train_df = train.merge(user_answers, how = 'left', on = 'user_id')\
                        .merge(questions, how = 'left', left_on = 'content_id', right_on = 'question_id')\
                        .merge(bundle_answers, how = 'left', on = 'bundle_id')\
                        .merge(part_answers, how = 'left', on = 'part')

In [None]:
new_train_df['prior_question_had_explanation'].fillna(False, inplace=True)
new_train_df['prior_question_had_explanation'] = new_train_df['prior_question_had_explanation'].apply(lambda x: 1 if x=='True' else 0)
new_train_df[new_train_df==np.inf]=np.nan
new_train_df.fillna(value = -1, inplace = True)

In [None]:
new_train_df.head()

## Split Data into train&validation set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
features = ['timestamp', 'content_type_id', 'prior_question_had_explanation', 
       'prior_question_elapsed_time', 
       'correct_answer', 'part',
       'mean_content_accuracy', 'question_asked', 'mean_content_time',
       'right_answers', 'bundle_size', 'bundle_rignt_answers',
       'bundle_questions_asked', 'bundle_accuracy', 'part_rignt_answers',
       'part_questions_asked', 'part_accuracy']
# 'user_answer', 'prior_question_had_explanation_True', 'mean_user_accuracy', 'questions_answered', 'mean_user_time'
target = 'answered_correctly'

X = new_train_df[features]
y = new_train_df[target]

X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, random_state=111)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

## Modeling

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

In [None]:
class model_selection(): 
    def __init__(self): 
        self.y_pred_FVC = pd.DataFrame()
        self.best_param = None
        self.scoring_train = 0
        self.scoring_val = 0
        
        
    def LogisticRegression(self, X, y, X_val, y_val): 
        params_logr = {'C': [1], 'solver': ['lbfgs'], 'max_iter': [100]}
        clf = GridSearchCV(LogisticRegression(), 
                           param_grid=params_logr, scoring='roc_auc')
        clf.fit(X, y)
        self.best_param = clf.best_params_
        self.scoring_train = clf.score(X, y)
#         self.scoring_val = clf.score(X_val, y_val)
        y_pred_logr_FVC = clf.predict_proba(X_val)

        self.y_pred_FVC = pd.concat([pd.Series(y_pred_logr_FVC[:, 1])], axis=1)
        return self.best_param, self.y_pred_FVC, self.scoring_train, self.scoring_val
    
    
    def SVM(self, X, y, X_val, y_val): 
        params_svm = {'C': [1], 'kernel': ['rbf']}
        clf = GridSearchCV(SVC(probability=True), 
                           param_grid=params_svm, scoring='roc_auc')
        clf.fit(X, y)
        self.best_param = clf.best_params_
        self.scoring_train = clf.score(X, y)
#         self.scoring_val = clf.score(X_val, y_val)
        y_pred_svm_FVC = clf.predict_proba(X_val)

        self.y_pred_FVC = pd.concat([pd.Series(y_pred_svm_FVC[:, 1])], axis=1)
        return self.best_param, self.y_pred_FVC, self.scoring_train, self.scoring_val
    
    
    def RandomForest(self, X, y, X_val, y_val): 
        params_rf = {'n_estimators': [100], 'criterion': ['gini']}
        clf = GridSearchCV(RandomForestClassifier(), 
                           param_grid=params_rf, scoring='roc_auc')
        clf.fit(X, y)
        self.best_param = clf.best_params_
        self.scoring_train = clf.score(X, y)
#         self.scoring_val = clf.score(X_val, y_val)
        y_pred_rf_FVC = clf.predict_proba(X_val)

        self.y_pred_FVC = pd.concat([pd.Series(y_pred_rf_FVC[:, 1])], axis=1)
        return self.best_param, self.y_pred_FVC, self.scoring_train, self.scoring_val
    
    
    def XGBoost(self, X, y, X_val, y_val): 
        params_xgb = {'max_depth':[6]}
        clf = GridSearchCV(XGBClassifier(), 
                           param_grid=params_xgb, scoring='roc_auc')
        clf.fit(X, y)
        self.best_param = clf.best_params_
        self.scoring_train = clf.score(X, y)
#         self.scoring_val = clf.score(X_val, y_val)
        y_pred_xgb_FVC = clf.predict_proba(X_val)

        self.y_pred_FVC = pd.concat([pd.Series(y_pred_xgb_FVC[:, 1])], axis=1)
        return self.best_param, self.y_pred_FVC, self.scoring_train, self.scoring_val
    

    def LightGBM(self, X, y, X_val, y_val): 
        params_lgb = {'n_estimators':[100], 'num_iterations': [100, 300], 'learning_rate':[0.05, 0.1]}
        clf = GridSearchCV(LGBMClassifier(), 
                           param_grid=params_lgb, scoring='roc_auc')
        clf.fit(X, y)
        self.best_param = clf.best_params_
        self.scoring_train = clf.score(X, y)
#         self.scoring_val = clf.score(X_val, y_val)
        y_pred_lgb_FVC = clf.predict_proba(X_val)
        
        self.y_pred_FVC = pd.concat([pd.Series(y_pred_lgb_FVC[:, 1])], axis=1)
        return self.best_param, self.y_pred_FVC, self.scoring_train, self.scoring_val

In [None]:
model_sel = model_selection()

### Random Forest

In [None]:
# rf = model_sel.RandomForest(X_train, y_train, X_val, y_val)
# rf

### Logistic Regression

In [None]:
# logr = model_sel.LogisticRegression(X_train, y_train, X_val, y_val)
# logr

### Support Vector Machine

In [None]:
# svm = model_sel.SVM(X_train, y_train, X_val, y_val)
# svm

### XGBoost

In [None]:
# xgb = model_sel.XGBoost(X_train, y_train, X_val, y_val)
# xgb

### LightGBM

In [None]:
lgb = model_sel.LightGBM(X_train, y_train, X_val, y_val)
lgb

In [None]:
model = LGBMClassifier(**lgb[0])
model.fit(X_train, y_train)

## Evaluation

In [None]:
# # plot AUC
# results = lgb.evals_result_ # evals_result()
# epochs = len(results['auc'])
# x_axis = range(0, epochs)
# fig, ax = plt.subplots(figsize=(8,5))
# ax.plot(x_axis, results['auc'], label='Train')
# ax.plot(x_axis, results['auc'], label='Test')
# ax.legend()
# plt.ylabel('AUC')
# plt.title('XGBoost AUC')
# plt.show()

In [None]:
import riiideducation

In [None]:
env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    test_df = test_df.merge(user_answers, how = 'left', on = 'user_id')
    test_df = test_df.merge(questions, how = 'left', left_on = 'content_id', right_on = 'question_id')
    test_df = test_df.merge(bundle_answers, how = 'left', on = 'bundle_id')
    test_df = test_df.merge(part_answers, how = 'left', on = 'part')
    
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].fillna(value = False).astype(bool)
    test_df.fillna(value = -1, inplace = True)

    test_df['answered_correctly'] = model.predict_proba(test_df[features])[:, 1]
#     test_df['answered_correctly'] = model_sel.LightGBM(X_train, y_train, test_df[features], y_val)[1]
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

In [None]:
new_train_df.head()

In [None]:
test_df.head()