# Two Feature "Model"

This notebook tests the very simple method of starting with the average score for the given question (`content_id`), and adding or subtracting how much the user is above or below that average score on average. 

Not even really a model, but should establish a solid baseline.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import gc
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LinearRegression
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data Processing

Because it's such a simple model, we only need three columns from one dataset, and we can go ahead and drop all the lecture rows. 

In [None]:
dtypes = {
    'user_id': 'int32', 
    'content_id': 'int16', 
    'answered_correctly': 'int8', 
}
train = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv', usecols=[2, 3, 7], dtype=dtypes)
train = train.loc[train['answered_correctly'] >= 0].copy()

`user_resid_rolling` how many more questions they've gotten right than expected compared to the average user.

In [None]:
# Get rolling number correct and number answered per user

train['content_mean'] = train.groupby('content_id')['answered_correctly'].transform('mean')
train['resid'] = train['answered_correctly'] - train['content_mean']

train['user_q_count'] = train.groupby('user_id').cumcount()
train['user_resid_rolling'] = train.groupby('user_id')['resid']\
    .transform(lambda x: x.cumsum().shift(1)).fillna(0)

# Get number of questions per user for creating validation set later
users = train.groupby('user_id')['user_q_count'].last()

train.head(20)

## Testing Model

To create a validation set that somewhat represents the test set, I select 5% of all users at random to be completely in the validation set--we know nothing about them ahead of time. I also then add the last five questions from every other user to the validation set as well.

I used the `shift` method to avoid an expensive `groupby` in selecting the last five rows for each user.

In [None]:
# Save content means
content_df = train.groupby('content_id')['answered_correctly'].mean()

# Select a few new users
thresh = 30
new_user_num = int(0.05 * len(users))
print(f"{100 * (users <= thresh).mean():.2f}% of users have {thresh} or fewer questions")
new_users = np.random.choice(users[users <= thresh].index, new_user_num)
new_user_mask = train['user_id'].isin(new_users)

# Create mask for last few questions of everyone else
num_q_val = 5
late_q_mask = train['user_id'].shift(-1 * num_q_val) != train['user_id']

# Create validation and training
use_cols = ['user_resid_rolling', 'user_q_count', 'content_mean', 'answered_correctly', 'user_id']
val = train.loc[new_user_mask | late_q_mask, use_cols].copy()
val['user_resid_rolling'] = val.groupby('user_id')['user_resid_rolling'].transform('first')
X = train.loc[~(new_user_mask | late_q_mask), use_cols].sample(1000000)
print(val.shape)
print(X.shape)
gc.collect()

How much better or worse a student is than average probably isn't reliable until they have several observations. Here I test the model using [Additive Smoothing](https://en.wikipedia.org/wiki/Additive_smoothing), attempting to find which parameter value is the best. 5 and 10 do equally well on the validation set.

In [None]:
# Fit model
for m in [1, 5, 10, 20]:
    print(m)
    X['user_mean_resid'] = (X['user_resid_rolling']) / (X['user_q_count'] + m)
    val['user_mean_resid'] = (val['user_resid_rolling']) / (val['user_q_count'] + m)
    pred_cols = ['user_mean_resid', 'content_mean']
    
    # Fit model
    print("fitting")
    lr = LinearRegression(fit_intercept=False)
    lr.fit(X[pred_cols], X['answered_correctly'])
    
    # Make predictions
    print("Predicting")
    y_pred_train = lr.predict(X[pred_cols])
    y_pred_test = lr.predict(val[pred_cols])
    print("Scoring")
    auc_train = roc_auc_score(X['answered_correctly'], y_pred_train)
    auc_test = roc_auc_score(val['answered_correctly'], y_pred_test)
    print(f"Train AUC: {auc_train:.3f}")
    print(f"Test AUC: {auc_test:.3f}")
    print(f"-- Coeff --\nuser_mean_resid: {lr.coef_[0]:.3f}\ncontent_mean: {lr.coef_[1]:.3f}\n\n")
    X['model_resids'] = X['answered_correctly'] - y_pred_train
    del y_pred_train, y_pred_test, lr
    gc.collect()


In [None]:
del X, val
gc.collect()

## Generate Predictions

When I train the model on all data, the coefficients both come out to almost 1. This is about what we would expect.

Instead of multiplying the features by almost 1, I just left them as they are for prediction. The result is we start by guessing the average correct value for a question, then we add or subtract how much better or worse than average a student has been on average in the past.

In [None]:
import riiideducation
env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
m = 10
train['user_resid_mean'] = train['user_resid_rolling'] / (train['user_q_count'] + m)
user_df = train.groupby('user_id')[['user_resid_mean']].last()

In [None]:
# Train final model
lr = LinearRegression()
train = train.sample(10000000)
lr.fit(train[['user_resid_mean', 'content_mean']], train['answered_correctly'])
lr.coef_

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    test_df = test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'user_id', 'content_id']]
    user_ids = test_df['user_id'].values
    content_ids = test_df['content_id'].values
    user_id_mask = np.array([user_id in user_df.index for user_id in user_ids])
    test_df['answered_correctly'] = content_df.loc[content_ids].values
    if sum(user_id_mask) > 0:
        test_df.loc[user_id_mask, 'answered_correctly'] += user_df.loc[user_ids[user_id_mask], 'user_resid_mean'].values
    env.predict(test_df[['row_id', 'answered_correctly']])