# Slightly change "user_correctness"

Here, 'user_correctness' is a feature that measures the smartness of each user. Probably in the published notebook, it was decided as follows.

In [None]:
train_df['user_correctness'] = train_df.groupby('user_id').agg({'answered_correctly':'mean'})

but the following minor change will increase the score significantly.



In [None]:
content_agg = train_df.groupby('content_id').agg({'answered_correctly':'mean'})
train_df['content_correctness'] = train_df['content_id'].map(content_agg['mean'])
train_df['score_diff'] = train_df['answered_correctly'] - train_df['content_correctness'] 
train_df['user_correctness'] = train_df.groupby('user_id').agg({'score_diff':'mean'})

This feature is not evaluated so much when a question with a high correct answer rate is answered correctly, but is highly evaluated when a question with a low correct answer rate is answered correctly.


e.g.) In case the user answered correctly
* content_correctness : 0.8 -> score_diff : 0.2
 
* content_correctness : 0.4 -> score_diff : 0.6


This makes it possible to create an average that takes into account the difficulty of the questions, rather than a simple binary average.

Let's take a quick look at the difference between traditional "user_correctness" and proposed "user_correctness".

Note that other features that I was using are included, but I am using almost same except for user_correctness.

# Preparation

In [None]:
# Import the Rapids suite here - takes abot 1.5 mins

import sys
!cp ../input/rapids/rapids.0.17.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
import riiideducation
import pandas as pd
import numpy as np
import cudf
import cupy
import gc
import pickle
import xgboost
from cuml.metrics import roc_auc_score

In [None]:
features = [
    'lagtime',
    'lagtime2',
    'lagtime3',
    'content_eqet',
    'user_correctness',
    'user_correct_cumsum',
    'part_user_correctness',
    'part_user_correct_cumcount',
    'part_user_correct_cumsum',
    'content_correctness',
    'content_count',
    'content_sum',
    'attempt_no',
    'part',
    'part_correctness_mean',
    'tags1',
    'tags1_correctness_mean',
    'bundle_id',
    'explanation_mean', 
]

target = 'answered_correctly'

params = {
    'max_depth' : 8,
    'max_leaves' : 350,
    'max_bin':800,
    'eta':0.1,
    'min_child_weight':0.03,
    'lambda':0.6,
    'alpha':0.4,
    'eval_metric': 'auc',
    'tree_method' : 'gpu_hist',
    'objective' : 'binary:logistic',
    'grow_policy' : 'lossguide'
}

# Traditional user_correctness

In [None]:
train_df = cudf.read_csv('../input/riiiddata2/data.csv')
valid_df = cudf.read_csv('../input/riiiddata2/valid.csv')

In [None]:
dtrain = xgboost.DMatrix(train_df[features], label=train_df[target])
dvalid = xgboost.DMatrix(valid_df[features], label=valid_df[target])

# Create & Train the model
model = xgboost.train(params,
                      dtrain = dtrain,
                      evals = [(dtrain, 'train'),(dvalid, 'eval')],
                      verbose_eval = 100,
                      num_boost_round = 10000,
                      early_stopping_rounds = 10,
                     )

In [None]:
roc_auc_score(valid_df[target].astype('int32'),model.predict(dvalid))

### eval-auc:0.783...

In [None]:
del train_df
del valid_df
del dtrain
del dvalid
_=gc.collect()

# Proposed user_correctness

In [None]:
train_df = cudf.read_csv('../input/riiidlastdata/data.csv')
valid_df = cudf.read_csv('../input/riiidlastdata/valid.csv')

In [None]:
dtrain = xgboost.DMatrix(train_df[features], label=train_df[target])
dvalid = xgboost.DMatrix(valid_df[features], label=valid_df[target])

# Create & Train the model
model = xgboost.train(params,
                      dtrain = dtrain,
                      evals = [(dtrain, 'train'),(dvalid, 'eval')],
                      verbose_eval = 100,
                      num_boost_round = 10000,
                      early_stopping_rounds = 10,
                     )


### eval-auc:0.789...

In [None]:
roc_auc_score(valid_df[target].astype('int32'),model.predict(dvalid))

## As a result, increased score +0.006! (0.783 -> 0.789)

thank you for watching.

NOTE: I wrote the proposed feature with an emphasis on readability, but be careful of leaks in practice.