I want to convey two things in this notebook.
## 1. Don't have to be hesitant about using Loop.
They say "avoid loops!'.
But I think It's not bad idea to use loops for this competition.
Because:
* We have to use small batch inference using Time-series API.
* Loops have very small overhead for each batch.
* Loops are more flexible.
* Even loops are not so slow. 3 features are extracted within 10 minits for 100M train data, as you can see blow.

## 2. Future information should not be used.
Time-series API doesn't allow us to use information from the future.
So we should not use it, especially user statistics from future make things very bad.

In [None]:
import pandas as pd
import numpy as np
import gc
from sklearn.metrics import roc_auc_score
from collections import defaultdict
from tqdm.notebook import tqdm
import lightgbm as lgb

## setting
CV files are generated by [this notebook](https://www.kaggle.com/its7171/cv-strategy)

In [None]:
train_pickle = '../input/pickle1/cv1_train.pickle'
valid_pickle = '../input/pickle1/cv1_valid.pickle'
question_file = '../input/riiid-test-answer-prediction/questions.csv'
debug = False
validaten_flg = False

In [None]:
train = pd.read_pickle(train_pickle)
valid = pd.read_pickle(valid_pickle)

In [None]:
question_df = pd.read_pickle('../input/questionspickle/question.pickle')

In [None]:
# def cal_method(type_of):
#     return len(str(type_of).split(' '))
# questions=pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
# questions['enhence'] = questions['tags'].apply(cal_method)
# questions['enhence'] = questions['enhence'].astype(np.int8)

In [None]:
question_df_avg =  question_df.question_average.mean()

In [None]:
# train = train.join(questions['enhence'],on=['content_id'],rsuffix='_question_average')
# valid = valid.join(questions['enhence'],on=['content_id'],rsuffix='_question_average')


## modeling

In [None]:
TARGET = 'answered_correctly'
FEATS = ['answered_correctly_avg_u','content_id', 'answered_correctly_sum_u', 'count_u', 'answered_correctly_avg_c', 'prior_question_had_explanation', 'prior_question_elapsed_time']
dro_cols = list(set(train.columns) - set(FEATS))
y_tr = train[TARGET]
y_va = valid[TARGET]
train.drop(dro_cols, axis=1, inplace=True)
valid.drop(dro_cols, axis=1, inplace=True)
_=gc.collect()

In [None]:
train

In [None]:
prior_question_elapsed_time_mean = train.prior_question_elapsed_time.dropna().values.mean()


In [None]:
lgb_train = lgb.Dataset(train[FEATS], y_tr)
lgb_valid = lgb.Dataset(valid[FEATS], y_va)
del train, y_tr,valid,y_va
_=gc.collect()

In [None]:
model = lgb.train(
                    {'objective': 'binary',#,#,
         'num_iterations' : 10},#50

                    lgb_train,
                    valid_sets=[lgb_train, lgb_valid],
                    verbose_eval=10,
                    num_boost_round=10000,
                    early_stopping_rounds=10
                )
_ = lgb.plot_importance(model)

In [None]:
answered_correctly_sum_u_dict = defaultdict(int)
count_u_dict = defaultdict(int)
valid = pd.read_pickle(valid_pickle)
y_va = valid[TARGET]
print('auc:', roc_auc_score(y_va, model.predict(valid[FEATS])))
_ = lgb.plot_importance(model)
del valid,y_va

## inference

In [None]:
def add_user_feats_without_update(df, answered_correctly_sum_u_dict, count_u_dict):
    acsu = np.zeros(len(df), dtype=np.int32)
    cu = np.zeros(len(df), dtype=np.int32)
    for cnt,row in enumerate(df[['user_id']].values):
        acsu[cnt] = answered_correctly_sum_u_dict[row[0]]
        cu[cnt] = count_u_dict[row[0]]
    user_feats_df = pd.DataFrame({'answered_correctly_sum_u':acsu, 'count_u':cu})
    user_feats_df['answered_correctly_avg_u'] = user_feats_df['answered_correctly_sum_u'] / user_feats_df['count_u']
    df = pd.concat([df, user_feats_df], axis=1)
    return df

In [None]:
content_df = pd.read_pickle('../input/pickle1/content.pickle')

In [None]:
# You can debug your inference code to reduce "Submission Scoring Error" with `validaten_flg = True`.
# Please refer https://www.kaggle.com/its7171/time-series-api-iter-test-emulator about Time-series API (iter_test) Emulator.
import riiideducation
env = riiideducation.make_env()
iter_test = env.iter_test()
set_predict = env.predict
for (test_df, sample_prediction_df) in iter_test:
    previous_test_df = test_df.copy()
    test_df = test_df[test_df['content_type_id'] == 0].reset_index(drop=True)
    test_df = add_user_feats_without_update(test_df, answered_correctly_sum_u_dict, count_u_dict)
    test_df = pd.merge(test_df, content_df, on='content_id',  how="left")
    test_df = pd.merge(test_df, questions_df, left_on='content_id', right_on='question_id', how='left')
    #test_df['prior_question_had_explanation'] = test_df.prior_question_had_explanation.fillna(False).astype('int8')
    test_df['prior_question_elapsed_time_mean'] = test_df.prior_question_elapsed_time.fillna(prior_question_elapsed_time_mean)
    test_df[TARGET] =  model.predict(test_df[FEATS])
    set_predict(test_df[['row_id', TARGET]])

Have a fun with loops! :)

In [None]:
answered_correctly_sum_u_dict.to_csv('answered_correctly_sum_u_dict.csv')

In [None]:
count_u_dict.to_csv('count_u_dict.csv')

In [None]:
model.save_model('modelfeats7.txt')