# Introduction

*This notebook was forked from Andrada Olteanu's notebook https://www.kaggle.com/andradaolteanu/answer-correctness-rapids-xgb-lgbm/notebook. It was really a big help to me as I'm nowhere near able to really understand how RAPIDS work at the time where I wrote this. Go check her notebook! (It's much more well-written and clear! And it's also contain additionnal information on RAPIDS!)*


My main goal here was to find a way to improve the auc score throught FE or something else by at least a bit. As it's my first competition (and also my first with RAPIDS), I tried to not have too much expectation. 

Anyway, it was still a interesting experience. Here are my conclusions, maybe it will help some people:
* RAPIDS is reeaaally fast! But still don't think that you can do anything. Always delete the things that don't use anymore if you don't want to run into a lot of memory error. I was a bit too much enthusiaste after seeing how fast it fast and didn't pay attention to memory management. Now, I have nightmare about black rectangle with red, green and MemoryError: std::bad_alloc: CUDA in it... And even if you are meticulous, some operations are too costly for Kaggle. 
* The feature engineering made by Andrada Olteanu was already really good. The various indicators (sum, count, mean, std, var, etc. by user and question) capture about 71% of the information. No matter what I tried, it didn't seem  that I could make better features. The first big improvement (and by big, I mean +0.006~...) was simply by adding 'prior_question_had_explanation' to the feature to keep. I also tried to scale the data but adding other variables and scaling was too much for the memory. But even when it worked, it didn't really improve the score.
* Other than that, I think the main limit here was that some student have "way too much free time" like I read it elswhere. Some student appear more than 15000 time. I might be wrong but I think that as they practice a lot, the became more consistant (they give more often the right answer) and thus more predictibles. I tried to make my way around that by removing all rows that have a timestamp superior to the upper outlier boundry for the timestamp and it improve the performance! But not on the hidden test data set... 


In [None]:
%%time
import sys
!cp ../input/rapids/rapids.0.17.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
import os
import psutil
import gc

import riiideducation

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Color Palette
custom_colors = ['#7400ff', '#a788e4', '#d216d2', '#ffb500', '#36c9dd']
sns.palplot(sns.color_palette(custom_colors))

# Set Style
sns.set_style("whitegrid")
sns.despine(left=True, bottom=True)

# Set tick size
plt.rc('xtick',labelsize=12)
plt.rc('ytick',labelsize=12)

In [None]:
# Rapids Imports
import cudf
import cupy # CuPy is an open-source array library accelerated with NVIDIA CUDA.


from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster()
client = Client(cluster)
client

In [None]:
cudf.set_allocator("managed")

In [None]:
%%time
# Import the data
train = cudf.read_parquet("../input/riid-competition-rapids-part-i-eda/clean_train.parquet")
questions = cudf.read_parquet("../input/riid-competition-rapids-part-i-eda/questions.parquet", columns=['question_id','bundle_id', 'part'])

In [None]:
del train['row_id']
del train['task_container_id']
gc.collect()

# I) Cleaning

**Removing outliers**

In [None]:
# Select ids to erase
ids_to_erase = train["user_id"].value_counts().reset_index()[(train["user_id"].value_counts().reset_index()["user_id"] < 10) |
                                                            (train["user_id"].value_counts().reset_index()["user_id"] > 15000)]\
                                                                                                                ["index"].values
previous_length = len(train)

# Erase the ids
train = train[~train['user_id'].isin(ids_to_erase)]

print("We erased {} rows meaning {:.3}% of all data.".format(previous_length-len(train), (1 - len(train)/previous_length)*100))
del ids_to_erase, previous_length
gc.collect()

In [None]:
# del train['timestamp'], total, feature, Q1, Q3, Q05, Q95, IQR, upper_outlier_boundry
# gc.collect()

# II) Feature Engineering

In [None]:
train = train.merge(questions, how = 'left', left_on = 'content_id', right_on = 'question_id')

In [None]:
del train['question_id']
gc.collect()

In [None]:
# Parameters
train_percent = 0.1
total_len = len(train)

In [None]:
# Split data into train data & feature engineering data
# The data is ordered by timestamp and user_id, so that the last 10% observations are new observations
#  is in descending order - meaning that the last 10% observations have
# the biggest chance of having had some performance recorded before
# so looking at the performance in the past we'll try to predict the performance now

features_df = train.iloc[ : int(total_len*(1-train_percent))]
train_df = train.iloc[int(total_len*(1-train_percent)) : ]

In [None]:
# Total rows we started with
total = len(features_df)
feature = "timestamp"

# Compute Outliers
Q1 = cupy.percentile(features_df[feature].values, q = 25).item()
Q3 = cupy.percentile(features_df[feature].values, q = 75).item()
IQR = Q3 - Q1

upper_outlier_boundry = Q3 + 1.5*IQR

print('The upper outlier boundry is {:,}, which means {:,.5} hrs, which means {:,.5} days.'.format(upper_outlier_boundry, (upper_outlier_boundry / 3.6e+6),
                                                                                       (upper_outlier_boundry / 3.6e+6)/24))

print('Timestamp: around {:.2}% of the data have been erased.'.format((len(features_df[features_df[feature] > upper_outlier_boundry])/total) * 100))


features_df = features_df[features_df['timestamp'] <= upper_outlier_boundry]

In [None]:
user_lectures=features_df[features_df['answered_correctly']==-1]
user_lectures['lec_sum']=user_lectures['answered_correctly']*-1
user_lectures=user_lectures[['user_id', 'lec_sum']].groupby('user_id').agg({'lec_sum': 'sum'}).reset_index()

gc.collect()

In [None]:
user_lectures.to_parquet('user_lectures.parquet')

In [None]:
%%time
# Let's exclude all observations where (content_type_id = 1) & (answered_correctly = -1)
features_df = features_df[features_df['content_type_id'] != 1]
features_df = features_df[features_df['answered_correctly'] != -1].reset_index(drop=True)
features_df.head()

In [None]:
%%time
# Let's exclude all observations where (content_type_id = 1) & (answered_correctly = -1)
train_df = train_df[train_df['content_type_id'] != 1]
train_df = train_df[train_df['answered_correctly'] != -1].reset_index(drop=True)
train_df.head()

In [None]:
%%time
# --- STUDENT ANSWERS ---
# Group by student
user_answers = features_df[features_df['answered_correctly']!=-1].\
                            groupby('user_id').\
                            agg({'answered_correctly': ['sum', 'mean', 'count', 'std']}).\
                            reset_index()

user_answers.columns = ['user_id', 'user_sum', 'user_mean', 
                        'user_count', 'user_std']

user_answers['user_percent'] = user_answers['user_sum']/user_answers['user_count']

In [None]:
%%time
# --- STUDENT ANSWERS ---
# Group by student and question part
user_part_performance = features_df[features_df['answered_correctly']!=-1].\
                            groupby(['user_id', 'part']).\
                            agg({'answered_correctly': ['sum', 'mean', 'count','std']}).\
                            reset_index()

user_part_performance.columns = ['user_id', 'part', 'user_part_sum', 'user_part_mean', 
                        'user_part_count', 'user_part_std']

user_part_performance['user_part_percent'] = user_part_performance['user_part_sum']/user_part_performance['user_part_count']

In [None]:
%%time
# --- CONTENT ID ANSWERS ---
# Group by student and bundle
user_bundle_performance = features_df[features_df['answered_correctly']!=-1].\
                            groupby(['user_id', 'bundle_id']).\
                            agg({'answered_correctly': ['sum', 'mean','count', 'std']}).\
                            reset_index()

user_bundle_performance.columns = ['user_id', 'bundle_id' , 'user_bundle_sum', 'user_bundle_mean', 
                                     'user_bundle_count', 'user_bundle_std']


user_bundle_performance['userbundle_percent'] = user_bundle_performance['user_bundle_sum']/user_bundle_performance['user_bundle_count']

In [None]:
%%time
# --- CONTENT ID ANSWERS ---
# Group by content and questions part
question_part_performance = features_df[features_df['answered_correctly']!=-1].\
                            groupby(['content_id', 'part']).\
                            agg({'answered_correctly': ['sum', 'mean','count', 'std']}).\
                            reset_index()

question_part_performance.columns = ['content_id', 'part' , 'question_part_sum', 'question_part_mean', 
                                     'question_part_count', 'question_part_std']


question_part_performance['question_part_percent'] = question_part_performance['question_part_sum']/question_part_performance['question_part_count']

In [None]:
%%time
# --- CONTENT ID ANSWERS ---
# Group by content and bundle
bundle_performance = features_df[features_df['answered_correctly']!=-1].\
                            groupby(['content_id', 'bundle_id']).\
                            agg({'answered_correctly': ['sum', 'mean','count', 'std']}).\
                            reset_index()

bundle_performance.columns = ['content_id', 'bundle_id' , 'bundle_sum', 'bundle_mean', 
                                     'bundle_count', 'bundle_std']


bundle_performance['bundle_percent'] = bundle_performance['bundle_sum']/bundle_performance['bundle_count']

In [None]:
%%time
# --- CONTENT ID ANSWERS ---
# Group by content
content_answers = features_df[features_df['answered_correctly']!=-1].\
                            groupby('content_id').\
                            agg({'answered_correctly': ['sum', 'mean', 'count', 'std']}).\
                            reset_index()

content_answers.columns = ['content_id', 'content_sum', 'content_mean', 'content_count', 'content_std']

content_answers['content_percent'] = content_answers['content_sum']/content_answers['content_count']

In [None]:
user_answers.to_parquet('user_answers.parquet')
user_part_performance.to_parquet('user_part_performance.parquet')
user_bundle_performance.to_parquet('user_bundle_performance.parquet')
question_part_performance.to_parquet('question_part_performance.parquet')
bundle_performance.to_parquet('bundle_performance.parquet')
content_answers.to_parquet('content_answers.parquet')

In [None]:
del train, questions, features_df
gc.collect()

# III) Preprocess

In [None]:
# from sklearn.compose import ColumnTransformer

# from cuml.experimental.preprocessing import MinMaxScaler

In [None]:
# We need to convert True-False variables to integers
def to_bool(x):
    '''For the string variables.'''
    if x == False:
        return 0
    else:
        return 1

    
def combine_features(data = None, add_metadata = False):
    '''Combine the features with the Train/Test data.'''
    
    # Add "past" information
    features_data = data.merge(user_answers, how = 'left', on = 'user_id')
    features_data = features_data.merge(content_answers, how = 'left', on = 'content_id')
    
    if add_metadata==True:
        features_data = features_data.merge(user_lectures, how = 'left', left_on = ['user_id'], right_on = ['user_id'])
        features_data['lec_sum'].fillna(0,inplace=True)
        features_data = features_data.merge(user_part_performance, how = 'left', left_on = ['user_id', 'part'], right_on = ['user_id', 'part'])
        features_data = features_data.merge(user_bundle_performance, how = 'left', left_on = ['user_id', 'bundle_id'], right_on = ['user_id', 'bundle_id'])
        features_data = features_data.merge(question_part_performance, how = 'left', left_on = ['content_id', 'part'], right_on = ['content_id', 'part'])
        features_data = features_data.merge(bundle_performance, how = 'left', left_on = ['content_id', 'bundle_id'], right_on = ['content_id', 'bundle_id'])

    # Apply
    features_data['content_type_id'] = features_data['content_type_id'].applymap(to_bool)
    features_data['prior_question_had_explanation'] = features_data['prior_question_had_explanation'].applymap(to_bool)

    # Fill in missing spots
    features_data.fillna(value = -1, inplace = True)
    
    return features_data


# def scale_data(features_data=None, train=True, columns=None, target=None):
#     '''Scales the provided data - if the data is for training, excludes the target column.
#     It also chooses the features used in the prediction.'''
    
#     column_index = [features_data.columns.get_loc(c) for c in columns if c in features_data]
    
#     ct = ColumnTransformer([('MinMax', MinMaxScaler(), column_index)], remainder='passthrough')
#     matrix = features_data.as_matrix()
#     ct = ct.fit(matrix)
#     scaled_matrix = ct.transform(matrix)
#     del ct, column_index, matrix
    
#     scaled_data = cudf.DataFrame(scaled_matrix)
#     del scaled_matrix
#     scaled_data.columns = features_data.columns
    
#     # We don't want to scale the target also
#     if train:
#         scaled_data[target] = features_data[target]
        
#     return scaled_data



def scale_data(features_data=None, train=True, features_to_keep=None, target=None):
    '''Scales the provided data - if the data is for training, excludes the target column.
    It also chooses the features used in the prediction.'''
    
    data_for_standardization = features_data[features_to_keep]
    matrix = data_for_standardization.as_matrix()
    MinMax = MinMaxScaler().fit(matrix)
    scaled_matrix = MinMax.transform(matrix)
    del MinMax, matrix
    
    scaled_data = cudf.DataFrame(scaled_matrix)
    scaled_data.columns = data_for_standardization.columns
    del data_for_standardization
    
    # We don't want to scale the target also
    if train:
        scaled_data[target] = features_data[target]
        
    return scaled_data



# IV) Training

In [None]:
# RAPIDS roc_auc_score is 16x faster than sklearn. - cdeotte
import cuml
import cupy
from cuml.metrics import roc_auc_score
from cuml.preprocessing.model_selection import train_test_split
import xgboost
from xgboost import XGBClassifier
import pickle


from xgboost import plot_importance
import plotly.express as px
from plotly.subplots import make_subplots
from matplotlib import pyplot
import plotly.graph_objects as go

In [None]:
def print_version(*x):
    for i in x:
        print(i, eval(f'{i}.__version__'))
        
print_version('xgboost', 'cupy')

In [None]:
def train_xgb_model(X_train, X_test, y_train, y_test, params, num_round=10, details = None, prints=True):
    '''Trains an XGB and returns the trained model + ROC value.'''
    # Create DMatrix - is optimized for both memory efficiency and training speed.
    train_matrix = xgboost.DMatrix(data = X_train, label = y_train)
    
    
    # Create & Train the model
    model = xgboost.train(params, dtrain = train_matrix, 
                          num_boost_round=num_round
                         )

    # Make prediction
    predicts = model.predict(xgboost.DMatrix(X_test))
    roc = roc_auc_score(y_test.astype('int32'), predicts)

    if prints:
        print(details + " - ROC: {:.5}".format(roc))
    
    return model, roc


def param_tuning_graph(param_values, roc_values):
    '''Represents visually the ROC results for the speciffic parameter tune.'''
    
    plt.figure(figsize=(18, 3))
    ax = sns.barplot(x=param_values, y=roc_values, palette=custom_colors)

    for p in ax.patches:
        width = p.get_width()
        height = p.get_height()
        x, y = p.get_xy() 
        ax.annotate(f'{height:.5%}', (x + width/2, y + height*1.02), ha='center')

# IV.1) Baseline

In [None]:
%%time

features_to_keep = ['user_sum', 'user_mean', 'user_count', 'user_std', 'user_percent',
                    'content_sum', 'content_mean', 'content_count', 'content_std', ]



target = 'answered_correctly'
all_features = features_to_keep.copy()
all_features.append(target)



train_df_combined = combine_features(data=train_df)

# Comment this if you're scaling
train_df_combined = train_df_combined[all_features]

print("Observations in train: {:,}".format(len(train_df)))
train_df_combined.head()

In [None]:
# Features, target and train/test split
X = train_df_combined[features_to_keep]
y = train_df_combined[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    shuffle=False, random_state=13, stratify=y)

In [None]:
del X, y, features_to_keep, target, all_features
gc.collect()

In [None]:
%%time

params1 = {
    'max_depth' : 12,
    'tree_method' : 'gpu_hist',
    'objective' : 'binary:logistic',
    'grow_policy' : 'depthwise',
    'eval_metric': 'auc'
}


model1, roc1 = train_xgb_model(X_train, X_test, y_train, y_test, 
                               params1, details="baseline model")

In [None]:
fig, ax = pyplot.subplots(figsize=(12, 8))


model1.get_score(importance_type='gain')
plot_importance(model1, ax=ax)
pyplot.show()

In [None]:
# save model to file
pickle.dump(model1, open("baseline_model.pickle.dat", "wb"))

In [None]:
del train_df_combined, model1, roc1, X_train, X_test, y_train, y_test
gc.collect()

# IV.2) Adding data

In [None]:
%%time

# Combine with past features
train_df_combined = combine_features(data=train_df, add_metadata = True)

# Features for ML
features_to_keep = ['timestamp', 'prior_question_elapsed_time', 'prior_question_had_explanation', 'part', 'lec_sum',
                    'user_sum', 'user_mean', 'user_count', 'user_std', 'user_percent',
                    'user_part_sum', 'user_part_mean', 'user_part_count', 'user_part_std', 'user_part_percent',
                    'user_bundle_sum', 'user_bundle_mean', 'user_bundle_count', 'user_bundle_std',
                    'question_part_sum', 'question_part_mean', 'question_part_count', 'question_part_std', 'question_part_percent',
                    'bundle_sum', 'bundle_mean', 'bundle_count', 'bundle_std', 'bundle_percent', 
                    'content_sum', 'content_mean', 'content_count', 'content_std', 'question_part_percent']


target = 'answered_correctly'
all_features = features_to_keep.copy()
all_features.append(target)

# Comment this if you're scaling
train_df_combined = train_df_combined[all_features]

print("Observations in train: {:,}".format(len(train_df)))
train_df_combined.head()

In [None]:
# Features, target and train/test split
X = train_df_combined[features_to_keep]
y = train_df_combined[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    shuffle=False, random_state=13, stratify=y)

In [None]:
del X, y, features_to_keep, target, all_features
gc.collect()

In [None]:
%%time

params2 = {
    'max_depth' : 12,
    'tree_method' : 'gpu_hist',
    'objective' : 'binary:logistic',
    'grow_policy' : 'depthwise',
    'eval_metric': 'auc'
}


model2, roc2 = train_xgb_model(X_train, X_test, y_train, y_test, 
                               params2, num_round=10, details="added data model")

In [None]:
fig, ax = pyplot.subplots(figsize=(12, 8))

model2.get_score(importance_type='gain')
plot_importance(model2, ax=ax)
pyplot.show()

In [None]:
# save model to file
pickle.dump(model2, open("model2.pickle.dat", "wb"))

In [None]:
%%time

# --- ETA ---
# aka learning rate

rocs2 = []
etas2 = [0.001, 0.005, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1]

result_etas = {}

for eta in etas2:
    params2 = {
        'tree_method' : 'gpu_hist',
        'objective' : 'binary:logistic',
        'grow_policy' : 'depthwise',
        'eval_metric': 'auc', 
        'eta' : eta
    }

    _, roc = train_xgb_model(X_train, X_test, y_train, y_test, 
                             params2, details = f"ETA: {eta}")
    rocs2.append(roc)
    result_etas.update({roc: eta})

best_eta = result_etas[max(rocs2)]

In [None]:
%%time

# --- ETA ---
# aka learning rate

rocs2 = []
max_depths = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
result_max_depths = {}

for max_depth in max_depths:
    params2 = {
        'max_depth' : max_depth,
        'tree_method' : 'gpu_hist',
        'objective' : 'binary:logistic',
        'grow_policy' : 'depthwise',
        'eval_metric': 'auc'
    }

    _, roc = train_xgb_model(X_train, X_test, y_train, y_test, 
                             params2, details = f"Max_depth: {max_depth}")
    rocs2.append(roc)
    result_max_depths.update({roc: max_depth})

best_max_depth = result_max_depths[max(rocs2)]

In [None]:
%%time

# --- ETA ---
# aka learning rate

rocs2 = []
gammas = [ 0.0, 0.2 , 0.4, 0.6, 0.8, 1, 2, 4, 6, 8, 10, 20, 50, 100, 200, 500, 1000]
result_gamma = {}

for gamma in gammas:
    params2 = {
        'tree_method' : 'gpu_hist',
        'objective' : 'binary:logistic',
        'grow_policy' : 'depthwise',
        'eval_metric': 'auc',
        'gamma': gamma
    }

    _, roc = train_xgb_model(X_train, X_test, y_train, y_test, 
                             params2, details = f"Gamma: {gamma}")
    rocs2.append(roc)
    result_gamma.update({roc: gamma})

best_gamma = result_gamma[max(rocs2)]

In [None]:
%%time

# --- ETA ---
# aka learning rate

rocs2 = []
colsample_bytrees = [ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 ]
result_colsample_bytrees = {}

for colsample_bytree in colsample_bytrees:
    params2 = {
        'tree_method' : 'gpu_hist',
        'objective' : 'binary:logistic',
        'grow_policy' : 'depthwise',
        'eval_metric': 'auc',
        'colsample_bytree': colsample_bytree
    }

    _, roc = train_xgb_model(X_train, X_test, y_train, y_test, 
                             params2, details = f"Colsample_bytree: {colsample_bytree}")
    rocs2.append(roc)
    result_colsample_bytrees.update({roc: colsample_bytree})

best_colsample_bytrees = result_colsample_bytrees[max(rocs2)]


In [None]:
%%time

# --- ETA ---
# aka learning rate

rocs2 = []
alphas = [0.0, 0.2, 0.4, 0.6, 0.8, 1, 5, 10]

result_alpha={}

for alpha in alphas:
    params2 = {
        'tree_method' : 'gpu_hist',
        'objective' : 'binary:logistic',
        'grow_policy' : 'depthwise',
        'eval_metric': 'auc', 
        'alpha' : alpha
    }

    _, roc = train_xgb_model(X_train, X_test, y_train, y_test, 
                             params2, details = f"alpha: {alpha}")
    rocs2.append(roc)
    result_alpha.update({roc: alpha})

best_alpha = result_alpha[max(rocs2)]

In [None]:
%%time

params3 = {
    'max_depth' : best_max_depth,
    'eta' : best_eta,
    'gamma': best_gamma,
    'alpha': best_alpha,
    'colsample_bytree': best_colsample_bytrees,
    'tree_method' : 'gpu_hist',
    'objective' : 'binary:logistic',
    'grow_policy' : 'depthwise',
    'eval_metric': 'auc'
}


model3, roc3 = train_xgb_model(X_train, X_test, y_train, y_test, 
                               params3, num_round=10, details="added data model")

In [None]:
fig, ax = pyplot.subplots(figsize=(12, 8))

model3.get_score(importance_type='gain')
plot_importance(model3, ax=ax)
pyplot.show()

In [None]:
# save model to file
pickle.dump(model3, open("model3.pickle.dat", "wb"))

In [None]:
feature_importance = pd.DataFrame.from_dict(data=model3.get_score(importance_type='gain'), orient='index')
feature_importance = feature_importance.sort_values(by=0, ascending=False)
most_important_feature = list(feature_importance[0:12].index)

In [None]:
del X_train, X_test, y_train, y_test, model3, roc3
gc.collect()

# IV.3) Feature selection

In [None]:
train_df_combined['answered_correctly'] = train_df['answered_correctly']

In [None]:
%%time
# Features for ML
features_to_keep = most_important_feature

target = 'answered_correctly'

all_features = features_to_keep.copy()
all_features.append(target)


print("Observations in train: {:,}".format(len(train_df)))
train_df_combined.head()

In [None]:
# Features, target and train/test split
X = train_df_combined[features_to_keep]
y = train_df_combined[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    shuffle=False, random_state=13, stratify=y)

In [None]:
del X, y, features_to_keep, target, all_features
gc.collect()

In [None]:
%%time

model4, roc4 = train_xgb_model(X_train, X_test, y_train, y_test, 
                               params3, num_round=10, details="added data model")

In [None]:
fig, ax = pyplot.subplots(figsize=(12, 8))

model4.get_score(importance_type='gain')
plot_importance(model4, ax=ax)
pyplot.show()

In [None]:
# save model to file
pickle.dump(model4, open("model4.pickle.dat", "wb"))

In [None]:
del train_df_combined, model4, roc4, X_train, X_test, y_train, y_test
gc.collect()

# Submissions

In [None]:
final_model = pickle.load(open('./model3.pickle.dat', 'rb'))

In [None]:
# Import library and create environment
import riiideducation
env = riiideducation.make_env()

In [None]:
# # Features for ML
features_to_keep = ['timestamp', 'prior_question_elapsed_time', 'prior_question_had_explanation', 'part', 'lec_sum', 
                    'user_sum', 'user_mean', 'user_count', 'user_std', 'user_percent',
                    'user_part_sum', 'user_part_mean', 'user_part_count', 'user_part_std', 'user_part_percent',
                    'user_bundle_sum', 'user_bundle_mean', 'user_bundle_count', 'user_bundle_std',
                    'question_part_sum', 'question_part_mean', 'question_part_count', 'question_part_std', 'question_part_percent',
                    'bundle_sum', 'bundle_mean', 'bundle_count', 'bundle_std', 'bundle_percent', 
                    'content_sum', 'content_mean', 'content_count', 'content_std', 'question_part_percent']

# features_to_keep = most_important_feature

In [None]:
questions = cudf.read_parquet("../input/riid-competition-rapids-part-i-eda/questions.parquet", columns=['question_id','bundle_id', 'part'])

In [None]:
# Here you would also add your pretrained model
iter_test = env.iter_test()

for (test_df, sample_prediction_df) in iter_test:
    test_df = cudf.from_pandas(test_df)
    
    # --- PREPROCESSING ---
    # Here is time to apply the preprocessing to the test_df
    test_df = test_df.merge(questions, how = 'left', left_on = 'content_id', right_on = 'question_id')
    test_df = combine_features(data = test_df, add_metadata = True)
    
    X = test_df[features_to_keep].to_pandas()
    
    # --- MODEL ---
    test_df['answered_correctly'] = final_model.predict(xgboost.DMatrix(X))
    test_df = test_df.to_pandas()
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])


In [None]:
# del questions
# del user_answers, user_part_performance, user_bundle_performance, question_part_performance, bundle_performance, content_answers
# del model3, roc3
# gc.collect()