# RAPIDS Forest Inference Library! TRAIN NOTEBOOK
# Kaggle Toxic Comp 2022 Solution - Silver 52nd Place
Since [RAPIDS Forest Inference Library][1] can infer Tree Based Models blazing fast at 100 million rows per second!!, we can approach this problem in a way that no other team did. During inference, we need to predict 14,000 toxic scores for each of the test data comments. Instead of predicting 14,000 numbers, we will predict 196,000,000 million numbers! We create every pair of comments `14,000 x 14,000 = 196,000,000`, and then for each of these 196 million pairs, we feed `1792 x 2 = 3584` columns of features and use [RAPIDS Forest Inference Library][1] to predict whether the second comment is more toxic than the first comment. After inferring all 196,000,000 pair probabilities, we convert these into 14,000 toxic scores.

[1]: https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35

# Load Leak Free Folds
We will load leak free K-folds from notebook [here][1]. Each pair of comments in `validation_data.csv` has two comments. These two comments are also contained in other pairs. When creating folds, we must put all instances of each comment in its own fold.

[1]: https://www.kaggle.com/its7171/jigsaw-cv-strategy

In [None]:
import pandas as pd, numpy as np, gc, os
from sklearn.metrics import roc_auc_score, accuracy_score

VER = 100
FOLDS = 11

train = pd.read_csv(f'../input/toxiccompscripts/folds-{FOLDS}.csv')
train = train.drop(['less_id','more_id','less_gid','more_gid'],axis=1)
print( train.shape )
train.head()

# Load Text Features
For each of the 30,108 pairs of comments in the file `validation_data.csv` we extract embeddings from `roBERTa-base` and `roBERTa-large` offline. The `roBERTa-base` embeddings have 768 columns and the `roBERTa-large` embeddings have 1024 columns. Then we load the embeddings here. A pair is two comments, so each row of train data will have `2 x (768 + 1024) = 3584` columns of features.

Note that both `roBERTa` models are pretrained on old comp data to predict old comp targets. Then we remove the pretrain head, and just extract the last hidden layer activations as embeddings.

In [None]:
less_embed = np.load('../input/roberta-7/embed1_v7.npy')
more_embed = np.load('../input/roberta-7/embed2_v7.npy')
print('roBERa-base embeddings have shape',less_embed.shape, more_embed.shape)

In [None]:
less_embed2 = np.load('../input/roberta-8/embed1_v8.npy')
more_embed2 = np.load('../input/roberta-8/embed2_v8.npy')
print('roBERa-large embeddings have shape',less_embed2.shape, more_embed2.shape)

In [None]:
less_embed = np.concatenate([less_embed,less_embed2],axis=1)
more_embed = np.concatenate([more_embed,more_embed2],axis=1)
WORDS = less_embed.shape[1]
print('Each comment has',WORDS,'features. These are roBERTa-base and roBERTa-large embeddings')
del less_embed2, more_embed2
gc.collect()

In [None]:
# PUT EMBEDDINGS INTO A DATAFRAME
df_l = pd.DataFrame(less_embed,columns=[f'f_{x}' for x in range(WORDS)])
df_m = pd.DataFrame(more_embed,columns=[f'f_{x}' for x in range(WORDS)])

# Train XGB
We will now train 11 folds of XGB models to accept two comments and predict if the second comment is more toxic. The model receives 1792 columns of features for each comment and outputs a probability. We will save the 11 fold models to a Kaggle dataset and then load it into another inference notebook for submission. We achieve 11-Fold CV score of 0.707 accuracy.

In [None]:
import xgboost as xgb
print('XGB Version',xgb.__version__)

In [None]:
xgb_parms = { 
    'max_depth':2, 
    'learning_rate':0.01, 
    'subsample':0.4,
    'colsample_bytree':0.3, 
    'eval_metric':'logloss',
    'objective':'binary:logistic',
    'tree_method':'gpu_hist',
    'predictor':'gpu_predictor',
    'random_state':42,
}

In [None]:
#%%time
importances = []
oof1 = np.zeros((len(train)))
oof2 = np.zeros((len(train)))

for fold in range(FOLDS):
    print('#'*25)
    print('### Fold',fold+1)
    print('#'*25)
    
    # TRAIN DATA FROM VALIDATION_DATA.CSV
    t_data = train.loc[train.fold!=fold].copy()
    v_data = train.loc[train.fold==fold].copy()
    
    t_data['y'] = 1 
    v_data['y'] = 1
            
    # WE WILL COPY EVERY PAIR IN VALIDATION DATA AND REVERSE THE ORDER AND TARGET
    t_data2 = t_data.copy()
    t_data2['y'] = 0
    t_data = pd.concat([t_data, t_data2],axis=0, ignore_index=True)
    del t_data2
    
    v_data2 = v_data.copy()
    v_data2['y'] = 0
    v_data = pd.concat([v_data, v_data2],axis=0, ignore_index=True)
    del v_data2    
    
    # MERGE FEATURES FROM ROBERTA-BASE AND ROBERTA-LARGE
    FEATURES = []
    more_train = pd.concat([df_m.loc[train.fold!=fold],df_l.loc[train.fold!=fold]],axis=0,ignore_index=True)
    more_valid = pd.concat([df_m.loc[train.fold==fold],df_l.loc[train.fold==fold]],axis=0,ignore_index=True)
    FEATURES = FEATURES + list( more_train.columns )
    t_data = pd.concat([t_data,more_train],axis=1)
    v_data = pd.concat([v_data,more_valid],axis=1)
    
    more_train = pd.concat([df_l.loc[train.fold!=fold],df_m.loc[train.fold!=fold]],axis=0,ignore_index=True)
    more_train.columns = [f'g_{x}' for x in range(WORDS)]
    more_valid = pd.concat([df_l.loc[train.fold==fold],df_m.loc[train.fold==fold]],axis=0,ignore_index=True)
    more_valid.columns = [f'g_{x}' for x in range(WORDS)]
    FEATURES = FEATURES + list( more_train.columns )
    t_data = pd.concat([t_data,more_train],axis=1)
    v_data = pd.concat([v_data,more_valid],axis=1)
    
    dtrain = xgb.DMatrix(data=t_data[FEATURES], label=t_data.y )
    dvalid = xgb.DMatrix(data=v_data[FEATURES], label=v_data.y )
    
    # TRAIN MODEL FOLD K
    model = xgb.train(xgb_parms, 
                dtrain=dtrain,
                evals=[(dtrain,'train'),(dvalid,'valid')],
                num_boost_round=9999,
                early_stopping_rounds=100,
                verbose_eval=500) 
    model.save_model(f'XGB_v{VER}_f{fold}.xgb')
    
    # GET FEATURE IMPORTANCE FOR FOLD K
    dd = model.get_score(importance_type='weight')
    tmp = pd.DataFrame({'feature':dd.keys(),f'importance_{fold}':dd.values()})
    importances.append(tmp)
    
    # INFER OOF FOLD K
    oof_preds = model.predict(dvalid)
    auc = roc_auc_score(v_data.y.values, oof_preds)
    acc = accuracy_score(v_data.y.values, (oof_preds>0.5).astype('int32'))
    print('AUC =',auc,'ACC =',acc)
    
    oof1[train.fold.values==fold] = oof_preds[:len(oof_preds)//2]
    oof2[train.fold.values==fold] = oof_preds[len(oof_preds)//2:]
    
print()
oof = np.concatenate([oof1,oof2])
true = np.concatenate([ np.ones(len(train)),np.zeros(len(train)) ])

acc = accuracy_score(true, (oof>0.5).astype('int32'))
auc = roc_auc_score(true, oof)
print('OVERALL AUC =',auc,'ACC =',acc)

# XGB Feature Importances

In [None]:
import matplotlib.pyplot as plt

dff = importances[0].copy()
for k in range(1,FOLDS): dff = dff.merge(importances[k], on='feature', how='left')
dff['importance'] = dff.iloc[:,1:].mean(axis=1)
dff = dff.sort_values('importance',ascending=False)
dff.to_csv(f'xgb_feature_importance_v{VER}_toxic.csv',index=False)

In [None]:
NUM_FEATURES = 20
plt.figure(figsize=(10,5*NUM_FEATURES//10))
plt.barh(np.arange(NUM_FEATURES,0,-1), dff.importance.values[:NUM_FEATURES])
plt.yticks(np.arange(NUM_FEATURES,0,-1), dff.feature.values[:NUM_FEATURES])
plt.title(f'XGB Feature Importance - Top {NUM_FEATURES}')
plt.show()