# RAPIDS Forest Inference Library! INFERENCE NOTEBOOK
# Kaggle Toxic Comp 2022 Solution - Silver 52nd Place
Since [RAPIDS Forest Inference Library][1] can infer Tree Based Models blazing fast at 100 million rows per second!!, we can approach this problem in a way that no other team did. During inference, we need to predict 14,000 toxic scores for each of the test data comments. Instead of predicting 14,000 numbers, we will predict 196,000,000 million numbers! We create every pair of comments `14,000 x 14,000 = 196,000,000`, and then for each of these 196 million pairs, we feed `1792 x 2 = 3584` columns of features and use [RAPIDS Forest Inference Library][1] to predict whether the second comment is more toxic than the first comment. After inferring all 196,000,000 pair probabilities, we convert these into 14,000 toxic scores.

[1]: https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35

# Extract Features for Test Comments
First we will extract `roBERTa-base` and `roBERTa-large` features for each test comment by running the two Python scripts below. The two scripts will save the features as two NumPy arrays named `embed.npy` and `embed_large.npy` 

In [None]:
%%time
# ROBERTA-BASE EMBEDDINGS
! python -W ignore ../input/toxiccompscripts/roberta-extract-7.py

In [None]:
%%time
# ROBERTA-LARGE EMBEDDINGS
! python -W ignore ../input/toxiccompscripts/roberta-extract-8.py

# Infer Test with RAPIDS Forest Inference Library
After reading the 14,000 unique test comments, we will create each of the `196,000,000 = 14,000 x 14,000` pairs. Then we will put these pairs into a RAPIDS cuDF dataframe. Next we will merge the `roBERTa` features extracted above onto the RAPIDS cuDF dataframe and use [RAPIDS Forest Inference Library (FIL)][1] to infer a probability score for each of the 196 million pairs! Finally we will convert these probabilities into 14,000 toxic scores for `submission.csv`.

[1]: https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35

In [None]:
VER = 100
FOLDS = 11

import pandas as pd, numpy as np, cudf, cupy, os
from sklearn.metrics import roc_auc_score, accuracy_score
from scipy.stats import rankdata
print('RAPIDS version',cudf.__version__)

In [None]:
# LOAD COMMENTS TO SCORE
sub = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')
sub.head()

In [None]:
# MERGE EMBEDDINGS ONTO COMMENTS TO SCORE
WORDS = 768 + 1024
embed = np.concatenate( [np.load('embed.npy'), np.load('embed_large.npy')], axis=1 )
df_e = pd.DataFrame(embed,columns=[f'f_{x}' for x in range(WORDS)])
sub = pd.concat([sub,df_e],axis=1)
del sub['text']

In [None]:
sub.iloc[:,:1] = sub.iloc[:,:1].astype('int32')
sub.iloc[:,1:] = sub.iloc[:,1:].astype('float32')
IDS = sub.comment_id.values
print('Test has',len(IDS),'unique comments')

In [None]:
# CREATE RAPIDS CUDF DATAFRAME FOR ALL 196,000,000 PAIRS!
test2 = cudf.DataFrame({'id1':np.repeat(IDS,len(IDS)),
                        'id2':np.tile(IDS,len(IDS))})
features = cudf.DataFrame(sub)
print('We will infer every row of dataframe with shape', test2.shape )
test2.head()

In [None]:
# ONLY INFER DURING SUBMIT AND NOT COMMIT
if len(IDS)==7537:
    print('SKipping Infer During Commit...')
    test2 = test2.iloc[:1_000_000]

In [None]:
FEATURES = []
WORDS = 768 + 1024
FEATURES = FEATURES + [f'f_{x}' for x in range(WORDS)]
FEATURES = FEATURES + [f'g_{x}' for x in range(WORDS)]
print('Each comment has',len(FEATURES),'feature columns')

In [None]:
%%time
from cuml import ForestInference

# INFER TEST ITEM PAIRS IN BATCHES
BATCH_SIZE = 100_000
BATCHES = int( np.ceil( len(test2)/BATCH_SIZE ) )
test_preds = cupy.zeros((len(test2)),dtype='float32')

#LOAD FOLD MODELS
models = []
for fold in range(FOLDS):
    model_path = f'../input/xgbtoxic100/XGB_v{VER}_f{fold}.xgb'
    model = ForestInference.load(model_path, output_class=True, threads_per_tree=16,
                                 storage_type='dense', algo='BATCH_TREE_REORG',
                                 n_items=2, blocks_per_sm=5)
    models.append(model)

# INFER BATCHES, INFER FOLD MODELS
print(f'Inferring {BATCHES} batches...')
for k in range(BATCHES):
    
    start = BATCH_SIZE*k
    end = min(BATCH_SIZE*(k+1),len(test2))
    print(k,', ',end='')
    
    test3 = test2.iloc[start:end].copy()
    test3['order'] = cupy.arange(len(test3)) #maintain order after merge
    features.columns = ['id1'] + [f'g_{x}' for x in range(WORDS)]
    test3 = test3.merge(features,on='id1',how='left')
    features.columns = ['id2'] + [f'f_{x}' for x in range(WORDS)]
    test3 = test3.merge(features,on='id2',how='left')
    test3 = test3.fillna(-1)
    test3 = test3.sort_values('order').drop(['order'],axis=1)
    dtest = cupy.array( test3[FEATURES].values.astype('float32'), order='C' )

    for fold in range(FOLDS):
        test_preds[start:end] += models[fold].predict_proba(dtest)[:,1] / FOLDS
    
print()

# Convert Pair Probabilities into Toxic Scores
We convert all the pair probabilities into toxic scores by simply adding up all the scores. If we use something more sophisted like ELO score or Bradley-Terry, we could probably boost our CV LB.

In [None]:
print('We inferred',len(test_preds),'numbers!')
print('During submit, we will infer 196,000,000 million numbers!')
test2['score'] = test_preds

In [None]:
# CONVERT PAIR PROBABILITIES INTO TOXIC SCORES
tmp1 = test2.groupby('id1').score.sum().reset_index()
tmp1.columns = ['comment_id','score1']
tmp2 = test2.groupby('id2').score.sum().reset_index()
tmp2.columns = ['comment_id','score2']
tmp3 = tmp1.merge(tmp2,on='comment_id',how='left')
tmp3['score'] = tmp3.score2 - tmp3.score1
tmp3.head()

# Create Submission.csv
Now we merge toxic scores onto `submission.csv`

In [None]:
sub = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')
sub.head()

In [None]:
# MERGE TOXIC SCORES ONTO COMMENT TO SCORE
sub = sub.merge(tmp3[['comment_id','score']].to_pandas(),on='comment_id',how='left')
sub = sub.fillna(tmp3.score.mean())
sub['score']=rankdata( sub['score'], method='ordinal')
sub.sort_values('score')

In [None]:
# WRITE SUBMISSION TO DISK
del sub['text']
sub.to_csv('submission.csv',index=False)