# Overview

- Hugging Face の toxicity 学習済みモデルを使ってスコアリングする。
- インターネットに接続すると submit できないため、Kaggle に取り込み済みのノートブックを使う。
- 試しに、予測スコアを -1 倍して最終スコアとしてみる。

References
- [🧩 EDA + Sentiment Analysis + Benchmark Baseline🧩 | Kaggle](https://www.kaggle.com/coldfir3/eda-sentiment-analysis-benchmark-baseline#Baseline-using-unitary/toxic-bert-model)
- [download_huggingface_pretrain_for_kaggle | Kaggle](https://www.kaggle.com/quincyqiang/download-huggingface-pretrain-for-kaggle/notebook)

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
import torch

In [2]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
MODEL_NAME = '/kaggle/input/toxic-bert'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to(device)

In [3]:
# remove the slicing of [:36] to run the whole set
comments_to_score = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')
# comments_to_score_texts = comments_to_score['text'].values

In [4]:
BS = 8
def get_comments_to_score():
    txts = comments_to_score['text'].values
    for i in range(0, len(txts), BS):
        yield txts[i : i + BS].tolist()

In [5]:
outputs = []
for sequences in tqdm(get_comments_to_score(), total = len(comments_to_score) // BS): 
    tokens = tokenizer(sequences, 
                       padding=True, 
                       truncation=True, 
                       add_special_tokens=True,
                       return_tensors="pt").to(device)
    output = model(**tokens)
    outputs.append(output['logits'].cpu().detach().numpy())

  0%|          | 0/942 [00:00<?, ?it/s]

In [6]:
predictions = np.concatenate(outputs)[:,0]

comments_to_score['score'] = -1 * predictions
comments_to_score = comments_to_score.drop('text', axis = 1)

comments_to_score.to_csv('submission.csv', index = False)
comments_to_score

Unnamed: 0,comment_id,score
0,114890,7.450975
1,732895,5.634093
2,1139051,5.636992
3,1434512,6.059226
4,2084821,0.620955
...,...,...
7532,504235362,-0.007879
7533,504235566,2.626713
7534,504308177,5.280117
7535,504570375,-4.360547


In [7]:
pd.DataFrame(pd.Series(predictions.ravel()).describe()).transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,7537.0,-0.514425,4.21569,-7.542584,-4.083736,-0.695464,2.978479,7.228723
