# Overview

- 学習済みモデル detoxify の動作確認。

### references

- [Using Detoxify in offline mode | Kaggle](https://www.kaggle.com/atamazian/using-detoxify-in-offline-mode/notebook)

In [1]:
# Installs
!cp -r ../input/detoxify/detoxify-master detoxify
!pip install -q ./detoxify
!rm -rf ./detoxify



In [2]:
# Parameters
HUGGINGFACE_CONFIG_PATH = '../input/bert-base-uncased'
CHECKPOINT_PATH = '../input/detoxify-models/toxic_original-c1212f89.ckpt'
MODEL_TYPE = 'original'
SUBMISSION_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/sample_submission.csv'
COMMENTS_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv'

In [3]:
# Modules
import pandas as pd
import torch
from detoxify import Detoxify
from tqdm import tqdm
from transformers import AutoTokenizer

In [4]:
# My functions

In [5]:
# Examples
max_len = 300
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

detox = Detoxify(model_type=MODEL_TYPE,  
                 checkpoint=CHECKPOINT_PATH,
                 device=device,
                 huggingface_config_path=HUGGINGFACE_CONFIG_PATH)

# A little trick allowing us to set max_len
detox.tokenizer = AutoTokenizer.from_pretrained(HUGGINGFACE_CONFIG_PATH,
                    local_files_only=True,
                    model_max_length=max_len)

results = detox.predict('I am not toxic, sorry!')
print(results)

{'toxicity': 0.0018949232, 'severe_toxicity': 9.441939e-05, 'obscene': 0.00022873534, 'threat': 0.00010636789, 'insult': 0.00019287909, 'identity_attack': 0.00014568506}


In [6]:
comments = pd.read_csv(COMMENTS_PATH, index_col='comment_id')
comments

Unnamed: 0_level_0,text
comment_id,Unnamed: 1_level_1
114890,"""\n \n\nGjalexei, you asked about whether ther..."
732895,"Looks like be have an abuser , can you please ..."
1139051,I confess to having complete (and apparently b...
1434512,"""\n\nFreud's ideas are certainly much discusse..."
2084821,It is not just you. This is a laundry list of ...
...,...
504235362,"Go away, you annoying vandal."
504235566,This user is a vandal.
504308177,""" \n\nSorry to sound like a pain, but one by f..."
504570375,Well it's pretty fucking irrelevant now I'm un...


In [7]:
class_names = ['toxicity', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack']
tqdm.pandas()
for class_name in class_names:
    comments[class_name] = comments['text'].progress_map(lambda line: detox.predict(line)[class_name])
    # comments['score_'+class_name].plot.hist(bins=50, grid=True)

100%|██████████| 7537/7537 [01:20<00:00, 93.24it/s]
100%|██████████| 7537/7537 [01:19<00:00, 94.87it/s] 
100%|██████████| 7537/7537 [01:19<00:00, 95.24it/s] 
100%|██████████| 7537/7537 [01:19<00:00, 94.67it/s] 
100%|██████████| 7537/7537 [01:20<00:00, 94.04it/s]
100%|██████████| 7537/7537 [01:20<00:00, 93.84it/s] 


In [8]:
comments['score'] = comments[[class_name for class_name in class_names]].sum(axis=1)
comments['score'] += comments['severe_toxicity']

In [9]:
submission = pd.read_csv(SUBMISSION_PATH)
submission['score'] = comments['score'].values
submission.to_csv('submission.csv', index=False)
submission

Unnamed: 0,comment_id,score
0,114890,0.001460
1,732895,0.004574
2,1139051,0.004588
3,1434512,0.001605
4,2084821,0.422560
...,...,...
7532,504235362,0.529902
7533,504235566,0.072544
7534,504308177,0.006299
7535,504570375,2.404573
