## Context

This notebook uses the "detoxify" model

https://github.com/unitaryai/detoxify

A discussion of this model is available here

https://www.kaggle.com/c/jigsaw-toxic-severity-rating/discussion/300058

One way to run

*Using Detoxify in offline mode*  
https://www.kaggle.com/atamazian/using-detoxify-in-offline-mode

@atamazian thanks!

## Clarification

This configuration has a limitation - the maximum length of a string is no more than 512 characters, otherwise an error like

> RuntimeError: The size of tensor a (___) must match the size of tensor b (512)

so I had to shorten it

```
def get_predict_data(model: Callable, data: pd.Series) -> pd.Series:
    """ Create toxic data by detoxify model. """
    # The BERT's limitation with the word count
    max_str_len = 500
    data = data.apply(tc.shorten_text, max_len=max_str_len)
    
    [...]
```

I tried the length of 300 (**V5**) and 500 (**V6**) characters, the more - the more score.

Also, I didn't use all the predictions

```
    
    labels_list = ['toxicity', 'severe_toxicity',
                   'obscene', 'threat', 'insult',
                   'identity_attack'],
    
    weigths_list = [None, 1, 1, 1, 1, 1]

    [...]    

    return (predict_df * weigths_list).median(axis=1)
```

# 1. Import & Def & Load data

In [None]:
%%capture

!cp -r ../input/detoxify/detoxify-master detoxify
!pip install -q ./detoxify
!rm -rf ./detoxify

In [None]:
import pandas as pd

from typing import Callable

from detoxify import Detoxify

import toxic_comments_utilities as tc

In [None]:
def get_predict_data(model: Callable, data: pd.Series) -> pd.Series:
    """ Create toxic data by detoxify model. """
    # The BERT's limitation with the word count
    max_str_len = 500
    data = data.apply(tc.shorten_text, max_len=max_str_len)
    
    labels_list = ['toxicity', 'severe_toxicity',
                   'obscene', 'threat', 'insult',
                   'identity_attack'],
    
    weigths_list = [None, 1, 1, 1, 1, 1]

    result = []

    predict_labels = model.class_names
        
    for text in data.values:
        result.append(list(model.predict(text).values()))
        
    predict_df = pd.DataFrame.from_records(
                                    result,
                                    index=data.index,
                                    columns=predict_labels)

    return (predict_df * weigths_list).median(axis=1)


def get_score(model: Callable, data: pd.DataFrame) -> float:
    """ Score a model on the validation data. """
    data = data.copy()
    
    data['less_toxic'] = get_predict_data(model, data['less_toxic'])
    data['more_toxic'] = get_predict_data(model, data['more_toxic'])
    
    score = data.eval('less_toxic < more_toxic').mean()
    
    return round(score, 4)


def get_submission(model: Callable, data: pd.DataFrame) -> pd.DataFrame:
    """ Get predicted toxicity scores to submit results. """
    data = data.copy()
    
    data['text'] = get_predict_data(model, data['text'])
    
    return data.rename(columns={'text':'score'})

In [None]:
comments_to_score_path = "../input/jigsaw-toxic-severity-rating/comments_to_score.csv"
validation_data_path = "../input/jigsaw-toxic-severity-rating/validation_data.csv"
score_data = pd.read_csv(comments_to_score_path)
valid_data = pd.read_csv(validation_data_path)

In [None]:
%whos DataFrame

# 2. Get & Check detoxify model

In [None]:
detoxify_model = Detoxify(
    model_type='original',  
    checkpoint='../input/detoxify-models/toxic_original-c1212f89.ckpt',
    huggingface_config_path='../input/bert-base-uncased',
    device='cuda'
) 
# device='cuda' / 'cpu'

In [None]:
predicts_dict = detoxify_model.predict("I'll tell you about toxicity labels.")

pd.DataFrame.from_dict(predicts_dict, orient='index', columns=['predict'])

In [None]:
%%time
get_score(detoxify_model, valid_data)

```
CPU times: user 10min 36s, sys: 812 ms, total: 10min 37s
Wall time: 10min 38s

0.6946
```

# 3. Create & Save submission

In [None]:
%%time
submission = get_submission(detoxify_model, score_data)

In [None]:
submission.to_csv("submission.csv", index=False)

In [None]:
submission