# Begin

The prediction function is at [this notebook](https://www.kaggle.com/seraphwedd18/jigsaw-toxic-severity-word-scoring-prediction) with a max LB score of `0.749` as of creation of this notebook.

In [None]:
import numpy as np
import pandas as pd
import re

We will be using multiple dataset from previous Jigsaw competitions as complementary train data.

In [None]:
train_1 = pd.read_csv("../input/jigsaw-toxic-severity-rating/validation_data.csv")
train_1

For the main train data, we will simple associate the column `less_toxic` with the score of `0` and the column `more_toxic` with the score of `1`. There are some text that are duplicates so we will filter out the uniques for each level of toxicity.

In [None]:
train_1_cleaned = pd.concat(
    [pd.DataFrame({"text":train_1.less_toxic.unique(), "score":np.zeros(train_1.less_toxic.nunique())}),
     pd.DataFrame({"text":train_1.more_toxic.unique(), "score":np.ones(train_1.more_toxic.nunique())})],
    axis=0).reset_index(drop=True)
train_1_cleaned

In [None]:
train_2 = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
train_2

For the data from `jigsaw-unintended-bias-in-toxicity-classification` we will only be using the `comment_text` and the `target` columns.

In [None]:
train_2_cleaned = pd.DataFrame({'text':train_2.comment_text, 'score':train_2.target})
train_2_cleaned['score'] = train_2_cleaned['score'].apply(lambda x: 0 if x <= 0.05 else x)
train_2_cleaned

In [None]:
train_3 = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip")
train_3

For the data from `jigsaw-toxic-comment-classification-challenge`, we will be scoring them based on the average of the five columns present as targets. Similar to `train_2` we will be using base 50% once more.

In [None]:
score = np.mean(train_3[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']], axis=1)
train_3_cleaned = pd.DataFrame({'text':train_3.comment_text, 'score':score})
train_3_cleaned['score'] = train_3_cleaned['score'].apply(lambda x: 0 if x <= 0.05 else x)
train_3_cleaned

Now, we will merge all train data as one.

In [None]:
train_df = pd.concat(
    [train_1_cleaned, train_2_cleaned, train_3_cleaned],
    axis=0
).reset_index(drop=True)
train_df

Let's check one random comment for each unique score (based from a rounded cutoff or 2 decimal places) printing the first 100 raw chars of the corresponding text.

In [None]:
printed = []
for i in sorted(train_df.score.unique()):
    n = np.round(i, 2) 
    if n in printed:
        continue
    printed.append(n)
    print(f"{len(printed):<3}: {i:.5f}\t{repr(np.random.choice(train_df[train_df.score==i]['text']))[:100]}")

# Data Preprocessing

In [None]:
from tqdm.notebook import tqdm
tqdm.pandas()
import re

add_space_before_punc = lambda x: re.sub(r'(\W|_)', r' \1 ', x)
remove_whitespaces = lambda x: re.sub(r'\s+', ' ', x)
remove_multiples = lambda x: re.sub(r'(.)\1{2,}', r'\1\1', x) #Remove repeated char multiple times

train_df['clean_text'] = train_df.text.progress_apply(
    lambda x: remove_whitespaces(remove_multiples(add_space_before_punc(x)))
)

# Word Scoring

We will be scoring each word used based on frequency. Words used on non-toxic comments will be treated with low scores and words exclusively used on offensive/toxic comments will be scored high.

In [None]:
import json
import gc

neutral_words = {}
toxic_words = {}

def get_words(all_words, toxic=False):
    keys = set(neutral_words.keys())
    print("Looping through all words...")
    for word in tqdm(all_words):
        if toxic and not (word in keys):
            toxic_words[word] = toxic_words.get(word, 0) + 1
        elif not toxic:
            neutral_words[word] = neutral_words.get(word, 0) + 1
            
    with open(f"{'neutral' if not toxic else 'toxic'}_word_counts.json", 'w') as f:
        if toxic:
            json.dump(toxic_words, f)
        else:
            json.dump(neutral_words, f)

#Separate neutral from toxic
print("Separating words...")
neutral_df = (' '.join(train_df[train_df.score == 0]['clean_text'].values)).split()
toxic_df = (' '.join(train_df[train_df.score != 0]['clean_text'].values)).split()
print("Neutral words:", len(neutral_df))
print("Toxic words:", len(toxic_df))

print("Creating word count mapping...")
get_words(neutral_df)
get_words(toxic_df, True)
print(len(neutral_words), len(toxic_words))
del neutral_df, toxic_df
gc.collect()

Now that we have the map of word and word counts, we will be scoring them. For neutral words, the more frequent words will have lower scores and rarer words will have higher scores. We will use the equation `word_score = 0.1 / word_count` to actualize this metric.

For toxic words, the less the frequency of the word, the higher the score. More frequent words will have lower score. We will use the equation of `word_score = 256 * 0.995 ** (word_count)` where `256` is the arbitrary limit of the toxicity score.

In [None]:
neutral_words = {word: 0.1/count for word,count in tqdm(neutral_words.items())}
toxic_words = {word: 256 * 0.995 ** count for word,count in tqdm(toxic_words.items())}
print(f"Neutral min/max: {min(neutral_words.values()):.5f}, {max(neutral_words.values()):.5f}")
print(f"Toxic min/max: {min(toxic_words.values()):.5f}, {max(toxic_words.values()):.5f}")

In [None]:
with open("neutral_word_scores.json", 'w') as f:
    json.dump(neutral_words, f)
with open("toxic_word_scores.json", 'w') as f:
    json.dump(toxic_words, f)

# Testing

In [None]:
def score_sentence(text):
    all_words = text.split()
    return sum(neutral_words[word] if neutral_words.get(word, 0) else
              toxic_words[word] for word in all_words)

train_df['text_score'] = train_df.clean_text.progress_apply(lambda x: score_sentence(x))
train_df['text_score'].describe()

# Submission

Best Score on Public LB is: `0.716`

In [None]:
from scipy.stats import rankdata

test = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")
test['clean_text'] = test.text.progress_apply(
    lambda x: remove_whitespaces(remove_multiples(add_space_before_punc(x)))
)

preds = test.clean_text.progress_apply(lambda x: score_sentence(x))
display(preds, preds.min(), preds.max())
sub = pd.DataFrame({'comment_id':test.comment_id.values, 'score':rankdata(preds, method='ordinal')})
sub.to_csv('submission.csv', index=False)
sub