# Overview

- Kaggle の jigsaw-toxic-severity-rating コンペ。
- pipeline
    - data cleaning
    - ridge regression
- learning strategy
    - ~~maxdiff (binary classification)~~
    - 複数の外部データを利用
    - toxic, severe_toxic, obscene, threat, insult, identity_hate などを重み付けして目的変数に変換
        - severe_toxic * 2 としてその他の項目と和をとり回帰タスク
        - max({toxic, severe_toxic, obscene, threat, insult, identity_hate}) として2クラス分類
    - blending
- CV strategy:
    - ~~比較ペアとして出現したことがあるテキストを再帰的に調べてまとめてひとつの gid 付与。~~
    - ~~gid に対して GroupKFold~~
    - ホールドアウト法 ■複数データセットとの関係を含めて詳細を書く
- training:
    - naive bayes
    - logistic regression
    - ridge regression
    - sgd classifier
    - random forest
- metric:
    - 以下の手続きで評価。

作るモデルの個数は、
- classification: naive bayes, logistic regression, sgd classifier, random forest 4種
- regression: lasso, ridge, sgd regressor, random forest 4種

TODO
- Local CV と Public LB の相関を確認。
- データ固定、モデル固定のもといくつかのバリデーションを試す。
- cleaning
    - NLTK, Spacy, Keras, torchtext を使う方法があるらしい。。
- feature extraction:
    - BoW
    - TF-IDF
    - N-gram
    - sentiment and intent analysis

# Directories

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/ruddit-jigsaw-dataset/LICENSE
/kaggle/input/ruddit-jigsaw-dataset/README.md
/kaggle/input/ruddit-jigsaw-dataset/requirements.txt
/kaggle/input/ruddit-jigsaw-dataset/ruddit-comment-extraction.ipynb
/kaggle/input/ruddit-jigsaw-dataset/Dataset/create_dataset_variants.py
/kaggle/input/ruddit-jigsaw-dataset/Dataset/identityterms_group.txt
/kaggle/input/ruddit-jigsaw-dataset/Dataset/Ruddit.csv
/kaggle/input/ruddit-jigsaw-dataset/Dataset/ReadMe.md
/kaggle/input/ruddit-jigsaw-dataset/Dataset/Ruddit_individual_annotations.csv
/kaggle/input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv
/kaggle/input/ruddit-jigsaw-dataset/Dataset/node_dictionary.npy
/kaggle/input/ruddit-jigsaw-dataset/Dataset/post_with_issues.csv
/kaggle/input/ruddit-jigsaw-dataset/Dataset/Thread_structure.txt
/kaggle/input/ruddit-jigsaw-dataset/Dataset/load_node_dictionary.py
/kaggle/input/ruddit-jigsaw-dataset/Dataset/sample_input_file.csv
/kaggle/input/ruddit-jigsaw-dataset/Models/BERT.py
/kaggle/input/ruddi

# Parameters

In [2]:
DEBUG_FLAG = False
VERSION = 'nb04'

SUBMISSION_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/sample_submission.csv'
VALIDATION_DATA_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv'
COMMENTS_SCORE_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv'
TOXIC3_TRAIN_PATH = '/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv'
RUDDIT_PATH = '/kaggle/input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv'

N_SPLITS = 5 if not DEBUG_FLAG else 2

# Modules

In [3]:
import datetime
import json
import pickle
import random
import re
import sys
import time

import datatable as dt
import gensim
import gensim.downloader as gensim_api
import lightgbm as lgb
import matplotlib as mpl
import matplotlib.pyplot as plt
import nltk
import numpy as np
import optuna
import pandas as pd
import scipy.stats as ss
import seaborn as sns
import transformers

from catboost import CatBoostClassifier
from contextlib import contextmanager
from lime import lime_text
from logging import getLogger, Formatter, FileHandler, StreamHandler, INFO, DEBUG
from matplotlib_venn import venn2
# from optuna.integration import lightgbm as lgb
from scipy.optimize import brute
from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression, Ridge, SGDClassifier
from sklearn.metrics import confusion_matrix, log_loss, roc_auc_score
from sklearn.model_selection import  GroupKFold, StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVC
from tensorflow.keras import models, layers, preprocessing as kprocessing
from tensorflow.keras import backend as K
from tqdm import tqdm

# settings
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

# mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False

# Functions

In [4]:
def reduce_mem_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df


def show_mem_usage():
    print("{}{: >25}{}{: >10}{}".format('|','Variable Name','|','Memory','|'))
    print(" ------------------------------------ ")
    for var_name in globals():
        if not var_name.startswith("_") and sys.getsizeof(eval(var_name)) > 1024**2:
            print("{}{: >25}{}{: >6} MiB{}".format('|',var_name,'|', int(sys.getsizeof(eval(var_name))/1024**2),'|'))


def read_data():
    valid = dt.fread(VALIDATION_DATA_PATH).to_pandas()
    test = dt.fread(COMMENTS_SCORE_PATH).to_pandas()
    submission = dt.fread(SUBMISSION_PATH).to_pandas()
    toxic3 = dt.fread(TOXIC3_TRAIN_PATH).to_pandas()
    ruddit = dt.fread(RUDDIT_PATH).to_pandas()
    
    return valid, test, submission, toxic3, ruddit


def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):
    # clean (convert to lowercase and remove punctuations and characters and then strip)
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    
    # remove numbers
    text = re.sub(r'[\d]', '', text)
            
    # Tokenize (convert from string to list)
    lst_text = text.split()
    # remove Stopwords
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in lst_stopwords]
                
    # Stemming (remove -ing, -ly, ...)
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    # Lemmatisation (convert the word into root word)
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    # back to string from list
    text = ' '.join(lst_text)
    return text

# Preparing

In [5]:
valid, test, submission, toxic3, ruddit = read_data()

if DEBUG_FLAG:
    valid = valid.sample(n=1000).reset_index(drop=True)
    toxic3 = toxic3.sample(n=1000).reset_index(drop=True)
    ruddit = ruddit.sample(n=1000).reset_index(drop=True)

print(f'valid shape: {valid.shape}')
print(f'test shape: {test.shape}')
print(f'submission shape: {submission.shape}')
print(f'toxic3 shape: {toxic3.shape}')
print(f'ruddit shape: {ruddit.shape}')

display(valid.head())
display(test.head())
display(submission.head())
display(toxic3.head())
display(ruddit.head())

valid shape: (30108, 3)
test shape: (7537, 2)
submission shape: (7537, 2)
toxic3 shape: (223549, 8)
ruddit shape: (5838, 5)


Unnamed: 0,worker,less_toxic,more_toxic
0,313,This article sucks \n\nwoo woo wooooooo,WHAT!!!!!!!!?!?!!?!?!!?!?!?!?!!!!!!!!!!!!!!!!!...
1,188,"""And yes, people should recognize that but the...",Daphne Guinness \n\nTop of the mornin' my fav...
2,82,"Western Media?\n\nYup, because every crime in...","""Atom you don't believe actual photos of mastu..."
3,347,And you removed it! You numbskull! I don't car...,You seem to have sand in your vagina.\n\nMight...
4,539,smelly vagina \n\nBluerasberry why don't you ...,"hey \n\nway to support nazis, you racist"


Unnamed: 0,comment_id,text
0,114890,"""\n \n\nGjalexei, you asked about whether ther..."
1,732895,"Looks like be have an abuser , can you please ..."
2,1139051,I confess to having complete (and apparently b...
3,1434512,"""\n\nFreud's ideas are certainly much discusse..."
4,2084821,It is not just you. This is a laundry list of ...


Unnamed: 0,comment_id,score
0,114890,0.5
1,732895,0.5
2,1139051,0.5
3,1434512,0.5
4,2084821,0.5


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,False,False,False,False,False,False
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,False,False,False,False,False,False
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",False,False,False,False,False,False
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",False,False,False,False,False,False
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",False,False,False,False,False,False


Unnamed: 0,post_id,comment_id,txt,url,offensiveness_score
0,42g75o,cza1q49,> The difference in average earnings between m...,https://www.reddit.com/r/changemyview/comments...,-0.083
1,42g75o,cza1wdh,"The myth is that the ""gap"" is entirely based o...",https://www.reddit.com/r/changemyview/comments...,-0.022
2,42g75o,cza23qx,[deleted],https://www.reddit.com/r/changemyview/comments...,0.167
3,42g75o,cza2bw8,The assertion is that women get paid less for ...,https://www.reddit.com/r/changemyview/comments...,-0.146
4,42g75o,cza2iji,You said in the OP that's not what they're mea...,https://www.reddit.com/r/changemyview/comments...,-0.083


# Cleaning

In [6]:
# ストップワード
stop_words = list(text.ENGLISH_STOP_WORDS)
html_tags = [
    '<p>', '</p>', '<table>', '</table>', '<tr>', '</tr>', '<ul>', '</ul>', '<ol>', '</ol>', '<dl>', '</dl>', 
    '<li>', '</li>', '<dd>', '</dd>', '<dt>', '</dt>', '/n', '\n'
]
r_buf = [
    'it', 'is', 'are', 'do', 'does', 'did', 'was', 'were', 'will', 'can', 'the', 'a', 'of', 'in', 'and', 'on',
    'what', 'where', 'when', 'which'
]
stop_words = list(set(stop_words + html_tags + r_buf))
print(' '.join(stop_words))

least co we hundred another are yours twenty </ul> due her ltd almost them fifteen interest others at before might here latter himself hers well is do whereby be over an </dd> often system becomes off on become whose already anything sincere not elsewhere whole except etc fifty thence whereafter wherever thin via found sometime everyone bill otherwise someone back it whither each anyway 
 amount whatever seems anyhow two that itself again among never formerly can first nor few un around myself something full four about the which five after seemed he de yet get very move between my next beside into <tr> name beforehand third such also under what him namely please did nowhere would neither give part ie afterwards why up below hereupon cant of eight <p> <li> must whom sixty you everything whoever they towards too whether yourself cannot twelve will became else with couldnt along me top though made only thus anywhere further sometimes while bottom rather since throughout </dl> have fire me

In [7]:
# クリーニング
toxic3 = toxic3.rename(columns={'comment_text': 'text'})
toxic3['text_clean'] = toxic3['text'].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=stop_words))
valid['less_toxic_clean'] = valid['less_toxic'].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=stop_words))
valid['more_toxic_clean'] = valid['more_toxic'].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=stop_words))
test['text_clean'] = test['text'].apply(lambda x: utils_preprocess_text(x))

# 目的変数
toxic3['y'] = toxic3[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1).astype(int) + toxic3['severe_toxic']
toxic3['y'] = toxic3['y'] / toxic3['y'].max()

display(toxic3.head(3))

Unnamed: 0,id,text,toxic,severe_toxic,obscene,threat,insult,identity_hate,text_clean,y
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,False,False,False,False,False,False,explanation edits username hardcore metallica ...,0.0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,False,False,False,False,False,False,daww match background colour im seemingly stuc...,0.0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",False,False,False,False,False,False,hey man im really trying edit war just guy con...,0.0


# Training

In [8]:
%%time

# 学習パイプライン
features = FeatureUnion([
    ('vect3', 
     TfidfVectorizer(
         min_df=3, 
         max_df=0.5, 
         analyzer='char_wb',
         ngram_range=(3,5)))
])

pipe = Pipeline([
    ('features', features),
    # ('Ridge', Ridge())
    ('RandomForestRegressor', RandomForestRegressor(n_estimators=100, max_depth=32, n_jobs=-1))
])

pipe.fit(toxic3['text_clean'], toxic3['y'])
print('Total number of features:', len(pipe['features'].get_feature_names()))

Total number of features: 326430
CPU times: user 21h 43min 2s, sys: 41.2 s, total: 21h 43min 43s
Wall time: 5h 32min 49s


In [9]:
# 特徴量の重みでソート
if pipe.steps[1][0] == 'Ridge':
    feat = pipe['Ridge'].coef_
elif pipe.steps[1][0] == 'RandomForestRegressor':
    feat = pipe['RandomForestRegressor'].feature_importances_
    
feature_weights = sorted(
    list(zip(pipe['features'].get_feature_names(), np.round(feat, 4))), 
    key=lambda x:x[1], 
    reverse=True)

display(pd.DataFrame(feature_weights[:20], columns=['feature', 'val']).set_index('feature'))

Unnamed: 0_level_0,val
feature,Unnamed: 1_level_1
vect3__fuck,0.406
vect3__suck,0.045
vect3__shit,0.0332
vect3__ fag,0.0328
vect3__bitch,0.0265
vect3__ as,0.0167
vect3__nigg,0.0139
vect3__uck,0.0131
vect3__ assh,0.0117
vect3__idiot,0.0106


# Prediction

In [10]:
# 推論
valid_preds_less = pipe.predict(valid['less_toxic_clean'])
valid_preds_more = pipe.predict(valid['more_toxic_clean'])
test_preds = pipe.predict(test['text_clean'])

# Validation

In [11]:
# 評価
print(f'validation accuracy is {(valid_preds_less < valid_preds_more).mean()}')

validation accuracy is 0.6012355520127541


# Submit

In [12]:
submission['score'] = test_preds
submission.head()

Unnamed: 0,comment_id,score
0,114890,0.004011
1,732895,0.009816
2,1139051,0.004011
3,1434512,0.00641
4,2084821,0.06423


In [13]:
pd.DataFrame(pd.Series(submission['score'].ravel()).describe()).transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,7537.0,0.107077,0.167398,0.00384,0.009816,0.009816,0.151323,0.822857


In [14]:
submission.to_csv('submission.csv', index=False)