# Overview

- Kaggle の jigsaw-toxic-severity-rating コンペ。
- pipeline
    - data cleaning
    - feature extraction
    - ridge regression
- learning strategy
    - ~~maxdiff (binary classification)~~
    - external dataset: 
        - ruddit
        - 1st toxic comp
        - 3rd toxic comp.
- CV strategy:
    - ~~比較ペアとして出現したことがあるテキストを再帰的に調べてまとめてひとつの gid 付与。~~
    - ~~gid に対して GroupKFold~~
    - **adversarial validation for feature selection**
- training:
    - ~~naive bayes~~
    - ~~logistic regression~~
    - ridge regression
    - ~~sgd classifier~~
    - random forest
- metric:
    - accuracy of comparable annotation

TODO
- adversarial validation for feature selection


References
- [Adversarial Validation to Select Validation Data for Evaluating Performance in E-commerce Purchase Intent Prediction | SIGIR econ 2021](https://sigir-ecom.github.io/ecom21DCPapers/paper3.pdf)
- [Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber | AdKDD 2020](https://arxiv.org/pdf/2004.03045.pdf)

# Directories

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv
/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test_labels.csv
/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation-processed-seqlen128.csv
/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test-processed-seqlen128.csv
/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train-processed-seqlen128.csv
/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv
/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv
/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv
/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv
/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-processed-seqlen128.csv
/kaggle/input/ruddit-jigsaw-dataset/LICENSE
/kaggle/input/ruddit-jigsaw-data

# Parameters

In [2]:
DEBUG_FLAG = False
VERSION = 'nb05'

SUBMISSION_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/sample_submission.csv'
VALIDATION_DATA_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv'
COMMENTS_SCORE_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv'
TOXIC3_TRAIN_PATH = '/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv'
RUDDIT_PATH = '/kaggle/input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv'

N_SPLITS = 5 if not DEBUG_FLAG else 2

# Modules

In [3]:
import datetime
import json
import pickle
import random
import re
import sys
import time

import datatable as dt
import gensim
import gensim.downloader as gensim_api
import lightgbm as lgb
import matplotlib as mpl
import matplotlib.pyplot as plt
import nltk
import numpy as np
import optuna
import pandas as pd
import scipy.stats as ss
import seaborn as sns
import transformers

from catboost import CatBoostClassifier
from contextlib import contextmanager
from lime import lime_text
from logging import getLogger, Formatter, FileHandler, StreamHandler, INFO, DEBUG
from matplotlib_venn import venn2
# from optuna.integration import lightgbm as lgb
from scipy.optimize import brute
from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression, Ridge, SGDClassifier
from sklearn.metrics import confusion_matrix, log_loss, roc_auc_score
from sklearn.model_selection import  GroupKFold, StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVC
from tensorflow.keras import models, layers, preprocessing as kprocessing
from tensorflow.keras import backend as K
from tqdm import tqdm

# settings
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

# mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False

# Functions

In [4]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            'Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)'.format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df


def show_mem_usage():
    print('|{: >25}|{: >10}|'.format('Variable Name', 'Memory'))
    print(' ------------------------------------ ')
    for var_name in globals():
        if not var_name.startswith('_') and sys.getsizeof(eval(var_name)) > 1024**2:
            print('|{: >25}|{: >6} MiB|'.format(var_name, int(sys.getsizeof(eval(var_name))/1024**2)))


def read_data():
    valid = dt.fread(VALIDATION_DATA_PATH).to_pandas()
    test = dt.fread(COMMENTS_SCORE_PATH).to_pandas()
    submission = dt.fread(SUBMISSION_PATH).to_pandas()
    toxic3 = dt.fread(TOXIC3_TRAIN_PATH).to_pandas()
    ruddit = dt.fread(RUDDIT_PATH).to_pandas()
    
    return valid, test, submission, toxic3, ruddit


def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):
    # clean (convert to lowercase and remove punctuations and characters and then strip)
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    
    # remove numbers
    text = re.sub(r'[\d]', '', text)
            
    # Tokenize (convert from string to list)
    lst_text = text.split()
    # remove Stopwords
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in lst_stopwords]
                
    # Stemming (remove -ing, -ly, ...)
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    # Lemmatisation (convert the word into root word)
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    # back to string from list
    text = ' '.join(lst_text)
    return text

# Preparing

In [5]:
valid, test, submission, toxic3, ruddit = read_data()

if DEBUG_FLAG:
    valid = valid.sample(n=1000).reset_index(drop=True)
    toxic3 = toxic3.sample(n=1000).reset_index(drop=True)
    ruddit = ruddit.sample(n=1000).reset_index(drop=True)

print(f'valid shape: {valid.shape}')
print(f'test shape: {test.shape}')
print(f'submission shape: {submission.shape}')
print(f'toxic3 shape: {toxic3.shape}')
print(f'ruddit shape: {ruddit.shape}')

display(valid.head())
display(test.head())
display(submission.head())
display(toxic3.head())
display(ruddit.head())

valid shape: (30108, 3)
test shape: (7537, 2)
submission shape: (7537, 2)
toxic3 shape: (223549, 8)
ruddit shape: (5838, 5)


Unnamed: 0,worker,less_toxic,more_toxic
0,313,This article sucks \n\nwoo woo wooooooo,WHAT!!!!!!!!?!?!!?!?!!?!?!?!?!!!!!!!!!!!!!!!!!...
1,188,"""And yes, people should recognize that but the...",Daphne Guinness \n\nTop of the mornin' my fav...
2,82,"Western Media?\n\nYup, because every crime in...","""Atom you don't believe actual photos of mastu..."
3,347,And you removed it! You numbskull! I don't car...,You seem to have sand in your vagina.\n\nMight...
4,539,smelly vagina \n\nBluerasberry why don't you ...,"hey \n\nway to support nazis, you racist"


Unnamed: 0,comment_id,text
0,114890,"""\n \n\nGjalexei, you asked about whether ther..."
1,732895,"Looks like be have an abuser , can you please ..."
2,1139051,I confess to having complete (and apparently b...
3,1434512,"""\n\nFreud's ideas are certainly much discusse..."
4,2084821,It is not just you. This is a laundry list of ...


Unnamed: 0,comment_id,score
0,114890,0.5
1,732895,0.5
2,1139051,0.5
3,1434512,0.5
4,2084821,0.5


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,False,False,False,False,False,False
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,False,False,False,False,False,False
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",False,False,False,False,False,False
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",False,False,False,False,False,False
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",False,False,False,False,False,False


Unnamed: 0,post_id,comment_id,txt,url,offensiveness_score
0,42g75o,cza1q49,> The difference in average earnings between m...,https://www.reddit.com/r/changemyview/comments...,-0.083
1,42g75o,cza1wdh,"The myth is that the ""gap"" is entirely based o...",https://www.reddit.com/r/changemyview/comments...,-0.022
2,42g75o,cza23qx,[deleted],https://www.reddit.com/r/changemyview/comments...,0.167
3,42g75o,cza2bw8,The assertion is that women get paid less for ...,https://www.reddit.com/r/changemyview/comments...,-0.146
4,42g75o,cza2iji,You said in the OP that's not what they're mea...,https://www.reddit.com/r/changemyview/comments...,-0.083


# Cleaning

In [6]:
# ストップワード
stop_words = list(text.ENGLISH_STOP_WORDS)

# Average AUC: 0.858978 -> 0.746329
adv_val01 = 'wa ha ridiculous le redirect fuck stupid thanks idiot gay sockpuppet damn vandalism stop hell nonsense section crap racist troll useless vandal fucking asshole shit silly moron life jew faggot clown article dick loser arrogant sock people ignorant guy bullshit suck fool help welcome source nazi hate homosexual cunt retarded wtf muslim puppet ignorance nigger fag jerk pathetic block kill issue wikipedia bitch link discussion sick piss thank list brain shut utc retard penis sockpuppets consensus die dont islam reverting vandalizing attack nerd terrorist quit nut arrogance dumb liar lie like wanker sex vandalising youre man arab ugly dumbass black'.split()

# Average AUC: 0.746329 -> 0.724437
adv_val02 = 'piece little blocked white face deleting ive ban cock garbage real head rape bunch hey warning eat god twat date evil wikiproject fat fuckin test jewish idiotic excuse dude friend trolling review pissed porn version oh request delete archive nose template insult grow reference harassing care specific listed shove image material queer joke scum truth listen disgusting reliable july check pig prick book admins sexual sockpuppetry fucker event january banned lame calling crazy dare info currently just bet happy policy lol dirty bastard best douchebag douche blocking added worst wikipediasuspected rubbish fight note personal childish deletion threat fact shoot admin'.split()

# Average AUC: 0.724437 -> 0.705158
adv_val03 = 'hypocrite worked coward mother motherfucker new bag away boy kid citation used lick power cheer february bloody lazy want going nigga goddamn offensive hitler ur abuse wanna error gonna leave freak obviously insane speedy mentioned dickhead fucked addition season copy appreciate ya reverted person copyright published child big enjoy description homosexuality woman insulting subject called sure racism result different lying regard september text wiki rude wow annoying prostitute waste notable main experimenting number biased looking status stick zionist series search according accused company moved stupidity worthless change november abusive fun mouth august primary interested jackass rot afd ip particular explanation'.split()

# Average AUC: 0.705158 -> 0.688806
adv_val04 = 'contributing title notability uploaded criterion yo men screw provide pussy created ga news butt fixed murder player posting discussed reader include banning cuz got computer improvement current proposed talk sit likely know work linked burn state administrator imbecile proposal vandalized improve removed relevant correct cocksucking nasty example vandalise stalker nomination trying blow category merged jimbo user disambiguation update aware lead censorship kindly racial process feedback map june cum dictator appears stalking interesting killed motherfucking term biography deserve space possible hypocritical throw film killing add greek guideline information male geek arse question watch come appropriate field im scumbag requested conflict agree'.split()

# Average AUC: 0.688806 -> 0.678011
adv_val05 = 'tag sourced paragraph dyk notice fuckhead damned discus infobox recent use recently acting creating summary ref buddy style contact crappy placed total cease vote common unfortunately helpful sad merge son appear shame fascist act separate religious changed bully content scared available basement inclusion moment mexican sign general united mom experiment worm specifically messing punk hispanic christian kiss wasting ball accusing threatening lesbian birth chance destroy masturbation somebody checkuser disgrace admit stub mention love rfc yeah authority baby particularly stuff homophobic semiprotected cocksucker fine ruining bad box islamist regarding suspected telling arent slur fing exact draft anal altered live design automatically'.split()

# Average AUC: 0.678011 -> 0.667824
adv_val06 = 'manual requesting file sentence input seriously ask multiple fair tagged mum appreciated updated insulted dog laughable university original independent definition homo unsourced featured petty explaining tell copyrighted accuse speedily signature december laid right easier bother journal wit great release introduction palestinian girl win monkey illiterate create project brief censoring sleep utter free medium photo melmac thing world everybody bisexual important shitty swear uk dipshit tired author village protect moderator controversy violating indicate expand wont station existing freedom rant statement ho point hindu bollock saying pay wording night anus corrected proof convention kind attacked understanding sucking shes today solution whining support'.split()

# Average AUC: 0.667824 -> 0.655644
adv_val07 = 'rationale tutorial pillar meet using biggest context data wikipediafiles expanded hello contribs week trash deleted really username talking mad oppose participate needed turk hesitate bum permission attitude aint district hurt technical action religion solid clever th whore described narrowminded terrorism improved including accurate pompous negro chicken vagina program dad destroying meatpuppet kicked try useful reverts ethnic presented virgin allow society american pat rapist rule wikipediaimages potential pretend tit restored international site ani girlfriend remark mr replaced bbc bias innocent fellow code acceptable war told building able replied audi included believe spelling repeatedly listing massive fanatic previously standard agreed obese music'.split()

# Average AUC: 0.655644 -> 0.644016
adv_val08 = 'entry helpme tilde uploading nation forever belong said censor rat nominated islamic merging nonfree merger bush constantly replace feature future ruin split bullying tool reviewing foul hole dear suggestion explained worse given asswipe rewrite central cite table funny wikipediaquestions october feeling wake plain clue moving secondary disagree censored dislike nerve missing stink need punished similar ahead upload turkish half height item confusion bnp study ludicrous spend corrupt desperate accepted silliness fac followed color eh lede song pushing skin speech situation erasing pedophile election requires killer formatting rewritten objection attacking hair tosser violence coverage various produce lunatic warn document start finding'.split()

# Average AUC: 0.644016 -> 0.630651
adv_val09 = 'place armenian format necessary respect bye detailed excellent wikipediaarticles sue suggested adolf propol hunt confusing character propaganda importance nominate related lmao reported warned web resolved crime poop death communist everyday wprs north thread license truly present reach footnote listas restore gut hang smell sht directly requirement tagging far edition mothjer curse laugh wikipediawikiproject young verifiability significant st engine external wikipedian bit fucken claiming fake tolerated haha usually meant untill bitching soul proud rudeness freaking intellectual friggin let verifiable testing improving shortly impression confirm jump favor chink cougar shared software complaining boot huh wizard development pas covered prose unclear contempt race'.split()

# Average AUC: 0.630651 -> 0.624885
adv_val10 = 'avoid bite cited business inline deserved stfu reached minor newspaper early assert fed db chan additional atheist low prod upset task origin naming inform hide historian foundation deletes press refrain police provides contested vandilism explain encyclopedic script creep addressed spree reviewer complain thousand position maggot delay reflect alternative squad genocide human dishonest shithead propose hateful swallow lulz pornography youve clicking fagget skill iq punch twisted period feel republican pride dead renaming unit referenced progress masturbate named transsexual hatred absurd calm intelligent plan air prison promotional blank filthy substantially typical smear ii appeal ridiculously newsletter job punish reply rollback antisemite copied'.split()

# Average AUC: 0.624885 -> 0.544398
# 1000 words cut
adv_val11 = 'generally contest sake running pull boyfriend chap angry noticed club anon notification interview fan sooooo significance offend gamaliel perverted clarification model run desire ping guide kiddo jun structure wikipediaadministrators internet mail disputed chart quote later translation neutral ilovedirtbikes blatantly good concern wikipeida furry guess destroyed rename area zomg editorial rightwing mar location scientific youll direct talkback blowjob highly devil pack playing lisa specify subsection gallery publication true yall hot delanoy apply hero commented bhaisaab option topic sucked bout caption function report service falsely thug rfa aid located criminal ta redirected gfdl bigoted bigot dummy hebrew leg reporting boo aussie answered reviewed shall lecture talkpage yea pov apr incompetent saint burning sexuality banner refuse stinking route blind giant horrible light wikinazi pretending train civil passed witch march harrassment cleanup hasbara fcuk spreading wipe privilege advertisement sanchez written board goof updating railway sexually translated murderer dork specified principle tc pity nerdy korean ongoing scumbags immature promise lover fred fukin responded portal nov old coupled cool behave slime sweet discovered album accordingly logo cup eye egyptian doing revert wasted organization sufficient coi ryan warring fck tone formal proper short cancer unlike tv padding assistance lgagnon cabal spit earlier incapable terrible quotation jeff ethnicity rediculous inappropriate feces cliff meaning specie werent sep fk dragon smart incest interpretation unless awful weep year facist git attempt intended strange copyvio historical city grave wale hold dheyward neonazi yay whiny frickin advance mod horse throat match raping toilet proven sucker merit throwing naked overall faking jamie involvement territory pronunciation userpages accordance satan decision orientation citing watchlist industry successful deleteing yankee passage regret bull verify deaf handle hahahahaha rip prior heck lock outraged cat marked submission horror properly sexy bigd delusional suitable coldplay minimum mommy problematic appeared direction cyber vomit canada shakespeare selfesteem instruction suffer house unreferenced illegal darn jerkoff forward crack perspective spell false sunni bongwarrior jack parameter peice wikipediaimage reality realize require pusher quite israel european critic develop forgot gotta office defending disagreement harrasing gender nationalist wikipeia promotion wiping thief antisemitic fatty mulatto kissing looked discussing following booo selfrighteous cheek helping kidding bos case banana bakutrix mama twit mafia nationalistic width weiner county freakin closed perfectly ashamed cretin subpage horrid ass self liberal email red turner jw unsigned completely photograph growing rid nice doo redirects product userpage expansion antiamerican revolve vain calzaghe fcking paedophile miserable tomorrow desist profanity division couple april value foreign tabtab creation polish jlatondre wall hand mongo cowardly talkthe poland existence fist offended introduce anti browser grandmother kick sandbox archiving cause checking raised turn preference intro dat master titty enemy conclusion gibberish inferior modified understood gunna maybe controversial technology judging certainly semitic justify massacre played chase logging set flame michael imdb anthem resource episode procedure standing participant tip wrong moaning bc wonder bored antic pant country paste court characteristic production cop akbar serf law smug yes directory changing effort border historic party stable reasonable supremacist dreamguy unattractive desk actor idiocy scope concept remember size bombed button donate belive honestly wikipedians art candidate unban contains capacity somewhat hated coming valid hunjan consistency overweight comparison threaten rating dumbhead effect infoboxes integrity build mentioning sensitive dr turd dance fa learn paper square edits london channel wannabe open creator bros wik archived wekepedia providing band legitimate clarify ll unblock heterosexual poor hail plot convicted dosent tactic molester falsehood lifted pissing method nude allowed expose politics japanese female french america proceed raise transgender thats brainless raped webpage dictionary holder tedious spare donkey idoit looser nail invitation limited energy language defend reception tree ridicule nuisance pump wikistalking prof aryan fggt wild hiding evula behaviour spic play transcluded recorded indulgent kinky mutha huge ship jackson freud built whats moronic make gwernol park critical eff health syntax removal yur problem working taste previous ok christ provided ballsack tongue roti omnigan vicious appearance adress accusation picking judaism english suppose evidence gray active apple revealed yep proving form zionism recommend jpg finish fix abusing preferred science attempting wikipedias static ensure overview supressing choke wrongfully goon brave vandalsim foreskin happen destructobot dolphin river wank nawlinwiki single shii beware analysis necessarily mess explains conformance unblocked obama connection probably perfect hypocrisy tramp wikimedia screwing indefinitely onorem hee supply flag performance ybm saddo teach natural dildo domain pervert nambla advertising factor faggoty russian holier agenda fukk totally official affect preceding officially blowing bound specifies intimidate compromise romanian definitely confused holocaust making sockpuppeting participated inane animal zhanzhao malta slanderous misleading promote advise despicable impotent grudge traitor swine vegan protecting itll harassment pestering island ghost mah census llama twitter hahaha anymore nad tend testicle mentaly diamond wondering selfpublished broken url david schmuck stats announced missed chriss mutt choose month theory bothering stylewidth produced fantasy geeky allegation semen kurdistan sitush spaniard directed nonnotable tyrant cockblaster permitted tandem gee licensed shallow clearly middle disease born hoo scjessey stone independence defense orphaned consider initial obnoxious washing released flow antivandalism supposedly smash marking slant shouldnt interfering renamed slimy incivility heard korn tour powerhungry mf olyeller football fattest mode gnaa imperialist blah stomp overreacted wish ergo horny drunk centre adult correctly baboon professor maniac agrees cooperation opportunity suprise erase yanksox special peer reediting limit learning parent baptist boi colour amusing reset imagine commentary logic classification irrelevant verticalaligntop muhammad cocaine straight politically simply douchebags slutty outdated establish kong deserves economy yamla scotland sooner german sense shameful infact universe selecting ahole thirdparty extremely dufuses quickly expanding trial similarly wider novel commit crock atkins monster terminator aircraft ohnoitsjamie cute shia equate diem hal briefly musician wobbs dumbest saxifrage unknown kite orgy appoloboy instance unique foolish rare generic restriction faced cream cellpadding libel lousy grammar director decide nuff wikilove expected guitar hominem crawl herunar orane homoerotic cuss chairmanofall concur megan auburnpilot required tranny refugee rival representation order average hour foreigner mothafucka doesnt minded management pdf smartest key hoe antihindu whine feb wpspi class shitload jesus biographical portion took sss abused replacement basketball sourcing oneday sadistic bahamut senseless mediation balance freepsbane technique opposition judgment blockd bryan construed oswald tough poll survey nazism frequently thought department unprotect shithole confirmation filth inferiority'.split()

# Average AUC: 0.544398 -> 0.485834
# 1000 words cut
adv_val12 = 'render dropdown backgroundcolorffffa logged backgroundcolor crybaby filed determine research contribution juicy complaint enjoying replaceable recreate view californiaalibaba load fuckdamn assume pile threating dong agreement cracka dream declined victim column bitchy noticeboardincidents goin constitutes wet publisher arbcom deviant encyclopedia harassed submitted type ready extremist prank informative atta threatened animalfucker vehicle nonsensical bro pretentious acorn meatpuppets heading sack dvd seven received letter borderpx soldier blp allahu styleverticalalign default speaker engaging absolutely variant christmas jealous civility john congratualtions janitor community cycle mabye snorting bowel developed pointless defamation insert somone relevance near qualifies click spite teahouse joining cookie password outline bizarre applies horribly track conform guidance center mo oxford documentation determined construction silenced literature raging contentious devoting money rowspan base established talkwikiproject attract presumably verified dawg committee basic introducing saw bankrupt blame stripper hatchet assessment sand theme thrown gang television identical fuckign aug barbarian constructive signing impose breaking gold haunt slut authoritarian balanced paint quick prominent imo harry defacing adopted intelligence super dickface neutrality unike extensive quality questionable prepubescent pm debate website assertion wikipediafair focus impunity genitals blood second jimmy facilitate mel commie hayter lieing ann cover gone japan signpost unworthy unrelated anarchist regularly opening persecution usage laughing afds tower anger wp referencing erect mullet nick jayron receiving odd league screenshot faq readd freaky setting despite sent px hairy samev mm hat henry steal irc cow sore sens normal letting pro sweetheart spam sob allen stage lil harrassing fascism aoluwatoyin concerning sissy tralala spiteful pencil imma tagsfair_use persistant carefully dealing sanger answer highest userfactualman probaly oops lived lifestyle core grateful meantime attribution driver geezus lowest undoing hi seen observation smoker worry pulling tolerate dose bowl threerevert malignant storm ndp neo infuriating arrest flying obtained shiteating studio modem indian modify exists credit editwarring recording coulter talklist device voting nearly based recreated winner peaceful reminder pubic comfortable flooding shanes anyways campus talent pumpkin dozen wpv undue graph misunderstanding latino continually communicatorkills id bomb goy mardyks katrina crank oil indefinite gap bugger treat peace summarize douch eventually gd alternate association rereverted peewee kkk spamming robert revolution strict indulge ment paedo cleaning nutcase extension public heart nixer dangerous talkcontribs ciao sorted decency userspace selfhating fking internal dab wave touch reasearch refers family history hermaphrodite semiprotection na satisfaction host europe appearing memory urself melodic gain hesperian implies wikipediasandbox husband font training selected porno slander variety civilian explode primarily paganism mood did went elinord harly closing poo drool announcement conspire disambig decist retarted hoping fanboy scientology kike stylecolor hilarious maintenance ki genre knew incorporated terror inconvenience chuck jumped stay participating fraud enjoys tire consideration cent hospital cave stupidly economic afc unfriendly conservative poopie nangparbat chamber risk unthinking hustler kfc castle canai vandlism ultimatum sitting walter dispute aclu intransigent solved assumed completed respond mass preview farix wikilinks clarified nightmare dross medal involved speed regional drug temperature kleagle root wanted undone cfd loony canadian snide definitive muppet simplified anime drink kingpin wanking pubes capital buy sycophant barry sig exploit carry unorthodox ranking dissident labeling layout otrs spui assassin province drv outcome idea numbskull bin deranged originally comprehensive cut influence racially fuckwit targeted neonazis liked equation pole hong republicanjacobite jeni careful hahah happens majorly smarter ca asshats fricken remains chimp ashley contra disagreed unlikely malleus extreme latest chin multiracial effective dig undo futile presume prat xenophobia soon httpwwwwikinfoorgindexphpcategoryprimordiality robs heil scan noticeboard indecent mean vozenilek hindustani opposing advice expressed oi branch technically rd zealand wikitheclown pakistani pe oldest homophobe cellspacing gabsadds pair descriptive ape simpleminded antitruth obscure shiite prove constant stubborn terri journalist lacking finally mix indicated russia nitpicking died taking oohh surly emperor kept cracker ppls waaaaay genus hereill sanction noteworthy tasc wpor hdayejr jfdwolff higher perpetuate annoyed vfd mentally idly kurd programming measure impersonating hit rag fisherqueen road subjectspecific naconkantari annoyance relating sonic kitten se pronounced styrofoam planned emo jdelanoy exist notify unhelpful xxxx background france post meetup brainwashed strictly ending wikipediawhere index decade presentation assignment oct reading comply reputation gibson incivil che styleverticalaligntop didnt gained omg marie chris ding paedophilia robe angered reversed swears coin asserted major broader course obtain selfimportance ray kurt toe compared cult brat kww chipper rabbi continuing kevin viva germany grade nd baldfaced switch nolife chester lose weak photographer molesting supported pira si urine strongly citizenship vendetta spanking usernames engineering instead wqa azari anniversary professional probation pimp readded maddox barrel pal rally denny attributed snip untagged vandelism header museum fundy irish manchu youi line executionstyle greeting soso hiphop trusted fraudulent guise legalese wikipedidiots warden mam cumbey hockey credited ambiguity exchange continue granted politely dumpster fucktard youc zoe markup serendipodous disagrees fckin pipe crew disgust roman larger violate roughly blackpearl commited math linkin depends hall straw itd screeching settled borrowed subhuman therapy upcoming ottoman sane creative evolutionist teenager scam tito appriciate precise recognized sadly nutjob diff overboard crossmr totaly arsehole yr handedly bandit nag glorification blamed eric oxymoron orange senate twerp normally hater relax nonscientific nandesuka hunter stabbing random parenthesis cleankeeper redvers damage gnu chump der summer hard stalin fu zero fbi atlan legally momma deskana dickwad reworded remain bail recommendation coordinate puppeteer becuase achieved wad wiktionary component paki temporary fresh entire focused mistranslate crusader library asking footballer jerusalem chriso goodbye uneducated conflicting east sorting spanked stadium hysterical worldwide waggers emailed flower silencing ibanez glad score vulnerable ended rod inactive circus market notified ukraine circumcision ce held downright bektashi verification croat seb polite urban marijuana empire justified threesome managed hehe van madonna psychopath sociopath gtfo knowledgeable possibility tabtabtab farcical notation geting sharing sneeze imho ankhmorpork satori html snobby saudi qualify antiislamic povwarrior widely sikh je offer comic africanamerican manga bark contradict connor distinct fantastic cfred secular concentration progressive beat chess fashion display sexiest matthew evolution bot attached safer popular wessely thuggish dennis aditude delted legal trollop pass fiend formula complex taping vanessa humiliated cmon cyprus ego putz deleated jones colombo enthiran ortiz story sewer dictatorship temple frustration nearest capable pit pest brazil boston whim boob readding disaster disproven motha condom refute hahahahahahahaha cultist americus assist wrote anyones cumming asshat promoted patient damnit turned centric vandalizm quoted stylebackgroundcolor'.split()

stop_words += adv_val01
stop_words += adv_val02
stop_words += adv_val03
stop_words += adv_val04
stop_words += adv_val05
stop_words += adv_val06
stop_words += adv_val07
stop_words += adv_val08
stop_words += adv_val09
stop_words += adv_val10
stop_words += adv_val11
stop_words += adv_val12[:500]

print(' '.join(stop_words))
stop_words = list(set(stop_words))



In [7]:
# クリーニング
toxic3 = toxic3.rename(columns={'comment_text': 'text'})
toxic3['text_clean'] = toxic3['text'].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=stop_words))
valid['less_toxic_clean'] = valid['less_toxic'].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=stop_words))
valid['more_toxic_clean'] = valid['more_toxic'].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=stop_words))
test['text_clean'] = test['text'].apply(lambda x: utils_preprocess_text(x))

display(toxic3.head(3))

Unnamed: 0,id,text,toxic,severe_toxic,obscene,threat,insult,identity_hate,text_clean
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,False,False,False,False,False,False,hardcore metallica vandalism closure gas voted...
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,False,False,False,False,False,False,daww match background seemingly stuck
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",False,False,False,False,False,False,edit removing instead page actual


In [8]:
# adversarial validation data
train = pd.concat([toxic3['text_clean'], test['text_clean']])
target = pd.concat([pd.Series(np.ones(toxic3.shape[0])), pd.Series(np.zeros(test.shape[0]))])
print(target.value_counts())

1.0    223549
0.0      7537
dtype: int64


# Learning

TF-IDF による特徴抽出。ここでは sklearn.feature_extraction.text.TfidfVectorizer を利用する。以下は主要なオプション。

|オプション|選択肢 |既定値|説明 |
|--|--|--|--|
|input |{‘filename’, ‘file’, ‘content’}|’content’|与えるオブジェクトの形式を選択できる。 |
|preprocessorcallable|callable|None|前処理関数呼び出し。analyzer が呼び出し可能でないときに限る。|
|tokenizer|callable|None|文字列トークン化ステップの呼び出し。analyzer == 'word' の場合に限る。|
|analyzer|{‘word’, ‘char’, ‘char_wb’} or callable|’word’|単語または文字の N-gram から特徴量を抽出。'char_wb' の場合、単語の端の N-gram はスペースで埋められる。|

In [9]:
sample_text = train.head(5)
display(sample_text)

vec = TfidfVectorizer(
    stop_words=stop_words,
    min_df=1, 
    max_df=0.5, 
    max_features=1000,
    analyzer='word',
    ngram_range=(1, 2)
)
vec.fit_transform(sample_text)

print(f'extracted feature size: {len(vec.get_feature_names())}')
print(f'extracted features: {" ".join(vec.get_feature_names())}')

0    hardcore metallica vandalism closure gas voted...
1                daww match background seemingly stuck
2                    edit removing instead page actual
3    suggestion wondered statistic type accident th...
4                                             sir page
Name: text_clean, dtype: object

extracted feature size: 50
extracted features: accident accident think actual background background seemingly backlog backlog wikipediagood_article_nominationstransport closure closure gas daww daww background doe doe backlog doll doll remove edit edit removing gas gas voted hardcore hardcore metallica instead instead page metallica metallica closure page actual page retired remove remove page removing removing instead retired seemingly seemingly stuck sir sir page statistic statistic accident stuck think think tidying tidying tidying doe voted voted york wikipediagood_article_nominationstransport wondered wondered statistic york york doll


In [10]:
%%time

models = []
oof_train = np.zeros((len(train),))
tst_preds = []
scores = []

stkf = StratifiedKFold(
    n_splits=N_SPLITS, 
    shuffle=True, 
    random_state=None
)

# 特徴抽出
print(f'feature extracting ...')
vectorizer = TfidfVectorizer(
    stop_words=stop_words,
    min_df=3, 
    max_df=0.5, 
    max_features=100_000,
    analyzer='word',
    ngram_range=(1, 1)
)
train_tfidf = vectorizer.fit_transform(train)
train_vocab = vectorizer.vocabulary_
print('Total number of features:', train_tfidf.shape[1])

# 交差検証
for fold_id, (train_idx, valid_idx) in enumerate(stkf.split(train, target)):
    start = time.time()
    print(f'* ' * 40)
    print(f'fold_id: {fold_id}')
    
    # 訓練データ、評価データ、テストデータを整形
    print(f'preprocessing ...')
    X_trn = train.iloc[train_idx].reset_index(drop=True)
    X_val = train.iloc[valid_idx].reset_index(drop=True)
    y_trn = target.iloc[train_idx].reset_index(drop=True)
    y_val = target.iloc[valid_idx].reset_index(drop=True)
    
    X_trn_tfidf = vectorizer.transform(X_trn)
    X_val_tfidf = vectorizer.transform(X_val)
    X_tst_tfidf = vectorizer.transform(test['text_clean'])
    
    print('Total number of train samples:', X_trn_tfidf.shape[0])
    print('Total number of valid samples:', X_val_tfidf.shape[0])
    print('Total number of test samples:', X_tst_tfidf.shape[0])
    
    # 訓練
    print(f'training ...')
    # clf = LogisticRegression()
    # clf = RandomForestClassifier(
    #     n_estimators=100, 
    #     # max_depth=32, 
    #     n_jobs=-1
    # )
    clf = SGDClassifier(
        loss='log', 
        class_weight='balanced',
        max_iter=1000, 
        tol=1e-3, 
        n_jobs=-1
    )
    
    clf.fit(X_trn_tfidf, y_trn)
    
    # 推論
    print(f'predicting ...')
    val_pred = clf.predict_proba(X_val_tfidf)[:, 1]
    tst_pred = clf.predict_proba(X_tst_tfidf)[:, 1]
    # val_pred = clf.predict(X_val_tfidf)
    # tst_pred = clf.predict(X_tst_tfidf)
    oof_train[valid_idx] = val_pred
    tst_preds.append(tst_pred)
    models.append(clf)

    # 評価
    print(f'validation ...')
    score_auc = roc_auc_score(y_val, val_pred)
    scores.append(score_auc)
    elapsed = time.time() - start
    print(f'fold {fold_id} - score: {score_auc:.6f}, elapsed time: {elapsed:.2f} [sec]')

print(f'* ' * 40)
print(f'Average AUC: {sum(scores)/N_SPLITS:.6f}')

feature extracting ...
Total number of features: 60213
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 
fold_id: 0
preprocessing ...
Total number of train samples: 184868
Total number of valid samples: 46218
Total number of test samples: 7537
training ...
predicting ...
validation ...
fold 0 - score: 0.512919, elapsed time: 7.94 [sec]
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 
fold_id: 1
preprocessing ...
Total number of train samples: 184869
Total number of valid samples: 46217
Total number of test samples: 7537
training ...
predicting ...
validation ...
fold 1 - score: 0.497427, elapsed time: 8.33 [sec]
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 
fold_id: 2
preprocessing ...
Total number of train samples: 184869
Total number of valid samples: 46217
Total number of test samples: 7537
training ...
predicting ...
validation ...
fold 2 - score: 0.521655, elapsed time: 8.30 [sec]
* *

In [11]:
feat = np.array([clf.coef_[0] for clf in models])
# feat = np.array([clf.feature_importances_ for clf in models]) #if clf == 'RandomForestClassifier'
feat = feat.sum(axis=0) / N_SPLITS

feature_weights = sorted(
    list(zip(vectorizer.get_feature_names(), np.abs(feat))), 
    key=lambda x:x[1], 
    reverse=True)

feature_weights = pd.DataFrame(feature_weights, columns=['feature', 'val']).set_index('feature')
feature_weights.head(100).style.bar(subset=['val'])

Unnamed: 0_level_0,val
feature,Unnamed: 1_level_1
styleverticalaligntop,0.559302
stylebackgroundcolor,0.54653
cellspacing,0.491212
lacking,0.48047
untagged,0.453231
classmainpagebg,0.440442
robe,0.43739
stylefontsize,0.427771
futile,0.426167
contributor,0.422176


In [12]:
' '.join(feature_weights.index[:100])

'styleverticalaligntop stylebackgroundcolor cellspacing lacking untagged classmainpagebg robe stylefontsize futile contributor goodbye smarter createdtook gibson dying fdffe byrd soon fuckwit noticeboard turned finally surly stylebackgroundcolorffffa fired ffffff subjectspecific exist oi sicken background racially fu resolution careful remains homie flatter expressed comprehensive normally disgust molesting ops ding wikipediawhere influence bin sadly compared wpor lose kept wrote presume rd chimp beat psychopath wash reading tito heil glad advice stalin latest branch dig course taking bend rag strongly presentation scan asking pira pakistani annoyance vt hysterical professional markup descriptive index audio obscure comply unlicensed assface pair fisherqueen measure relating che liked equation vozenilek indicated'