# Overview
- Kaggle の jigsaw-toxic-severity-rating コンペ。
- EDA
    - Dataset overview
- data cleaning
- learning strategy
    - maxdiff (binary classification)
- CV strategy:
    - 比較ペアとして出現したことがあるテキストを再帰的に調べてまとめてひとつの gid 付与。
    - gid に対して GroupKFold
- training:
    - naive bayes
    - logistic regression
    - sgd classifier
    - random forest
- metric:
    - 以下の手続きで評価。

```py
scores  = []
for pair in test_pairs:
    if score_less_toxic_comment < score_more_toxic_comment:
        scores.append(1)
    else:
        scores.append(0)

total_score = np.mean(scores)
```

TODO
- cleaning
    - NLTK, Spacy, Keras, torchtext を使う方法があるらしい。。
- feature extraction:
    - BoW
    - TF-IDF
    - N-gram
    - sentiment and intent analysis


### References
- [Jiggsaw Toxic Comments EDA & Twitch Stream | Kaggle](https://www.kaggle.com/robikscube/jiggsaw-toxic-comments-eda-twitch-stream)
- data cleaning and kfold [Fold is Gold 🌟 [ cleaned data ] | Kaggle](https://www.kaggle.com/kishalmandal/fold-is-gold-cleaned-data)
- data cleaning and vectorize(TF-IDF, word2vec, BERT) [Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT | toward data science](https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794)
- [Jigsaw CV strategy | Kaggle](https://www.kaggle.com/its7171/jigsaw-cv-strategy)
- [Metric understanding: perfect score is not 1 | Kaggle](https://www.kaggle.com/c/jigsaw-toxic-severity-rating/discussion/287350)
- [MaxDiffのスコア計算ロジックを教えて下さい。 | Marketing Technology](https://m-te.com/maxdiff-score-logic-200625/)

# Directories

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/jigsaw-toxic-severity-rating/sample_submission.csv
/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv
/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv


# Parameters

In [2]:
DEBUG_FLAG = False
VERSION = 'nb02'

SUBMISSION_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/sample_submission.csv'
VALIDATION_DATA_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv'
COMMENTS_SCORE_PATH = '/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv'

N_SPLITS = 5 if not DEBUG_FLAG else 2

# Installs & imports

In [3]:
import datetime
import json
import pickle
import random
import re
import sys
import time

import datatable as dt
import gensim
import gensim.downloader as gensim_api
import lightgbm as lgb
import matplotlib as mpl
import matplotlib.pyplot as plt
import nltk
import numpy as np
import optuna
import pandas as pd
import scipy.stats as ss
import seaborn as sns
import transformers


from catboost import CatBoostClassifier
from contextlib import contextmanager
from lime import lime_text
from logging import getLogger, Formatter, FileHandler, StreamHandler, INFO, DEBUG
from matplotlib_venn import venn2
# from optuna.integration import lightgbm as lgb
from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import confusion_matrix, log_loss, roc_auc_score
from sklearn.model_selection import  GroupKFold, StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVC
from tensorflow.keras import models, layers, preprocessing as kprocessing
from tensorflow.keras import backend as K
from tqdm import tqdm


# settings
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

# mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False

# Functions

In [4]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df


def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df


def create_logger(exp_version):
    log_file = (f'{exp_version}.log')

    logger_ = getLogger(exp_version)
    logger_.setLevel(DEBUG)

    # formatter
    # fmr = Formatter('[%(levelname)s] %(asctime)s >>\t%(message)s')
    fmr = Formatter(
        '%(asctime)s %(name)s %(lineno)d'
        ' [%(levelname)s][%(funcName)s] %(message)s'
    )

    # file handler
    fh = FileHandler(log_file)
    fh.setLevel(DEBUG)
    fh.setFormatter(fmr)

    # stream handler
    ch = StreamHandler()
    ch.setLevel(INFO)
    ch.setFormatter(fmr)

    logger_.addHandler(fh)
    logger_.addHandler(ch)


def get_logger(exp_version):
    return getLogger(exp_version)


def get_args_of_func(f):
    return f.__code__.co_varnames[:f.__code__.co_argcount]


def show_mem_usage():
    print("{}{: >25}{}{: >10}{}".format('|','Variable Name','|','Memory','|'))
    print(" ------------------------------------ ")
    for var_name in globals():
        if not var_name.startswith("_") and sys.getsizeof(eval(var_name)) > 1024**2:
            print("{}{: >25}{}{: >6} MiB{}".format('|',var_name,'|', int(sys.getsizeof(eval(var_name))/1024**2),'|'))


def read_data():
    valid_data = dt.fread(VALIDATION_DATA_PATH).to_pandas()
    score_data = dt.fread(COMMENTS_SCORE_PATH).to_pandas()
    submission = dt.fread(SUBMISSION_PATH).to_pandas()
    return valid_data, score_data, submission


def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):
    # clean (convert to lowercase and remove punctuations and characters and then strip)
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    
    # remove numbers
    text = re.sub(r'[\d]', '', text)
            
    # Tokenize (convert from string to list)
    lst_text = text.split()
    # remove Stopwords
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in lst_stopwords]
                
    # Stemming (remove -ing, -ly, ...)
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    # Lemmatisation (convert the word into root word)
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    # back to string from list
    text = ' '.join(lst_text)
    return text

# Preparing

In [5]:
create_logger(VERSION)

In [6]:
valid_data, comments_score, submission = read_data()

if DEBUG_FLAG:
    valid_data = valid_data.sample(n=1000).reset_index(drop=True)

print(f'validation data shape: {valid_data.shape}')
print(f'comments score shape: {comments_score.shape}')
print(f'submission sample shape: {submission.shape}')

validation data shape: (30108, 3)
comments score shape: (7537, 2)
submission sample shape: (7537, 2)


In [7]:
display(valid_data.head())
display(comments_score.head())
display(submission.head())

Unnamed: 0,worker,less_toxic,more_toxic
0,313,This article sucks \n\nwoo woo wooooooo,WHAT!!!!!!!!?!?!!?!?!!?!?!?!?!!!!!!!!!!!!!!!!!...
1,188,"""And yes, people should recognize that but the...",Daphne Guinness \n\nTop of the mornin' my fav...
2,82,"Western Media?\n\nYup, because every crime in...","""Atom you don't believe actual photos of mastu..."
3,347,And you removed it! You numbskull! I don't car...,You seem to have sand in your vagina.\n\nMight...
4,539,smelly vagina \n\nBluerasberry why don't you ...,"hey \n\nway to support nazis, you racist"


Unnamed: 0,comment_id,text
0,114890,"""\n \n\nGjalexei, you asked about whether ther..."
1,732895,"Looks like be have an abuser , can you please ..."
2,1139051,I confess to having complete (and apparently b...
3,1434512,"""\n\nFreud's ideas are certainly much discusse..."
4,2084821,It is not just you. This is a laundry list of ...


Unnamed: 0,comment_id,score
0,114890,0.5
1,732895,0.5
2,1139051,0.5
3,1434512,0.5
4,2084821,0.5


# CV strategy

In [8]:
# テキストに対してユニークな番号付け
texts = set(valid_data.less_toxic.to_list() + valid_data.more_toxic.to_list())
text2id = {t:id for id,t in enumerate(texts)}
valid_data['less_id'] = valid_data['less_toxic'].map(text2id)
valid_data['more_id'] = valid_data['more_toxic'].map(text2id)
valid_data

Unnamed: 0,worker,less_toxic,more_toxic,less_id,more_id
0,313,This article sucks \n\nwoo woo wooooooo,WHAT!!!!!!!!?!?!!?!?!!?!?!?!?!!!!!!!!!!!!!!!!!...,11081,217
1,188,"""And yes, people should recognize that but the...",Daphne Guinness \n\nTop of the mornin' my fav...,8738,1205
2,82,"Western Media?\n\nYup, because every crime in...","""Atom you don't believe actual photos of mastu...",12854,4252
3,347,And you removed it! You numbskull! I don't car...,You seem to have sand in your vagina.\n\nMight...,8804,2106
4,539,smelly vagina \n\nBluerasberry why don't you ...,"hey \n\nway to support nazis, you racist",653,459
...,...,...,...,...,...
30103,461,I'm sorry. I'm not an admin. I will give you t...,get out my large penis,2446,11544
30104,527,I'm sorry. I'm not an admin. I will give you t...,get out my large penis,2446,11544
30105,352,"wow...\nare you out of your mind, how was my e...",Piss off you slant eyed-gook,5783,471
30106,311,"wow...\nare you out of your mind, how was my e...",Piss off you slant eyed-gook,5783,471


In [9]:
# validation data に出現するテキストのペアをフラグ化
len_ids = len(text2id)
idarr = np.zeros((len_ids,len_ids), dtype=bool)

for lid, mid in valid_data[['less_id', 'more_id']].values:
    min_id = min(lid, mid)
    max_id = max(lid, mid)
    idarr[max_id, min_id] = True

In [10]:
# Recursively retrieve the text that is paired with the text whose id is i,
# and store it's id in this_list.
# then set idarr[i, j] to False
def add_ids(i, this_list):
    for j in range(len_ids):
        if idarr[i, j]:
            idarr[i, j] = False
            this_list.append(j)
            this_list = add_ids(j,this_list)
            #print(j,i)
    for j in range(i+1,len_ids):
        if idarr[j, i]:
            idarr[j, i] = False
            this_list.append(j)
            this_list = add_ids(j,this_list)
            #print(j,i)
    return this_list


# ユニークテキストのペアをした三角行列 idarr に格納
# 同時に出現するテキストを再帰的に調べて、そられをまとめてひとつのid付与
group_list = []
for i in tqdm(range(len_ids)):
    for j in range(i+1,len_ids):
        if idarr[j, i]:
            this_list = add_ids(i,[i])
            #print(this_list)
            group_list.append(this_list)

id2groupid = {}
for gid,ids in enumerate(group_list):
    for id in ids:
        id2groupid[id] = gid

valid_data['less_gid'] = valid_data['less_id'].map(id2groupid)
valid_data['more_gid'] = valid_data['more_id'].map(id2groupid)
valid_data

100%|██████████| 14250/14250 [01:00<00:00, 233.78it/s] 


Unnamed: 0,worker,less_toxic,more_toxic,less_id,more_id,less_gid,more_gid
0,313,This article sucks \n\nwoo woo wooooooo,WHAT!!!!!!!!?!?!!?!?!!?!?!?!?!!!!!!!!!!!!!!!!!...,11081,217,203,203
1,188,"""And yes, people should recognize that but the...",Daphne Guinness \n\nTop of the mornin' my fav...,8738,1205,30,30
2,82,"Western Media?\n\nYup, because every crime in...","""Atom you don't believe actual photos of mastu...",12854,4252,2609,2609
3,347,And you removed it! You numbskull! I don't car...,You seem to have sand in your vagina.\n\nMight...,8804,2106,1594,1594
4,539,smelly vagina \n\nBluerasberry why don't you ...,"hey \n\nway to support nazis, you racist",653,459,93,93
...,...,...,...,...,...,...,...
30103,461,I'm sorry. I'm not an admin. I will give you t...,get out my large penis,2446,11544,1779,1779
30104,527,I'm sorry. I'm not an admin. I will give you t...,get out my large penis,2446,11544,1779,1779
30105,352,"wow...\nare you out of your mind, how was my e...",Piss off you slant eyed-gook,5783,471,422,422
30106,311,"wow...\nare you out of your mind, how was my e...",Piss off you slant eyed-gook,5783,471,422,422


In [11]:
print('unique text counts:', len_ids)
print('grouped text counts:', len(group_list))

unique text counts: 14250
grouped text counts: 4142


In [12]:
# now we can use GroupKFold with group id
group_kfold = GroupKFold(n_splits=N_SPLITS)

# Since df.less_gid and df.more_gid are the same, let's use df.less_gid here.
for fold, (trn, val) in enumerate(group_kfold.split(valid_data, valid_data, valid_data.less_gid)): 
    valid_data.loc[val , "fold"] = fold

valid_data["fold"] = valid_data["fold"].astype(int)
valid_data

Unnamed: 0,worker,less_toxic,more_toxic,less_id,more_id,less_gid,more_gid,fold
0,313,This article sucks \n\nwoo woo wooooooo,WHAT!!!!!!!!?!?!!?!?!!?!?!?!?!!!!!!!!!!!!!!!!!...,11081,217,203,203,4
1,188,"""And yes, people should recognize that but the...",Daphne Guinness \n\nTop of the mornin' my fav...,8738,1205,30,30,2
2,82,"Western Media?\n\nYup, because every crime in...","""Atom you don't believe actual photos of mastu...",12854,4252,2609,2609,1
3,347,And you removed it! You numbskull! I don't car...,You seem to have sand in your vagina.\n\nMight...,8804,2106,1594,1594,2
4,539,smelly vagina \n\nBluerasberry why don't you ...,"hey \n\nway to support nazis, you racist",653,459,93,93,0
...,...,...,...,...,...,...,...,...
30103,461,I'm sorry. I'm not an admin. I will give you t...,get out my large penis,2446,11544,1779,1779,0
30104,527,I'm sorry. I'm not an admin. I will give you t...,get out my large penis,2446,11544,1779,1779,0
30105,352,"wow...\nare you out of your mind, how was my e...",Piss off you slant eyed-gook,5783,471,422,422,2
30106,311,"wow...\nare you out of your mind, how was my e...",Piss off you slant eyed-gook,5783,471,422,422,2


# Cleaning

In [13]:
stop_words = list(text.ENGLISH_STOP_WORDS)
html_tags = [
    '<p>', '</p>', '<table>', '</table>', '<tr>', '</tr>', '<ul>', '</ul>', '<ol>', '</ol>', '<dl>', '</dl>', 
    '<li>', '</li>', '<dd>', '</dd>', '<dt>', '</dt>', '/n', '\n'
]
r_buf = [
    'it', 'is', 'are', 'do', 'does', 'did', 'was', 'were', 'will', 'can', 'the', 'a', 'of', 'in', 'and', 'on',
    'what', 'where', 'when', 'which'
]
stop_words = list(set(stop_words + html_tags + r_buf))
print(' '.join(stop_words))

becomes but however who after yourselves </dt> </li> these fifteen seems none full with whose whatever why whence behind every of many else whereupon although via all up his fill now here though whither latterly ltd they throughout towards besides found yourself formerly wherever name about ourselves my namely themselves therefore mill further could at between i most then two upon <ul> less several such whereby </dl> bill call also thereby myself and few yet whereafter top what their mine anyone ie while herself nobody other seem that enough noone nowhere around put <dl> it thence front still are whom serious etc whole cant too </table> he ten would along you by last since ours yours she own above wherein before nothing no </p> please three beforehand five anywhere sixty always when whether off below something through least on became must ever find hereby again anyway because amongst somehow among itself an <tr> hers moreover himself eight this we go four each forty see was become seem

In [14]:
# テキストをクリーニング
valid_data['less_toxic_clean'] = valid_data['less_toxic'].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=stop_words))
valid_data['more_toxic_clean'] = valid_data['more_toxic'].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=stop_words))
comments_score['text_clean'] = comments_score['text'].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=stop_words))

display(valid_data)
display(comments_score)

Unnamed: 0,worker,less_toxic,more_toxic,less_id,more_id,less_gid,more_gid,fold,less_toxic_clean,more_toxic_clean
0,313,This article sucks \n\nwoo woo wooooooo,WHAT!!!!!!!!?!?!!?!?!!?!?!?!?!!!!!!!!!!!!!!!!!...,11081,217,203,203,4,article suck woo woo wooooooo,wher sexy pic gone main page
1,188,"""And yes, people should recognize that but the...",Daphne Guinness \n\nTop of the mornin' my fav...,8738,1205,30,30,2,yes people recognize usually dont objection he...,daphne guinness mornin favourite alien life fo...
2,82,"Western Media?\n\nYup, because every crime in...","""Atom you don't believe actual photos of mastu...",12854,4252,2609,2609,1,western medium yup crime entire world caused w...,atom dont believe actual photo masturbation se...
3,347,And you removed it! You numbskull! I don't car...,You seem to have sand in your vagina.\n\nMight...,8804,2106,1594,1594,2,removed numbskull dont care say anymore life a...,sand vagina want wash stinking hole bitch
4,539,smelly vagina \n\nBluerasberry why don't you ...,"hey \n\nway to support nazis, you racist",653,459,93,93,0,smelly vagina bluerasberry dont model cheesy b...,hey way support nazi racist
...,...,...,...,...,...,...,...,...,...,...
30103,461,I'm sorry. I'm not an admin. I will give you t...,get out my large penis,2446,11544,1779,1779,0,im sorry im admin piece advice edit page away ...,large penis
30104,527,I'm sorry. I'm not an admin. I will give you t...,get out my large penis,2446,11544,1779,1779,0,im sorry im admin piece advice edit page away ...,large penis
30105,352,"wow...\nare you out of your mind, how was my e...",Piss off you slant eyed-gook,5783,471,422,422,2,wow mind edit talk page vandalism simply expla...,piss slant eyedgook
30106,311,"wow...\nare you out of your mind, how was my e...",Piss off you slant eyed-gook,5783,471,422,422,2,wow mind edit talk page vandalism simply expla...,piss slant eyedgook


Unnamed: 0,comment_id,text,text_clean
0,114890,"""\n \n\nGjalexei, you asked about whether ther...",gjalexei asked antieditorializing policy calle...
1,732895,"Looks like be have an abuser , can you please ...",look like abuser look thanks
2,1139051,I confess to having complete (and apparently b...,confess having complete apparently blissful ig...
3,1434512,"""\n\nFreud's ideas are certainly much discusse...",freud idea certainly discussed today agree gra...
4,2084821,It is not just you. This is a laundry list of ...,just laundry list stupid allegation scooped go...
...,...,...,...
7532,504235362,"Go away, you annoying vandal.",away annoying vandal
7533,504235566,This user is a vandal.,user vandal
7534,504308177,""" \n\nSorry to sound like a pain, but one by f...",sorry sound like pain following tad stalking h...
7535,504570375,Well it's pretty fucking irrelevant now I'm un...,pretty fucking irrelevant im unblocked aint


# Traning & prediction

In [15]:
show_mem_usage()

if 'idarr' in dir():
    del idarr

|            Variable Name|    Memory|
 ------------------------------------ 
|               valid_data|    48 MiB|
|           comments_score|     5 MiB|
|                    idarr|   193 MiB|


In [16]:
%%time

# 分類器
models = [
    (MultinomialNB(), 'naive bayes'),
    (LogisticRegression(solver='liblinear'), 'logistic regression'),
    (SGDClassifier(loss='modified_huber', max_iter=1000, tol=1e-3, n_jobs=-1), 'SGDClassifier'),
    (RandomForestClassifier(n_estimators=100, max_depth=32, n_jobs=-1), 'random forest')
]
model_test_pred = {}

for model, name in models:
    print(f'{model}')
    train_models = []
    valid_scores = []
    test_preds = []
    
    for fold_id in range(N_SPLITS):
        start = time.time()
        
        # 訓練用データ、検証用データ、テスト用データを整形
        train = valid_data.query('fold != @fold_id')
        valid = valid_data.query('fold == @fold_id')
        
        X_train = pd.concat([train['less_toxic_clean'], train['more_toxic_clean']])
        y_train = np.vstack([np.zeros(shape=(len(train['less_toxic_clean']), 1)),
                             np.ones(shape=(len(train['more_toxic_clean']), 1))])
        X_valid = pd.concat([valid['less_toxic_clean'], valid['more_toxic_clean']])
        y_valid = np.vstack([np.zeros(shape=(len(valid['less_toxic_clean']), 1)),
                             np.ones(shape=(len(valid['more_toxic_clean']), 1))])
        X_test = comments_score['text_clean']
        
        # 特徴量抽出
        cnt_vec = CountVectorizer(min_df=3)
        X_valid = cnt_vec.fit_transform(X_valid).toarray()
        X_train = cnt_vec.transform(X_train).toarray()
        X_test = cnt_vec.transform(X_test).toarray()
        
        print(f'train data size: {X_train.shape}')
        print(f'valid data size: {X_valid.shape}')
        print(f'test data size: {X_test.shape}')
        
        print(f'fitting ...') 
        model.fit(X_train, y_train.ravel())
        
        # 推論
        print(f'predicting ...')
        valid_pred = model.predict_proba(X_valid)[:, 1]
        test_pred = model.predict_proba(X_test)[:, 1]
        
        # 評価
        print(f'evaluating ...')
        valid_score = np.mean(valid_pred[:len(valid)] < valid_pred[len(valid):])
        elapsed = time.time() - start
        print(f'{name} - fold {fold_id} - score: {valid_score:.6f}, elapsed time: {elapsed:.2f} [sec]')
        
        train_models.append(model)
        valid_scores.append(valid_score)
        test_preds.append(test_pred)
    
    # valid_pred = np.array(valid_preds).T.sum(axis=1) / len(valid_preds)
    # display(pd.DataFrame(pd.Series(valid_pred.ravel()).describe()).transpose())
    total_score = np.mean(valid_scores)
    print(f'{name} - Local CV score: {total_score:.6f}\n')
    
    test_pred = np.array(test_preds).T.sum(axis=1) / len(test_preds)
    model_test_pred[name] = test_pred

MultinomialNB()
train data size: (48172, 15758)
valid data size: (12044, 15758)
test data size: (7537, 15758)
fitting ...
predicting ...
evaluating ...
naive bayes - fold 0 - score: 0.615576, elapsed time: 37.02 [sec]
train data size: (48172, 15987)
valid data size: (12044, 15987)
test data size: (7537, 15987)
fitting ...
predicting ...
evaluating ...
naive bayes - fold 1 - score: 0.631186, elapsed time: 37.36 [sec]
train data size: (48172, 15576)
valid data size: (12044, 15576)
test data size: (7537, 15576)
fitting ...
predicting ...
evaluating ...
naive bayes - fold 2 - score: 0.629359, elapsed time: 36.31 [sec]
train data size: (48174, 15947)
valid data size: (12042, 15947)
test data size: (7537, 15947)
fitting ...
predicting ...
evaluating ...
naive bayes - fold 3 - score: 0.615180, elapsed time: 37.13 [sec]
train data size: (48174, 15278)
valid data size: (12042, 15278)
test data size: (7537, 15278)
fitting ...
predicting ...
evaluating ...
naive bayes - fold 4 - score: 0.628799, 



predicting ...
evaluating ...
logistic regression - fold 1 - score: 0.580870, elapsed time: 28.89 [sec]
train data size: (48172, 15576)
valid data size: (12044, 15576)
test data size: (7537, 15576)
fitting ...




predicting ...
evaluating ...
logistic regression - fold 2 - score: 0.605447, elapsed time: 23.67 [sec]
train data size: (48174, 15947)
valid data size: (12042, 15947)
test data size: (7537, 15947)
fitting ...




predicting ...
evaluating ...
logistic regression - fold 3 - score: 0.591264, elapsed time: 27.86 [sec]
train data size: (48174, 15278)
valid data size: (12042, 15278)
test data size: (7537, 15278)
fitting ...




predicting ...
evaluating ...
logistic regression - fold 4 - score: 0.596413, elapsed time: 21.26 [sec]
logistic regression - Local CV score: 0.592168

SGDClassifier(loss='modified_huber', n_jobs=-1)
train data size: (48172, 15758)
valid data size: (12044, 15758)
test data size: (7537, 15758)
fitting ...
predicting ...
evaluating ...
SGDClassifier - fold 0 - score: 0.507639, elapsed time: 105.37 [sec]
train data size: (48172, 15987)
valid data size: (12044, 15987)
test data size: (7537, 15987)
fitting ...
predicting ...
evaluating ...
SGDClassifier - fold 1 - score: 0.442046, elapsed time: 61.22 [sec]
train data size: (48172, 15576)
valid data size: (12044, 15576)
test data size: (7537, 15576)
fitting ...
predicting ...
evaluating ...
SGDClassifier - fold 2 - score: 0.423946, elapsed time: 52.53 [sec]
train data size: (48174, 15947)
valid data size: (12042, 15947)
test data size: (7537, 15947)
fitting ...
predicting ...
evaluating ...
SGDClassifier - fold 3 - score: 0.495599, elapsed t

# Submit

In [17]:
for model, name in models:
    print(model, name)

MultinomialNB() naive bayes
LogisticRegression(solver='liblinear') logistic regression
SGDClassifier(loss='modified_huber', n_jobs=-1) SGDClassifier
RandomForestClassifier(max_depth=32, n_jobs=-1) random forest


In [18]:
submission['score'] = model_test_pred['random forest']
submission.head()

Unnamed: 0,comment_id,score
0,114890,0.423435
1,732895,0.479773
2,1139051,0.454011
3,1434512,0.304729
4,2084821,0.514779


In [19]:
submission.to_csv('submission.csv', index=False)