# このノートについて

## 目的
* とりあえず提出してみて課題の感じをつかむ

## 流れ
* ライブラリをインポート
* 学習・予測に使うデータを取得
* 特徴エンジニアリング(テキストクリーニング)
* 学習・予測・提出

## 参考

* データの分析およびクリーニングについて
    * jagan, Stop the S@#$ - Toxic Comments EDA, https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda
        * ほとんどこの人のやり方をパクった
* モデル構築
    * Bojan Tunguz, Logistic regression with words and char n-grams, https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams/code
    
    

# ライブラリをインポート

In [106]:
# いつもの
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

# NLP
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem.wordnet import WordNetLemmatizer 

# FeatureEngineering
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# データのパスなど
home_dir=os.path.dirname(os.getcwd())
data_dir=os.path.join(home_dir, "data")
result_dir=os.path.join(home_dir, "result")

pd.set_option("display.max_rows", 30)

## コーパスをダウンロード
* wordnetはレンマ化のため, stopwordsはストップワードのためにダウンロードする

In [3]:
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# 学習・予測に使うデータを取得

In [16]:
train_path = os.path.join(data_dir, "train.csv") 
train_df = pd.read_csv(train_path)
print(train_df.isnull().sum())
train_df[:10]

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


In [15]:
test_path = os.path.join(data_dir, "test.csv") 
test_df = pd.read_csv(test_path)
print(test_df.isnull().sum())
test_df[:10]

id              0
comment_text    0
dtype: int64


Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.
5,0001ea8717f6de06,Thank you for understanding. I think very high...
6,00024115d4cbde0f,Please do not add nonsense to Wikipedia. Such ...
7,000247e83dcc1211,:Dear god this site is horrible.
8,00025358d4737918,""" \n Only a fool can believe in such numbers. ..."
9,00026d1092fe71cc,== Double Redirects == \n\n When fixing double...


In [17]:
submit_path = os.path.join(data_dir, "sample_submission.csv")
submit_df = pd.read_csv(submit_path)
submit_df[:10]

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.5,0.5,0.5,0.5,0.5,0.5
1,0000247867823ef7,0.5,0.5,0.5,0.5,0.5,0.5
2,00013b17ad220c46,0.5,0.5,0.5,0.5,0.5,0.5
3,00017563c3f7919a,0.5,0.5,0.5,0.5,0.5,0.5
4,00017695ad8997eb,0.5,0.5,0.5,0.5,0.5,0.5
5,0001ea8717f6de06,0.5,0.5,0.5,0.5,0.5,0.5
6,00024115d4cbde0f,0.5,0.5,0.5,0.5,0.5,0.5
7,000247e83dcc1211,0.5,0.5,0.5,0.5,0.5,0.5
8,00025358d4737918,0.5,0.5,0.5,0.5,0.5,0.5
9,00026d1092fe71cc,0.5,0.5,0.5,0.5,0.5,0.5


# 特徴エンジニアリング
## テキストクリーニング 

In [65]:
#https://drive.google.com/file/d/0B1yuv8YaUVlZZ1RzMFJmc1ZsQmM/view
# Aphost lookup dict
APPO = {
    "aren't" : "are not",
    "can't" : "cannot",
    "couldn't" : "could not",
    "didn't" : "did not",
    "doesn't" : "does not",
    "don't" : "do not",
    "hadn't" : "had not",
    "hasn't" : "has not",
    "haven't" : "have not",
    "he'd" : "he would",
    "he'll" : "he will",
    "he's" : "he is",
    "i'd" : "I would",
    "i'd" : "I had",
    "i'll" : "I will",
    "i'm" : "I am",
    "isn't" : "is not",
    "it's" : "it is",
    "it'll":"it will",
    "i've" : "I have",
    "let's" : "let us",
    "mightn't" : "might not",
    "mustn't" : "must not",
    "shan't" : "shall not",
    "she'd" : "she would",
    "she'll" : "she will",
    "she's" : "she is",
    "shouldn't" : "should not",
    "that's" : "that is",
    "there's" : "there is",
    "they'd" : "they would",
    "they'll" : "they will",
    "they're" : "they are",
    "they've" : "they have",
    "we'd" : "we would",
    "we're" : "we are",
    "weren't" : "were not",
    "we've" : "we have",
    "what'll" : "what will",
    "what're" : "what are",
    "what's" : "what is",
    "what've" : "what have",
    "where's" : "where is",
    "who'd" : "who would",
    "who'll" : "who will",
    "who're" : "who are",
    "who's" : "who is",
    "who've" : "who have",
    "won't" : "will not",
    "wouldn't" : "would not",
    "you'd" : "you would",
    "you'll" : "you will",
    "you're" : "you are",
    "you've" : "you have",
    "'re": " are",
    "wasn't": "was not",
    "we'll":" will",
    "didn't": "did not",
    "tryin'":"trying",
    "&": " and ",
    "@": " at ",
    "0": " zero ",
    "1": " one ",
    "2": " two ",
    "3": " three ",
    "4": " four ",
    "5": " five ",
    "6": " six ",
    "7": " seven ",
    "8": " eight ",
    "9": " nine "
}
tokenizer=TweetTokenizer()
lem = WordNetLemmatizer()
eng_stopwords = set(stopwords.words("english"))

In [74]:
def clean(comment):
    # 大文字->小文字に変換
    comment = comment.lower()
    # ipとユーザを除く
    comment = re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.d{,3}", "", comment)
    comment = re.sub("\[\[.\]]", "", comment)
    # 連結している記号を分離 ("!!" -> "! !")
    comment = re.sub(r'([\'\"\.\(\)\!\?\-\\\/\,])', r' \1 ', comment)
    # 数字を分離("2018" -> "2 0 1 8")
    comment = re.sub(r'([0-9*])', r' \1 ', comment)
    # 特殊記号を削除
    comment = re.sub(r'([\;\:\|•«=\n])', ' ', comment)    
    # 文章を単語の配列にする
    words = tokenizer.tokenize(comment)
    # 略語を戻す
    words = [APPO[word] if word in APPO else word for word in words]
    # レンマ化(例:databases -> database等)
    words = [lem.lemmatize(word, "v") for word in words]
    # ストップワードを除去(例 are, by 等)
    words = [w for w in words if not w in eng_stopwords]
    # 配列になった単語列を繋げ直して文章化
    clean_sent=" ".join(words)
    clean_sent = re.sub(r'([\'])', ' ', clean_sent)
    return clean_sent

In [75]:
comment_idx = 0
print("----before----")
print(train_df["comment_text"][comment_idx])
print("----after----")
print(clean(train_df["comment_text"][comment_idx]))

----before----
Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27
----after----
explanation edit make username hardcore metallica fan revert ?   vandalisms , closure gas vote new york dolls fac . please   remove template talk page since   retire .  two   seven 


'test  two   zero   zero   nine '

In [76]:
corpus_train = train_df["comment_text"]
corpus_train = corpus_train.apply(lambda x:clean(x))
corpus_train

0         explanation edit make username hardcore metall...
1           aww ! match background colour   seemingly st...
2         hey man ,   really try edit war .   guy consta...
3         "   make real suggestions improvement - wonder...
4                   , sir , hero . chance remember page   ?
5         " congratulations well , use tool well . · talk "
6                               cocksucker piss around work
7         vandalism matt shirvington article revert . pl...
8         sorry word   nonsense   offensive . anyway ,  ...
9                      alignment subject contrary dulithgow
10        " fair use rationale image wonju . jpg thank u...
11                      bbq man let discuss - maybe phone ?
12        hey .  .  . .  .  at  talk . .  .  . exclusive...
13        start throw accusations warn , let review edit...
14        oh , girl start arguments . stick nose   belon...
                                ...                        
159556                           irc , ,

In [54]:
corpus_test = test_df["comment_text"]
corpus_test = corpus_test.apply(lambda x:clean(x))

## テキストをBoW化
### TfidfVectorizer
* strip_accents ... unicode: アクセント記号の削除
* ngram_range=(n_min, n_max) ...n-gram の指定 
* max_features ... tfidfのスコアの上位いくつまで語彙を残すか指定

In [96]:
#vectorizer = CountVectorizer()
vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    ngram_range=(1, 1),
    max_features=15000)
X_train = vectorizer.fit_transform(corpus_train)

In [103]:
print(len(vectorizer.get_feature_names()))

vectorizer.get_feature_names()[:30]

15000


['TM',
 '__',
 '___',
 '_noticeboard',
 'aa',
 'aaron',
 'ab',
 'abandon',
 'abbreviation',
 'abc',
 'abide',
 'abilities',
 'ability',
 'able',
 'able edit',
 'able find',
 'able get',
 'able help',
 'able make',
 'able see',
 'abolish',
 'abortion',
 'abraham',
 'abroad',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absolutely nothing',
 'absorb']

In [98]:
X_test = vectorizer.transform(corpus_test)

# 学習&予測&提出
## モデル
* ロジスティック回帰(0~1)
* カテゴリ('toxic', 'severe_toxic'等)についてそれぞれモデルを作成(計6個)し予測
* 一応CVしておく

In [None]:
class_names = train_df.columns[2:].tolist()
class_names

In [108]:
for class_name in class_names:
    y_train = train_df[class_name]
    
    classifier = LogisticRegression()
    #classifier = RandomForestClassifier()
    #classifier = SVC()

    cv_loss = np.mean(cross_val_score(classifier, X_train, y_train, cv=5, scoring='roc_auc'))
    print('CV score for class {} is {}'.format(class_name, cv_loss))

    classifier.fit(X_train, y_train)
    submit_df[class_name] = classifier.predict_proba(X_test)[:, 1]

CV score for class toxic is 0.9689649816587739
CV score for class severe_toxic is 0.9845102846644188
CV score for class obscene is 0.9842854110524597
CV score for class threat is 0.9834109963254125
CV score for class insult is 0.9751851202723028
CV score for class identity_hate is 0.9735204222232172


In [101]:
submit_df.to_csv(os.path.join(result_dir, "tdidf_logistic.csv"), index=False) 

submit_df[:10]

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999177,0.239051,0.996915,0.051012,0.958278,0.338641
1,0000247867823ef7,0.009816,0.003505,0.004842,0.001986,0.008483,0.003472
2,00013b17ad220c46,0.005898,0.001192,0.003339,0.000509,0.00328,0.0008
3,00017563c3f7919a,0.00591,0.002618,0.003964,0.00125,0.004519,0.001265
4,00017695ad8997eb,0.030232,0.002802,0.008956,0.001256,0.01477,0.001863
5,0001ea8717f6de06,0.009118,0.00153,0.00462,0.00102,0.008135,0.001481
6,00024115d4cbde0f,0.006794,0.000948,0.004256,0.000615,0.006288,0.001653
7,000247e83dcc1211,0.540047,0.00459,0.041849,0.003581,0.117467,0.006196
8,00025358d4737918,0.02341,0.003529,0.011891,0.002299,0.015518,0.005727
9,00026d1092fe71cc,0.004735,0.001052,0.003505,0.000798,0.004109,0.001385


# このチュートリアルをやれば、大体1600~1700位くらいになるはず(2600チーム中)!!!!

## これから

* いろんな特徴量をとる
    * コメントの文字数, 顔文字, id
* BoW以外で特徴抽出をやってみる
    * http://catindog.hatenablog.com/entry/2017/03/31/221644
    * fasttext https://www.kaggle.com/mschumacher/using-fasttext-models-for-robust-embeddings
* 特徴量を圧縮する
    * PCA, Topic Model, ...
    * https://www.kaggle.com/jagangupta/understanding-the-topic-of-toxicity
* いろんな学習モデルを使う
    * https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout