# 第8章: 機械学習

本章では，Bo Pang氏とLillian Lee氏が公開しているMovie Review Dataのsentence polarity dataset v1.0を用い，文を肯定的（ポジティブ）もしくは否定的（ネガティブ）に分類するタスク（極性分析）に取り組む．

In [89]:
import random

# 70. データの入手・整形
文に関する極性分析の正解データを用い，以下の要領で正解データ（sentiment.txt）を作成せよ．

1. rt-polarity.posの各行の先頭に"+1 "という文字列を追加する（極性ラベル"+1"とスペースに続けて肯定的な文の内容が続く）
2. rt-polarity.negの各行の先頭に"-1 "という文字列を追加する（極性ラベル"-1"とスペースに続けて否定的な文の内容が続く）
3. 上述1と2の内容を結合（concatenate）し，行をランダムに並び替える

sentiment.txtを作成したら，正例（肯定的な文）の数と負例（否定的な文）の数を確認せよ

In [90]:
# nkfコマンドでバイナリからutf8に変換
with open('rt-polaritydata/pos', 'r') as f:
    sentiment_pos = ["+1 " + x[:-1] for x in f.readlines()]
with open('rt-polaritydata/neg', 'r') as f:
    sentiment_neg = ["-1 " + x[:-1] for x in f.readlines()]
sentiment_pos[0] = sentiment_pos[0].replace("\ufeff","")
sentiment_neg[0] = sentiment_neg[0].replace("\ufeff","")
sentiment = sentiment_pos + sentiment_neg
random.shuffle(sentiment)
with open('sentiment.txt', 'w') as f:
    f.writelines([x + "\n" for x in sentiment])

In [91]:
with open('sentiment.txt', 'r') as f:
    sentiment = [x[:-1] for x in f.readlines()]
    count = len(sentiment)
    count_pos = len([x for x in sentiment if x[:2] == "+1"])
    count_neg = len([x for x in sentiment if x[:2] == "-1"])
count, count_pos, count_neg

(10661, 5330, 5331)

# 71. ストップワード
英語のストップワードのリスト（ストップリスト）を適当に作成せよ．さらに，引数に与えられた単語（文字列）がストップリストに含まれている場合は真，それ以外は偽を返す関数を実装せよ．さらに，その関数に対するテストを記述せよ．

In [92]:
with open('stopwords.csv', 'r') as f:
    stopwords = [x[:-1] for x in f.readlines()]
def validate(word):
    return word in stopwords
# 数字、括弧、文字化け（漢字入り）なども外したい

In [93]:
validate('you')

True

In [94]:
validate('aaaaa')

False

# 72. 素性抽出
極性分析に有用そうな素性を各自で設計し，学習データから素性を抽出せよ．素性としては，レビューからストップワードを除去し，各単語をステミング処理したものが最低限のベースラインとなるであろう．

In [95]:
def get_bow(wordlist):
    row = {}
    for word in wordlist:
        if not validate(word):
            w = stem(word)
            if w in row.keys():
                row[w] += 1
            else:
                row[w] = 1
    return row

In [96]:
from stemming.porter2 import stem
wordlist = []
label = []
bow = []
for line in sentiment:
    line_list = line[:-1].split(" ")
    label.append(line_list[0])
    bow.append(get_bow(line_list[1:]))

In [97]:
bow_df = pd.DataFrame(bow).fillna(0)

# 73. 学習
72で抽出した素性を用いて，ロジスティック回帰モデルを学習せよ．

In [98]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(bow_df, label)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

# 74. 予測
73で学習したロジスティック回帰モデルを用い，与えられた文の極性ラベル（正例なら"+1"，負例なら"-1"）と，その予測確率を計算するプログラムを実装せよ

In [99]:
logreg.predict(bow_df[:5])

array(['-1', '+1', '+1', '-1', '+1'], 
      dtype='<U2')

In [100]:
bow_df[:5]

Unnamed: 0,!,"""",#3,#9,$1,$100,$20,$40,$50-million,$7,...,佖ico,層arm,層hat,灣,疎n,疳ice,租irect-to-video,粗m,駘an,imo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [101]:
label[:5]

['-1', '+1', '+1', '-1', '+1']

In [109]:
def get_bow_df(text):
    text_list = [x for x in text.split(" ") if x in bow_df.columns]
    bow_dict = get_bow(text_list)
    return pd.DataFrame([bow_dict], columns=bow_df.columns).fillna(0)

In [110]:
logreg.predict(get_bow_df("i have a pen"))

array(['-1'], 
      dtype='<U2')

# 75. 素性の重み
73で学習したロジスティック回帰モデルの中で，重みの高い素性トップ10と，重みの低い素性トップ10を確認せよ．

In [134]:
weight = pd.DataFrame(logreg.coef_[0], index=bow_df.columns, columns=["weight"])

In [137]:
# 重みの高い素性トップ10
weight.sort_values("weight", ascending=False).head(10)

Unnamed: 0,weight
bore,2.27105
dull,1.914285
wast,1.895045
mediocr,1.891142
fail,1.889047
routin,1.816378
suppos,1.740451
flat,1.726293
plod,1.667226
disguis,1.657306


In [138]:
# 重みの低い素性トップ10
weight.sort_values("weight").head(10)

Unnamed: 0,weight
refresh,-2.192161
engross,-2.06322
unexpect,-1.988156
glorious,-1.823523
wonder,-1.755877
remark,-1.66494
examin,-1.62454
smarter,-1.620871
beauti,-1.61393
resist,-1.565932


# 76. ラベル付け
学習データに対してロジスティック回帰モデルを適用し，正解のラベル，予測されたラベル，予測確率をタブ区切り形式で出力せよ．

In [164]:
predict_proba = logreg.predict_proba(bow_df)
predict = logreg.predict(bow_df)
with open('76.csv', 'w') as f:
    for i, l in enumerate(label):
        line = "\t".join([str(l), predict[i], str(max(predict_proba[i]))])
        f.write(line + "\n")

# 77. 正解率の計測
76の出力を受け取り，予測の正解率，正例に関する適合率，再現率，F1スコアを求めるプログラムを作成せよ．

In [192]:
result77 = pd.read_csv('76.txt', header=None, sep='\t', names=("label", "predict", "predict_proba"))

In [198]:
def evaluate(result):
    TP = len(result[(result['label'] == 1) & (result['predict'] == 1)])
    FP = len(result[(result['label'] == -1) & (result['predict'] == 1)])
    TN = len(result[(result['label'] == -1) & (result['predict'] == -1)])
    FN = len(result[(result['label'] == 1) & (result['predict'] == -1)])
    # print(len(result) == (TP+FP+TN+FN))
    # 正解率
    rate = (TP + TN) / len(result)
    # 適合率
    precision = TP/(TP + FP)
    # 再現率
    recall=TP/(TP + FN)
    # F1
    F = 2 * recall * precision / (recall + precision)
    return {"rate":rate, "precision":precision, "recall":recall, "F":F}

In [199]:
value = evaluate(result77)

In [200]:
value["rate"]

0.9462526967451459

In [201]:
value["precision"]

0.9502176793488548

In [202]:
value["recall"]

0.9418386491557224

In [203]:
value["F"]

0.9460096108546123

# 78. 5分割交差検定
76-77の実験では，学習に用いた事例を評価にも用いたため，正当な評価とは言えない．すなわち，分類器が訓練事例を丸暗記する際の性能を評価しており，モデルの汎化性能を測定していない．そこで，5分割交差検定により，極性分類の正解率，適合率，再現率，F1スコアを求めよ．