# 第8章: 機械学習
本章では，Bo Pang氏とLillian Lee氏が公開しているMovie Review Dataのsentence polarity dataset v1.0を用い，文を肯定的（ポジティブ）もしくは否定的（ネガティブ）に分類するタスク（極性分析）に取り組む．

## 70. データの入手・整形
文に関する極性分析の正解データを用い，以下の要領で正解データ（sentiment.txt）を作成せよ．

1. rt-polarity.posの各行の先頭に"+1 "という文字列を追加する（極性ラベル"+1"とスペースに続けて肯定的な文の内容が続く）
1. rt-polarity.negの各行の先頭に"-1 "という文字列を追加する（極性ラベル"-1"とスペースに続けて否定的な文の内容が続く）
1. 上述1と2の内容を結合（concatenate）し，行をランダムに並び替える

sentiment.txtを作成したら，正例（肯定的な文）の数と負例（否定的な文）の数を確認せよ．

In [1]:
import codecs
import pandas as pd
import numpy as np

def knock_70():
    with codecs.open('../data/rt-polaritydata/rt-polarity.pos', 'r', 'cp1252') as f_pos:
        df_pos = pd.concat([pd.DataFrame([['+1', sentence.rstrip('\n')]]) for sentence in f_pos], ignore_index=True)        
    with codecs.open('../data/rt-polaritydata/rt-polarity.neg', 'r', 'cp1252') as f_neg:
        df_neg = pd.concat([pd.DataFrame([['-1', sentence.rstrip('\n')]]) for sentence in f_neg], ignore_index=True)
    df = pd.concat([df_pos, df_neg], ignore_index=True)
    
    np.random.seed(1)
    df.columns = ['Sentiment', 'Review']
    df = df.reindex(np.random.permutation(df.index))
    df.to_csv('../work/sentiment.txt', sep=' ', index=False)

    print('正例: ' + str((df['Sentiment'] == '+1').sum()) + '件') 
    print('負例: ' + str((df['Sentiment'] == '-1').sum()) + '件')

knock_70()

正例: 5331件
負例: 5331件


In [2]:
!head -n 20 ../work/sentiment.txt

Sentiment Review
-1 "to portray modern women the way director davis has done is just unthinkable . "
-1 "kenneth branagh's energetic sweet-and-sour performance as a curmudgeonly british playwright grounds this overstuffed , erratic dramedy in which he and his improbably forbearing wife contend with craziness and child-rearing in los angeles . "
+1 " . . . with "" the bourne identity "" we return to the more traditional action genre . "
+1 "you can watch , giggle and get an adrenaline boost without feeling like you've completely lowered your entertainment standards . "
+1 "fun , flip and terribly hip bit of cinematic entertainment . "
+1 "fisher has bared his soul and confronted his own shortcomings here in a way . . . that feels very human and very true to life . "
+1 "while the plot follows a predictable connect-the-dots course . . . director john schultz colors the picture in some evocative shades . "
-1 "the impact of the armenian genocide is diluted by too much stage busine

## 71. ストップワード
英語のストップワードのリスト（ストップリスト）を適当に作成せよ．さらに，引数に与えられた単語（文字列）がストップリストに含まれている場合は真，それ以外は偽を返す関数を実装せよ．さらに，その関数に対するテストを記述せよ．

In [3]:
def is_stop_word(word):
    if not word:
        return True
    if len(word.rstrip()) <= 1:
        return True
    
    STOP_WORD = set("""
    , . a an the at to on of for in by with above under
    this that i you it he she they am are is was were 
    and but though although then so as 
    " ' - – ( ) *
    """.lower().split())
    return word.lower() in STOP_WORD

assert is_stop_word('a')
assert is_stop_word('the')
assert is_stop_word('i')
assert is_stop_word('I')
assert is_stop_word('YOU')
assert is_stop_word('"')
assert is_stop_word('*')
assert is_stop_word('')
assert is_stop_word('　')
assert is_stop_word(None)
assert is_stop_word('e')

assert not is_stop_word('good')
assert not is_stop_word('bad')


## 72. 素性抽出
極性分析に有用そうな素性を各自で設計し，学習データから素性を抽出せよ．素性としては，レビューからストップワードを除去し，各単語をステミング処理したものが最低限のベースラインとなるであろう．

In [7]:
import pandas as pd
from stemming.porter2 import stem
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

def stem_tokenizer(sentence):
    return [stem(word) for word in sentence.split(' ') if not is_stop_word(word)]

def extract_feature(text, min_df=1):
    vectorizer = TfidfVectorizer(tokenizer=stem_tokenizer, min_df=min_df)
    vectorizer = vectorizer.fit(text)
    vector = vectorizer.transform(text)
    return vectorizer, vector.toarray()

def knock_72():
    df = pd.read_csv('../work/sentiment.txt', sep=' ')
    text = df['Review'].tolist()[:2]
    vectorizer, feature = extract_feature(text)
    print('元文')
    print(text)
    print('\nマッピング')
    print(vectorizer.get_feature_names())
    print('\n素性')
    print(feature)

knock_72()

元文
['to portray modern women the way director davis has done is just unthinkable . ', "kenneth branagh's energetic sweet-and-sour performance as a curmudgeonly british playwright grounds this overstuffed , erratic dramedy in which he and his improbably forbearing wife contend with craziness and child-rearing in los angeles . "]

マッピング
['angel', 'branagh', 'british', 'child-rear', 'contend', 'crazi', 'curmudgeon', 'davi', 'director', 'done', 'dramedi', 'energet', 'errat', 'forbear', 'ground', 'has', 'his', 'improb', 'just', 'kenneth', 'los', 'modern', 'overstuf', 'perform', 'playwright', 'portray', 'sweet-and-sour', 'unthink', 'way', 'which', 'wife', 'women']

素性
[[ 0.          0.          0.          0.          0.          0.          0.
   0.31622777  0.31622777  0.31622777  0.          0.          0.          0.
   0.          0.31622777  0.          0.          0.31622777  0.          0.
   0.31622777  0.          0.          0.          0.31622777  0.
   0.31622777  0.31622777  0.

## 73. 学習
72で抽出した素性を用いて，ロジスティック回帰モデルを学習せよ．

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

def learn_by_logistic_regression():
    df = pd.read_csv('../work/sentiment.txt', sep=' ')
    X = df['Review'].tolist()
    y = df['Sentiment'].tolist()
    vectorizer, feature = extract_feature(X)
    X_train, X_test, y_train, y_test = train_test_split(feature, y, test_size=0.3, shuffle=False)
    
    lr = LogisticRegression(solver='liblinear', random_state=0)
    model = lr.fit(X_train, y_train)
    print('Accuracy(train): ' + str(lr.score(X_train, y_train)))
    print('Accuracy(test): ' + str(lr.score(X_test, y_test)))
    
    joblib.dump(model, '../work/model.pkl')
    joblib.dump(vectorizer, '../work/vectorizer.pkl')

learn_by_logistic_regression()

Accuracy(train): 0.894144445933
Accuracy(test): 0.764926539544


## 74. 予測
73で学習したロジスティック回帰モデルを用い，与えられた文の極性ラベル（正例なら"+1"，負例なら"-1"）と，その予測確率を計算するプログラムを実装せよ．

In [9]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

def predict_by_logistic_regression(text):
    lr = joblib.load('../work/model.pkl')
    vectorizer = joblib.load('../work/vectorizer.pkl')
    feature = vectorizer.transform([text]).toarray()
    return lr.predict(feature), lr.predict_proba(feature)

label, prob = predict_by_logistic_regression("to portray modern women the way director davis has done is just unthinkable . ")
print(str(label) + ' ' + str(prob))

[1] [[ 0.3663977  0.6336023]]


## 75. 素性の重み
73で学習したロジスティック回帰モデルの中で，重みの高い素性トップ10と，重みの低い素性トップ10を確認せよ．

## 76. ラベル付け
学習データに対してロジスティック回帰モデルを適用し，正解のラベル，予測されたラベル，予測確率をタブ区切り形式で出力せよ．

## 77. 正解率の計測
76の出力を受け取り，予測の正解率，正例に関する適合率，再現率，F1スコアを求めるプログラムを作成せよ．

## 78. 5分割交差検定
76-77の実験では，学習に用いた事例を評価にも用いたため，正当な評価とは言えない．すなわち，分類器が訓練事例を丸暗記する際の性能を評価しており，モデルの汎化性能を測定していない．そこで，5分割交差検定により，極性分類の正解率，適合率，再現率，F1スコアを求めよ．

## 79. 適合率-再現率グラフの描画
ロジスティック回帰モデルの分類の閾値を変化させることで，適合率-再現率グラフを描画せよ．