# Sprint21 自然言語処理

# データの準備

下記のURLから、圧縮ファイルをダウンロードしてください。

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

ダウンロードした圧縮ファイルを解凍し、このsprint21.ipynbと同じ階層においてください。

# ライブラリのimport

In [52]:
from sklearn.datasets import load_files
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import itertools
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
import lightgbm as lgb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from gensim.models import word2vec
import numpy as np
from sklearn.preprocessing import normalize
import re

# データの読み込み

In [11]:
train_review = load_files('./aclImdb/train/', encoding='utf-8')
x_train, y_train = train_review.data, train_review.target
test_review = load_files('./aclImdb/test/', encoding='utf-8')
x_test, y_test = test_review.data, test_review.target

In [12]:
# テスト出力
print("x : {}".format(x_train[0]))
print(np.array(x_train).shape,np.array(x_test).shape,np.array(y_train).shape,np.array(y_test).shape)

x : Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. <br /><br />Flawed but honest with a terrible honesty.
(25000,) (25000,) (25000,) (25000,)


# 問題1　BoWのスクラッチ実装

まずは、sklearnでBoWを計算してみます。

In [13]:
# 仮のデータ
mini_dataset = ['This movie is very good.','This film is a good','Very bad. Very, very bad.']

# インスタンス化
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b')
# 実行
bow = (vectorizer.fit_transform(mini_dataset)).toarray()
# DataFrame化
df = pd.DataFrame(bow, columns=vectorizer.get_feature_names())
# 出力
display(df)

Unnamed: 0,a,bad,film,good,is,movie,this,very
0,0,0,0,1,1,1,1,1
1,1,0,1,1,1,0,1,0
2,0,2,0,0,0,0,0,3


次に、スクラッチで作ってみます。

In [14]:
def bow(data):
    """BoW算出
    Parameters
    -----------
    data : 文章リスト
    """
    ## 単語リスト作成
    # 小文字に統一
    # !除去
    # 文字列を半角スペース基準で分割し、リスト化
    row_data = xxxxxxxxxxxxxxxxxxxxxxxxxxxx
    # 1次元のリストに(単語リスト)
    feature_names = xxxxxxxxxxxxxxxxxxxxxxxxxxxx
    
    ## bow計算
    bow = []
    # 1つづつ文章でループ
    for index,row in enumerate(data):
        bow.append([])
        # 単語リストでループ
        for feature_name in feature_names:
            # 何個含まれているか
            num = xxxxxxxxxxxxxxxxxxxxxxxxxxxx
            # 追加
            bow[index].append(num)
    return feature_names,bow

# 仮データの定義
mini_dataset = ['This movie is SOOOO funny!!!','What a movie! I never','best movie ever!!!!! this movie']
# bow関数実行
feature_names,bow = bow(mini_dataset)
# DF化
df = pd.DataFrame(bow, columns=feature_names)
# 出力
display(df)

Unnamed: 0,i,never,funny,best,is,movie,ever,soooo,what,a,this
0,0,0,1,0,1,1,0,1,0,0,1
1,1,1,0,0,0,1,0,0,1,1,0
2,0,0,0,1,0,2,1,0,0,0,1


# 問題2　TF-IDFの計算

In [15]:
# nltkライブラリのstopwordsを利用
stop_words = nltk.download('stopwords')
stop_words = stopwords.words('english')
print("stop word : {}".format(stop_words))

stop word : ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\root\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
# tfidfの算出
vectorizer = TfidfVectorizer(stop_words= stop_words, max_features=5000)
X_train = vectorizer.fit_transform(x_train)
X_test = vectorizer.fit_transform(x_test)

In [17]:
# テスト出力
print(X_train.shape, X_test.shape)

(25000, 5000) (25000, 5000)


# 問題3　TF-IDFを用いた学習

In [18]:
# lightGBMを用いた学習
lgb = lgb.LGBMClassifier().fit(X_train,y_train)
# 推定
y_pred = lgb.predict(X_test)

In [19]:
# 結果出力
print("{}".format(lgb.score(X_test, y_test)))
print(confusion_matrix(y_test, y_pred))

0.57248
[[6871 5629]
 [5059 7441]]


# 問題4　TF-IDFのスクラッチ実装

まずは、sklearnでtfidfを計算してみます。

In [33]:
# 仮データ
mini_dataset = ['This movie is SOOOO funny!!!','What a movie! I never','best movie ever!!!!! this movie']

In [45]:
# インスタンス化
tfidf_model = TfidfVectorizer()
# 計算
tfidf = tfidf_model.fit_transform(mini_dataset)
# DF化
tfidf = pd.DataFrame(tfidf.toarray(), columns=tfidf_model.get_feature_names())
# 出力
tfidf

Unnamed: 0,best,ever,funny,is,movie,never,soooo,this,what
0,0.0,0.0,0.504611,0.504611,0.298032,0.0,0.504611,0.38377,0.0
1,0.0,0.0,0.0,0.0,0.385372,0.652491,0.0,0.0,0.652491
2,0.501651,0.501651,0.0,0.0,0.592567,0.0,0.0,0.381519,0.0


次に、スクラッチで作ってみます。

In [46]:
# インスタンス化
cv_model = CountVectorizer()
# 計算
cv= cv_model.fit_transform(mini_dataset)
# 扱いやすいように配列化
cv_array = cv.toarray()

# TF値計算
N = cv_array.shape[0]
tf = xxxxxxxxxxxxxxxxxxxxxxxxxxxx

# IDF値計算
df = np.count_nonzero(cv_array, axis=0)
idf = xxxxxxxxxxxxxxxxxxxxxxxxxxxx

# normalize
tfidf = normalize(xxxxxxxxxxxxxxxxxxxxxxxxxxxx)
tfidf = pd.DataFrame(tfidf, columns=cv_model.get_feature_names())
tfidf

Unnamed: 0,best,ever,funny,is,movie,never,soooo,this,what
0,0.0,0.0,0.504611,0.504611,0.298032,0.0,0.504611,0.38377,0.0
1,0.0,0.0,0.0,0.0,0.385372,0.652491,0.0,0.0,0.652491
2,0.501651,0.501651,0.0,0.0,0.592567,0.0,0.0,0.381519,0.0


# 問題5　コーパスの前処理

この問題以降は簡単のため、1文のみを扱う。

In [50]:
# 簡単のため、URL含んでそうな1文抜き出す
with_url = 0
for i, s in enumerate(x_train):
    if 'www' in s:
        with_url = i
        print('-----before processing')
        print(s)
        break
no_preprocessing = x_train[with_url]

# urlは除外
after_preprocessing1 = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", no_preprocessing) 
# タグ除去
after_preprocessing2 = re.sub(r'<[^>]+>', " ", after_preprocessing1) 
# 数字と英字以外除去
after_preprocessing3 = re.sub(r"[^0-9a-zA-Z ]", "", after_preprocessing2) 
# 小文字に統一
after_preprocessing = after_preprocessing3.lower() 

print('-----after processing')
print(after_preprocessing)

-----before processing
I don't hand out "ones" often, but if there was ever a film that deserved this sort of attention, it's "Gas!" This is self-indulgent crap that reaches for some of the ambiance of M*A*S*H and falls completely flat on its face in the attempt.<br /><br />I see what Corman was going for - Malcolm Marmorstein and Elliott Gould tried to reproduce Gould's deathless role in the original movie version of M*A*S*H with a similar plot (in the movie "Whiffs" - look it up here in IMDb, http://www.imdb.com/title/tt0073891/ for more information).<br /><br />Marmorstein and Gould got closer to the brass ring with "Whiffs" than Corman did with "Gas!" but didn't quite get there. Neither one of those films even got close to the success of M*A*S*H.<br /><br />What's wrong with "Gas!"? What isn't? No one comes close to really acting at a level above junior high school theatrics. The production values stink. Someone else here mentioned the magically regenerating headlights on a getaway

# 問題6　Word2Vecの学習

In [124]:
# 単語リスト
word_list = [after_preprocessing.split(' ')]

# vector_size: 圧縮次元数
# min_count: 出現頻度の低いものをカットする
# window: 前後の単語を拾う際の窓の広さを決める
# epochs: 機械学習の繰り返し回数(デフォルト:5)十分学習できていないときにこの値を調整する
model = word2vec.Word2Vec(word_list,min_count=1) 

In [125]:
# 確認
model.wv['hand']

array([-0.00904435,  0.00534277,  0.00375054, -0.00605336,  0.00606624,
       -0.00113982,  0.00532705, -0.00198728,  0.00572351,  0.00578795,
       -0.00270007, -0.00911711, -0.00086151,  0.0028268 , -0.00730402,
       -0.00822396,  0.00099289,  0.00080993, -0.00464603, -0.00557673,
        0.00827145,  0.00953451, -0.00814784, -0.00735056,  0.0004159 ,
       -0.00356807, -0.00062897, -0.00580837, -0.00767971, -0.0048091 ,
        0.00368572,  0.00361012, -0.00872424, -0.00864171, -0.00645147,
        0.00244312, -0.00828653,  0.0004232 ,  0.00879883, -0.00906996,
       -0.00346835,  0.00891898,  0.00697609, -0.00029563, -0.00428081,
       -0.00887511, -0.00466706,  0.00085721,  0.0078073 , -0.00229957,
        0.00929993, -0.00791849,  0.00531931,  0.00555262, -0.00503432,
        0.00774953, -0.00636481,  0.00402099, -0.00960674, -0.000822  ,
       -0.00196953,  0.00656306,  0.00407827, -0.00955887, -0.0024127 ,
       -0.00562746, -0.00759491,  0.00170983,  0.0072332 , -0.00