## BoWをインプットとするニューラルネットワークによるテキスト分類
- BoW（Count base One-Hotエンコーディング）を用いたシンプルな例

参考：
- [機械学習・深層学習による自然言語処理入門 ~scikit-learnとTensorFlowを使った実践プログラミング](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%83%BB%E6%B7%B1%E5%B1%A4%E5%AD%A6%E7%BF%92%E3%81%AB%E3%82%88%E3%82%8B%E8%87%AA%E7%84%B6%E8%A8%80%E8%AA%9E%E5%87%A6%E7%90%86%E5%85%A5%E9%96%80-scikit-learn%E3%81%A8TensorFlow%E3%82%92%E4%BD%BF%E3%81%A3%E3%81%9F%E5%AE%9F%E8%B7%B5%E3%83%97%E3%83%AD%E3%82%B0%E3%83%A9%E3%83%9F%E3%83%B3%E3%82%B0-Compass-Data-Science/dp/4839966605/)
- [07_simple_neural_network.ipynb](https://colab.research.google.com/drive/1GtFEsTloBKvDD6W_y2F2iNWgCUA7vvpO)

In [2]:
import string

import pandas as pd
#import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, TensorBoard
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import load_model, Sequential

In [5]:
%%time
## データのロードとクリーニング、列選択（review_body、star_rating）

def filter_by_ascii_rate(text, threshold=0.9):
    ascii_letters = set(string.printable)
    rate = sum(c in ascii_letters for c in text) / len(text)
    return rate <= threshold

def load_dataset(filename, n=5000, state=6):
    df = pd.read_csv(filename, sep='\t')

    # Converts multi-class to binary-class.
    mapping = {1: 0, 2: 0, 4: 1, 5: 1}
    df = df[df.star_rating != 3]
    df.star_rating = df.star_rating.map(mapping)

    # extracts Japanese texts.
    is_jp = df.review_body.apply(filter_by_ascii_rate)
    df = df[is_jp]

    # sampling.
    df = df.sample(frac=1, random_state=state)  # shuffle
    grouped = df.groupby('star_rating')
    df = grouped.head(n=n)
    return df.review_body.values, df.star_rating.values

def clean_html(html, strip=False):
    soup = BeautifulSoup(html, 'html.parser')
    # タグ除去
    text = soup.get_text(strip=strip)
    return text

url = 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz'
x, y = load_dataset(url)
x = [clean_html(text, strip=True) for text in x]

CPU times: user 12.3 s, sys: 731 ms, total: 13.1 s
Wall time: 24.4 s


In [6]:
## 作成されたデータ
df = pd.DataFrame({'review_body':x, 'star_rating':y})
print(df.shape)
df.head()

(10000, 2)


Unnamed: 0,review_body,star_rating
0,現在、地球温暖化の悪影響が、ここまで顕在化しているとは想像していませんでした。特に、このまま...,1
1,このアクション映画ほど、男気を感じたものはあったのだろうか。シンプル構成で時間をたっぷりと使...,1
2,このアプリを入れて以来、かなりお世話になりました。私の場合、PCで作成したデータや画像を出先...,0
3,取り出してさっと撮影することが必要な旅行用に不可欠だと思います。,1
4,Kindleで使用しています。複数のCloudが管理できたり、ワードやエクセルが使えたりと素...,1


In [7]:
## Train/Test Split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
print(len(x_train), len(x_test), len(y_train), len(y_test))

8000 2000 8000 2000


In [8]:
%%time
## x_train、x_testのトークン化

t = Tokenizer(wakati=True)

def tokenize(text):
    return t.tokenize(text)

# Vectorizing dataset.
vectorizer = CountVectorizer(tokenizer=tokenize)  # Count Baseのベクタライザー
x_train = vectorizer.fit_transform(x_train)    # Learn the vocabulary dictionary and return document-term matrix.
x_test = vectorizer.transform(x_test)            # Transform documents to document-term matrix.
x_train = x_train.toarray()
x_test = x_test.toarray()

CPU times: user 2min 7s, sys: 552 ms, total: 2min 7s
Wall time: 2min 7s


In [9]:
print(x_train.shape, x_test.shape)    # 40980がトークン化されたユニークな単語数

(8000, 40980) (2000, 40980)


In [10]:
x_train    # これが最終的に投入する学習データ

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [11]:
# x_trainからトークン化された単語
vectorizer.get_feature_names()[:10]

['\x03',
 '\x08',
 '\x1a',
 '\x1a\x1a',
 ' ',
 '  ',
 '   ',
 '    ',
 '     ',
 '      ']

In [12]:
# x_trainをDataFrameで表現すると
pd.DataFrame(x_train, columns=vectorizer.get_feature_names()).head()

Unnamed: 0,,,,,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,...,￣)＿,￣;）,￣▽￣),￥,👀,💢,💦,😞,😢,󾭛
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
## Setting hyperparameters
vocab_size = len(vectorizer.vocabulary_)   # ユニークな単語数  ->  モデルのインプットサイズとなる
label_size = len(set(y_train))                        # ターゲットの水準数（２値）
print(vocab_size, label_size)

40980 2


In [18]:
## モデル定義
model = Sequential()
model.add(Dense(units=16, activation='relu', input_shape=(vocab_size,)))
model.add(Dense(units=label_size, activation='softmax'))
model

<tensorflow.python.keras.engine.sequential.Sequential at 0x7f9d42ac2550>

In [19]:
model.summary()

# 40980 -> 16 -> 2

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 16)                655696    
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 34        
Total params: 655,730
Trainable params: 655,730
Non-trainable params: 0
_________________________________________________________________


In [20]:
(40980+1) * 16   # input+bias * hidden

655696

In [43]:
%%time
## モデル学習

epochs = 100
batch_size = 32
save_path = '/tmp/model'
log_dir = 'logs'

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    x_train, 
    y_train,
    validation_split=0.2,
    epochs=epochs,
    batch_size=batch_size,
    callbacks=[
               EarlyStopping(monitor='val_loss', patience=3),
               ModelCheckpoint(
                   filepath=save_path,
                   monitor='val_loss',
                   save_best_only=True,
                   mode='min'
               ),
               TensorBoard(log_dir=log_dir)
    ]
)

Epoch 1/100
INFO:tensorflow:Assets written to: /tmp/model/assets
Epoch 2/100
INFO:tensorflow:Assets written to: /tmp/model/assets
Epoch 3/100
Epoch 4/100
Epoch 5/100
CPU times: user 43.6 s, sys: 8.52 s, total: 52.1 s
Wall time: 9.69 s


In [46]:
history

<tensorflow.python.keras.callbacks.History at 0x7ff451202fa0>

In [47]:
## テストデータに対する精度（Accuracy）
y_pred = model.predict(x_test)
accuracy_score(y_pred.argmax(axis=1), y_test)

0.831

In [45]:
## 予測
text = 'このアプリ超最高！'
vec = vectorizer.transform([text])
model.predict(vec.toarray())

array([[0.17112611, 0.8288739 ]], dtype=float32)