# 【文書分類】Text Classification using LSTM with Chainer

![title](https://miro.medium.com/max/960/1*HgXA9v1EsqlrRDaC_iORhQ.png)

## ---------------------------------------

## Step 0. Task Definition

### Goal  
文章をカテゴリごとに正確に分類するAIを作る   
### Business Impact  
メディアドメインでインパクト大       
### Algorithm  
CNN, RNN, LSTM   
### Framework  
Chainer  

![title](https://www.preferred-networks.jp/wp-content/uploads/2017/02/chainer_red_h.png)

## Chainerの基本ファンクションについて
  
### Chain  
複数のLinkの組み合わせ（ニューラルネットワーク）をChainの配下にLinkを作ってまとめ上げることで、一つのChainオブジェクトとして管理する   
### Links    
y = x1w1 + x2w2 + bなどの学習可能なパラメータである「重み w」と「バイアス b」などの学習可能なパラメータを持った最適化のための関数   
### Function  
Linksが学習したパラメータ（Wとかｂとか）を持つ関数ならFunctionsはパラメータを持たない関数   
### Variable  
変数   
  

## ---------------------------------------

## Step 1. Data Preparation

In [46]:
data = [
    ["Could I exchange business cards, if you don’t mind?", 1],
    ["I'm calling regarding the position advertised in the newspaper.", 0],
    ["I'd like to apply for the programmer position.", 0],
    ["Could you tell me what an applicant needs to submit?", 1],
    ["Could you tell me what skills are required?", 1],
    ["We will assist employees with training and skill development.", 0],
    ["What kind of in-house training system do you have for your new recruits?", 1],
    ["For office equipment I think rental is better.", 0],
    ["Is promotion based on the seniority system?", 1],
    ["What's still pending from February?", 1],
    ["Which is better, rental or outright purchase?", 1],
    ["General Administration should do all the preparations for stockholder meetings.", 0],
    ["One of the elevators is out of order. When do you think you can have it fixed?", 1],
    ["General Administration is in charge of office building maintenance.", 0],
    ["Receptionists at the entrance hall belong to General Administration.", 0],
    ["Who is managing the office supplies inventory?", 1],
    ["Is there any difference in pay between males and females?", 1],
    ["The General Administration Dept. is in charge of office maintenance.", 0],
    ["Have you issued the meeting notice to shareholders?", 1],
    ["What is an average annual income in Japan?", 1],
    ["Many Japanese companies introduced the early retirement system.", 0],
    ["How much did you pay for the office equipment?", 1],
    ["Is the employee training very popular here?", 1],
    ["What kind of amount do you have in mind?", 1],
    ["We must prepare our financial statement by next Monday.", 0],
    ["Would it be possible if we check the draft?", 1],
    ["The depreciation of fixed assets amounts to $5 million this year.", 0],
    ["Please expedite the completion of the balance sheet.", 0],
    ["Could you increase the maximum lending limit for us?", 1],
    ["We should cut down on unnecessary expenses to improve our profit ratio.", 0],
    ["What percentage of revenue are we spending for ads?", 1],
    ["One of the objectives of internal auditing is to improve business efficiency.", 0],
    ["Did you have any problems finding us?", 1],
    ["How is your business going?", 1],
    ["Not really well. I might just sell the business.", 0],
    ["What line of business are you in?", 1],
    ["He has been a valued client of our bank for many years.", 0],
    ["Would you like for me to show you around our office?", 1],
    ["It's the second door on your left down this hall.", 0],
    ["This is the … I was telling you about earlier.", 0],
    ["We would like to take you out to dinner tonight.", 0],
    ["Could you reschedule my appointment for next Wednesday?", 1],
    ["Would you like Japanese, Chinese, Italian, French or American?", 1],
    ["Is there anything you prefer not to have?", 1],
    ["Please give my regards to the staff back in San Francisco.", 0],
    ["This is a little expression of our thanks.", 0],
    ["Why don’t you come along with us to the party this evening?", 1],
    ["Unfortunately, I have a prior engagement on that day.", 0],
    ["I am very happy to see all of you today.", 0],
    ["It is a great honor to be given this opportunity to present here.", 0],
    ["The purpose of this presentation is to show you the new direction our business is taking in 2009.", 0],
    ["Could you please elaborate on that?", 1],
    ["What's your proposal?", 1],
    ["That's exactly the point at issue here.", 0],
    ["What happens if our goods arrive after the delivery dates?", 1],
    ["I'm afraid that's not accpetable to us.", 0],
    ["Does that mean you can deliver the parts within three months?", 1],
    ["We can deliver parts in as little as 5 to 10 business days.", 0],
    ["We've considered all the points you've put forward and our final offer is $900.", 0],
    ["Excuse me but, could I have your name again, please?", 1],
    ["It's interesting that you'd say that.", 0],
    ["The pleasure's all ours. Thank you for coimng today.", 0],
    ["Could you spare me a little of your time？", 1],
    ["That's more your area of expertise than mine, so I'd like to hear more.", 0],
    ["I'd like to talk to you about the new project.", 0],
    ["What time is convenient for you?", 1],
    ["How’s 3:30 on Tuesday the 25th?", 1],
    ["Could you inform us of the most convenient dates for our visit?", 1],
    ["Fortunately, I was able to return to my office in time for the appointment.", 0],
    ["I am sorry, but we have to postpone our appointment until next month.", 0],
    ["Great, see you tomorrow then.", 0],
    ["Great, see you tomorrow then.", 1],
    ["I would like to call on you sometime in the morning.", 0],
    ["I'm terribly sorry for being late for the appointment.", 0],
    ["Could we reschedule it for next week?", 1],
    ["I have to fly to New York tomorrow, can we reschedule our meeting when I get back?", 1],
    ["I'm looking forward to seeing you then.", 0],
    ["Would you mind writing down your name and contact information?", 1],
    ["I'm sorry for keeping you waiting.", 0],
    ["Did you find your way to our office wit no problem?", 1],
    ["I need to discuss this with my superior. I'll get back to you with our answer next week.", 0],
    ["I'll get back to you with our answer next week.", 0],
    ["Thank you for your time seeing me.", 0],
    ["What does your company do?", 1],
    ["Could I ask you to make three more copies of this?", 1],
    ["We have appreciated your business.", 0],
    ["When can I have the contract signed?", 1],
    ["His secretary is coming down now.", 0],
    ["Please take the elevator on your right to the 10th floor.", 0],
    ["Would you like to leave a message?", 1],
    ["It's downstairs in the basement.", 0],
    ["Your meeting will be held at the main conference room on the 15th floor of the next building.", 0],
    ["Actually, it is a bit higher than expected. Could you lower it?", 1],
    ["We offer the best price anywhere.", 0],
    ["All products come with a 10-year warranty.", 0],
    ["It sounds good, however, is made to still think; seem to have a problem.", 0],
    ["Why do you need to change the unit price?", 1],
    ["Could you please tell me the gist of the article you are writing?", 1],
    ["Would you mind sending or faxing your request to me?", 1],
    ["About when are you publishing this book?", 1],
    ["May I record the interview?", 1]
]

len(data)

101

## ---------------------------------------

## Step 2. Build Neural Network Architecture

![title](https://cdn-images-1.medium.com/max/2600/1*sO-SP58T4brE9EHazHSeGA.png)

### 分類器をつくるステップ

1. ネットワーク形状を決める
2. パラメータ（Weight&Bias）に適当な数字をいれる
3. あるデータセットを入れてみて、出力を確認する
4. 正解データからのずれから、適切にパラメータを調整する   
*以下、「学習がそれなりにうまくいった」と思えるまで繰り返し
  
  
具体的には、Chainerは「1. ネットワーク形状を決める」以外のすべてを肩代わりしてくれます。プログラマが行うのは

1. ネットワーク形状を決める
2. 正解データを用意する
3. 最適化手法を選ぶ
4. 最適化のためのパラメータ(エポック数やミニバッチサイズ等)を決める  

### A) Define Network Architecture

In [112]:
import chainer
from chainer import Chain
import chainer.links as L
import chainer.functions as F

# モデルクラスの定義
class LSTM_TextClassifier(Chain):
    
    """「単語IDを入力して、記事のカテゴリを分類する」ニューラルネットワークを設計する """
    # vocab_sizeは単語の種類数、vector_sizeは単語ベクトルの次元数、hidden_sizeは隠れ層の次元数、out_sizeは分類するカテゴリ数を表す
    def __init__(self, vocab_size, vector_size, hidden_size, out_size):      
        
        super(LSTM_TextClassifier, self).__init__(
            # EmbedIDは入力側がone-hotベクトルの場合のLinearで、ベクトルの代わりに発火している要素のIDを渡すことができる
            wv = L.EmbedID(vocab_size, vector_size, ignore_label=1),
            # モデルを定義する（LSTM）
            vh = L.LSTM(vector_size, hidden_size),
            # 重みとバイアスが全て線形作用素（行列）なのでLinearを使う
            hh = L.Linear(hidden_size, hidden_size),
            hy = L.Linear(hidden_size, out_size)
        )
    
    """ 実際の計算を行うforward（順伝播）関数を定義する """
    # 言語モデルの場合は次の式に表す文の結合確率を求めることになる
    def __call__(self, x):
        
        # エンコード
        x = F.transpose_sequence(x)
        self.vh.reset_state()
        for word in x:
            e = self.wv(word)
            h = self.vh(e)
            
            # 分類
            y = self.hy(h)
            return y
        

＊Chainerはミニバッチ処理が前提になっているので、データの次元がひとつ多くなっています（コード中でバッチ処理は行っていない）

###### データを「テキスト」と「ラベル」に分割する

In [113]:
N = len(data)
texts = []
labels = []
for e in data:
    texts.append(e[0])
    labels.append(e[1])

###### 正規表現関数を作成する：文章の標準化、単語化、記号や数字の削除、ストップワードの除外

In [114]:
import re

def seq2word(text):
    stopwords = ["i", "a", "an", "the", "and", "or", "if", "is", "are", "am", "it", "this", "that", "of", "from", "in", "on"]
    # 小文字化
    text = text.lower()
    # 改行を削除
    text = text.replace("\n", "")
    # re.compile()で同じパターンを繰り返し使用する&re.sub()でマッチした部分を置換する
    text = re.sub(re.compile(r"[!-\/:-@[-`{-~]"), " ", text)
    # re.split()はパターンにマッチした部分で文字列を分割&リストにして返す
    text = text.split(" ")
    
    words = []
    for word in text:
        if ( re.compile(r"^.*[0-9]+.*$").fullmatch(word) is not None ):
            continue
        if word in stopwords:
            continue
        words.append(word)
    return words

###### 上記の正規表現関数にテキストデータを流して単語辞書を作成する

In [115]:
corpus = {}
for text in texts:
    words = seq2word(text)
    for word in words:
        if word not in corpus:
            corpus[word] = len(corpus)

In [116]:
import pandas as pd

print("単語数", len(corpus))
inside_corpus = pd.Series(corpus)
inside_corpus.head(6)

単語数 370


                    4
able              286
about             183
accpetable        240
actually          343
administration     60
dtype: int64

###### 文章を単語ID配列にする

In [117]:
texts_vec = []
for text in texts:
    words = seq2word(text)
    words_id = []
    for word in words:
        words_id.append(corpus[word])
    texts_vec.append(words_id)

In [118]:
print("単語ID配列で表現される文書数：", len(texts_vec))
inside_texts_vec = pd.Series(texts_vec)
inside_texts_vec.head(6)

単語ID配列で表現される文書数： 101


0              [0, 1, 2, 3, 4, 5, 6, 7, 4]
1                [8, 9, 10, 11, 12, 13, 4]
2          [14, 15, 16, 17, 18, 19, 11, 4]
3    [0, 5, 20, 21, 22, 23, 24, 16, 25, 4]
4            [0, 5, 20, 21, 22, 26, 27, 4]
5      [28, 29, 30, 31, 32, 33, 34, 35, 4]
dtype: object

###### 【標準化】 文章の長さを揃える（前パディングする）

In [119]:
max_text_size = 0

for text_vec in texts_vec:
    if max_text_size < len(text_vec):
        max_text_size = len(text_vec)
for words_id in texts_vec:
    while len(words_id) < max_text_size:
        words_id.insert(0, -1)

###### 【高速化】計算効率を上げる

In [120]:
import numpy as np

texts_vec = np.array(texts_vec, dtype="int32")
labels = np.array(labels, dtype="int32")
dataset = []

# ベクトル化された文章とラベルをまとめる
for v, l in zip(texts_vec, labels):
    dataset.append((v, l))

###### 【正則化】学習率を定義
![title](https://cdn-images-1.medium.com/max/800/1*i_lp_hUFyUD_Sq4pLer28g.png)

#### エポック数：「訓練データを一巡したら1カウントされる数」のこと   
#### バッチサイズ：「一回の学習に使うデータの個数」ことで、入力データを一定数の束（ミニバッチ）に分割したものです

In [121]:
EPOCH_NUM = 10
BATCH_SIZE = 5
EMBED_SIZE = 200
HIDDEN_SIZE = 100
OUT_SIZE = 2

###### 【最適化】：勾配降下法の勾配方法を定義（今回はAdam）

![title](https://camo.qiitausercontent.com/2d025ad02dd4676ecc34a095954b08b7457c7ddc/687474703a2f2f73656261737469616e72756465722e636f6d2f636f6e74656e742f696d616765732f323031362f30312f736164646c655f706f696e745f6576616c756174696f6e5f6f7074696d697a6572732e676966)

In [122]:
from chainer import optimizers

model = L.Classifier(LSTM_TextClassifier(
    vocab_size=len(corpus),
    vector_size=EMBED_SIZE,
    hidden_size=HIDDEN_SIZE,
    out_size=OUT_SIZE
))

optimizer = optimizers.Adam()
optimizer.setup(model)

<chainer.optimizers.adam.Adam at 0x11f4f1908>

## Step 3. Model Evaluation and Tuning Parameters
Chainerのアーキテクチャ

![title](https://camo.qiitausercontent.com/d3af5d369c038fed989ea7839b1566ef49a6c331/68747470733a2f2f71696974612d696d6167652d73746f72652e73332e616d617a6f6e6177732e636f6d2f302f31373933342f61373531646633312d623939392d663639322d643833392d3438386332366231633438612e706e67)

In [123]:
from chainer import training
from chainer.training import extensions

train, test = chainer.datasets.split_dataset_random(dataset, N-20)
train_iter = chainer.iterators.SerialIterator(train, BATCH_SIZE)
test_iter = chainer.iterators.SerialIterator(test, BATCH_SIZE, repeat=False, shuffle=False)
updater = training.StandardUpdater(train_iter, optimizer, device=-1)
trainer = training.Trainer(updater, (EPOCH_NUM, "epoch"), out="result")
trainer.extend(extensions.Evaluator(test_iter, model, device=-1))
trainer.extend(extensions.LogReport(trigger=(1, "epoch")))
# エポック数、学習損失、テスト損失、学習正解率、テスト正解率、経過時間を表示する
trainer.extend(extensions.PrintReport(["epoch", "main/loss", "validation/main/loss", "main/accuracy", "validation/main/accuracy", "elapsed_time"]))

In [124]:
trainer.run()

epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
[J1           0.73926     0.774251              0.411765       0.35                      0.324076      
[J2           0.711038    0.711952              0.5375         0.35                      0.486369      
[J3           0.688513    0.694492              0.45           0.35                      0.635327      
[J4           0.685435    0.728399              0.5375         0.35                      0.781089      
[J5           0.692864    0.713429              0.5375         0.35                      0.924365      
[J6           0.682774    0.745042              0.564706       0.35                      1.09401       
[J7           0.693282    0.702431              0.5375         0.35                      1.24367       
[J8           0.686603    0.716328              0.525          0.35                      1.41009       
[J9           0.682935    0.725888              0.55       

# 【結論】
35%しか正解率がない
要調整が必要