# Document classification

- PubMed abstract들 중 소아과(ped), 내과(im), 외과(surg)와 관련된 것들을 모아 이를 분류하는 CNN 모델 시도
- 미리 학습된 Word2Vec을 이용하지 않는 버젼
- Text CNN을 이용하여 모델

In [1]:
from keras.layers import Input, Dense, Embedding, Conv2D, MaxPool2D
from keras.layers import Reshape, Flatten, Dropout, Concatenate
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras.models import Model
from sklearn.model_selection import train_test_split
from data_helpers import load_data

Using TensorFlow backend.


### Read the data

In [2]:
x, y, vocabulary, vocabulary_inv = load_data()

# x.shape -> (2509, 815)
# y.shape -> (2509, 3)
# len(vocabulary) -> 19924
# len(vocabulary_inv) -> 19924

In [3]:
x.shape

(2509, 815)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=1234)

# X_train.shape -> (2007, 815)
# y_train.shape -> (2007, 3)
# X_test.shape -> (502, 815)
# y_test.shape -> (502, 3)

In [5]:
X_train[0]

array([13025,  4625,  9916,  9219,  4388,  6189,  2956,  8938, 11734,
        7110,  9182,  9880,  5403,  8400, 18178, 19784,  9916, 13865,
        7269, 11743,  2191, 13891,  2192, 18237, 17552,  5152,  4391,
       14646,  2956, 14471,  3508, 13106, 18178, 13891, 19784, 19735,
       18178, 14768,  3383,  8560,  1097, 11882,  2897,  7005, 11438,
       17815, 19580, 16670, 18355, 11783, 13106, 18178,  2867,  2287,
       13106, 13869,  2191,  2208,  2192, 16568, 13167,  7240,  2956,
       18355, 12759,  2208, 11783,  4008,  4579,  9916, 13891, 10130,
       13167,  6202,   549, 14511,  4643,  2956, 14842,  2433,   549,
        4391, 14646,   549,  8623, 14217,   549,  2956,  4210,  2191,
       19174, 18742, 19239, 16523, 15258,  2192, 19580,  2930, 19174,
       17325,  6297, 17373, 16009, 13106,  1110,   549,   789, 17819,
       11439,   549,  2083, 15974, 19656, 15479,  2191,  1501,  1999,
       15973,  2192, 13469,  9103,  2191,  1687,  1895,  2192, 13106,
       15967, 19656,

In [6]:
vocabulary_inv[2190]

'<PAD/>'

### Hyperparameters

In [7]:
sequence_length = x.shape[1] # 815
vocabulary_size = len(vocabulary_inv) # 19924
embedding_dim = 200
filter_sizes = [3,4,5]
num_filters = 64
drop = 0.5

epochs = 20
batch_size = 32

### Model design
Keras에는 두 가지의 모델 생성 방법이 있습니다.

1. Sequential Models
2. Functional Models

**Sequential model API**는 상당히 쉽게 딥러닝 모델을 생성하는 인터페이스를 제공하지만 한 방향성으로만 모델을 생성시킨다는 단점이 있습니다. 따라서 다음의 경우에는 Sequential model API로 모델을 생성하기가 어렵습니다.

1. 다중의 입력 소스를 만들 경우
2. 다중의 출력 층을 만들 경우
3. 층을 여러 방향으로 공유하는 경우 등.

또 다른 방법은 **Functional model API**를 이용하는 것입니다. 이 방법은 좀 더 유연하게 딥러닝 모델을 디자인할 수 있게 합니다.
만드는 것은 전혀 어렵지 않습니다. `keras.models.Model`을 활용하여 생성할 수 있으며 **Input**과 **Output**만 잘 정의해주면 됩니다.

**Functional model API**에 대한 자세한 가이드는 Keras 공식 문서 (https://keras.io/getting-started/functional-api-guide/)를 참고하시기 바랍니다.

아래는 **Functional model API**로 모델을 생성한 경우입니다.

In [8]:
inputs = Input(shape=(sequence_length,), dtype='int32')
embedding = Embedding(input_dim=vocabulary_size, output_dim=embedding_dim, input_length=sequence_length)(inputs)
reshape = Reshape((sequence_length, embedding_dim, 1))(embedding)

conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)

maxpool_0 = MaxPool2D(pool_size=(sequence_length - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_0)
maxpool_1 = MaxPool2D(pool_size=(sequence_length - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_1)
maxpool_2 = MaxPool2D(pool_size=(sequence_length - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_2)

concatenated_tensor = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2])
flatten = Flatten()(concatenated_tensor)
dropout = Dropout(drop)(flatten)
output = Dense(units=y_train.shape[1], activation='softmax')(dropout)

# this creates a model that includes
model = Model(inputs=inputs, outputs=output)

In [9]:
print(inputs)
print(embedding)
print(reshape)

Tensor("input_1:0", shape=(?, 815), dtype=int32)
Tensor("embedding_1/embedding_lookup/Identity:0", shape=(?, 815, 200), dtype=float32)
Tensor("reshape_1/Reshape:0", shape=(?, 815, 200, 1), dtype=float32)


In [10]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 815)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 815, 200)     3984800     input_1[0][0]                    
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 815, 200, 1)  0           embedding_1[0][0]                
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 813, 1, 64)   38464       reshape_1[0][0]                  
__________________________________________________________________________________________________
conv2d_2 (

In [None]:
# # 학습 시 성능이 좋아지면, 그 때의 가중치를 저장하는 코드
# # 용량이 크므로 개인 컴퓨터에서 저장 공간을 어느 정도 확보한 후에 실행하는 것이 좋음
# checkpoint = ModelCheckpoint('weights.{epoch:03d}-{val_acc:.4f}.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='auto')

In [11]:
adam = Adam(lr=1e-3, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])
print("Traning Model...")

# # Check point가 있을 때,
# model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, callbacks=[checkpoint], validation_data=(X_test, y_test))  # starts training

# Check point가 없을 때,
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(X_test, y_test))  # starts training

Traning Model...
Train on 2007 samples, validate on 502 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20

KeyboardInterrupt: 

In [12]:
import numpy as np
for i in range(5):
    idx = np.random.randint(len(X_test))
    x_test = X_test[idx].reshape(1,X_test.shape[1])
    y_label = y_test[idx][0]
    y_pred = model.predict(x_test)[0][0]
    sent = " ".join([vocabulary_inv[x] for x in x_test[0].tolist() if x != 0])
    print("%.0f\t%d\n%s" % (y_pred, y_label, sent.replace('<PAD/>', '').strip()))
    print("\n")

0	0
organ donation after circulatory death \( dcd \) has experienced a revival worldwide over the past 20 years , and is now widely practiced for kidney transplantation some previous concerns about these organs such as the high incidence of delayed graft function have been alleviated through evidence from adult studies there are now a number of large adult cohorts reporting favorable 5 year outcomes for dcd kidney transplants , comparable to kidneys donated after brain death \( dbd \) this has resulted in a marked increase in the use of dcd kidneys for adult recipients in some countries and an increase in the overall number of kidney transplants in contrast , the uptake of dcd kidneys for pediatric recipients is still low and concerns still exist over the longer term outcomes of dcd organs in view of the data from adult practice and the poor outcomes for children who stay on dialysis , dcd kidney transplantation should be offered as an option for children on the kidney transplant waiti