# 2D CNN for Text Classification: 5 Topics BBC news
#### 2D畳み込み層を利用した文章分類
![](https://cdn-images-1.medium.com/max/1200/1*h_L7fSoQhipTHFULgXmHyQ.png)

## 1. Prepare Data

In [2]:
import pandas as pd
import string

In [3]:
#カラム内の文字数。デフォルトは50だった
pd.set_option("display.max_colwidth", 90)
#行数
pd.set_option("display.max_rows", 101)

In [37]:
df = pd.read_csv("bbc-text.csv")
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home theatre systems plasma high-definition tv...
1,business,worldcom boss left books alone former worldcom boss bernie ebbers who is accused of...
2,sport,tigers wary of farrell gamble leicester say they will not be rushed into making a bi...
3,sport,yeading face newcastle in fa cup premiership side newcastle united face a trip to ryma...
4,entertainment,ocean s twelve raids box office ocean s twelve the crime caper sequel starring george...


In [38]:
df.text[2]

'tigers wary of farrell  gamble  leicester say they will not be rushed into making a bid for andy farrell should the great britain rugby league captain decide to switch codes.   we and anybody else involved in the process are still some way away from going to the next stage   tigers boss john wells told bbc radio leicester.  at the moment  there are still a lot of unknowns about andy farrell  not least his medical situation.  whoever does take him on is going to take a big  big gamble.  farrell  who has had persistent knee problems  had an operation on his knee five weeks ago and is expected to be out for another three months. leicester and saracens are believed to head the list of rugby union clubs interested in signing farrell if he decides to move to the 15-man game.  if he does move across to union  wells believes he would better off playing in the backs  at least initially.  i m sure he could make the step between league and union by being involved in the centre   said wells.  i t

In [39]:
df.category.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

In [40]:
df.isnull().sum()

category    0
text        0
dtype: int64

In [41]:
category_encoder = {
    "sport": 1,
    "business": 2,
    "politics": 3,
    "tech": 4,
    "entertainment": 5 }

In [42]:
df.category.replace(category_encoder, inplace=True)

In [43]:
df.category.value_counts()

1    511
2    510
3    417
4    401
5    386
Name: category, dtype: int64

In [44]:
df["news_id"] = df.index

In [46]:
df.set_index("news_id", inplace=True)
df.head()

Unnamed: 0_level_0,category,text
news_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4,tv future in the hands of viewers with home theatre systems plasma high-definition tv...
1,2,worldcom boss left books alone former worldcom boss bernie ebbers who is accused of...
2,1,tigers wary of farrell gamble leicester say they will not be rushed into making a bi...
3,1,yeading face newcastle in fa cup premiership side newcastle united face a trip to ryma...
4,5,ocean s twelve raids box office ocean s twelve the crime caper sequel starring george...


## Clean Data
  
#### Goal
 - Create Vocablary and Find Vocablary Size
 - Find Max Title Length
 - Clean Data
 - Split Train and Test data

In [47]:
def clean_text(text):
    punct = str.maketrans("", "", string.punctuation)
    text = text.replace("  ", " ")
    text = text.replace("   ", " ")
    text = text.replace("    ", " ")
    words = text.split()
    cleaned_words = []
    for word in words:
        word = word.translate(punct)
        if len(word) < 2: # remove a or s
            continue
        if not word.isalpha():
            continue
        cleaned_words.append(word)
    cleaned_text = " ".join(cleaned_words)
    return cleaned_text

In [48]:
df.text = df.text.apply(lambda x: clean_text(x))

In [49]:
df.text[2]

'tigers wary of farrell gamble leicester say they will not be rushed into making bid for andy farrell should the great britain rugby league captain decide to switch codes we and anybody else involved in the process are still some way away from going to the next stage tigers boss john wells told bbc radio leicester at the moment there are still lot of unknowns about andy farrell not least his medical situation whoever does take him on is going to take big big gamble farrell who has had persistent knee problems had an operation on his knee five weeks ago and is expected to be out for another three months leicester and saracens are believed to head the list of rugby union clubs interested in signing farrell if he decides to move to the game if he does move across to union wells believes he would better off playing in the backs at least initially sure he could make the step between league and union by being involved in the centre said wells think england would prefer him to progress to pos

In [50]:
MAX_TEXT_LEN = 0
for text in df.text:
    words = text.split()
    if len(words) > MAX_TEXT_LEN:
        MAX_TEXT_LEN = len(words)

In [51]:
MAX_TEXT_LEN

4265

In [110]:
words = []
for text in df.text:
    tokens = text.split()
    words.append(tokens[:2000])

In [112]:
words[:5]

[['tv',
  'future',
  'in',
  'the',
  'hands',
  'of',
  'viewers',
  'with',
  'home',
  'theatre',
  'systems',
  'plasma',
  'highdefinition',
  'tvs',
  'and',
  'digital',
  'video',
  'recorders',
  'moving',
  'into',
  'the',
  'living',
  'room',
  'the',
  'way',
  'people',
  'watch',
  'tv',
  'will',
  'be',
  'radically',
  'different',
  'in',
  'five',
  'years',
  'time',
  'that',
  'is',
  'according',
  'to',
  'an',
  'expert',
  'panel',
  'which',
  'gathered',
  'at',
  'the',
  'annual',
  'consumer',
  'electronics',
  'show',
  'in',
  'las',
  'vegas',
  'to',
  'discuss',
  'how',
  'these',
  'new',
  'technologies',
  'will',
  'impact',
  'one',
  'of',
  'our',
  'favourite',
  'pastimes',
  'with',
  'the',
  'us',
  'leading',
  'the',
  'trend',
  'programmes',
  'and',
  'other',
  'content',
  'will',
  'be',
  'delivered',
  'to',
  'viewers',
  'via',
  'home',
  'networks',
  'through',
  'cable',
  'satellite',
  'telecoms',
  'companies',
  '

### Create Vocabraly

In [52]:
dictionary = dict()
idx = 0
for i in range(0, len(df)):
    text = df.iloc[i, 1]
    words = text.split()
    for word in words:
        if word not in dictionary:
            dictionary[idx] = word
            idx += 1
        else:
            continue

In [55]:
idx2word = dictionary
idx2word

{0: 'tv',
 1: 'future',
 2: 'in',
 3: 'the',
 4: 'hands',
 5: 'of',
 6: 'viewers',
 7: 'with',
 8: 'home',
 9: 'theatre',
 10: 'systems',
 11: 'plasma',
 12: 'highdefinition',
 13: 'tvs',
 14: 'and',
 15: 'digital',
 16: 'video',
 17: 'recorders',
 18: 'moving',
 19: 'into',
 20: 'the',
 21: 'living',
 22: 'room',
 23: 'the',
 24: 'way',
 25: 'people',
 26: 'watch',
 27: 'tv',
 28: 'will',
 29: 'be',
 30: 'radically',
 31: 'different',
 32: 'in',
 33: 'five',
 34: 'years',
 35: 'time',
 36: 'that',
 37: 'is',
 38: 'according',
 39: 'to',
 40: 'an',
 41: 'expert',
 42: 'panel',
 43: 'which',
 44: 'gathered',
 45: 'at',
 46: 'the',
 47: 'annual',
 48: 'consumer',
 49: 'electronics',
 50: 'show',
 51: 'in',
 52: 'las',
 53: 'vegas',
 54: 'to',
 55: 'discuss',
 56: 'how',
 57: 'these',
 58: 'new',
 59: 'technologies',
 60: 'will',
 61: 'impact',
 62: 'one',
 63: 'of',
 64: 'our',
 65: 'favourite',
 66: 'pastimes',
 67: 'with',
 68: 'the',
 69: 'us',
 70: 'leading',
 71: 'the',
 72: 'trend'

In [56]:
VOCAB_SIZE = len(idx2word)
VOCAB_SIZE

812148

In [57]:
word2idx = {}
for idx, word in idx2word.items():
    word2idx[word] = idx

In [58]:
word2idx

{'dowell': 206146,
 'robben': 728928,
 'induction': 667081,
 'countryside': 730565,
 'innate': 98866,
 'timeshifting': 161931,
 'similar': 809825,
 'robs': 598036,
 'cigarettes': 462577,
 'nonmarket': 316492,
 'laughing': 678299,
 'streaming': 585982,
 'mentioned': 786216,
 'kok': 34290,
 'runins': 789775,
 'zillion': 566148,
 'politicians': 797754,
 'extracting': 699065,
 'displayed': 748255,
 'norwegians': 770660,
 'brilliant': 779813,
 'bombings': 196481,
 'scandinavians': 793292,
 'rocksolid': 184115,
 'malnourishment': 372780,
 'forcibly': 661024,
 'millionaire': 553725,
 'retrieved': 340264,
 'mindset': 372461,
 'cd': 788815,
 'fixtures': 771505,
 'hotmail': 180541,
 'tasting': 796553,
 'accepted': 791844,
 'csiny': 702417,
 'par': 396275,
 'ginsberg': 310333,
 'grassroots': 192136,
 'certain': 790488,
 'lanka': 789271,
 'zambians': 361060,
 'danaher': 284747,
 'pronouncement': 743187,
 'hughes': 798728,
 'hunters': 765367,
 'crumble': 474151,
 'ravaged': 697275,
 'urbanbased': 6

In [60]:
len(df)

2225

### Vectorize Sentence

In [109]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=100000)

In [114]:
tokenizer.fit_on_texts(words)
sequences = tokenizer.texts_to_sequences(words)

In [115]:
sequences[0]

[171,
 234,
 5,
 1,
 1259,
 3,
 1210,
 15,
 119,
 1126,
 774,
 5023,
 1260,
 4108,
 4,
 208,
 253,
 3972,
 1407,
 69,
 1,
 1357,
 1641,
 1,
 110,
 42,
 961,
 171,
 21,
 14,
 6309,
 397,
 5,
 178,
 75,
 65,
 8,
 7,
 204,
 2,
 31,
 2997,
 1342,
 32,
 2426,
 19,
 1,
 622,
 480,
 1276,
 138,
 5,
 2846,
 2998,
 2,
 1719,
 120,
 170,
 44,
 910,
 21,
 797,
 49,
 3,
 117,
 968,
 19361,
 15,
 1,
 45,
 703,
 1,
 1556,
 1065,
 4,
 66,
 470,
 21,
 14,
 1949,
 2,
 1210,
 801,
 119,
 550,
 164,
 1476,
 2051,
 1621,
 202,
 4,
 388,
 188,
 2646,
 2,
 949,
 5024,
 4,
 1096,
 649,
 49,
 3,
 1,
 107,
 19362,
 910,
 3,
 2479,
 17,
 35,
 208,
 4,
 411,
 253,
 3972,
 8572,
 4,
 5582,
 170,
 4261,
 3369,
 92,
 1,
 45,
 5583,
 4,
 1,
 73,
 1477,
 207,
 511,
 42,
 2,
 214,
 1422,
 148,
 4425,
 4,
 554,
 5584,
 171,
 1065,
 61,
 26,
 150,
 6310,
 1,
 151,
 2052,
 6,
 134,
 37,
 8573,
 171,
 26,
 23,
 40,
 85,
 4426,
 2,
 1260,
 171,
 1950,
 32,
 23,
 216,
 229,
 5,
 475,
 4,
 1,
 45,
 22,
 2925,
 2,
 109,
 147,

In [117]:
print(len(seqs[0]))
print(len(seqs[1]))
print(len(seqs[2]))

717
282
237


### Padding

In [118]:
from keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(sequences, maxlen=2000, dtype='int32', padding='post', truncating='post', value=0)

In [119]:
len(np.unique(padded_sequences))

30102

In [120]:
max(np.unique(padded_sequences))

30101

In [121]:
df["seqs"] = pd.Series(list(padded_sequences))

In [122]:
df.head()

Unnamed: 0_level_0,category,text,seqs
news_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,4,tv future in the hands of viewers with home theatre systems plasma highdefinition tvs ...,"[171, 234, 5, 1, 1259, 3, 1210, 15, 119, 1126, 774, 5023, 1260, 4108, 4, 208, 253, 397..."
1,2,worldcom boss left books alone former worldcom boss bernie ebbers who is accused of ov...,"[1591, 678, 359, 1623, 1479, 193, 1591, 678, 5585, 1624, 41, 7, 704, 3, 6743, 31, 830,..."
2,1,tigers wary of farrell gamble leicester say they will not be rushed into making bid fo...,"[4818, 6747, 3, 3691, 6313, 1261, 142, 26, 21, 25, 14, 5026, 69, 305, 489, 6, 904, 369..."
3,1,yeading face newcastle in fa cup premiership side newcastle united face trip to ryman ...,"[10670, 390, 1106, 5, 1643, 308, 1041, 277, 1106, 261, 390, 1923, 2, 19377, 2481, 536,..."
4,5,ocean twelve raids box office ocean twelve the crime caper sequel starring george cloo...,"[3693, 4822, 5589, 680, 254, 3693, 4822, 1, 1042, 14717, 2140, 1343, 1167, 7853, 5939,..."


### One-Hot Vectorized Labels

In [94]:
df.category.unique()

array([4, 2, 1, 5, 3])

In [98]:
category = []
for label in df.category:
    if label == 5:
        label = 0
    category.append(label)

In [99]:
from keras.utils import to_categorical
onehot_category = to_categorical(category, dtype='int32')

In [100]:
onehot_category

array([[0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0],
       ...,
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0]], dtype=int32)

### 学習データの分割

In [123]:
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, onehot_category, test_size=0.2)

In [124]:
X_train

array([[  140,  2310,  3235, ...,     0,     0,     0],
       [ 4578,  2894,  3793, ...,     0,     0,     0],
       [ 2017,  7932,  2004, ...,     0,     0,     0],
       ...,
       [ 1440, 18768, 27009, ...,     0,     0,     0],
       [   37,   422,     2, ...,     0,     0,     0],
       [ 3324,   241,    19, ...,     0,     0,     0]], dtype=int32)

## 2. Modeling : 2D CNN for text

入力のshape  
data_format='channels_first'の場合， (batch_size, channels, rows, cols)の4階テンソル

In [31]:
from keras.layers import Input, Dense, Embedding, Conv2D, MaxPool2D
from keras.layers import Reshape, Flatten, Dropout, Concatenate
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras.models import Model
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [129]:
# Hyperparams
DOC_EMBEDDING_DIM = 256

FILTER_SIZES = [2, 3, 4]
NUM_FILTERS = 512
DROPOUT_RATE = 0.2

EPOCHS = 5
BATCH_SIZE = 32

In [130]:
MAX_TEXT_LEN = 2000
VOCAB_SIZE = 30102

adam = Adam(lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

In [131]:
# ====================================================================
# model
# ====================================================================

inputs = Input(shape=(MAX_TEXT_LEN,), dtype='int32')
embedding_layer = Embedding(VOCAB_SIZE, DOC_EMBEDDING_DIM, input_length=MAX_TEXT_LEN)(inputs)
reshape = Reshape((MAX_TEXT_LEN, DOC_EMBEDDING_DIM, 1))(embedding_layer) # カギ

conv_0 = Conv2D(filters=NUM_FILTERS, kernel_size=(FILTER_SIZES[0], DOC_EMBEDDING_DIM), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_1 = Conv2D(filters=NUM_FILTERS, kernel_size=(FILTER_SIZES[1], DOC_EMBEDDING_DIM), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_2 = Conv2D(filters=NUM_FILTERS, kernel_size=(FILTER_SIZES[2], DOC_EMBEDDING_DIM), padding='valid', kernel_initializer='normal', activation='relu')(reshape)

maxpool_0 = MaxPool2D(pool_size=(MAX_TEXT_LEN - FILTER_SIZES[0] + 1, 1), strides=(1,1), padding='valid')(conv_0)
maxpool_1 = MaxPool2D(pool_size=(MAX_TEXT_LEN - FILTER_SIZES[1] + 1, 1), strides=(1,1), padding='valid')(conv_1)
maxpool_2 = MaxPool2D(pool_size=(MAX_TEXT_LEN - FILTER_SIZES[2] + 1, 1), strides=(1,1), padding='valid')(conv_2)

concatenated_tensor = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2])
flatten = Flatten()(concatenated_tensor)
dropout = Dropout(rate=DROPOUT_RATE)(flatten)
outputs = Dense(units=5, activation="softmax")(dropout)

model = Model(inputs, outputs)

model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, 2000)         0                                            
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 2000, 256)    7706112     input_5[0][0]                    
__________________________________________________________________________________________________
reshape_5 (Reshape)             (None, 2000, 256, 1) 0           embedding_5[0][0]                
__________________________________________________________________________________________________
conv2d_13 (Conv2D)              (None, 1999, 1, 512) 262656      reshape_5[0][0]                  
__________________________________________________________________________________________________
conv2d_14 

## 3. Training CNN

In [None]:
model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=1, callbacks=[checkpoint], validation_data=(X_test, y_test))

## 4. Validating CNN

In [None]:
loss, accuracy = model.evaluate(X_test, y_test)
print(loss, accuracy)