**INTRODUCTION**

Hi all, this kernel is an intorduction to text classification using deep leanring. It took some time for the deep learning approaches to make a mark on textual data but since then the impact of deep learning on NLP has had a vertical graph. 

In this kernel we will get our hands dirty with a well in demand problem of text/document classification, around 2014 yoon kim et al. started to experiment with the relevance of CNN in the field of NLP and since then there has been no looking back. In the paper "[Convolutional Neural Networks for Sentence Classification](http://arxiv.org/pdf/1408.5882.pdf)" yoon kim et al. experiments with multiple CNN models (single channel, multiple channel) on top of word embeddings for text classification.

For the sake of simplicity we will start off with a single channel model with pretrasined Glove embeddings. The data set used is the famous [20_newsgroup dataset](http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/news20.html)(original dataset link).

In this kernel we will first learn about the processing of dataset, followed by a keras implementation of text classification using the preexisting Glove embeddings. 


**THE APPROACH**

The idea presented follows a flow like : 
<a href="https://imgur.com/xLrP6IM"><img src="https://i.imgur.com/xLrP6IM.png" title="source: imgur.com" style="width:400px;height:600px;"/></a>


We basically add different convolution layers of filter sizes [3, 4, 5], this somewhat emulates different skip-gram models where different filter sizes essentially means the number of words the filter is being applied to. 

In [1]:
import os
import sys
import numpy as np
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Activation, Conv2D, Input, Embedding, Reshape, MaxPool2D, Concatenate, Flatten, Dropout, Dense, Conv1D
from keras.layers import MaxPool1D
from keras.models import Model
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam

Using TensorFlow backend.


In [2]:
# just to make sure the dataset is added properly 
!ls '../input/20-newsgroup-original/20_newsgroup/20_newsgroup/'


alt.atheism		  rec.autos	      sci.space
comp.graphics		  rec.motorcycles     soc.religion.christian
comp.os.ms-windows.misc   rec.sport.baseball  talk.politics.guns
comp.sys.ibm.pc.hardware  rec.sport.hockey    talk.politics.mideast
comp.sys.mac.hardware	  sci.crypt	      talk.politics.misc
comp.windows.x		  sci.electronics     talk.religion.misc
misc.forsale		  sci.med


In [22]:
# the dataset path
TEXT_DATA_DIR = r'../input/20-newsgroup-original/20_newsgroup/20_newsgroup/'
#the path for Glove embeddings
GLOVE_DIR = r'../input/glove6b/'
# make the max word length to be constant
MAX_WORDS = 10000
MAX_SEQUENCE_LENGTH = 1000
# the percentage of train test split to be applied
VALIDATION_SPLIT = 0.20
# the dimension of vectors to be used
EMBEDDING_DIM = 100
# filter sizes of the different conv layers 
filter_sizes = [3,4,5]
num_filters = 512
embedding_dim = 100
# dropout probability
drop = 0.5
batch_size = 30
epochs = 2

**DATASET STRUCTURE**

The dataset is present in a hierarchical structure, i.e. all files of a given class are located in their respective folders and each datapoint has its own '.txt' file.

* First we go through the entire dataset to build our text list and label list. 
* Followed by this we tokenize the entire data using Tokenizer, which is a part of keras.preprocessing.text.
* We then add padding to the sequences to make them of a uniform length.

In [23]:
## preparing dataset


texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n')  # skip header
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)
print(labels_index)

print('Found %s texts.' % len(texts))

{'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x': 5, 'misc.forsale': 6, 'rec.autos': 7, 'rec.motorcycles': 8, 'rec.sport.baseball': 9, 'rec.sport.hockey': 10, 'sci.crypt': 11, 'sci.electronics': 12, 'sci.med': 13, 'sci.space': 14, 'soc.religion.christian': 15, 'talk.politics.guns': 16, 'talk.politics.mideast': 17, 'talk.politics.misc': 18, 'talk.religion.misc': 19}
Found 19997 texts.


In [87]:
print(texts[1000])




			CALL FOR PRESENTATIONS
	
      NAVY SCIENTIFIC VISUALIZATION AND VIRTUAL REALITY SEMINAR

			Tuesday, June 22, 1993

	    Carderock Division, Naval Surface Warfare Center
	      (formerly the David Taylor Research Center)
			  Bethesda, Maryland

SPONSOR: NESS (Navy Engineering Software System) is sponsoring a 
one-day Navy Scientific Visualization and Virtual Reality Seminar.  
The purpose of the seminar is to present and exchange information for
Navy-related scientific visualization and virtual reality programs, 
research, developments, and applications.

PRESENTATIONS: Presentations are solicited on all aspects of 
Navy-related scientific visualization and virtual reality.  All 
current work, works-in-progress, and proposed work by Navy 
organizations will be considered.  Four types of presentations are 
available.

     1. Regular presentation: 20-30 minutes in length
     2. Short presentation: 10 minutes in length
     3. Video presentation: a stand-alone videotape (author 

In [25]:
tokenizer  = Tokenizer(num_words = MAX_WORDS)
tokenizer.fit_on_texts(texts)
sequences =  tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print("unique words : {}".format(len(word_index)))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
print(labels)

unique words : 174074
Shape of data tensor: (19997, 1000)
Shape of label tensor: (19997, 20)
[[1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]


In [81]:
x_test=data[1000]
y_test=labels[1000]
print(x_test.shape, y_test.shape)

(1000,) (20,)


In [80]:
labels[1000]

array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)

In [6]:
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

Since we have our train-validation split ready, our next step is to create an embedding matrix from the precomputed Glove embeddings.
For convenience we are freezing the embedding layer i.e we will not be fine tuning the word embeddings. Feel free to test it out for better accuracy on very specific examples. From what can be seen, the Glove embeddings are universal features and tend to perform great in general.

In [7]:
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [8]:
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [9]:
from keras.layers import Embedding

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [17]:
inputs = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding = embedding_layer(inputs)

print(embedding.shape)
reshape = Reshape((MAX_SEQUENCE_LENGTH,EMBEDDING_DIM,1))(embedding)
print(reshape.shape)

conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)

maxpool_0 = MaxPool2D(pool_size=(MAX_SEQUENCE_LENGTH - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_0)
maxpool_1 = MaxPool2D(pool_size=(MAX_SEQUENCE_LENGTH - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_1)
maxpool_2 = MaxPool2D(pool_size=(MAX_SEQUENCE_LENGTH - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_2)

concatenated_tensor = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2])
flatten = Flatten()(concatenated_tensor)
dropout = Dropout(drop)(flatten)
output = Dense(units=20, activation='softmax')(dropout)

# this creates a model that includes
model = Model(inputs=inputs, outputs=output)

checkpoint = ModelCheckpoint('weights_cnn_sentece.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='auto')
adam = Adam(lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()


(?, 1000, 100)
(?, 1000, 100, 1)
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            (None, 1000)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 1000, 100)    17407500    input_4[0][0]                    
__________________________________________________________________________________________________
reshape_4 (Reshape)             (None, 1000, 100, 1) 0           embedding_1[3][0]                
__________________________________________________________________________________________________
conv2d_10 (Conv2D)              (None, 998, 1, 512)  154112      reshape_4[0][0]                  
____________________________________________________________________________

In [18]:
print("Traning Model...")
model.fit(x_train, y_train, batch_size=batch_size, epochs=20, verbose=1, callbacks=[checkpoint], validation_data=(x_val, y_val))


Traning Model...
Train on 15998 samples, validate on 3999 samples
Epoch 1/20

Epoch 00001: val_acc improved from -inf to 0.47562, saving model to weights_cnn_sentece.hdf5
Epoch 2/20

Epoch 00002: val_acc improved from 0.47562 to 0.58490, saving model to weights_cnn_sentece.hdf5
Epoch 3/20

Epoch 00003: val_acc improved from 0.58490 to 0.67367, saving model to weights_cnn_sentece.hdf5
Epoch 4/20

Epoch 00004: val_acc improved from 0.67367 to 0.70193, saving model to weights_cnn_sentece.hdf5
Epoch 5/20

Epoch 00005: val_acc improved from 0.70193 to 0.72168, saving model to weights_cnn_sentece.hdf5
Epoch 6/20

Epoch 00006: val_acc improved from 0.72168 to 0.74444, saving model to weights_cnn_sentece.hdf5
Epoch 7/20

Epoch 00007: val_acc did not improve from 0.74444
Epoch 8/20

Epoch 00008: val_acc improved from 0.74444 to 0.75219, saving model to weights_cnn_sentece.hdf5
Epoch 9/20

Epoch 00009: val_acc improved from 0.75219 to 0.76594, saving model to weights_cnn_sentece.hdf5
Epoch 10/20

<keras.callbacks.History at 0x7f20281019e8>

In [35]:
score, acc = model.evaluate(x_val, y_val)
print("Loss: ", score)
print("Accuracy: ", acc*100)

Loss:  0.6367613789468266
Accuracy:  79.61990498071553


In [88]:
x_test=x_test.reshape(1, 1000)
pred=model.predict(x_test).argmax()

In [90]:
print("Actual label: ", y_test.argmax())
print("Predicted label: ", pred)

Actual label:  1
Predicted label:  1


In [102]:
print(labels[1000].argmax())

1


In [103]:
labels_index

{'alt.atheism': 0,
 'comp.graphics': 1,
 'comp.os.ms-windows.misc': 2,
 'comp.sys.ibm.pc.hardware': 3,
 'comp.sys.mac.hardware': 4,
 'comp.windows.x': 5,
 'misc.forsale': 6,
 'rec.autos': 7,
 'rec.motorcycles': 8,
 'rec.sport.baseball': 9,
 'rec.sport.hockey': 10,
 'sci.crypt': 11,
 'sci.electronics': 12,
 'sci.med': 13,
 'sci.space': 14,
 'soc.religion.christian': 15,
 'talk.politics.guns': 16,
 'talk.politics.mideast': 17,
 'talk.politics.misc': 18,
 'talk.religion.misc': 19}

In [94]:
print(texts[1000])




			CALL FOR PRESENTATIONS
	
      NAVY SCIENTIFIC VISUALIZATION AND VIRTUAL REALITY SEMINAR

			Tuesday, June 22, 1993

	    Carderock Division, Naval Surface Warfare Center
	      (formerly the David Taylor Research Center)
			  Bethesda, Maryland

SPONSOR: NESS (Navy Engineering Software System) is sponsoring a 
one-day Navy Scientific Visualization and Virtual Reality Seminar.  
The purpose of the seminar is to present and exchange information for
Navy-related scientific visualization and virtual reality programs, 
research, developments, and applications.

PRESENTATIONS: Presentations are solicited on all aspects of 
Navy-related scientific visualization and virtual reality.  All 
current work, works-in-progress, and proposed work by Navy 
organizations will be considered.  Four types of presentations are 
available.

     1. Regular presentation: 20-30 minutes in length
     2. Short presentation: 10 minutes in length
     3. Video presentation: a stand-alone videotape (author 

I hope this Kernel was helpful for you, any sort of feedback and comments are appreciated. Feel free to reach out in case something is unclear.
the entire code is also uploaded on my github : https://github.com/au1206/Convolutional-Neural-Networks-for-Sentence-Classification


Until next time, Happy learning :) . . .. ...