## Deep Learning for Text Classification
### Vincent Wang

The 20 newsgroups text dataset was applied in this project.

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

This module contains two loaders. The first one, sklearn.datasets.fetch_20newsgroups, returns a list of the raw texts that can be fed to text feature extractors such as sklearn.feature_extraction.text.CountVectorizer with custom parameters so as to extract feature vectors. The second one, sklearn.datasets.fetch_20newsgroups_vectorized, returns ready-to-use features, i.e., it is not necessary to use a feature extractor.



### Data Loading 

In [37]:
import os
import numpy as np
from keras.layers import Activation, Conv1D, Dense, Embedding, Flatten, Input, MaxPooling1D
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.base import get_data_home

data_home = get_data_home()
twenty_home = os.path.join(data_home, "20news_home")

if not os.path.exists(data_home):
    os.makedirs(data_home)
    
if not os.path.exists(twenty_home):
    os.makedirs(twenty_home)
    
!cp ../input/20-newsgroup-sklearn/20news-bydate_py3* /tmp/scikit_learn_data

cp: ../input/20-newsgroup-sklearn/20news-bydate_py3*: No such file or directory


In [38]:
data_home

'/Users/wangziwen/scikit_learn_data'

In [39]:
twenty_home

'/Users/wangziwen/scikit_learn_data/20news_home'

In [40]:
# http://qwone.com/~jason/20Newsgroups/
dataset = fetch_20newsgroups(subset='all', shuffle=True, download_if_missing=False)
texts = dataset.data # Extract text
target = dataset.target # Extract target

### Preprocessing the data 

We have to tokenize the text before we can feed it into a neural network. This tokenization process will also remove some of the features of the original text, such as all punctuation or words that are less common. 

In [41]:
texts[0]

"From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\nSubject: Pens fans reactions\nOrganization: Post Office, Carnegie Mellon, Pittsburgh, PA\nLines: 12\nNNTP-Posting-Host: po4.andrew.cmu.edu\n\n\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

In [42]:
target[0]

10

In [43]:
print (len(texts))
print (len(target))
print (len(texts[0].split()))
print (texts[0])
print (target[0])
print (dataset.target_names[target[0]])

18846
18846
157
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


10
rec.sport.hockey


### We have to specify the size of our vocabulary. Words that are less frequent will get removed. In this case we want to retain the 20,000 most common words.

In [18]:
vocab_size = 20000

tokenizer = Tokenizer(num_words=vocab_size) # Setup tokenizer
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts) # Generate sequences

In [20]:
print (tokenizer.texts_to_sequences(['hello world, how are you? I am fine, how about you?']))

print (len(sequences))
print (len(sequences[0]))
print (sequences[0])

[[1595, 151, 82, 20, 13, 7, 122, 706, 82, 50, 13]]
18846
160
[14, 19415, 455, 559, 15, 29, 2552, 1240, 5609, 33, 322, 767, 2175, 2121, 871, 1343, 32, 251, 88, 77, 84, 12087, 455, 559, 15, 7, 122, 228, 63, 3, 2552, 1240, 20, 517, 3490, 50, 1, 1393, 3, 61, 437, 3, 1507, 50, 1, 1302, 2552, 3027, 3, 1, 2701, 309, 7, 122, 243, 16334, 175, 5, 4, 243, 19416, 268, 7, 122, 194, 2, 296, 37, 337, 2, 369, 4389, 22, 4, 243, 3, 7286, 12, 1, 2552, 349, 30, 20, 1502, 137, 2701, 1382, 90, 7, 397, 5987, 74, 2025, 13, 130, 56, 8, 140, 215, 90, 93, 1457, 770, 1963, 56, 8, 97, 4, 308, 9186, 1857, 2, 1306, 6, 1, 2327, 6760, 115, 348, 5987, 21, 4, 308, 3, 1857, 6, 1, 365, 658, 3, 467, 185, 1, 2552, 20, 194, 2, 1985, 1, 66, 3, 3215, 608, 7, 26, 132, 8755, 19, 2, 131, 1, 3280, 2000, 1, 1151, 1457, 770, 283, 2552, 1222]


In [21]:
word_index = tokenizer.word_index
print('Found {:,} unique words.'.format(len(word_index)))

Found 179,209 unique words.


### Our text is now converted to sequences of numbers. It makes sense to convert some of those sequences back into text to check what the tokenization did to our text. To this end we create an inverse index that maps numbers to words while the tokenizer maps words to numbers. 

In [22]:
# Create inverse index mapping numbers to words
inv_index = {v: k for k, v in tokenizer.word_index.items()}

# Print out text again
for w in sequences[0]:
    x = inv_index.get(w)
    print(x,end = ' ')

from ratnam andrew cmu edu subject pens fans reactions organization post office carnegie mellon pittsburgh pa lines 12 nntp posting host po4 andrew cmu edu i am sure some of pens fans are pretty confused about the lack of any kind of posts about the recent pens massacre of the devils actually i am bit puzzled too and a bit relieved however i am going to put an end to non relief with a bit of praise for the pens man they are killing those devils worse than i thought jagr just showed you why he is much better than his regular season stats he is also a lot fo fun to watch in the playoffs bowman should let jagr have a lot of fun in the next couple of games since the pens are going to beat the out of jersey anyway i was very disappointed not to see the islanders lose the final regular season game pens rule 

## Measuring text length
### Let's ensure all sequences have the same length.

In [23]:
# Get the average length of a text
avg = sum(map(len, sequences)) / len(sequences)

# Get the standard deviation of the sequence length
std = np.sqrt(sum(map(lambda x: (len(x) - avg)**2, sequences)) / len(sequences))

avg,std

(292.4769712405816, 666.93290630508761)

### You can see, the average text is about 300 words long. However, the standard deviation is quite large which indicates that some texts are much much longer. If some user decided to write an epic novel in the newsgroup it would massively slow down training. So for speed purposes we will restrict sequence length to 100 words. You should try out some different sequence lengths and experiment with processing time and accuracy gains.

In [24]:
pad_sequences([[1,2,3]], maxlen=5)

array([[0, 0, 1, 2, 3]], dtype=int32)

In [25]:
max_length = 100
data = pad_sequences(sequences, maxlen=max_length)

In [26]:
data[0]

array([19416,   268,     7,   122,   194,     2,   296,    37,   337,
           2,   369,  4389,    22,     4,   243,     3,  7286,    12,
           1,  2552,   349,    30,    20,  1502,   137,  2701,  1382,
          90,     7,   397,  5987,    74,  2025,    13,   130,    56,
           8,   140,   215,    90,    93,  1457,   770,  1963,    56,
           8,    97,     4,   308,  9186,  1857,     2,  1306,     6,
           1,  2327,  6760,   115,   348,  5987,    21,     4,   308,
           3,  1857,     6,     1,   365,   658,     3,   467,   185,
           1,  2552,    20,   194,     2,  1985,     1,    66,     3,
        3215,   608,     7,    26,   132,  8755,    19,     2,   131,
           1,  3280,  2000,     1,  1151,  1457,   770,   283,  2552,  1222], dtype=int32)

## Turning labels into One-Hot encodings
### Labels can quickly be encoded into one-hot vectors with Keras:

In [27]:
from keras.utils import to_categorical
labels = to_categorical(np.asarray(target))
print('Shape of data:', data.shape)
print('Shape of labels:', labels.shape)

Shape of data: (18846, 100)
Shape of labels: (18846, 20)


### Loading GloVe embeddings

In [28]:
#Glove Embeddings Pre trained model can be download: https://www.kaggle.com/rtatman/glove-global-vectors-for-word-representation/downloads/glove-global-vectors-for-word-representation.zip/1
glove_dir = 'glove-global-vectors-for-word-representation' # This is the folder with the dataset

embeddings_index = {} # We create a dictionary of word -> embedding

with open(os.path.join(glove_dir, 'glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0] # The first value is the word, the rest are the values of the embedding
        embedding = np.asarray(values[1:], dtype='float32') # Load embedding
        embeddings_index[word] = embedding # Add embedding to our embedding dictionary

print('Found {:,} word vectors in GloVe.'.format(len(embeddings_index)))

Found 400,000 word vectors in GloVe.


In [29]:
print (embeddings_index['frog'])
print (len(embeddings_index['frog']))

[ 0.043084    0.53232998  0.54254001 -0.076952   -0.29673001  0.52986002
  0.21379     0.15789001 -0.39520001 -0.91889    -0.65850002  0.68706
  0.10821    -0.10694    -0.34009999  1.04400003  0.12774999  0.51156998
  0.60314     0.71366    -0.53740001  0.37737     0.12186     0.60891002
  0.50107002  2.02150011 -0.47318     0.46952999  0.12542     0.60206997
  0.11007     0.37586999  1.01370001 -0.24779999  0.65748     0.12801
 -0.57647002 -0.25753999  0.62426001  0.010864   -0.40680999  0.16173001
 -0.84694999 -0.24603     0.29078001  0.85460001 -0.067021    0.69331002
 -0.71544999 -0.25184    -0.74741    -0.26506999  0.48730001  0.41991001
 -0.86741    -0.52350003 -0.44773999 -0.044584    0.033836    0.29909
  0.73754001  0.81651002  0.69431001  0.80453002  0.29276001 -0.025244
 -0.30452999 -0.34329     0.11933    -0.29655001  0.1072     -0.18945999
  0.18501    -0.75480002 -0.25628     0.34437999 -0.016743    0.0040503
  0.39342001  0.99404001 -0.32159001 -0.49434     0.41707999 -0

In [30]:
# https://nlp.stanford.edu/projects/glove/
print (np.linalg.norm(embeddings_index['frog'] - embeddings_index['toad']))
print (np.linalg.norm(embeddings_index['frog'] - embeddings_index['man']))

4.12497
6.79435


In [31]:
embedding_dim = 100 # We use 100 dimensional glove vectors

word_index = tokenizer.word_index
nb_words = min(vocab_size, len(word_index)) # How many words are there actually

embedding_matrix = np.zeros((nb_words, embedding_dim))

# The vectors need to be in the same position as their index. 
# Meaning a word with token 1 needs to be in the second row (rows start with zero) and so on

# Loop over all words in the word index
for word, i in word_index.items():
    # If we are above the amount of words we want to use we do nothing
    if i >= vocab_size: 
        continue
    # Get the embedding vector for the word
    embedding_vector = embeddings_index.get(word)
    # If there is an embedding vector, put it in the embedding matrix
    if embedding_vector is not None: 
        embedding_matrix[i] = embedding_vector

In [32]:
model = Sequential()
model.add(Embedding(vocab_size, 
                    embedding_dim, 
                    input_length=max_length, 
                    weights = [embedding_matrix], 
                    trainable = False))
model.add(Conv1D(128, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(20, activation='softmax'))
model.summary()

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          2000000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 98, 128)           38528     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 32, 128)           0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 30, 128)           49280     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 10, 128)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 8, 128)            49280     
___________________________________________________________

In [46]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',  # https://stackoverflow.com/questions/42081257/keras-binary-crossentropy-vs-categorical-crossentropy-performance
              metrics=['accuracy'])

model.fit(data, labels, validation_split=0.2, epochs=20)

Train on 15076 samples, validate on 3770 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x13a6e4e80>

### Our model achieves more than 95% accuracy on the validation set in only 2 epochs. Systems like these can be used to assign emails in customer support centers, suggest responses, or classify other forms of text like invoices which need to be assigned to an department. Let's take a look at how our model classified one of the texts:

In [47]:
example = data[450] # get the tokens

# Print tokens as text
for w in example:
    x = inv_index.get(w)
    print(x,end = ' ')

30 i'm posting this for a friend but you can e mail questions to me at cc bellcore com however the best way to get your questions answered is to call the phone number listed for sale 1991 corrado 2 2 coupe low mileage approx 28 000 miles 5 speed manual 7 speaker factory stereo system new all weather 205 sun roof ac red speed activated extra set of tires equipped with factory winter package heated seats mirrors and security system with 2 all records documentation service car mint condition must sacrifice at 11 000 or best offer call 908 

In [48]:
# Get prediction
pred = model.predict(example.reshape(1,100))

In [49]:
# Output predicted category
dataset.target_names[np.argmax(pred)]

'misc.forsale'