# Text Data

Trying to construct models that understand text falls under the field of *natural language processing*. This is a field of enormous practical importance: chatbot, automated translation and generated new articles area few notable applications. In this notebook we will look into some basic ways of processing text data.

Below is what you might get in a typical dataset of review data:

In [1]:
#Text data
corpus = [
    "This is good.",
    "This is bad.",
    "This is very good.",
    "This is not good.",
    "This is not bad.",
    "This is...is bad."
]

ratings = [
    1,
    0,
    1,
    0,
    1,
    0
]

When analyzing review data the typical goal is to predict a single value, the rating, from the written text. In the case of chatbot and automated translation, where one single value is not sufficient to represent the meaning of text, a vector is outputed by the model instead.

### A. N-gram

Let us count the number of times each word appears in a sample. This is called *unigram* in natural language processing. To do so, we will use ```CountVectorizer``` of scikit-learn:

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

[[0 1 1 0 1 0]
 [1 0 1 0 1 0]
 [0 1 1 0 1 1]
 [0 1 1 1 1 0]
 [1 0 1 1 1 0]
 [1 0 2 0 1 0]]


Use ```get_feature_names()``` to see which word each column represents:

In [3]:
vectorizer.get_feature_names()

['bad', 'good', 'is', 'not', 'this', 'very']

The word-count vector can now be used with a suitable model to conduct language processing. Here we will simply use a logit model:

In [4]:
y = ratings

#Logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X,y)
print(model.score(X,y))
print(model.predict(X))

0.6666666666666666
[1 0 1 1 0 0]


Which phrases do our model have a difficulty understanding? Why might that be the case?

Now let us take a look at the estimated coefficients:

In [5]:
print(model.coef_)

[[-0.20256028  0.28162527 -0.27606588  0.02590234  0.079065    0.36973116]]


Take a look at the coefficients of each word. Can you see what is wrong with our model? One thing you might notice is that 'is' has a very negative coefficient while 'very' has very a positive coefficient, even though these words do not have such connotations themselves.  

When we start counting combination of words instead of individual words, what we have is *n-gram*. ```CountVectorizer``` allows us to specify the range of words we wish to consider via the option ```ngram_range```:

In [5]:
vectorizer = CountVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names())

[[0 1 0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0 1 1]
 [0 0 0 1 0 0 1 1 0]
 [0 0 0 1 0 1 0 1 0]
 [1 0 1 0 0 0 0 1 0]]
['is bad', 'is good', 'is is', 'is not', 'is very', 'not bad', 'not good', 'this is', 'very good']


Now let us try running the logistic regression again:

In [6]:
model = LogisticRegression()
model.fit(X,y)
print(model.score(X,y))
print(model.coef_)

1.0
[[-0.65382919  0.38656707 -0.29461512 -0.02100773  0.32587553  0.39061201
  -0.41161974  0.03760567  0.32587553]]


Much better!

### B. IMDB Movie Review

Now let us try something real. We will analyse a sample of <a href="https://www.imdb.com/">IMDB</a> movie reviews, trying to predict the rating a user gives based on his written review. We will be using a pre-processed version of the data, but the original text data can be found <a href="http://ai.stanford.edu/~amaas/data/sentiment/">here</a>.

First let us import the data:

In [7]:
import numpy as np
imdb = np.load("../Data/imdb.npz")

Now take a look what is inside this file:

In [8]:
for d in imdb:
    print(d)

x_test
x_train
y_train
y_test


How many samples do we have?

In [9]:
print(imdb["x_train"].shape)
print(imdb["x_test"].shape)

(25000,)
(25000,)


What is inside each X sample?

In [10]:
imdb["x_train"][0]

[23022,
 309,
 6,
 3,
 1069,
 209,
 9,
 2175,
 30,
 1,
 169,
 55,
 14,
 46,
 82,
 5869,
 41,
 393,
 110,
 138,
 14,
 5359,
 58,
 4477,
 150,
 8,
 1,
 5032,
 5948,
 482,
 69,
 5,
 261,
 12,
 23022,
 73935,
 2003,
 6,
 73,
 2436,
 5,
 632,
 71,
 6,
 5359,
 1,
 25279,
 5,
 2004,
 10471,
 1,
 5941,
 1534,
 34,
 67,
 64,
 205,
 140,
 65,
 1232,
 63526,
 21145,
 1,
 49265,
 4,
 1,
 223,
 901,
 29,
 3024,
 69,
 4,
 1,
 5863,
 10,
 694,
 2,
 65,
 1534,
 51,
 10,
 216,
 1,
 387,
 8,
 60,
 3,
 1472,
 3724,
 802,
 5,
 3521,
 177,
 1,
 393,
 10,
 1238,
 14030,
 30,
 309,
 3,
 353,
 344,
 2989,
 143,
 130,
 5,
 7804,
 28,
 4,
 126,
 5359,
 1472,
 2375,
 5,
 23022,
 309,
 10,
 532,
 12,
 108,
 1470,
 4,
 58,
 556,
 101,
 12,
 23022,
 309,
 6,
 227,
 4187,
 48,
 3,
 2237,
 12,
 9,
 215]

Words are encoded by their frequency-of-apperance ranking in the data. This allows us to easily delete words that either
- appear frequently but add little to the meaning of the text (e.g. articles, conjunctions and prepositions), or
- appear too infrequently to be of use.

What is inside each y sample?

In [11]:
imdb["y_train"][0]

1

We will now repeat what we have done previously:

In [21]:
from sklearn.utils import resample
from sklearn import preprocessing
from sklearn.preprocessing import label_binarize

x_raw_train,y_train,x_raw_test,y_test = resample(imdb["x_train"],imdb["y_train"],
                                         imdb["x_test"],imdb["y_test"],
                                         n_samples=1000)
#lb = preprocessing.LabelBinarizer()
#lb.fit(x_train.flatten())
#x_train = [np.sum(label_binarize(x,classes=np.arange(20000)),axis=1) for x in x_train]
#x_test = [np.sum(label_binarize(x,classes=np.arange(20000)),axis=1) for x in x_test]

x_train = [' '.join(str(e) for e in x) for x in x_raw_train]
x_test = [' '.join(str(e) for e in x) for x in x_raw_test]
vectorizer = CountVectorizer()
vectorizer.fit(x_train)
x_train = vectorizer.transform(x_train)
x_test = vectorizer.transform(x_test)

How well does our model does?

In [23]:
model = LogisticRegression()
model.fit(x_train,y_train)
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))

1.0
0.798


### C. Neural Network

Below is a simple LSTM neural network model to do the same thing:

In [30]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train,y_train,x_test,y_test = resample(x_train,y_train,x_test,y_test,
                                         n_samples=1000)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2))
model.add(Dense(128))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Loading data...
1000 train sequences
1000 test sequences
Pad sequences (samples x time)
x_train shape: (1000, 80)
x_test shape: (1000, 80)
Build model...
Train...
Train on 1000 samples, validate on 1000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 1.52359737206
Test accuracy: 0.719


You should notice that training a neural network is several orders of magnitude slower than a n-gram model. Furthermore, the neural network model above is not more accurate than our simple n-gram model. One reason is that with so many parameters, neural network models need more than a thousand sample to achieve good results. You can try running the same script with more data on a computer with GPU and see whether you get better results.

### Further Readings
- <a href="https://github.com/dipanjanS/text-analytics-with-python">Text Analytics with Python</a> (or the <a href="https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72">free tutorial</a> by the same author on Towards Data Science.)