# Text Data

Trying to construct models that understand text falls under the field of *natural language processing*. This is a field of enormous practical importance: chatbot, automated translation and generated new articles area few notable applications. In this notebook we will look into some basic ways of processing text data.

Below is what you might get in a typical dataset of review data:

In [1]:
#Text data
corpus = [
    "This is good.",
    "This is bad.",
    "This is very good.",
    "This is not good.",
    "This is not bad.",
    "This is...is bad."
]

ratings = [
    1,
    0,
    1,
    0,
    1,
    0
]

When analyzing review data the typical goal is to predict a single value, the rating, from the written text. This is a form of *sentiment analysis*. In the case of chatbot and automated translation, where one single value is not sufficient to represent the meaning of text, a vector is outputed by the model instead.

### A. N-gram

Let us count the number of times each word appears in a sample. This is called *unigram* in natural language processing. To do so, we will use ```CountVectorizer``` of scikit-learn:

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

[[0 1 1 0 1 0]
 [1 0 1 0 1 0]
 [0 1 1 0 1 1]
 [0 1 1 1 1 0]
 [1 0 1 1 1 0]
 [1 0 2 0 1 0]]


Use ```get_feature_names()``` to see which word each column represents:

In [3]:
vectorizer.get_feature_names()

['bad', 'good', 'is', 'not', 'this', 'very']

The word-count vector can now be used with a suitable model to conduct language processing. Here we will simply use a logit model:

In [4]:
y = ratings

#Logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X,y)
print(model.score(X,y))
print(model.predict(X))

0.6666666666666666
[1 0 1 1 0 0]


Which phrases do our model have a difficulty understanding? Why might that be the case?

Now let us take a look at the estimated coefficients:

In [5]:
print(model.coef_)

[[-2.37601408e-01  2.37590746e-01 -3.55854219e-01  1.81681559e-06
  -1.06620709e-05  3.55804904e-01]]


Take a look at the coefficients of each word. Can you see what is wrong with our model? One thing you might notice is that 'is' has a very negative coefficient while 'very' has very a positive coefficient, even though these words do not have such connotations themselves.  

When we start counting combination of words instead of individual words, what we have is *n-gram*. ```CountVectorizer``` allows us to specify the range of words we wish to consider via the option ```ngram_range```:

In [6]:
vectorizer = CountVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names())

[[0 1 0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0 1 1]
 [0 0 0 1 0 0 1 1 0]
 [0 0 0 1 0 1 0 1 0]
 [1 0 1 0 0 0 0 1 0]]
['is bad', 'is good', 'is is', 'is not', 'is very', 'not bad', 'not good', 'this is', 'very good']


Now let us try running the logistic regression again:

In [7]:
model = LogisticRegression()
model.fit(X,y)
print(model.score(X,y))
print(model.coef_)

1.0
[[-6.65709533e-01  3.78667208e-01 -2.99691500e-01 -3.25311016e-02
   3.19576438e-01  3.84884164e-01 -4.17415266e-01  3.01182347e-06
   3.19576438e-01]]


Much better!

### B. IMDB Movie Review

Now let us try something real. We will analyse a sample of <a href="https://www.imdb.com/">IMDB</a> movie reviews, trying to predict the rating a user gives based on his written review. We will be using a pre-processed version of the data, but the original text data can be found <a href="http://ai.stanford.edu/~amaas/data/sentiment/">here</a>.

First let us import the data:

In [15]:
import numpy as np
imdb = np.load("../Data/imdb.npz",allow_pickle=True) 

Now take a look what is inside this file:

In [16]:
for d in imdb:
    print(d)

x_test
x_train
y_train
y_test


How many samples do we have?

In [17]:
print(imdb["x_train"].shape)
print(imdb["x_test"].shape)

(25000,)
(25000,)


What is inside each X sample?

In [18]:
imdb["x_train"][0]

[23022,
 309,
 6,
 3,
 1069,
 209,
 9,
 2175,
 30,
 1,
 169,
 55,
 14,
 46,
 82,
 5869,
 41,
 393,
 110,
 138,
 14,
 5359,
 58,
 4477,
 150,
 8,
 1,
 5032,
 5948,
 482,
 69,
 5,
 261,
 12,
 23022,
 73935,
 2003,
 6,
 73,
 2436,
 5,
 632,
 71,
 6,
 5359,
 1,
 25279,
 5,
 2004,
 10471,
 1,
 5941,
 1534,
 34,
 67,
 64,
 205,
 140,
 65,
 1232,
 63526,
 21145,
 1,
 49265,
 4,
 1,
 223,
 901,
 29,
 3024,
 69,
 4,
 1,
 5863,
 10,
 694,
 2,
 65,
 1534,
 51,
 10,
 216,
 1,
 387,
 8,
 60,
 3,
 1472,
 3724,
 802,
 5,
 3521,
 177,
 1,
 393,
 10,
 1238,
 14030,
 30,
 309,
 3,
 353,
 344,
 2989,
 143,
 130,
 5,
 7804,
 28,
 4,
 126,
 5359,
 1472,
 2375,
 5,
 23022,
 309,
 10,
 532,
 12,
 108,
 1470,
 4,
 58,
 556,
 101,
 12,
 23022,
 309,
 6,
 227,
 4187,
 48,
 3,
 2237,
 12,
 9,
 215]

Words are encoded by their frequency-of-apperance ranking in the data. This allows us to easily delete words that either
- appear frequently but add little to the meaning of the text (e.g. articles, conjunctions and prepositions), or
- appear too infrequently to be of use.

What is inside each y sample?

In [19]:
imdb["y_train"][0]

1

We will now repeat what we have done previously:

In [26]:
from sklearn.utils import resample
from sklearn import preprocessing
from sklearn.preprocessing import label_binarize

x_raw_train,y_train,x_raw_test,y_test = resample(imdb["x_train"],imdb["y_train"],
                                         imdb["x_test"],imdb["y_test"],
                                         n_samples=1000)

x_train = [' '.join(str(e) for e in x) for x in x_raw_train]
x_test = [' '.join(str(e) for e in x) for x in x_raw_test]
vectorizer = CountVectorizer()
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)

How well does our model do?

In [27]:
model = LogisticRegression()
model.fit(x_train,y_train)
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))

1.0
0.806


### C. Lemmatization

Consider the following corpus of text, modified from the original one:

In [8]:
# Text data
corpus2 = [
    "Apple is good.",
    "Apple was bad.",
    "Apples are good.",
    "Apples were not good.",
    "Apple is not bad.",
    "Apples were...are bad."
]

Having plurals complicates our analysis: `CountVectorizer` will treat 'Apple' and 'Apples' as two distinct words, unncessarily splitting the samples for apples. Similarly, 'is' and 'are' are both forms of the verb 'to be', so they should be considered as one word. What we need is *lemmatization*, which is the process of grouping together the inflected forms of a word for use in analysis.

We will be using <a href="https://textblob.readthedocs.io/en/dev/index.html">TextBlob</a>, a library for processing textual data. TextBlob in turn relies on <a href="http://www.nltk.org/">NLTK</a> (short for *Natural Language ToolKit*) to do some of the heavy lifting. Since NLTK does not come with all packages installed, we will need to first download the ones we need:

In [None]:
import nltk
nltk.download('punkt') 
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

The process goes as follows:
1. First convert each string to a `TextBlob` object. 
2. Split each string into sentences with the `.sentences` property if needed.
3. Split each string (or sentence) into words with the `.words` property.
4. Lemmatize each word with the `lemmatize()` method. 

Note that `lemmatize()` expects words to be in lowercase.

In [9]:
# Use TextBlob to lemmatize the corpus
from textblob import TextBlob

tb = [TextBlob(c.lower()) for c in corpus2]
sentences = [t.words for t in tb]
data = [s.lemmatize() for s in sentences]
data

[WordList(['apple', 'is', 'good']),
 WordList(['apple', 'wa', 'bad']),
 WordList(['apple', 'are', 'good']),
 WordList(['apple', 'were', 'not', 'good']),
 WordList(['apple', 'is', 'not', 'bad']),
 WordList(['apple', 'were', 'are', 'bad'])]

The code above successfully grouped 'apples' with 'apple', but it failed to group 'is' and 'are'. The second sample gives us some hint as to what went wrong---'was' was somehow converted to 'wa'. What happened was that `lemmatize()` by default treats all words as nouns. To ensure proper conversion, we will need to provide it with each word's part of speech (POS).

First, we generate part-of-speech tags by using the `.tags` property of the `TextBlob` object:


In [10]:
# Extract Penn Treebank POS
tags = [t.tags for t in tb]
tags

[[('apple', 'NN'), ('is', 'VBZ'), ('good', 'JJ')],
 [('apple', 'NN'), ('was', 'VBD'), ('bad', 'JJ')],
 [('apples', 'NNS'), ('are', 'VBP'), ('good', 'JJ')],
 [('apples', 'NNS'), ('were', 'VBD'), ('not', 'RB'), ('good', 'JJ')],
 [('apple', 'NN'), ('is', 'VBZ'), ('not', 'RB'), ('bad', 'JJ')],
 [('apples', 'NNS'), ('were', 'VBD'), ('are', 'VBP'), ('bad', 'JJ')]]

We can then providing `lemmatize()` with part-of-speech tags. Unfortunately it is not as simple as passing the POS tags from above. The reason is that NLTK generates tags base on the <a href="https://catalog.ldc.upenn.edu/LDC99T42">Penn Treebank</a> corpus, which uses different <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">POS</a> tags than the <a href="https://wordnet.princeton.edu/documentation/wndb5wn">Wordnet</a> corpus that `lemmatize()` is based on. 

We therefore need to map the two POS systems before lemmatization:

In [13]:
# Function to map Penn Treebank POS to Wordnet POS
def pos_conv(pos):
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}    
    return tag_dict.get(pos[0], 'n')

# Convert Penn Treebank POS to Wordnet POS
wordnet_tags = [[[w, pos_conv(pos)] for w, pos in t] for t in tags]

# Lemmatize with POS
data = [[w.lemmatize(t) for w,t in s] for s in wordnet_tags]
data

[['apple', 'be', 'good'],
 ['apple', 'be', 'bad'],
 ['apple', 'be', 'good'],
 ['apple', 'be', 'not', 'good'],
 ['apple', 'be', 'not', 'bad'],
 ['apple', 'be', 'be', 'bad']]

TextBlob and NLTK have many other useful features such as spelling correction and translation that you can explore on your own. One particularly useful feature is pre-trained sentiment analysis:

In [12]:
# Sentiment analysis with TextBlob
sentiment =  [t.sentiment for t in tb]
sentiment

[Sentiment(polarity=0.7, subjectivity=0.6000000000000001),
 Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666),
 Sentiment(polarity=0.7, subjectivity=0.6000000000000001),
 Sentiment(polarity=-0.35, subjectivity=0.6000000000000001),
 Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666),
 Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666)]

### D. Chinese Text

One major issue with Chinese text is that there is no space between words. Unsurprisingly then, this is a major focus for Chinese natural language processing research.

They are multiple libraries for Chinese NLP. Here we will try out `jieba` and `pkuseg`.

In [1]:
text = '我愛吃北京餃子。'

# jieba default
import jieba
seg_list = jieba.cut(text) 
print([w for w in seg_list])

# jieba cut all mode
import jieba
seg_list = jieba.cut(text, cut_all=True) 
print([w for w in seg_list])

# jieba paddle mode
import paddle
paddle.enable_static()
seg_list = jieba.cut(text) 
print([w for w in seg_list])

# pkuseg
import pkuseg
seg = pkuseg.pkuseg() 
text = seg.cut(text)
print(text)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.441 seconds.
Prefix dict has been built successfully.


['我愛吃', '北京', '餃子', '。']
['我', '愛', '吃', '北京', '餃', '子', '。']
['我愛吃', '北京', '餃子', '。']
['我', '愛', '吃', '北京', '餃子', '。']


Things are much easier once we have the individual words. For example, we could immediately use ngram on the text.

We can also fetch POS:

In [2]:
# jieba
import jieba.posseg as pseg
words = pseg.cut(text,use_paddle=True)
print([(w,f) for w,f in words])

# pkuseg
seg = pkuseg.pkuseg(postag=True)
text = seg.cut(text)
print(text)

  and should_run_async(code)


AttributeError: 'list' object has no attribute 'decode'

POS tags for `pkugseg`:
https://github.com/lancopku/pkuseg-python/blob/master/tags.txt

For `jieba`:
https://github.com/fxsjy/jieba

### E. Neural Network

Below is a simple LSTM neural network model that runs sentiment analysis on the IMDB data:

In [30]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train,y_train,x_test,y_test = resample(x_train,y_train,x_test,y_test,
                                         n_samples=1000)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2))
model.add(Dense(128))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Loading data...
1000 train sequences
1000 test sequences
Pad sequences (samples x time)
x_train shape: (1000, 80)
x_test shape: (1000, 80)
Build model...
Train...
Train on 1000 samples, validate on 1000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 1.52359737206
Test accuracy: 0.719


You should notice that training a neural network is several orders of magnitude slower than a n-gram model. Furthermore, the neural network model above is not more accurate than our simple n-gram model. One reason is that with so many parameters, neural network models need more than a thousand sample to achieve good results. You can try running the same script with more data on a computer with GPU and see whether you get better results.

### Further Readings
- <a href="https://github.com/dipanjanS/text-analytics-with-python">Text Analytics with Python</a> (or the <a href="https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72">free tutorial</a> by the same author on Towards Data Science.)