# Text Data

Version: 2024-10-14

Trying to construct models that understand text falls under the field of *natural language processing*. This is a field of enormous practical importance: chatbot, automated translation and generated new articles area few notable applications. In this notebook we will look into some basic ways of processing text data.

Below is what you might get in a typical dataset of review data:

In [None]:
#Text data
corpus = [
    "This is good.",
    "This is bad.",
    "This is very good.",
    "This is not good.",
    "This is not bad.",
    "This is...is bad."
]

ratings = [
    1,
    0,
    1,
    0,
    1,
    0
]

When analyzing review data the typical goal is to predict a single value, the rating, from the written text. In the case of chatbot and automated translation, where one single value is not sufficient to represent the meaning of text, a vector is outputed by the model instead.

### A. N-gram

Let us count the number of times each word appears in a sample. This is called *unigram* in natural language processing. To do so, we will use ```CountVectorizer``` of scikit-learn:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


Use ```get_feature_names_out()``` to see which word each column represents:

The word-count vector can now be used with a suitable model to conduct language processing. Here we will simply use a logit model:

In [None]:
#Logistic regression
from sklearn.linear_model import LogisticRegression


Which phrases do our model have a difficulty understanding? Why might that be the case?

Now let us take a look at the estimated coefficients:

Take a look at the coefficients of each word. Can you see what is wrong with our model? One thing you might notice is that 'is' has a very negative coefficient while 'very' has very a positive coefficient, even though these words do not have such connotations themselves.  

When we start counting combination of words instead of individual words, what we have is *n-gram*. ```CountVectorizer``` allows us to specify the range of words we wish to consider via the option ```ngram_range```:

Now let us try running the logistic regression again:

Much better!

### B. IMDB Movie Review

Now let us try something real. We will analyse a sample of <a href="https://www.imdb.com/">IMDB</a> movie reviews, trying to predict the rating a user gives based on his written review. For speed reasons we will be using a subsample, but the original text data can be found <a href="http://ai.stanford.edu/~amaas/data/sentiment/">here</a>.

First let us import the data:

In [None]:
import pandas as pd
imdb_train = pd.read_csv("../Data/imdb_train.csv",
                         names=['label','text'])
imdb_test = pd.read_csv("../Data/imdb_test.csv",
                         names=['label','text'])

How many samples do we have?

In [None]:
print(imdb_train.shape)
print(imdb_test.shape)

What is inside each sample?

In [None]:
imdb_train.head(5)

<!--Words are encoded by their frequency-of-apperance ranking in the data. This allows us to easily delete words that either
- appear frequently but add little to the meaning of the text (e.g. articles, conjunctions and prepositions), or
- appear too infrequently to be of use.-->

We will now repeat what we have done previously:

In [None]:
y_train = imdb_train['label']
y_test = imdb_test['label']

# N-gram


How well does our model do?

In [None]:
model = LogisticRegression()
model.fit(x_train,y_train)
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))

### C. Lemmatization

Consider the following corpus of text, modified from the original one:

In [None]:
# Text data
corpus2 = [
    "Apple is good.",
    "Apple was bad.",
    "Apples are good.",
    "Apples were not good.",
    "Apple is not bad.",
    "Apples were...are bad."
]

Having plurals complicates our analysis: `CountVectorizer` will treat 'Apple' and 'Apples' as two distinct words, unncessarily splitting the samples for apples. Similarly, 'is' and 'are' are both forms of the verb 'to be', so they should be considered as one word. What we need is *lemmatization*, which is the process of grouping together the inflected forms of a word for use in analysis.

We will be using <a href="https://textblob.readthedocs.io/en/dev/index.html">TextBlob</a>, a library for processing textual data. TextBlob in turn relies on <a href="http://www.nltk.org/">NLTK</a> (short for *Natural Language ToolKit*) to do some of the heavy lifting. Since NLTK does not come with all packages installed, we will need to first download the ones we need:

In [None]:
import nltk
nltk.download('punkt') 
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

The process goes as follows:
1. First convert each string to a `TextBlob` object. 
2. Split each string into sentences with the `.sentences` property if needed.
3. Split each string (or sentence) into words with the `.words` property.
4. Lemmatize each word with the `lemmatize()` method. 

Note that `lemmatize()` expects words to be in lowercase.

In [None]:
# Use TextBlob to lemmatize the corpus
from textblob import TextBlob



The code above successfully grouped 'apples' with 'apple', but it failed to group 'is' and 'are'. The second sample gives us some hint as to what went wrong---'was' was somehow converted to 'wa'. What happened was that `lemmatize()` by default treats all words as nouns. To ensure proper conversion, we will need to provide it with each word's part of speech (POS).

First, we generate part-of-speech tags by using the `.tags` property of the `TextBlob` object:


In [None]:
# Extract Penn Treebank POS


We can then providing `lemmatize()` with part-of-speech tags. Unfortunately it is not as simple as passing the POS tags from above. The reason is that NLTK generates tags base on the <a href="https://catalog.ldc.upenn.edu/LDC99T42">Penn Treebank</a> corpus, which uses different <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">POS</a> tags than the <a href="https://wordnet.princeton.edu/documentation/wndb5wn">Wordnet</a> corpus that `lemmatize()` is based on. 

We therefore need to map the two POS systems before lemmatization:

In [None]:
# Function to map Penn Treebank POS to Wordnet POS

# Convert Penn Treebank POS to Wordnet POS

# Lemmatize with POS


TextBlob and NLTK have many other useful features such as spelling correction and translation that you can explore on your own. One particularly useful feature is pre-trained sentiment analysis:

In [None]:
# Sentiment analysis with TextBlob


### D. Chinese Text

One major issue with Chinese text is that there is no space between words. Unsurprisingly then, this is a major focus for Chinese natural language processing research.

They are multiple libraries for Chinese NLP. Here we will try out `jieba` and `pkuseg`.

In [None]:
text = '我愛吃北京餃子。'

# jieba default


# jieba + paddle


# pkuseg


Things are much easier once we have the individual words. For example, we could immediately use ngram on the text.

We can also fetch POS:

In [None]:
# jieba


# jieba + paddle


# pkuseg


### E. Neural Network

Below is a simple LSTM neural network model that runs sentiment analysis on the IMDB data:

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
from keras.utils import pad_sequences
from sklearn.utils import resample

max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train,y_train,x_test,y_test = resample(x_train,y_train,x_test,y_test,
                                         n_samples=1000)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2))
model.add(Dense(128))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

You should notice that training a neural network is several orders of magnitude slower than a n-gram model. Furthermore, the neural network model above is not more accurate than our simple n-gram model. One reason is that with so many parameters, neural network models need more than a thousand sample to achieve good results if you are training one from scratch. You can try running the same script with more data on a computer with GPU and see whether you get better results.

### F. Large Language Models

A much better way to incorporate a neural network models is to use a *pre-trained* language model, 
which has been trained to understand language based on an enormous amount of text data.
The main reason why the models we have tried so far do not work well is that they have to learn
English from scratch based on the relatively small number of samples we provide. 
The use of a pre-trained model circumvent this issue. 

We will go into the details of such pre-trained models in a later lecture.
Here we will simply have a demo. As running these models are very computationally intensive,
you should run the following code with access to GPU, otherwise it is going to be very slow.

In [None]:
# Run with GPU! This is very slow on CPU.

# Data
x_test = imdb_test['text'].tolist()
y_test = imdb_test['label'].tolist()

# Use a pre-trained text classifier through Hugging Face transformer library
from transformers import pipeline
classifier = pipeline('text-classification', device=0) # device=0 means use (first) GPU
tokenizer_kwargs = {'truncation':True,'max_length':512} # Truncate to 512 tokens
results = classifier(x_test,**tokenizer_kwargs) # Returns a text label
y_predicted = [ 1 if x['label']=='POSITIVE' else 0 for x in results] # Convert to 1 or 0

# Measure accuracy
from sklearn.metrics import accuracy_score
print("Accuracy:",accuracy_score(y_test,y_predicted))

### G. Accessing Language Models through API

Instead of running your own copy pf a pre-trained language model,
you can defer the task to a remotely-hosted model. The most well-known
remotely-hosted models are OpenAI's GPTs and Anthropic's Claude.
Here we will use Meta's Llama 3.1, hosted on CUHK Department of Economics'
servers. 

Note that while this is the most powerful and convenient method of conducting 
text classification, it is also the slowest&mdash;you pay a price for being able 
to give the model instuctions in plain language.

In [None]:
# Text classification through API

# Need to supply your own API key
api_key = 'your-key-here'

from openai import OpenAI
import numpy as np

# Data - using only 30 samples for speed reasons
x_test = imdb_test['text'].head(30).tolist()
y_test = imdb_test['label'].head(30).tolist()

# OpenAI API
client = OpenAI(
    base_url = 'https://scrp-chat.econ.cuhk.edu.hk/api',
    api_key=api_key,
)

# Function for running the inference
def f(x):
    response = client.chat.completions.create(
      model="llama3.1:8b-instruct-q5_K_M",
      messages=[
        {"role": "system", 
         "content": """Please classify if the given text's sentiment is positive or negative.
                       If it is positive, return 1. Otherwise return 0. 
                       Show only the finally answer, do not show your reasoning.
                    """},        
        {"role": "user", "content": x}
      ],
      temperature=0  
    )
    return int(response.choices[0].message.content)

# Loop through data
y_predicted = [f(x) for x in x_test]

# Measure accuracy
from sklearn.metrics import accuracy_score
print("Accuracy:",accuracy_score(y_test,y_predicted))

### Further Readings
- <a href="https://github.com/dipanjanS/text-analytics-with-python">Text Analytics with Python</a> (or the <a href="https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72">free tutorial</a> by the same author on Towards Data Science.)