# NLP Seminar 2 - Normalization and Simple Word Embeddings

# 0. Preliminaries

In this first NLP seminar, we focus on classical text normalization methods (stemming, lemmatization) and simple word embedding/vectorization techniques (bag of words, TF-IDF), that will allow us to train machine learning methods on text data. We use again the `nltk` (natural language toolkit) python package, as well as `scikit-learn`.

In [None]:
# Module imports
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearnex import patch_sklearn
patch_sklearn(verbose=False)

In [None]:
# First download some nltk resources
#nltk.download("stopwords")
#nltk.download("wordnet")
#nltk.download("omw-1.4")
#nltk.download("punkt")
#nltk.download("averaged_perceptron_tagger")
# alternatively, they are all part of:
nltk.download('popular', quiet=True)
nltk.download('universal_tagset', quiet=True)

# 1. Download the data


We will work with the famous `20newsgroups` dataset. It consists of a large collection of news posts across 20 topics. We will be using it to test some basic NLP techniques and train a multi-class classification model to predict the most likely topic for unseen news posts. 
For more information, check [the dataset description](https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset) and the [import function helper](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html)

To make the task a bit harder, we import the data without the headers, footers and quotes.

We also restrict the dataset to only 4 of the categories, for presentation simplicity.

In [None]:
from sklearn.datasets import fetch_20newsgroups
# we restrict the data to the following response categories:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

data_train = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42,
                                remove=("headers", "footers", "quotes"))

data_test = fetch_20newsgroups(subset="test", categories=categories, shuffle=True, random_state=42,
                               remove=("headers", "footers", "quotes"))

df_train = pd.DataFrame({"text": data_train.data, "class": data_train.target})
df_test = pd.DataFrame({"text": data_test.data, "class": data_test.target})

In [None]:
# Inspect the data
df_train.head()

In [None]:
# How many classes are there to identify?
df_train["class"].unique().shape

In [None]:
# What do these class labels correspond to?
target_names = data_train.target_names
print(target_names)

In [None]:
print(f"Train: {df_train.shape}")
print(f"Test: {df_test.shape}")

# 2. Natural Language Processing: Text Normalization

For classic NLP techniques, text normalization can be crucial to good model performance. Their main aim is to decrease the vocabulary size, by reducing similar words to common roots or removing useless words. These techniques include word tokenization, stopword removal, lemmatization, and more. We'll try a few of those here to see their impact on our news classification model.

We begin by testing those text normalization methods on a single news post example, before applying them to the entire dataset.

In [None]:
# Take an example entry in the training set for demonstration purposes here
text = df_train["text"].iloc[11]
print(text)

### 2.1. Tokenization
Before any type of normalization, the first step is to tokenize each document in our corpus. We again rely on the NLTK tokenizer to convert a string to a list of words. We here tokennize the selected `text` observation from above.

In [None]:
text_tokens = nltk.word_tokenize(??)
print(text_tokens)

### 2.2. Stopwords and punctuation

Stopwords correspond to unimportant words which might safely be ignored for the task at hand. For the specific task of news classification, we might for example rely on the default stopwords provided by NLTK.

In [None]:
# Let's use the nltk stopwords available for the English language
stopwords = nltk.corpus.stopwords.words("english")
print(stopwords)

One might also want to remove punctuation tokens. Here is a list of standard punctuation.

In [None]:
import string
string.punctuation

### 2.3. Stemming

When it comes to reducing similar words to common roots, stemming is the simplest and fastest approach. Stemming is an abstract rule-based process that stems or removes some of the last few characters from a word. This sometimes leads to incorrect meanings and spelling, as a downsize.

Let's try two different stemmers: "PorterStemmer" and the slightly more recent "SnowballStemmer"

In [None]:
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [None]:
ps = PorterStemmer()
text_stems = ??
print(text_stems)

In [None]:
snowstem = SnowballStemmer("english")
text_sstems = [snowstem.stem(word) for word in text_tokens]
print(text_sstems)

In [None]:
# before/after?

In [None]:
[(a,b) for a, b in zip(text_stems, text_sstems) if a!=b]

### 2.4. Part-of-speech (POS) tagging
Part-of-speech tagging categorizes words as a particular part-of-speech (e.g. verb, noun, etc ...) using the word itself and the surrounding context in a predictive model. Those tags can be useful for several types of analyses. In particular, they are used for lemmatization.

Different more or less precise tagsets exist for each language.

In [None]:
# We can use the default pos-tagger provided by NLTK
text_PoS_p = nltk.pos_tag(??)
print(text_PoS_p)

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
# Or the "universal" tags
text_PoS_u = nltk.pos_tag(??, tagset="universal")
print(text_PoS_u)

http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf

https://github.com/slavpetrov/universal-pos-tags

- VERB - verbs (all tenses and modes)
- NOUN - nouns (common and proper)
- PRON - pronouns 
- ADJ - adjectives
- ADV - adverbs
- ADP - adpositions (prepositions and postpositions)
- CONJ - conjunctions
- DET - determiners
- NUM - cardinal numbers
- PRT - particles or other function words
- X - other: foreign words, typos, abbreviations
- . - punctuation

### 2.5. Lemmatization
Lemmatization considers the context and converts the word to its meaningful base form, which is called the "Lemma".
It uses language-specific lookup tables to find the root forms of words. This makes it more computationnaly expensive than stemming.
We here use the `WordNetLemmatizer` (https://www.nltk.org/api/nltk.stem.wordnet.html).
It uses the Wordnet database (https://wordnet.princeton.edu/) to find the lemma of a word, given the word AND its PoS tag!

In [None]:
# We can use the English language lemmatizer provided by NLTK
wnl = nltk.stem.WordNetLemmatizer()

In [None]:
?wnl.lemmatize

In [None]:
for w, pos in text_PoS_u:
    # Map the detailed set of POS-tags to nouns/verbs for the lemmatizer
    if pos in ["VERB"]:
        pos = ?
    elif pos in ["ADJ"]:
        pos = ?
    elif pos in ["ADV"]:
        pos = ?
    elif pos in ["NOUN"]:
        pos = ?
    else:
        pos = ?
    # print the original tokens and their lemma
    print(f"{w} -> {wnl.lemmatize(?, ?)}")

In [None]:
# To illustrate the difference:
diff_ex = ["run","ran",
           "universal", "university", "universe",
           "alumnus","alumni"]

for w, pos in nltk.pos_tag(diff_ex, tagset="universal"):
    # Map the detailed set of POS-tags to nouns/verbs for the lemmatizer
    if pos in ["VERB"]:
        pos = "v"
    elif pos in ["ADJ"]:
        pos = "a"
    elif pos in ["ADV"]:
        pos = "r"
    elif pos in ["NOUN"]:
        pos = "n"
    else:
        pos = "n"
    print(f"{w} - stem: {ps.stem(w)} - lemma: {wnl.lemmatize(w, pos=pos)}")

### 2.6. Putting it all together

Let's combine all of the above techniques into a single "tokenize + normalize" function which will output a list of normalized words given a string of text (document) as input.

In [None]:
# Write your custom tokenizer method here:
def custom_tokenizer(text: str):
    text_tokens = nltk.word_tokenize(text)
    output = []
    for w, pos in nltk.pos_tag(text_tokens, tagset="universal"):
        if pos in ["VERB"]:
            pos = "v"
        elif pos in ["ADJ"]:
            pos = "a"
        elif pos in ["ADV"]:
            pos = "r"
        elif pos in ["NOUN"]:
            pos = "n"
        else:
            pos = "n"

        # Lemmatized form accounting for POS-tag
        l = wnl.lemmatize(w, pos=pos)

        # Filter out stopwords
        if l not in stopwords:
            output.append(l)

    return output

In [None]:
# Try it out on the text
print(custom_tokenizer(text))

# 3. Word embeddings and text classification

Using the NLP normalization techniques from above, including our custom tokenization method, we want to train a simple multi-class ML classifier to predict the news topic. In order to do so, we must first vectorize the text data.

### 3.1. Bag-of-words (BOW)

Bag-of-words is the simplest *embedding* technique in which words are represented as one-hot encoded numeric vectors of word counts. The length of these vectors corresponds to the size of the (reduced) vocabulary of the training corpus.
That is, each column $j$ in the resulting sparse matrix represents a word and and each row $i$ represents a different observation (i.e. document), and the coresonping entry in the matrix is the word count of how many times the word $j$ appears in document $i$.
See [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for the scikit-learn implementation of the bag-of-words embedding and its many options.

We start by training the bag-of-words on our train corpus, and compare the vocabulary size with and without our `'custom_tokenizer'` normalization procedure. We can ignore tokens that appear in less than 0.1% of documents.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
simple_vectorizer = CountVectorizer(lowercase=True, min_df=1e-3)

bow_train_simple = simple_vectorizer.fit_transform(df_train["text"])

In [None]:
bow_train_simple.shape

In [None]:
vectorizer = CountVectorizer(tokenizer=custom_tokenizer, lowercase=True, min_df=1e-3)

bow_train = vectorizer.fit_transform(df_train["text"])

In [None]:
bow_train.shape

### 3.2. Bag-of-words (BOW) with Term-frequency Inverse-document-frequency (TF-IDF)

TF-IDF uses the **same** one-hot encoding as traditional BOW, but transforms the simple counts to the relative word frequency, normalized by the inverse-document-frequency to account for frequently occurring words across all documents. 

The intuition is that not only the word's frequency in a given document indicates if that word represents the document well, but also how rare it is in other documents in comparison.

We now compute the TF-IDF transformation of the BoW, using our `'custom_tokenizer'`. See the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) scikit-learn classes for help.

#### Way 1: two-step bag of words and TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

#using the BoW from above:
tfidf_train0 = tfidf_transformer.fit_transform(bow_train)

In [None]:
tfidf_train0.shape

#### Way 2: both at once

Here we repeat the two steps above in a single equivalent step, using TfidfVectorizer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, lowercase=True, min_df=1e-3)

tfidf_train = tfidf_vectorizer.fit_transform(df_train["text"])

In [None]:
tfidf_train.shape

# 4. Topic Classification using Simple Machine Learning

### 4.1. Lemmatization and TF-IDF impacts

In this section we want to study the impact of lemmatization and of TF-IDF on the classification accuracy of a simple ML model. We use a multinomial logistic regression with default hyper-parameters.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Let's check the test accuracy using BoW without lemmatization.

In [None]:
clf = LogisticRegression(max_iter=int(1e4))
clf.fit(bow_train_simple, df_train["class"])

# Run the inference on test data
pred_bow_simple = clf.predict(simple_vectorizer.transform(df_test["text"]))

In [None]:
# Report the score
accuracy_score(df_test["class"], pred_bow_simple)

Is the lemmatization improving the test accuracy?

In [None]:
clf = LogisticRegression(max_iter=int(1e4))
clf.fit(bow_train, df_train["class"])

# Run the inference on test data
pred_bow = clf.predict(vectorizer.transform(df_test["text"]))

In [None]:
# Report the score
accuracy_score(df_test["class"], pred_bow)

Let's see if using TF-IDF weights improves the score further.

In [None]:
clf = LogisticRegression(max_iter=int(1e4))
clf.fit(tfidf_train, df_train["class"])

# Run the inference on test data
pred_tfidf = clf.predict(tfidf_vectorizer.transform(df_test["text"]))

In [None]:
# Report the score
accuracy_score(df_test["class"], pred_tfidf)

This already performs better than a simple BOW model. You can try changing the vectorizer parameters or using a different ML model for the classification. We will investigate more advanced methods in later labs.

### 4.2. Hyper-Parameter Tuning

We can perform a cross-validated grid search to select the best hyper-parameter values for both the TF-IDF vectorizer and the logistic regression model at the same time using a sklearn `Pipeline`. For the TF-IDF Vectorizer, let's for example check if considering bigrams in the vocabulary helps the classifier. For the logistic regression, the tuning parameter is the cost ($1/penalty$) `'C'`.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV

In [None]:
logistic_pipe = Pipeline([
    ("tfidf", TfidfVectorizer(tokenizer=custom_tokenizer, lowercase=True, min_df=1e-3, max_df=0.999)),
    ("logistic", LogisticRegression(max_iter=int(1e4)))
])

In [76]:
# Define parameter grid
my_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    "logistic__C": [10, 100, 1000],
}

# Define folds
folds = KFold(n_splits=3, shuffle=True, random_state=42)

# Define grid search CV
tfidf_logistic_cv = GridSearchCV(estimator=logistic_pipe, param_grid=my_grid, scoring="accuracy", cv=folds, n_jobs=-2)

# Run CV
tfidf_logistic_cv.fit(df_train["text"], df_train["class"])



In [77]:
pd.DataFrame(tfidf_logistic_cv.cv_results_).sort_values("rank_test_score")

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logistic__C,param_tfidf__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
2,59.497286,7.672235,29.446819,2.391476,100,"(1, 1)","{'logistic__C': 100, 'tfidf__ngram_range': (1,...",0.868526,0.859043,0.851064,0.859544,0.007138,1
0,45.952068,1.381878,31.013361,1.099879,10,"(1, 1)","{'logistic__C': 10, 'tfidf__ngram_range': (1, 1)}",0.864542,0.857713,0.849734,0.85733,0.006051,2
4,60.970096,13.564977,24.709396,3.421494,1000,"(1, 1)","{'logistic__C': 1000, 'tfidf__ngram_range': (1...",0.861886,0.852394,0.845745,0.853341,0.006624,3
1,52.562788,0.741009,33.600293,1.090875,10,"(1, 2)","{'logistic__C': 10, 'tfidf__ngram_range': (1, 2)}",0.868526,0.847074,0.841755,0.852452,0.011572,4
3,74.983177,2.083764,27.871013,1.014869,100,"(1, 2)","{'logistic__C': 100, 'tfidf__ngram_range': (1,...",0.864542,0.847074,0.843085,0.851567,0.009318,5
5,50.044275,2.023628,14.360304,1.448081,1000,"(1, 2)","{'logistic__C': 1000, 'tfidf__ngram_range': (1...",0.860558,0.845745,0.844415,0.850239,0.007317,6


In [79]:
tfidf_logistic_cv.best_params_

{'logistic__C': 100, 'tfidf__ngram_range': (1, 1)}

In [80]:
best_model = tfidf_logistic_cv.best_estimator_
y_hat = best_model.predict(df_test["text"])
accuracy_score(df_test["class"], y_hat)

0.8015978695073236

### 4.3. Diagnostics

We can run some classification accuracy diagnostics to understand in more detail how our model is performing.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

In [None]:
cm = confusion_matrix(df_test["class"], y_hat)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=data_test.target_names).plot()
plt.show()

rows = actual category

columns = predicted category

In [None]:
print(classification_report(df_test["class"], y_hat, target_names=data_test.target_names))

precision = $TP/(TP+FP)$ = When the model is predicting this class, how often it is right

recall = $TP/(TP+FN)$ = How well the model is performing (accuracy) within each (true) class

F1 = harmonic mean of precision and recall = $\frac{2}{(1/precision)+(1/recall)}$ (1= best, prefect precision and recall, 0=worst, either precision or recall is zero)

support = number of occurences in the test set

Predict the topic of a few sentences:

In [None]:
docs_new = ['Lungs and heart health', "CPU or GPU?", 'God is love']

predicted = best_model.predict(docs_new)

for doc, category in zip(docs_new, predicted):
    print(doc, "=>", data_test.target_names[category])

## Appendix: Text wrangling and preprocessing

In practice, textual datasets are rarely clean nor well structured, and often need some wrangling and preprocessing to be used effectively. 

Furthermore, depending on the specific task and context at hand, there are often other tailor-made transformations that can prove usefull as an addition or a replacement to normalization. (E.g. the way you would like to handle the `@`symbol might differ between e-mail and social media data. Or there might be specific groups of words that have similar meaning in general, but whose differentiation is important in a specific context.)

Additionally to processing and normalizing the test for vectorisation, manual feature extraction can also prove useful. For example the number of exclamation marks, the number of ALL CAPS WORDS, or the average word per sentence ratio might give additional information on the tone or sentiment of written text, depending on the context and model.

Here are a few basic string methods that can come in handy for those scenarios.

In [None]:
import string

In [None]:
text = "Hi there!"
text

In [None]:
text.replace("Hi","Hello")

In [None]:
text.replace("!"," ! ")

In [None]:
text.replace("e","")

In [None]:
text.split(" ")

In [None]:
text.lower()

In [None]:
string.punctuation

The few examples above are just to inspire you some ideas. There are many things you could think of to analyze and extract informative summaries from text data. The pandas `<pd.Series>.apply()` method can come in very handy with custom user-defined functions.

pd.Series also has a `str` subset of methods for text data. Here are a few dummy examples.

In [None]:
import pandas as pd
str_text = pd.DataFrame({"my_text":["Hi there!","My dog is cute.","i lost my wallet"]})
str_text

In [None]:
str_text.my_text.str.capitalize()

In [None]:
str_text.my_text.str.lower()

In [None]:
str_text.my_text.str.contains("y")

In [None]:
str_text.my_text.str.contains("my")

In [None]:
str_text.my_text.str.count("e")

In [None]:
str_text.my_text.str.replace("e","")

Many more examples in the pandas documentation.

For more complicated text processing procedures, one would usually turn to [**regular expressions**](https://en.wikipedia.org/wiki/Regular_expression), as a much more powerful tool. The [`re` module](https://docs.python.org/3/library/re.html) provides the base tools to work with regular expressions in python. Some `pandas`'s  `Series.str` methods above also accept regular expressions.

This goes beyond the scope of this seminar, but if you are interested:
- [Interactive RegEx tutorials](https://regexr.com/)
- [Another tutorial](https://www.w3schools.com/python/python_regex.asp)
- And many more...

In [None]:
import re
re_str = "Hello, how are you Luis?"
re.findall("[A-Z]\w+", re_str)

In [None]:
best_model = Pipeline([
    ("tfidf", TfidfVectorizer(tokenizer=custom_tokenizer, ngram_range=(1,1), lowercase=True, min_df=1e-3, max_df=0.999)),
    ("logistic", LogisticRegression(C=100, multi_class="multinomial", max_iter=int(1e4)))
])
best_model.fit(df_train["text"], df_train["class"])