# Python Workshop
# Session 5: Text Processing and Machine Learning

Stefan Scholz

In this fifth session we will get an overview of **natural language processing**. We will borrow techniques such as **linear regressions**, **classifications**, **pre-processing**, **tokenization**, **n-grams**, **vectorization** and **word embeddings**. These will enable us to train and evaluate a **text classifier**.

## 5.1 Natural Language Processing

Natural language processing (NLP) is about programming computers to process and analyze natural language data (text and speech).

For Python, there are two main NLP modules:
- [spaCy](https://spacy.io/)
- [NLTK](https://www.nltk.org/)

Both modules implement the following NLP applications (and more), at least, for some languages:
- Named entity recognition (NER)
- Sentiment detection
- Tokenization: splitting a text into words (aka tokens)
- Part-of-speech tagging (POS)
- Lemmatization: mapping a word in text to its base form (aka lemma)
- Syntax parsing
- Semantic representation of words

We will first look into spaCy to explore NLP applications. Later, during the text classification we'll also touch some aspects of NLP, namely tokenization and the semantic representation of words in a vector space (word embeddings).

### Text Corpus

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Let us load as the text corpus the full texts we collected in the previous session. Store the full texts in a list called articles.
</div>

### Find the Most Commonly Used Words in Articles

We will now look into the articles itself and split the text into words, count word occurrences and generate a [word cloud](https://en.wikipedia.org/wiki/Tag_cloud) to visualize word frequencies or the "importance" of words.

In [None]:
from collections import Counter

In [None]:
# define counter
words = Counter()

# loop over articles
for article in articles:
    # loop over words after splitting articles by spaces
    for word in article.split():
        word = word.lower()
        words[word] += 1

# print most common words
print(words.most_common()[0:20])

This initial attempt shows that we need to skip over the most common functional words, in text processing called [stop words](https://en.wikipedia.org/wiki/Stop_word).

Let us count word occurrences again but skip over stop words.

In [None]:
from stop_words import get_stop_words

In [None]:
# get english stop words
stop_words = set(get_stop_words('en'))

def word_counts(articles):
    words = Counter()
    for article in articles:
        for word in article.split():
            word = word.lower()
            if word in stop_words:
                continue
            words[word] += 1
    return words

# print most common words again
print(word_counts(articles).most_common()[0:20])

### Word Clouds

Word clouds are generated using the [wordcloud package](https://pypi.org/project/wordcloud/), see also:
- [API docs of the WordCloud class](https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html#wordcloud.WordCloud)
- more [examples](https://amueller.github.io/word_cloud/auto_examples/index.html)

In [None]:
from wordcloud import WordCloud

In [None]:
# create word cloud
wordcloud = WordCloud(width=400, height=400, background_color="lightgrey").generate_from_frequencies(word_counts(articles))
wordcloud.to_image()

### Tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer

Let us look next into spaCy to explore NLP applications. Besides importing the spaCy library we have to also download the core modules for the processed language. These modules have to be downloaded only once.

 - [en_core_web_sm: English core module](https://spacy.io/models/en)
 - [de_core_news_sm: German core module](https://spacy.io/models/de)

To download the module `en_core_web_sm` you can run the command `python -m spacy download en_core_web_sm`.


In [None]:
import spacy

In [None]:
# load trained pipelines for English
nlp = spacy.load("en_core_web_sm")

# apply pipeline to article
doc = nlp(articles[3])

# inspect resulting article
doc.to_json()

In [None]:
# filter tokens tagged as nouns
list(filter(lambda t: t.pos_ == 'NOUN', doc))

For some NLP applications, spaCy provides nice visualizations, for example, for named entities or syntax trees of dependency parsing.

In [None]:
from spacy import displacy

In [None]:
# display named entities
displacy.render(doc, style="ent")

In [None]:
# display dependencies
displacy.render(doc, style="dep")

## 5.2 Machine Learning

The field of machine learning is too broad to be fully introduced here. Please, see [Google's machine learning crash course](https://developers.google.com/machine-learning/crash-course/ml-intro). We'll focus on a couple of examples and introduce ML libraries written in or providing a Python API.

- [scikit-learn](https://scikit-learn.org/): popular Python ML framework covering regression, classification and clustering using various approaches
- [fastText](https://fasttext.cc/): a library for text classification and word representation learning with Python bindings
- [TensorFlow](https://www.tensorflow.org/): ML framework with Python bindings focused on deep neural networks
- [Keras](https://keras.io/): high-level API to Tensorflow
- [PyTorch](https://pytorch.org/): competitor of Tensorflow
- [Transformers](https://huggingface.co/transformers/): library to use, train and adapt [transformer deep learning models](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model))

Before we begin to look into Python ML examples, few ML key terms:
- label: something we want to predict
- feature: variable in the input (eg. numeric value, words)
- example: data to learn from during training (labeled example) or to predict the label for using a learned model
- model: a model is trained on labeled input data and later used to make predictions ("infer" labels) for unlabeled examples
- regression vs. classification: labels are continuous vs. categorical values

### Linear Regression and Classification with Scikit-Learn

As an example for linear regression we take few trees from the tree cadastre used in [session 2](./2_structured_data.ipynb). We select a small subset of trees species to work with. We choose 3 trees quite different in shape: birch (tall and high, thinner trunk), lime tree (broad, thicker trunk) and apple tree (small, not tall).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# load trees as dataframe
trees = pd.read_csv("data/KN_Baumkataster_2020.csv")

# rename columns in dataframe
trees.rename(columns={"hoeheM": "height (m)", "kronendurchmesserM": "treetop diameter (m)", "stammumfangCM": "trunk perimeter (cm)"}, inplace=True)

# define species
species = ["Betula pendula", "Tilia cordata", "Malus domestica"]

# define columns
columns = ["Name_lat", "trunk perimeter (cm)", "treetop diameter (m)", "height (m)"]

# select trees
trees_selected = trees.loc[trees["Name_lat"].isin(species), columns]

# print selected trees
trees_selected

In [None]:
# prepare a 3D plot to show how the trees are placed given the 3 metrics
fig = plt.figure()
ax = fig.add_subplot(projection="3d")

# for loop over each species of trees
for name, idx in trees_selected.groupby("Name_lat").groups.items():
    ax.scatter(*trees_selected.loc[idx, ["trunk perimeter (cm)", "treetop diameter (m)", "height (m)"]].T.values, label=name)

ax.set_xlabel("trunk perimeter (cm)")
ax.set_ylabel("treetop diameter (m)")
ax.set_zlabel("height (m)")

ax.legend()
plt.show()

Let us predict the trunk perimeter and treetop diameter given the height using a linear regression.

See also the scikit-learn documentation about [linear models](https://scikit-learn.org/stable/modules/linear_model.html).

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

In [None]:
# loop over species
for sp in species:
    # select rows by species
    trees_sp = trees_selected.loc[trees_selected["Name_lat"]==sp].dropna()

    # convert metric cells to numpy arrays
    height = trees_sp.loc[:, "height (m)"].values.reshape(-1,1)
    treetop_trunk = trees_sp.loc[:, ["trunk perimeter (cm)", "treetop diameter (m)"]].values.reshape(-1,2)

    # run linear regression
    rgr = LinearRegression()
    rgr.fit(height, treetop_trunk)

    # print for certain species and height the trunk perimeter and treetop diameter
    print(sp)
    for height in [2, 5, 10, 15, 20]:
        print(height, rgr.predict(np.array([[height]])))
    print()

In the following, we will use a [neural network classifier](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification) which uses as input features the 3 metric columns and tries to predict the species of a tree. How are the results given that only 3 tree species are used. What if we use more or even all species?

In [None]:
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
# split data into train and test data (80% resp. 20% of the data)
train, test = train_test_split(trees_selected.dropna(), test_size=0.2)

# create multi-layer perceptron classifier
cls = MLPClassifier(alpha=1, max_iter=1000)

# prepare training data
x_train = train[["trunk perimeter (cm)", "treetop diameter (m)", "height (m)"]].values.reshape(-1,3)
y_train = train[["Name_lat"]].values.reshape(-1,1).ravel()

# prepare testing data
x_test = test[["trunk perimeter (cm)", "treetop diameter (m)", "height (m)"]].values.reshape(-1,3)
y_test = test[["Name_lat"]].values.reshape(-1,1).ravel()

# fit classifier
cls.fit(x_train, y_train)

# print results for predictions on test data
y_predicted = cls.predict(x_test)
print(f"Classification report for classifier {cls}:\n", f"{sklearn.metrics.classification_report(y_test, y_predicted)}\n")

In [None]:
# print a confusion matrix: which tree species are predicted better? which ones are confused more often?
cm = confusion_matrix(y_test, y_predicted, normalize="true")
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=species)
disp.plot()
plt.show()

### Text Classification with fastText

[fastText](https://fasttext.cc/) is a software library for text classification and word representation learning. See the fastText tutorials for

- [Text classification](https://fasttext.cc/docs/en/supervised-tutorial.html)
- [Word representation learning](https://fasttext.cc/docs/en/unsupervised-tutorial.html)

We will now follow the [fastText text classification](https://fasttext.cc/docs/en/supervised-tutorial.html) tutorial (cf. documentation of the [Python module "fasttext"](https://pypi.org/project/fasttext/)) to train and apply a text classifier.


We will use the [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview) data set. In order to download the data set, you need to register at [Kaggle.com](https://www.kaggle.com/). Note: Kaggle is a good place to look and learn how other researchers and engineers tried to solve various ML problems.

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Sign up for <a href="https://www.kaggle.com/">Kaggle</a>.
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Download the <a href="https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data">Toxic Comment Classification Challenge</a> and unpacked into the folder `data/kaggle-jigsaw-toxic`, you should see the tree files `train.csv`, `test.csv` and `test_labels.csv` in the mentioned folder.
</div>

Let us load the data and inspect the variables inside the dataset.

In [None]:
import pandas as pd

In [None]:
# load training comments
comments_train = pd.read_csv("data/kaggle-jigsaw-toxic/train.csv")

# select labels
labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

# get means of each type of toxicity
comments_train[labels].mean()

In the next step we have to tokenize the comments similar to what we have already done with the articles. And then we write it into a new file that fasttext can use it.

In [None]:
import string
from nltk.tokenize import TweetTokenizer

In [None]:
# initialize tokenizer
tweet_tokenizer = TweetTokenizer(reduce_len=True)

def tokenize(text):
    global tweet_tokenizer
    words = tweet_tokenizer.tokenize(text)
    words = filter(lambda w: w != "" and w not in string.punctuation, words)
    words = map(lambda w: w.lower(), words)
    return ' '.join(words)

tokenize("You're a hero! http://example.com/index.html")

In [None]:
# write data to fastText train file
train_file = "data/kaggle-jigsaw-toxic/train.txt"

def write_line_fasttext(fp, row):
    global labels
    line = ''
    for label in labels:
        if row[label] == 1:
            if line:
                line += ' '
            line += '__label__' + label
    if line:
        line += ' '
    else:
        line += '__label__none '
    line += tokenize(row['comment_text'])
    fp.write(line)
    fp.write('\n')

with open(train_file, 'w') as fp:
    comments_train.apply(lambda row: write_line_fasttext(fp, row), axis=1)

In the next step we can train our own text classifier.

In [None]:
import fasttext

In [None]:
# define train file
train_file = "data/kaggle-jigsaw-toxic/train.txt"

# create classifier with max length of word ngram 2 and minimal number of word occurences 2
model = fasttext.train_supervised(input=train_file, wordNgrams=2, minCount=2)

In [None]:
# predict sample comment 1
model.predict(tokenize("This is a well-written article."))

In [None]:
# predict sample comment 2
model.predict(tokenize("Fuck you!"))

In [None]:
# looking into the underlying word embeddings
model.get_nearest_neighbors("idiot", k=20)

In [None]:
# save the model
model_file = "data/kaggle-jigsaw-toxic/model.bin"
model.save_model(model_file)

Now that we have trained our models, let us evalate it on the testing set.

In [None]:
# read test files as data frames
comments_test = pd.read_csv("data/kaggle-jigsaw-toxic/test.csv")
comments_test_labels = pd.read_csv("data/kaggle-jigsaw-toxic/test_labels.csv")

# join both tables
comments_test = comments_test.merge(comments_test_labels, on="id")

# skip rows not labelled / not used
comments_test  = comments_test.loc[comments_test['toxic'] != -1]

# write test set for fastText
test_file = "data/kaggle-jigsaw-toxic/test.txt"
with open(test_file, 'w') as fp:
    comments_test.apply(lambda row: write_line_fasttext(fp, row), axis=1)

In [None]:
# test model on test file (returns support, precision, recall)
model.test(test_file)

In [None]:
# test model pe each label
res_per_label = model.test_label(test_file)

for label in res_per_label.items():
    print(label)

### Transformer Language Models and the Transformers Library

[Transformer language models](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) are used to address a couple of NLP tasks -- text classification, text generation, translation and more. [Hugging Face's transformers library](https://huggingface.co/transformers/) provides an powerful and easy to learn interface to use transformers. Hugging Face also offers a large repository of transformer models shared by a growing community of researchers and organizations. For more details exceeding the examples below, see the [transformers course](https://huggingface.co/course).
  
Transformers can be "fine-tuned" to a specific task, see [training of transformers](https://huggingface.co/transformers/training.html). Adding a task-specific head to a transformer pre-trained on large amounts of training data (usually 100 GBs or even TBs of text) saves resources spent for training and can overcome the problem of not enough training data. Manually labelling training data is expensive and naturally puts a limit on the amount of training data. But even if the vocabulary in the training data is limited, there's a good chance that the pre-trained transformer has seen the unknown words in the huge data used for pre-training.

In [None]:
from transformers import pipeline

In [None]:
# create pipeline
p = pipeline("fill-mask", model="bert-base-cased")

In [None]:
# print sequences (with filled mask)
for s in p("He works as a [MASK] in a clinic."):
    print(s)

In [None]:
# print sequences (with filled mask)
for s in p("He works as a [MASK] at the Zeppelin university."):
    print(s)

To see which other tasks can be done by this pipeline just call the function `help` on it.

In [None]:
help(pipeline)

Let us use a pipeline to make a sentiment analysis.

In [None]:
# create a pipeline for sentiment analysis
p = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# get the sentiment for a sample sentence
p("I'm happy.")

In [None]:
# get the sentiment for a sample sentence 2
p("I'm sad.")

Let us use a pipeline to translate a text from German to English.

In [None]:
# create pipeline for translation
p = pipeline("translation", model="facebook/wmt19-de-en")

# get the translation
p("""Nicht nur unterschiedliche Berechnungen bereiten Kopfzerbrechen.
  Bei der Eigenwahrnehmung zeigt sich: In Deutschland gibt es massive
  Missverständnisse über Ausmaß und Art von Ungleichheit.""")