# Basics of text processing

### Natural Language Processing and Information Extraction,  2025WS
10/10/2025

Gábor Recski

## In this lecture
- Regular Expressions (SLP 2.7)
- Text segmentation and normalization (SLP 2.2, 2.5, 2.7, old SLP)
   - sentence segmentation (SLP 2.7)
   - tokenization (SLP 2.5)
   - lemmatization, stemming (old SLP)
   - decompounding, morphology (SLP 2.2, old SLP)
   - the CoNLL format (old SLP)
   
[SLP Ch. 2](https://web.stanford.edu/~jurafsky/slp3/2.pdf), [SLP 2025 Jan](https://web.stanford.edu/~jurafsky/slp3/old_jan25/), [SLP 2024 Aug](https://web.stanford.edu/~jurafsky/slp3/old_aug24/)

## Import dependencies

In [None]:
import json
import re
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import stanza

## Download models

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
stanza.download('en')
stanza.download('de')

## Regular expressions

- Pattern matching
- Substitution and grouping

### Pattern matching

We use a dataset of ca. 24K Wikipedia articles about movies after 2000 (created for the [GIR exercise](https://github.com/TUW-GIR/exercise-2023WS-template))

In [None]:
!wget -nc -O data/wp_movie_data.jsonl https://tucloud.tuwien.ac.at/public.php/dav/files/A4YFbg3PD4pXMs4/?accept=zip

In [None]:
with open("data/wp_movie_data.jsonl") as f:
    movies = {item['title']: item['text'] for item in (json.loads(line) for line in f)}

In [None]:
len(movies)

In [None]:
def search_title(pattern, data, n=10):
    return sorted(title for title in data.keys() if re.match(pattern, title))[:n]

#### Which movies have the number 7 in their titles?

In [None]:
def search_title(pattern, data):
    return sorted(title for title in data.keys() if re.search(pattern, title))[:]

In [None]:
search_title('7', movies)

#### Limit it to those with 7 as a word

In [None]:
search_title('(\s|^)7(\s|$)', movies)

In [None]:
search_title('(\s|^)(7|[sS]even)(\s|$)', movies)

#### Let's try to find movies involving Aaron Sorkin

In [None]:
def search_text(pattern, data, r=50):
    for title, text in data.items():
        match = re.search(pattern, text)
        if match is None:
            continue
        i, j = match.span()
        start = max(i-r, 0)
        end = i+r
        print(f"{title}\n\n...{text[start:end]}...\n\n")        


In [None]:
search_text('Aaron Sorkin', movies)

#### Could we find all names in all texts?

In [None]:
def count_patterns(pattern, data):
    return Counter(match for title, text in data.items() for match in re.findall(pattern, text)).most_common()

In [None]:
name_pattern = '[A-Z][a-z]+(?: [A-Z][a-z]+)+'

In [None]:
count_patterns(name_pattern, movies)

#### Let's reuse this pattern

In [None]:
count_patterns('starring ' + name_pattern, movies)

In [None]:
count_patterns(name_pattern+' franchise', movies)

In [None]:
count_patterns('Academy Award for ' + name_pattern, movies)

### Substitution and groups

Regexes are not just for pattern matching, they are also a powerful tool for text manipulation.

In [None]:
with open('data/tww_s1_e1.txt') as f:
    text = f.read()

In [None]:
print(text)

Let's get the structure of this document, step by step

In [None]:
match = re.search('(.*)\nACT ONE', text, re.S)
print(match)

In [None]:
header = match.group(1).strip()
print(header)

In [None]:
footer = re.search('THE END\n\* \* \*(.*)', text, re.S).group(1).strip()
print(footer)

We can do all this with a single regex

In [None]:
header, body, footer = re.search('(.*)\n(ACT ONE.*THE END)\n\* \* \*(.*)', text, re.S).groups()

In [None]:
print(header)

In [None]:
print(footer)

In [None]:
print(body)

Now let's get the scenes!

In [None]:
SCENE_SEP_PATT = ("\n(?:CUT TO:|ACT [A-Z]*)")

In [None]:
scenes = re.split(SCENE_SEP_PATT, body)

In [None]:
len(scenes)

In [None]:
print('\n\n***\n\n'.join(f'Scene {i}:\n{scenes[i].strip()[:50]}...' for i in range(5)))

Now let's get the structure of the dialogue!

In [None]:
print(scenes[2])

In [None]:
LINE_PATT = "\n([A-Z.\[\] ]+)\n(.*?)\n"

In [None]:
utterances = re.findall(LINE_PATT, scenes[0], re.S)

In [None]:
utterances[:10]

In [None]:
script = {
    "header": header,
    "scenes": [
        {"lines": [
            {
                "char": character,
                "text": text
            }
            for character, text in re.findall(LINE_PATT, scene)
        ]
        }
        for scene in re.split(SCENE_SEP_PATT, body)
        ],
    "footer": footer
}

In [None]:
script['scenes'][2]

Let's use this data for something. Let's get a list of characters by frequency.

In [None]:
Counter(line['char'] for scene in script['scenes'] for line in scene['lines']).most_common(10)

Regular expressions are surprisingly powerful. Also, with the right implementation, they are literally as fast as you can get. That's because they are equivalent to [finite state automata (FSAs)](https://en.wikipedia.org/wiki/Finite-state_machine). Actually, every regular expression is a [regular grammar](https://en.wikipedia.org/wiki/Regular_grammar) defining a [regular language](https://en.wikipedia.org/wiki/Regular_language).

![re_xkcd](media/re_xkcd.png)([XKCD #208](https://xkcd.com/208/))

## Text segmentation

### Splitting text into sentences

In [None]:
text2 = "'Of course it's only because Tom isn't home,' said Mrs. Parsons vaguely."

#### Naive: split on `.`, `!`, `?`, etc.

In [None]:
re.split('[.!?]', text2)

#### Better: use language-specific list of abbreviation words, collocations, etc.

In [None]:
nltk.sent_tokenize(text2)

Custom lists of patterns are often necessary for **special domains**. 

_An die Stelle der Landesgesetze vom 17. Jänner 1883, n.ö.L.G. u. V.Bl. Nr. 35, vom 26. Dezember 1890, n.ö.L.G. u. V.Bl. Nr. 48, vom 17. Juni 1920 n.ö.L.G. u. V.Bl. Nr. 547, vom 4. November 1920 n.ö.L.G. u. V.Bl. Nr. 808, und vom 9. Dezember 1927, L.G.Bl. für Wien Nr. 1 ex 1928, die, soweit dieses Gesetz nichts anderes bestimmt, zugleich ihre Wirksamkeit verlieren, hat die nachfolgende Bauordnung zu treten._

[Bauordnung für Wien](https://www.ris.bka.gv.at/Dokumente/Landesnormen/LWI40000064/LWI40000064.html)

In [None]:
text3 = "An die Stelle der Landesgesetze vom 17. Jänner 1883, n.ö.L.G. u. V.Bl. Nr. 35, vom 26. Dezember 1890, n.ö.L.G. u. V.Bl. Nr. 48, vom 17. Juni 1920 n.ö.L.G. u. V.Bl. Nr. 547, vom 4. November 1920 n.ö.L.G. u. V.Bl. Nr. 808, und vom 9. Dezember 1927, L.G.Bl. für Wien Nr. 1 ex 1928, die, soweit dieses Gesetz nichts anderes bestimmt, zugleich ihre Wirksamkeit verlieren, hat die nachfolgende Bauordnung zu treten."

In [None]:
print(text3)

In [None]:
nltk.sent_tokenize(text3, language='german')

In [None]:
nltk.sent_tokenize("17. Jänner", language='german')

In [None]:
nltk.sent_tokenize("17. Januar", language='german')

**NB: most real-world NLP applications are in special domains!**

###  Tokenization - splitting sentences into words

#### Naive approach: split on whitespace

In [None]:
text2.split()

#### Better: separate punctuation marks

In [None]:
re.findall('(\w+|[^\w\s]+)', text2)[:30]

#### Best: add some language-specific conventions:

In [None]:
nltk.word_tokenize(text2)

In [None]:
nltk.word_tokenize("O'Brian")

## Text normalization

#### What are the most common words in some sample of text?

In [None]:
movie_sample = {title: text for i, (title, text) in enumerate(movies.items()) if i % 100 == 0}

In [None]:
sorted(movie_sample.keys())

In [None]:
words = [word for text in movie_sample.values() for word in nltk.word_tokenize(text)]

In [None]:
words[:10]

In [None]:
len(words)

In [None]:
Counter(words).most_common(10)

Let's get rid of punctuation

In [None]:
words = [word for word in words if re.match('\w', word)]

In [None]:
len(words)

In [None]:
Counter(words).most_common(10)

Filtering common function words is called __stopword removal__

In [None]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
print(stopwords)

In [None]:
words = [word for word in words if word.lower() not in stopwords]

In [None]:
Counter(words).most_common(20)

In [None]:
char_counter = Counter(line['char'] for scene in script['scenes'] for line in scene['lines'])

In [None]:
char_counter.most_common(5)

In [None]:
stopwords.add("n't")

In [None]:
from collections import defaultdict
word_counter = defaultdict(Counter)
for scene in script['scenes']:
    for line in scene['lines']:
        for word in nltk.word_tokenize(line['text']):
            if re.match('\w', word) and word.lower() not in stopwords:
                word_counter[line['char']][word.lower()] += 1

In [None]:
for char, _ in char_counter.most_common(5):
    print(char)
    print(word_counter[char].most_common(10))

### Lemmatization and stemming

Words like _say_, _says_, and _said_ are all different **word forms** of the same **lemma**. Grouping them together can be useful in many applications. 

**Stemming** is the reduction of words to a common prefix, using simple rules that only work some of the time:

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
for word in ('dogs', 'foxes', 'jumps'):
    print(stemmer.stem(word))

In [None]:
for word in ('say', 'says', 'said'):
    print(stemmer.stem(word))

In [None]:
for word in ('he', 'his', 'him'):
    print(stemmer.stem(word))

In [None]:
stemmer.stem('dogs')

**Lemmatization** is the mapping of word forms to their lemma, using either a dictionary of word forms, a grammar of how words are formed (a **morphology**), or both.

In [None]:
nlp = stanza.Pipeline('en', processors='tokenize,lemma,pos')

In [None]:
text = movies["The Trial of the Chicago 7"]

In [None]:
doc = nlp(text)

In [None]:
for sentence in doc.sentences[:5]:
    for word in sentence.words:
        print(word.text + '\t' + word.lemma)
    print()

**QUESTION: Consider lemmas that could be reduced further, e.g. _historical_ or _protester_. Why aren't they?**

Now we can count lemmas

In [None]:
Counter(
    word.lemma for sentence in doc.sentences for word in sentence.words
    if word.lemma.lower() not in stopwords and re.match('\w', word.lemma)).most_common(20)

The full analysis of how a word form is built from its lemma is known as **morphological analysis**

In [None]:
for sentence in doc.sentences[:5]:
    for word in sentence.words:
        print('\t'.join([word.text, word.lemma, word.upos, word.feats if word.feats else '']))
    print()

A special case of lemmatization is **decompounding**, recognizing multiple lemmas in a word

In [None]:
nlp('roller-coaster')

In [None]:
nlp('wastebasket')

In [None]:
nlp('anti-Vietnam')

In [None]:
nlp('underrated')

In [None]:
nlp('overwhelmed')

For English you might say that this is good enough... but _some languages_ allow forming compounds on the fly...

In [None]:
nlp_de = stanza.Pipeline('de', processors='tokenize,lemma,pos')

In [None]:
nlp_de('Kraftfahrzeug-Haftpflichtversicherung')

In [None]:
nlp_de('Nahrungsmittelunverträglichkeit')

In [None]:
nlp_de('Rindfleischetikettierungsüberwachungsaufgabenübertragunsgesetz')

see also [https://de.wikipedia.org/wiki/Rindfleischetikettierungs%C3%BCberwachungsaufgaben%C3%BCbertragungsgesetz](https://de.wikipedia.org/wiki/Rindfleischetikettierungs%C3%BCberwachungsaufgaben%C3%BCbertragungsgesetz)

In [None]:
nlp_de('Kassenidentifikationsnummer')

In [None]:
nlp_de('Klimabonus')

There is no good generic solution and no standard tool. There are some unsupervised approaches like [SECOS](https://github.com/riedlma/SECOS) and [CharSplit](https://github.com/dtuggener/CharSplit), and there are also full-fledged morphological analyzers that might work, like [SMOR](https://www.cis.lmu.de/~schmid/tools/SMOR/) and its extensions [zmorge](https://pub.cl.uzh.ch/users/sennrich/zmorge/) and [SMORLemma](https://github.com/rsennrich/SMORLemma).

## Text preprocessing in NLP: best practices

Text preprocessing steps such as those above are critical components of most NLP applications. Very often they are also a main bottleneck.

**Preprocessing for segmentation and normalization should be a separate component in almost any NLP application**

When storing preprocessed text, the format should ensure **reproducibility** and it should be **platform-independent**. It should also be easy to **inspect** and allow for **version control**

### The CoNLL format

In [None]:
from stanza.utils.conll import CoNLL

CoNLL.write_doc2conll(doc,"data/output.conllu")

In [None]:
with open('data/output.conllu') as f:
    print(''.join(f.readlines()))

This format can be processed by several NLP libraries (stanza, spacy, nltk, etc.)

In [None]:
!spacy convert data/output.conllu -c conllu data/

There is also a python library for reading them

In [None]:
import conllu

In [None]:
with open("data/output.conllu") as f:
    data = conllu.parse(f.read())

In [None]:
data[0][4]

**For Milestone 1 of the Project exercise your team should gather the dataset(s) they are planning to use, perform standard preprocessing steps and INSPECT THE RESULTS to uncover potential issues that need to be handled. Finally, datasets should be stored in CoNLL-U format and pushed to the repository together with a short documentation of how the data was created.**