# Basics of text processing

### Natural Language Processing and Information Extraction,  2025WS
10/10/2025

Gábor Recski

## In this lecture
- Regular Expressions (SLP 2.7)
- Text segmentation and normalization (SLP 2.2, 2.5, 2.7, old SLP)
   - sentence segmentation (SLP 2.7)
   - tokenization (SLP 2.5)
   - lemmatization, stemming (old SLP)
   - decompounding, morphology (SLP 2.2, old SLP)
   - the CoNLL format (old SLP)
   
[SLP Ch. 2](https://web.stanford.edu/~jurafsky/slp3/2.pdf), [SLP 2025 Jan](https://web.stanford.edu/~jurafsky/slp3/old_jan25/), [SLP 2024 Aug](https://web.stanford.edu/~jurafsky/slp3/old_aug24/)

## Import dependencies

In [1]:
import json
import re
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import stanza

## Download models

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
stanza.download('en')
stanza.download('de')

[nltk_data] Downloading package punkt to /home/recski/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/recski/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


HBox(children=(HTML(value='Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/res…

2025-10-09 09:07:50 INFO: Downloading default packages for language: en (English) ...





2025-10-09 09:07:51 INFO: File exists: /home/recski/stanza_resources/en/default.zip
2025-10-09 09:07:56 INFO: Finished downloading models and saved to /home/recski/stanza_resources.


HBox(children=(HTML(value='Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/res…

2025-10-09 09:07:56 INFO: Downloading default packages for language: de (German) ...





2025-10-09 09:07:57 INFO: File exists: /home/recski/stanza_resources/de/default.zip
2025-10-09 09:08:02 INFO: Finished downloading models and saved to /home/recski/stanza_resources.


## Regular expressions

- Pattern matching
- Substitution and grouping

### Pattern matching

We use a dataset of ca. 24K Wikipedia articles about movies after 2000 (created for the [GIR exercise](https://github.com/TUW-GIR/exercise-2023WS-template))

In [3]:
!wget -nc -O data/wp_movie_data.jsonl https://tucloud.tuwien.ac.at/public.php/dav/files/A4YFbg3PD4pXMs4/?accept=zip

File ‘data/wp_movie_data.jsonl’ already there; not retrieving.


In [4]:
with open("data/wp_movie_data.jsonl") as f:
    movies = {item['title']: item['text'] for item in (json.loads(line) for line in f)}

In [5]:
len(movies)

24378

In [6]:
def search_title(pattern, data, n=10):
    return sorted(title for title in data.keys() if re.match(pattern, title))[:n]

#### Which movies have the number 7 in their titles?

In [7]:
def search_title(pattern, data):
    return sorted(title for title in data.keys() if re.search(pattern, title))[:]

In [8]:
search_title('7', movies)

["'71 (film)",
 "'76 (film)",
 '127 Hours',
 '17 Again (film)',
 '17 Blocks',
 '17 Miracles',
 '1745 (film)',
 '1917 (2019 film)',
 '1922 (2017 film)',
 '1971 (2014 film)',
 '2067 (film)',
 '247°F',
 '27 Dresses',
 '27 Guns',
 '27, Memory Lane',
 '2:22 (2017 film)',
 '2:37',
 '3:10 to Yuma (2007 film)',
 '4 (2007 film)',
 '41 (2007 film)',
 '47 Meters Down',
 '47 Meters Down: Uncaged',
 '47 Ronin (2013 film)',
 '5 to 7',
 '5-25-77',
 '537 Votes',
 '6 Days (2017 film)',
 '7 Chinese Brothers',
 '7 Days (2021 film)',
 '7 Days in Syria',
 '7 Days to Vegas',
 '7 Girls',
 '7 Letters',
 '7 Lives',
 '7 Minutes',
 '7 Notes to Infinity',
 '7 Seconds (film)',
 '7 Splinters in Time',
 '7/7 Ripple Effect',
 '700 Sundays (film)',
 '7500 (film)',
 '759: Boy Scouts of Harlem',
 '9/11 (2017 film)',
 "A Midsummer Night's Dream (2017 film)",
 'Academy (2007 film)',
 'After Sex (2007 film)',
 'Aftermath (2017 film)',
 'Allure (2017 film)',
 'Alter Ego (2017 film)',
 'America the Beautiful (2007 film)',
 '

#### Limit it to those with 7 as a word

In [9]:
search_title('(\s|^)7(\s|$)', movies)

['5 to 7',
 '7 Chinese Brothers',
 '7 Days (2021 film)',
 '7 Days in Syria',
 '7 Days to Vegas',
 '7 Girls',
 '7 Letters',
 '7 Lives',
 '7 Minutes',
 '7 Notes to Infinity',
 '7 Seconds (film)',
 '7 Splinters in Time',
 'Fested: A Journey to Fest 7',
 'Furious 7',
 'MS Slavic 7',
 'The 7 Adventures of Sinbad',
 'The Magic 7',
 'The Trial of the Chicago 7']

In [10]:
search_title('(\s|^)(7|[sS]even)(\s|$)', movies)

['5 to 7',
 '7 Chinese Brothers',
 '7 Days (2021 film)',
 '7 Days in Syria',
 '7 Days to Vegas',
 '7 Girls',
 '7 Letters',
 '7 Lives',
 '7 Minutes',
 '7 Notes to Infinity',
 '7 Seconds (film)',
 '7 Splinters in Time',
 'Feast of the Seven Fishes (film)',
 'Fested: A Journey to Fest 7',
 'Furious 7',
 'Gathering of Heroes: Legend of the Seven Swords',
 'MS Slavic 7',
 'Original Sin – The Seven Sins',
 'Patient Seven',
 'Red Shoes and the Seven Dwarfs',
 'Seven (2019 Nigerian film)',
 'Seven Below',
 'Seven Days in Utopia',
 'Seven Days of Grace (2006 film)',
 'Seven Pounds',
 'Seven Psychopaths',
 'Seven Stages to Achieve Eternal Bliss',
 'Seven Times Lucky',
 'Seven and a Match',
 'Seven in Heaven',
 'Sinbad: Legend of the Seven Seas',
 'The 7 Adventures of Sinbad',
 'The Last Seven',
 'The Magic 7',
 'The Magnificent Seven (2016 film)',
 'The Seven Faces of Jane',
 'The Seven Five',
 'The Seven of Daran: Battle of Pareo Rock',
 'The Trial of the Chicago 7',
 'ZR-7 :The Red House Seven

#### Let's try to find movies involving Aaron Sorkin

In [11]:
def search_text(pattern, data, r=50):
    for title, text in data.items():
        match = re.search(pattern, text)
        if match is None:
            continue
        i, j = match.span()
        start = max(i-r, 0)
        end = i+r
        print(f"{title}\n\n...{text[start:end]}...\n\n")        


In [12]:
search_text('Aaron Sorkin', movies)

Charlie Wilson's War (film)

...d by Mike Nichols (his final film) and written by Aaron Sorkin, who adapted George Crile III's 2003 ...


What Lies Beneath

...n with production for rewrites, he had to decline Aaron Sorkin's offer to read for a major role in S...


Gambit (2012 film)

...oved it. He initially sent the original script to Aaron Sorkin to rewrite it; however, despite being...


ISteve

...ie also bested a third Jobs movie in the works by Aaron Sorkin adapted from Steve Jobs by Walter Isa...


Jobs (film)

...ere 'abhorred' by it. Wozniak was a consultant on Aaron Sorkin's 2015 Steve Jobs film. When asked wh...


Molly's Game

...raphical crime drama film written and directed by Aaron Sorkin (in his directorial debut), based on ...


Moneyball (film)

...Bennett Miller and written by Steven Zaillian and Aaron Sorkin. The film is based on the 2003 nonfic...


Seven Psychopaths

... play like a combination of Quentin Tarantino and Aaron Sorkin." About the film itself, he wro

#### Could we find all names in all texts?

In [13]:
def count_patterns(pattern, data):
    return Counter(match for title, text in data.items() for match in re.findall(pattern, text)).most_common()

In [14]:
name_pattern = '[A-Z][a-z]+(?: [A-Z][a-z]+)+'

In [15]:
count_patterns(name_pattern, movies)

[('Rotten Tomatoes', 16740),
 ('United States', 12625),
 ('Los Angeles', 4487),
 ('On Metacritic', 4271),
 ('Box Office Mojo', 3593),
 ('United Kingdom', 3517),
 ('New York', 3424),
 ('New York City', 3002),
 ('The Hollywood Reporter', 2801),
 ('The New York Times', 2691),
 ('On Rotten Tomatoes', 2567),
 ('Warner Bros', 2383),
 ('Sundance Film Festival', 2224),
 ('Toronto International Film Festival', 2078),
 ('North America', 2022),
 ('The Guardian', 1875),
 ('Roger Ebert', 1776),
 ('Los Angeles Times', 1578),
 ('Academy Award', 1227),
 ('Entertainment Weekly', 1222),
 ('Cannes Film Festival', 1208),
 ('Academy Awards', 1185),
 ('In May', 1094),
 ('New Zealand', 1088),
 ('Chicago Sun', 1058),
 ('Best Actress', 1047),
 ('In October', 1040),
 ('In March', 1037),
 ('In April', 1009),
 ('Tribeca Film Festival', 976),
 ('In February', 972),
 ('Best Actor', 965),
 ('North American', 955),
 ('In June', 929),
 ('In January', 910),
 ('In September', 905),
 ('Rolling Stone', 903),
 ('In July', 

#### Let's reuse this pattern

In [16]:
count_patterns('starring ' + name_pattern, movies)

[('starring Nicolas Cage', 24),
 ('starring Steven Seagal', 20),
 ('starring Robert De Niro', 14),
 ('starring John Travolta', 14),
 ('starring Cuba Gooding Jr', 14),
 ('starring Jason Statham', 13),
 ('starring Dean Cain', 12),
 ('starring John Cusack', 12),
 ('starring Ryan Reynolds', 12),
 ('starring James Franco', 12),
 ('starring Billy Zane', 12),
 ('starring Val Kilmer', 12),
 ('starring Danny Trejo', 12),
 ('starring Katherine Heigl', 11),
 ('starring Adam Sandler', 11),
 ('starring Heather Graham', 11),
 ('starring Tom Berenger', 11),
 ('starring Jackie Chan', 11),
 ('starring Tom Selleck', 11),
 ('starring Bruce Willis', 10),
 ('starring Dennis Quaid', 10),
 ('starring Ben Stiller', 10),
 ('starring Willem Dafoe', 10),
 ('starring Tom Cruise', 10),
 ('starring Brendan Fraser', 10),
 ('starring Keanu Reeves', 10),
 ('starring Jim Carrey', 10),
 ('starring Dwayne Johnson', 10),
 ('starring Majid Michel', 10),
 ('starring Arnold Schwarzenegger', 9),
 ('starring Johnny Depp', 9),


In [17]:
count_patterns(name_pattern+' franchise', movies)

[('Star Wars franchise', 24),
 ('Super Hero Girls franchise', 12),
 ('Harry Potter franchise', 11),
 ('Toy Story franchise', 10),
 ('Jurassic Park franchise', 10),
 ('Star Trek franchise', 10),
 ('Wizarding World franchise', 10),
 ('Evil Dead franchise', 8),
 ('Elm Street franchise', 8),
 ('Kung Fu Panda franchise', 8),
 ('The Conjuring Universe franchise', 7),
 ('John Wick franchise', 7),
 ('American Pie franchise', 6),
 ('Mad Max franchise', 6),
 ('The Texas Chainsaw Massacre franchise', 6),
 ('Ice Age franchise', 6),
 ('My Little Pony franchise', 6),
 ('Disney Fairies franchise', 6),
 ('The Snow Queen franchise', 6),
 ('Cliff Beasts franchise', 6),
 ('James Bond franchise', 5),
 ('The Mummy franchise', 5),
 ('Indiana Jones franchise', 5),
 ('King Kong franchise', 5),
 ('Spy Kids franchise', 5),
 ('The Expendables franchise', 5),
 ('Despicable Me franchise', 5),
 ('Resident Evil franchise', 4),
 ('Air Buddies franchise', 4),
 ('Bad Boys franchise', 4),
 ('The Conjuring franchise', 4)

In [18]:
count_patterns('Academy Award for ' + name_pattern, movies)

[('Academy Award for Best Documentary Feature', 91),
 ('Academy Award for Best Original Song', 52),
 ('Academy Award for Best Foreign Language Film', 52),
 ('Academy Award for Best Animated Short Film', 47),
 ('Academy Award for Best Actress', 45),
 ('Academy Award for Best Animated Feature', 39),
 ('Academy Award for Best Actor', 33),
 ('Academy Award for Best Picture', 30),
 ('Academy Award for Best Live Action Short Film', 30),
 ('Academy Award for Best Documentary', 28),
 ('Academy Award for Best Original Screenplay', 24),
 ('Academy Award for Best Visual Effects', 22),
 ('Academy Award for Best Original Score', 22),
 ('Academy Award for Best Supporting Actor', 21),
 ('Academy Award for Best International Feature Film', 21),
 ('Academy Award for Best Supporting Actress', 19),
 ('Academy Award for Best Makeup', 17),
 ('Academy Award for Best Documentary Short Subject', 17),
 ('Academy Award for Best Adapted Screenplay', 16),
 ('Academy Award for Best Documentary Short', 13),
 ('Acad

### Substitution and groups

Regexes are not just for pattern matching, they are also a powerful tool for text manipulation.

In [19]:
with open('data/tww_s1_e1.txt') as f:
    text = f.read()

In [20]:
print(text)

THE WEST WING
"PILOT"
WRITTEN BY: AARON SORKIN
DIRECTED BY: THOMAS SCHLAMME


ACT ONE

WAITER [VO]
Two Absolut Martinis up; another Dewars rocks.

FADE IN: INT. FOUR SEASONS HOTEL - GEORGETOWN - NIGHT
SAM SEABORN is sitting with a reporter, BILLY KENWORTHY, in the bar.

SAM SEABORN
I don't think we're going to run the table, if that's what you're asking.

BILLY KENWORTHY [OS]
It's not.

SAM
I know.

BILLY [OS]
Deep background. I'm not going to come close to using your name.

SAM
[laughs] You're not going to come close to getting a quote, either.

BILLY
Why are we sitting here?

SAM
[taking a drink] You sat down.

BILLY
Is Josh on his way out?

SAM
No.

BILLY
Is he?

SAM
No.

BILLY
I know he's your friend.

SAM
He is.

BILLY
Did Caldwell say...?

SAM
Billy, I'm not talking about this.

BILLY
Who do I call?

SAM
No one.

BILLY
Just tell me who to call.

SAM
Well, you could call 1-800-BITE-ME.

BILLY
Sam.

SAM
He's not going anywhere, Billy. It's a non-story.

BILLY
Okay. You're lying now

Let's get the structure of this document, step by step

In [21]:
match = re.search('(.*)\nACT ONE', text, re.S)
print(match)

<re.Match object; span=(0, 85), match='THE WEST WING\n"PILOT"\nWRITTEN BY: AARON SORKIN\>


In [22]:
header = match.group(1).strip()
print(header)

THE WEST WING
"PILOT"
WRITTEN BY: AARON SORKIN
DIRECTED BY: THOMAS SCHLAMME


In [23]:
footer = re.search('THE END\n\* \* \*(.*)', text, re.S).group(1).strip()
print(footer)

The West Wing and all its characters are properties of Aaron Sorkin, John Wells
Production, Warner Brothers Television, and NBC. No copyright infringement
is intended.

Episode 1.1 -- 'Pilot'
Original Airdate: September 22, 1999, 9;00 EST


We can do all this with a single regex

In [24]:
header, body, footer = re.search('(.*)\n(ACT ONE.*THE END)\n\* \* \*(.*)', text, re.S).groups()

In [25]:
print(header)

THE WEST WING
"PILOT"
WRITTEN BY: AARON SORKIN
DIRECTED BY: THOMAS SCHLAMME




In [26]:
print(footer)



The West Wing and all its characters are properties of Aaron Sorkin, John Wells
Production, Warner Brothers Television, and NBC. No copyright infringement
is intended.

Episode 1.1 -- 'Pilot'
Original Airdate: September 22, 1999, 9;00 EST



In [27]:
print(body)

ACT ONE

WAITER [VO]
Two Absolut Martinis up; another Dewars rocks.

FADE IN: INT. FOUR SEASONS HOTEL - GEORGETOWN - NIGHT
SAM SEABORN is sitting with a reporter, BILLY KENWORTHY, in the bar.

SAM SEABORN
I don't think we're going to run the table, if that's what you're asking.

BILLY KENWORTHY [OS]
It's not.

SAM
I know.

BILLY [OS]
Deep background. I'm not going to come close to using your name.

SAM
[laughs] You're not going to come close to getting a quote, either.

BILLY
Why are we sitting here?

SAM
[taking a drink] You sat down.

BILLY
Is Josh on his way out?

SAM
No.

BILLY
Is he?

SAM
No.

BILLY
I know he's your friend.

SAM
He is.

BILLY
Did Caldwell say...?

SAM
Billy, I'm not talking about this.

BILLY
Who do I call?

SAM
No one.

BILLY
Just tell me who to call.

SAM
Well, you could call 1-800-BITE-ME.

BILLY
Sam.

SAM
He's not going anywhere, Billy. It's a non-story.

BILLY
Okay. You're lying now, aren't you?

SAM
That hurts, Billy. Why would I lie to a journalist of all p

Now let's get the scenes!

In [28]:
SCENE_SEP_PATT = ("\n(?:CUT TO:|ACT [A-Z]*)")

In [29]:
scenes = re.split(SCENE_SEP_PATT, body)

In [30]:
len(scenes)

21

In [31]:
print('\n\n***\n\n'.join(f'Scene {i}:\n{scenes[i].strip()[:50]}...' for i in range(5)))

Scene 0:
ACT ONE

WAITER [VO]
Two Absolut Martinis up; anot...

***

Scene 1:
EXT. DAWN RISING OVER LARGE TUDOR STYLE HOUSE - DA...

***

Scene 2:
INT. DINING ROOM - CONTINUOUS
LEO McGARRY is doing...

***

Scene 3:
INT. HEALTH CLUB - DAY

C.J. CREGG is running on a...

***

Scene 4:
INT. JOSH LYMAN'S OFFICE - DARK
In the dark office...


Now let's get the structure of the dialogue!

In [32]:
print(scenes[2])

 INT. DINING ROOM - CONTINUOUS
LEO McGARRY is doing a crossword puzzle while eating breakfast. A television is
turned on to the news.

LEO McGARRY
17 across is wrong. It's just wrong. Do you believe that Ruth?

RUTH
You should call them.

LEO
I will call them.

WOMAN [OS]
Telephone, Leo.

LEO
I'm in the shower.

WOMAN [OS]
It's POTUS.

LEO
[sits down and picks up the phone] Yeah.



In [33]:
LINE_PATT = "\n([A-Z.\[\] ]+)\n(.*?)\n"

In [34]:
utterances = re.findall(LINE_PATT, scenes[0], re.S)

In [35]:
utterances[:10]

[('WAITER [VO]', 'Two Absolut Martinis up; another Dewars rocks.'),
 ('SAM SEABORN',
  "I don't think we're going to run the table, if that's what you're asking."),
 ('BILLY KENWORTHY [OS]', "It's not."),
 ('SAM', 'I know.'),
 ('BILLY [OS]',
  "Deep background. I'm not going to come close to using your name."),
 ('SAM',
  "[laughs] You're not going to come close to getting a quote, either."),
 ('BILLY', 'Why are we sitting here?'),
 ('SAM', '[taking a drink] You sat down.'),
 ('BILLY', 'Is Josh on his way out?'),
 ('SAM', 'No.')]

In [36]:
script = {
    "header": header,
    "scenes": [
        {"lines": [
            {
                "char": character,
                "text": text
            }
            for character, text in re.findall(LINE_PATT, scene)
        ]
        }
        for scene in re.split(SCENE_SEP_PATT, body)
        ],
    "footer": footer
}

In [37]:
script['scenes'][2]

{'lines': [{'char': 'RUTH', 'text': 'You should call them.'},
  {'char': 'LEO', 'text': 'I will call them.'},
  {'char': 'WOMAN [OS]', 'text': 'Telephone, Leo.'},
  {'char': 'LEO', 'text': "I'm in the shower."},
  {'char': 'WOMAN [OS]', 'text': "It's POTUS."},
  {'char': 'LEO', 'text': '[sits down and picks up the phone] Yeah.'}]}

Let's use this data for something. Let's get a list of characters by frequency.

In [38]:
Counter(line['char'] for scene in script['scenes'] for line in scene['lines']).most_common(10)

[('SAM', 117),
 ('JOSH', 117),
 ('LEO', 101),
 ('TOBY', 56),
 ('C.J.', 36),
 ('LAURIE', 28),
 ('DONNA', 26),
 ('MANDY', 26),
 ('CALDWELL', 23),
 ('BILLY', 22)]

Regular expressions are surprisingly powerful. Also, with the right implementation, they are literally as fast as you can get. That's because they are equivalent to [finite state automata (FSAs)](https://en.wikipedia.org/wiki/Finite-state_machine). Actually, every regular expression is a [regular grammar](https://en.wikipedia.org/wiki/Regular_grammar) defining a [regular language](https://en.wikipedia.org/wiki/Regular_language).

![re_xkcd](media/re_xkcd.png)([XKCD #208](https://xkcd.com/208/))

## Text segmentation

### Splitting text into sentences

In [39]:
text2 = "'Of course it's only because Tom isn't home,' said Mrs. Parsons vaguely."

#### Naive: split on `.`, `!`, `?`, etc.

In [40]:
re.split('[.!?]', text2)

["'Of course it's only because Tom isn't home,' said Mrs",
 ' Parsons vaguely',
 '']

#### Better: use language-specific list of abbreviation words, collocations, etc.

In [41]:
nltk.sent_tokenize(text2)

["'Of course it's only because Tom isn't home,' said Mrs. Parsons vaguely."]

Custom lists of patterns are often necessary for **special domains**. 

_An die Stelle der Landesgesetze vom 17. Jänner 1883, n.ö.L.G. u. V.Bl. Nr. 35, vom 26. Dezember 1890, n.ö.L.G. u. V.Bl. Nr. 48, vom 17. Juni 1920 n.ö.L.G. u. V.Bl. Nr. 547, vom 4. November 1920 n.ö.L.G. u. V.Bl. Nr. 808, und vom 9. Dezember 1927, L.G.Bl. für Wien Nr. 1 ex 1928, die, soweit dieses Gesetz nichts anderes bestimmt, zugleich ihre Wirksamkeit verlieren, hat die nachfolgende Bauordnung zu treten._

[Bauordnung für Wien](https://www.ris.bka.gv.at/Dokumente/Landesnormen/LWI40000064/LWI40000064.html)

In [42]:
text3 = "An die Stelle der Landesgesetze vom 17. Jänner 1883, n.ö.L.G. u. V.Bl. Nr. 35, vom 26. Dezember 1890, n.ö.L.G. u. V.Bl. Nr. 48, vom 17. Juni 1920 n.ö.L.G. u. V.Bl. Nr. 547, vom 4. November 1920 n.ö.L.G. u. V.Bl. Nr. 808, und vom 9. Dezember 1927, L.G.Bl. für Wien Nr. 1 ex 1928, die, soweit dieses Gesetz nichts anderes bestimmt, zugleich ihre Wirksamkeit verlieren, hat die nachfolgende Bauordnung zu treten."

In [43]:
print(text3)

An die Stelle der Landesgesetze vom 17. Jänner 1883, n.ö.L.G. u. V.Bl. Nr. 35, vom 26. Dezember 1890, n.ö.L.G. u. V.Bl. Nr. 48, vom 17. Juni 1920 n.ö.L.G. u. V.Bl. Nr. 547, vom 4. November 1920 n.ö.L.G. u. V.Bl. Nr. 808, und vom 9. Dezember 1927, L.G.Bl. für Wien Nr. 1 ex 1928, die, soweit dieses Gesetz nichts anderes bestimmt, zugleich ihre Wirksamkeit verlieren, hat die nachfolgende Bauordnung zu treten.


In [44]:
nltk.sent_tokenize(text3, language='german')

['An die Stelle der Landesgesetze vom 17.',
 'Jänner 1883, n.ö.L.G.',
 'u. V.Bl.',
 'Nr. 35, vom 26. Dezember 1890, n.ö.L.G.',
 'u. V.Bl.',
 'Nr. 48, vom 17. Juni 1920 n.ö.L.G.',
 'u. V.Bl.',
 'Nr. 547, vom 4. November 1920 n.ö.L.G.',
 'u. V.Bl.',
 'Nr. 808, und vom 9. Dezember 1927, L.G.Bl.',
 'für Wien Nr. 1 ex 1928, die, soweit dieses Gesetz nichts anderes bestimmt, zugleich ihre Wirksamkeit verlieren, hat die nachfolgende Bauordnung zu treten.']

In [45]:
nltk.sent_tokenize("17. Jänner", language='german')

['17.', 'Jänner']

In [46]:
nltk.sent_tokenize("17. Januar", language='german')

['17. Januar']

**NB: most real-world NLP applications are in special domains!**

###  Tokenization - splitting sentences into words

#### Naive approach: split on whitespace

In [47]:
text2.split()

["'Of",
 'course',
 "it's",
 'only',
 'because',
 'Tom',
 "isn't",
 "home,'",
 'said',
 'Mrs.',
 'Parsons',
 'vaguely.']

#### Better: separate punctuation marks

In [48]:
re.findall('(\w+|[^\w\s]+)', text2)[:30]

["'",
 'Of',
 'course',
 'it',
 "'",
 's',
 'only',
 'because',
 'Tom',
 'isn',
 "'",
 't',
 'home',
 ",'",
 'said',
 'Mrs',
 '.',
 'Parsons',
 'vaguely',
 '.']

#### Best: add some language-specific conventions:

In [49]:
nltk.word_tokenize(text2)

["'Of",
 'course',
 'it',
 "'s",
 'only',
 'because',
 'Tom',
 'is',
 "n't",
 'home',
 ',',
 "'",
 'said',
 'Mrs.',
 'Parsons',
 'vaguely',
 '.']

In [50]:
nltk.word_tokenize("O'Brian")

["O'Brian"]

## Text normalization

#### What are the most common words in some sample of text?

In [51]:
movie_sample = {title: text for i, (title, text) in enumerate(movies.items()) if i % 100 == 0}

In [52]:
sorted(movie_sample.keys())

['.45 (film)',
 '1 Day',
 '12 Strong',
 '3 Day Test',
 '388 Arletta Avenue',
 'A Billion Lives',
 'A Boy Called Sailboat',
 'A Haunting in Cawdor',
 'A Healing Art',
 'A Lonely Place for Dying',
 'A Poet in New York',
 "Adina's Deck",
 'Age of Kill',
 'Alice Upside Down',
 'All My Friends Are Funeral Singers (film)',
 "America's Sweethearts",
 'American Anarchist',
 'An Act of War',
 'An Everlasting Piece',
 'An Inspector Calls (2015 TV film)',
 'Andover (film)',
 'Angel Eyes (film)',
 'Apaye',
 'Aquarium of the Dead',
 'Article VI (film)',
 'Awakening (2013 film)',
 'Aztec Rex',
 'Bait (2019 film)',
 'Barbie as Rapunzel',
 'Before It Had a Name',
 'Behind the Scenes of Total Hell: The Jamie Gunn Chronicles',
 'Better Nate Than Ever (film)',
 'Better Watch Out',
 'Black Water (2007 film)',
 'Blair Witch (film)',
 'Bobby (2006 film)',
 'Bobby Sands: 66 Days',
 'Bring Your Own Brigade',
 'Britney Ever After',
 'Brooklyn Dodgers: Ghosts of Flatbush',
 'Cabin Fever 2: Spring Fever',
 'Cane

In [53]:
words = [word for text in movie_sample.values() for word in nltk.word_tokenize(text)]

In [54]:
words[:10]

['1', 'Day', 'is', 'a', '2009', 'British', 'crime', 'film', 'about', 'gangs']

In [55]:
len(words)

237817

In [56]:
Counter(words).most_common(10)

[(',', 12184),
 ('the', 10869),
 ('.', 8521),
 ('and', 6268),
 ('of', 4973),
 ('to', 4847),
 ('a', 4789),
 ('in', 3306),
 ('as', 2797),
 ('film', 2621)]

Let's get rid of punctuation

In [57]:
words = [word for word in words if re.match('\w', word)]

In [58]:
len(words)

206432

In [59]:
Counter(words).most_common(10)

[('the', 10869),
 ('and', 6268),
 ('of', 4973),
 ('to', 4847),
 ('a', 4789),
 ('in', 3306),
 ('as', 2797),
 ('film', 2621),
 ('The', 2320),
 ('is', 2041)]

Filtering common function words is called __stopword removal__

In [60]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
print(stopwords)

{'between', 'having', 'll', 'of', 'its', "you'll", 're', "he's", 'during', 'him', 'that', 'but', 'or', 'there', 'itself', 'own', 'themselves', 'yours', 'shan', 'any', "i'd", 'he', 'it', 'mustn', 'again', "hadn't", 'were', 'no', "they'll", "they'd", "didn't", 'such', 'when', 'ours', 'hers', 'which', 'their', 'your', 'other', 'them', 'can', 'needn', 'm', 'further', 'hasn', 'why', 'into', "i'll", "she's", 'o', 'theirs', 'aren', 'have', "we'll", 'herself', 'in', 'y', 'few', "needn't", 'am', 'won', 'on', 'ain', 'himself', 'doesn', 've', 'has', 't', 'with', 'down', 'me', "mustn't", 'doing', "it'll", 'her', 'do', 'than', 'been', 'those', 'nor', "i've", 'to', 'haven', 'out', "won't", "wouldn't", 'against', 'weren', "should've", 'the', 'under', 'isn', 'whom', 'was', 'until', 'just', 'only', "wasn't", 'too', 'very', "i'm", 'same', "hasn't", "they're", 'where', 'from', 'should', "you're", 'had', 'below', 'yourself', 'does', "you've", 'we', 'not', "couldn't", "that'll", 's', 'off', "you'd", 'don',

In [61]:
words = [word for word in words if word.lower() not in stopwords]

In [62]:
Counter(words).most_common(20)

[('film', 2621),
 ('also', 409),
 ('Film', 398),
 ('released', 347),
 ('movie', 302),
 ('one', 299),
 ('Festival', 293),
 ('reviews', 275),
 ('million', 255),
 ('first', 250),
 ('New', 248),
 ('links', 240),
 ('based', 239),
 ('External', 238),
 ('References', 235),
 ('IMDb', 235),
 ('directed', 235),
 ('would', 233),
 ('gave', 215),
 ('time', 213)]

In [63]:
char_counter = Counter(line['char'] for scene in script['scenes'] for line in scene['lines'])

In [64]:
char_counter.most_common(5)

[('SAM', 117), ('JOSH', 117), ('LEO', 101), ('TOBY', 56), ('C.J.', 36)]

In [65]:
stopwords.add("n't")

In [66]:
from collections import defaultdict
word_counter = defaultdict(Counter)
for scene in script['scenes']:
    for line in scene['lines']:
        for word in nltk.word_tokenize(line['text']):
            if re.match('\w', word) and word.lower() not in stopwords:
                word_counter[line['char']][word.lower()] += 1

In [67]:
for char, _ in char_counter.most_common(5):
    print(char)
    print(word_counter[char].most_common(10))

SAM
[('know', 14), ('phone', 11), ('right', 7), ('really', 6), ('yeah', 6), ('well', 5), ('could', 5), ('call', 5), ('would', 5), ('good', 5)]
JOSH
[('know', 9), ('yeah', 8), ('get', 8), ('lloyd', 8), ('russell', 7), ('leo', 5), ('donna', 5), ('al', 5), ('caldwell', 5), ('think', 5)]
LEO
[('josh', 8), ('get', 7), ('know', 6), ('president', 6), ('think', 6), ('oh', 5), ('one', 5), ('call', 4), ('yeah', 4), ('hell', 4)]
TOBY
[('josh', 6), ('said', 4), ('come', 4), ('sit', 4), ('right', 3), ('agree', 3), ('make', 3), ('morning', 3), ('raising', 3), ('voice', 3)]
C.J.
[('president', 3), ('leo', 3), ('say', 2), ('seriously', 2), ('hard', 2), ('josh', 2), ('walks', 2), ('press', 2), ('chris', 2), ('get', 2)]


### Lemmatization and stemming

Words like _say_, _says_, and _said_ are all different **word forms** of the same **lemma**. Grouping them together can be useful in many applications. 

**Stemming** is the reduction of words to a common prefix, using simple rules that only work some of the time:

In [68]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [69]:
for word in ('dogs', 'foxes', 'jumps'):
    print(stemmer.stem(word))

dog
fox
jump


In [70]:
for word in ('say', 'says', 'said'):
    print(stemmer.stem(word))

say
say
said


In [71]:
for word in ('he', 'his', 'him'):
    print(stemmer.stem(word))

he
hi
him


In [72]:
stemmer.stem('dogs')

'dog'

**Lemmatization** is the mapping of word forms to their lemma, using either a dictionary of word forms, a grammar of how words are formed (a **morphology**), or both.

In [73]:
nlp = stanza.Pipeline('en', processors='tokenize,lemma,pos')

2025-10-09 09:08:10 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


HBox(children=(HTML(value='Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/res…




2025-10-09 09:08:11 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-10-09 09:08:11 INFO: Using device: cpu
2025-10-09 09:08:11 INFO: Loading: tokenize
2025-10-09 09:08:11 INFO: Loading: pos
2025-10-09 09:08:11 INFO: Loading: lemma
2025-10-09 09:08:11 INFO: Done loading processors!


In [74]:
text = movies["The Trial of the Chicago 7"]

In [75]:
doc = nlp(text)

In [76]:
for sentence in doc.sentences[:5]:
    for word in sentence.words:
        print(word.text + '\t' + word.lemma)
    print()

The	the
Trial	Trial
of	of
the	the
Chicago	Chicago
7	7
is	be
a	a
2020	2020
American	American
historical	historical
legal	legal
drama	drama
film	film
written	write
and	and
directed	direct
by	by
Aaron	Aaron
Sorkin	Sorkin
.	.

The	the
film	film
follows	follow
the	the
Chicago	Chicago
Seven	Seven
,	,
a	a
group	group
of	of
anti–Vietnam	anti–Vietnam
War	War
protesters	protester
charged	charge
with	with
conspiracy	conspiracy
and	and
crossing	cross
state	state
lines	line
with	with
the	the
intention	intention
of	of
inciting	incite
riots	riot
at	at
the	the
1968	1968
Democratic	Democratic
National	National
Convention	Convention
in	in
Chicago	Chicago
.	.

It	it
features	feature
an	a
ensemble	ensemble
cast	cast
including	include
Yahya	Yahya
Abdul	Abdul
-	-
Mateen	Mateen
II	II
,	,
Sacha	Sacha
Baron	Baron
Cohen	Cohen
,	,
Daniel	Daniel
Flaherty	Flaherty
,	,
Joseph	Joseph
Gordon	Gordon
-	-
Levitt	Levitt
,	,
Michael	Michael
Keaton	Keaton
,	,
Frank	Frank
Langella	Langella
,	,
John	John
Carroll	Carroll
Lync

**QUESTION: Consider lemmas that could be reduced further, e.g. _historical_ or _protester_. Why aren't they?**

Now we can count lemmas

In [77]:
Counter(
    word.lemma for sentence in doc.sentences for word in sentence.words
    if word.lemma.lower() not in stopwords and re.match('\w', word.lemma)).most_common(20)

[('film', 43),
 ('Chicago', 37),
 ('Sorkin', 26),
 ('Trial', 21),
 ('7', 20),
 ('2020', 20),
 ('Hoffman', 15),
 ('good', 14),
 ('Hayden', 14),
 ('police', 14),
 ('cast', 12),
 ('release', 12),
 ('October', 11),
 ('Seale', 11),
 ('write', 10),
 ('Spielberg', 10),
 ('include', 9),
 ('New', 9),
 ('Netflix', 9),
 ('Award', 9)]

The full analysis of how a word form is built from its lemma is known as **morphological analysis**

In [78]:
for sentence in doc.sentences[:5]:
    for word in sentence.words:
        print('\t'.join([word.text, word.lemma, word.upos, word.feats if word.feats else '']))
    print()

The	the	DET	Definite=Def|PronType=Art
Trial	Trial	PROPN	Number=Sing
of	of	ADP	
the	the	DET	Definite=Def|PronType=Art
Chicago	Chicago	PROPN	Number=Sing
7	7	NUM	NumForm=Digit|NumType=Card
is	be	AUX	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
a	a	DET	Definite=Ind|PronType=Art
2020	2020	NUM	NumForm=Digit|NumType=Card
American	American	ADJ	Degree=Pos
historical	historical	ADJ	Degree=Pos
legal	legal	ADJ	Degree=Pos
drama	drama	NOUN	Number=Sing
film	film	NOUN	Number=Sing
written	write	VERB	Tense=Past|VerbForm=Part
and	and	CCONJ	
directed	direct	VERB	Tense=Past|VerbForm=Part
by	by	ADP	
Aaron	Aaron	PROPN	Number=Sing
Sorkin	Sorkin	PROPN	Number=Sing
.	.	PUNCT	

The	the	DET	Definite=Def|PronType=Art
film	film	NOUN	Number=Sing
follows	follow	VERB	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
the	the	DET	Definite=Def|PronType=Art
Chicago	Chicago	PROPN	Number=Sing
Seven	Seven	PROPN	Number=Sing
,	,	PUNCT	
a	a	DET	Definite=Ind|PronType=Art
group	group	NOUN	Number=Sing
of	of	ADP	
anti–V

A special case of lemmatization is **decompounding**, recognizing multiple lemmas in a word

In [79]:
nlp('roller-coaster')

[
  [
    {
      "id": 1,
      "text": "roller",
      "lemma": "roller",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Number=Sing",
      "start_char": 0,
      "end_char": 6
    },
    {
      "id": 2,
      "text": "-",
      "lemma": "-",
      "upos": "PUNCT",
      "xpos": "HYPH",
      "start_char": 6,
      "end_char": 7
    },
    {
      "id": 3,
      "text": "coaster",
      "lemma": "coaster",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Number=Sing",
      "start_char": 7,
      "end_char": 14
    }
  ]
]

In [80]:
nlp('wastebasket')

[
  [
    {
      "id": 1,
      "text": "wastebasket",
      "lemma": "wastebasket",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Number=Sing",
      "start_char": 0,
      "end_char": 11
    }
  ]
]

In [81]:
nlp('anti-Vietnam')

[
  [
    {
      "id": 1,
      "text": "anti-Vietnam",
      "lemma": "anti-Vietnam",
      "upos": "PROPN",
      "xpos": "NNP",
      "feats": "Number=Sing",
      "start_char": 0,
      "end_char": 12
    }
  ]
]

In [82]:
nlp('underrated')

[
  [
    {
      "id": 1,
      "text": "underrated",
      "lemma": "underrated",
      "upos": "ADJ",
      "xpos": "JJ",
      "feats": "Degree=Pos",
      "start_char": 0,
      "end_char": 10
    }
  ]
]

In [83]:
nlp('overwhelmed')

[
  [
    {
      "id": 1,
      "text": "overwhelmed",
      "lemma": "overwhelm",
      "upos": "VERB",
      "xpos": "VBN",
      "feats": "Tense=Past|VerbForm=Part",
      "start_char": 0,
      "end_char": 11
    }
  ]
]

For English you might say that this is good enough... but _some languages_ allow forming compounds on the fly...

In [84]:
nlp_de = stanza.Pipeline('de', processors='tokenize,lemma,pos')

2025-10-09 09:08:24 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


HBox(children=(HTML(value='Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/res…






2025-10-09 09:08:25 INFO: Loading these models for language: de (German):
| Processor | Package      |
----------------------------
| tokenize  | gsd          |
| mwt       | gsd          |
| pos       | gsd_charlm   |
| lemma     | gsd_nocharlm |

2025-10-09 09:08:25 INFO: Using device: cpu
2025-10-09 09:08:25 INFO: Loading: tokenize
2025-10-09 09:08:25 INFO: Loading: mwt
2025-10-09 09:08:25 INFO: Loading: pos
2025-10-09 09:08:25 INFO: Loading: lemma
2025-10-09 09:08:25 INFO: Done loading processors!


In [85]:
nlp_de('Kraftfahrzeug-Haftpflichtversicherung')

[
  [
    {
      "id": 1,
      "text": "Kraftfahrzeug",
      "lemma": "Kraftfahrzeug",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Case=Nom|Gender=Fem|Number=Sing",
      "start_char": 0,
      "end_char": 13
    },
    {
      "id": 2,
      "text": "-",
      "lemma": "-",
      "upos": "PUNCT",
      "xpos": "$(",
      "start_char": 13,
      "end_char": 14
    },
    {
      "id": 3,
      "text": "Haftpflichtversicherung",
      "lemma": "Haftpflichtversicherung",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Case=Nom|Gender=Fem|Number=Sing",
      "start_char": 14,
      "end_char": 37
    }
  ]
]

In [86]:
nlp_de('Nahrungsmittelunverträglichkeit')

[
  [
    {
      "id": 1,
      "text": "Nahrungsmittelunverträglichkeit",
      "lemma": "Nahrungsmittelunverträglichkeit",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Case=Nom|Gender=Fem|Number=Sing",
      "start_char": 0,
      "end_char": 31
    }
  ]
]

In [87]:
nlp_de('Rindfleischetikettierungsüberwachungsaufgabenübertragunsgesetz')

[
  [
    {
      "id": 1,
      "text": "Rindfleischetikettierungsüberwachungsaufgabenübertragunsgesetz",
      "lemma": "Rindfleischetikettierungsüberwachungsaufgabenübertragunsgesetz",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Case=Nom|Gender=Neut|Number=Sing",
      "start_char": 0,
      "end_char": 62
    }
  ]
]

see also [https://de.wikipedia.org/wiki/Rindfleischetikettierungs%C3%BCberwachungsaufgaben%C3%BCbertragungsgesetz](https://de.wikipedia.org/wiki/Rindfleischetikettierungs%C3%BCberwachungsaufgaben%C3%BCbertragungsgesetz)

In [88]:
nlp_de('Kassenidentifikationsnummer')

[
  [
    {
      "id": 1,
      "text": "Kassenidentifikationsnummer",
      "lemma": "Kassenidentifikationsnummer",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Case=Nom|Gender=Fem|Number=Sing",
      "start_char": 0,
      "end_char": 27
    }
  ]
]

In [89]:
nlp_de('Klimabonus')

[
  [
    {
      "id": 1,
      "text": "Klimabonus",
      "lemma": "Klimabonus",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Case=Nom|Gender=Masc|Number=Sing",
      "start_char": 0,
      "end_char": 10
    }
  ]
]

There is no good generic solution and no standard tool. There are some unsupervised approaches like [SECOS](https://github.com/riedlma/SECOS) and [CharSplit](https://github.com/dtuggener/CharSplit), and there are also full-fledged morphological analyzers that might work, like [SMOR](https://www.cis.lmu.de/~schmid/tools/SMOR/) and its extensions [zmorge](https://pub.cl.uzh.ch/users/sennrich/zmorge/) and [SMORLemma](https://github.com/rsennrich/SMORLemma).

## Text preprocessing in NLP: best practices

Text preprocessing steps such as those above are critical components of most NLP applications. Very often they are also a main bottleneck.

**Preprocessing for segmentation and normalization should be a separate component in almost any NLP application**

When storing preprocessed text, the format should ensure **reproducibility** and it should be **platform-independent**. It should also be easy to **inspect** and allow for **version control**

### The CoNLL format

In [90]:
from stanza.utils.conll import CoNLL

CoNLL.write_doc2conll(doc,"data/output.conllu")

In [91]:
with open('data/output.conllu') as f:
    print(''.join(f.readlines()))

# text = The Trial of the Chicago 7 is a 2020 American historical legal drama film written and directed by Aaron Sorkin.
# sent_id = 0
1	The	the	DET	DT	Definite=Def|PronType=Art	0	_	_	start_char=0|end_char=3
2	Trial	Trial	PROPN	NNP	Number=Sing	1	_	_	start_char=4|end_char=9
3	of	of	ADP	IN	_	2	_	_	start_char=10|end_char=12
4	the	the	DET	DT	Definite=Def|PronType=Art	3	_	_	start_char=13|end_char=16
5	Chicago	Chicago	PROPN	NNP	Number=Sing	4	_	_	start_char=17|end_char=24
6	7	7	NUM	CD	NumForm=Digit|NumType=Card	5	_	_	start_char=25|end_char=26
7	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	6	_	_	start_char=27|end_char=29
8	a	a	DET	DT	Definite=Ind|PronType=Art	7	_	_	start_char=30|end_char=31
9	2020	2020	NUM	CD	NumForm=Digit|NumType=Card	8	_	_	start_char=32|end_char=36
10	American	American	ADJ	JJ	Degree=Pos	9	_	_	start_char=37|end_char=45
11	historical	historical	ADJ	JJ	Degree=Pos	10	_	_	start_char=46|end_char=56
12	legal	legal	ADJ	JJ	Degree=Pos	11	_	_	start_char=57|end_ch

This format can be processed by several NLP libraries (stanza, spacy, nltk, etc.)

In [92]:
!spacy convert data/output.conllu -c conllu data/

[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (135 documents): data/output.spacy[0m


There is also a python library for reading them

In [93]:
import conllu

In [94]:
with open("data/output.conllu") as f:
    data = conllu.parse(f.read())

In [95]:
data[0][4]

{'id': 5,
 'form': 'Chicago',
 'lemma': 'Chicago',
 'upos': 'PROPN',
 'xpos': 'NNP',
 'feats': {'Number': 'Sing'},
 'head': 4,
 'deprel': '_',
 'deps': None,
 'misc': {'start_char': '17', 'end_char': '24'}}

**For Milestone 1 of the Project exercise your team should gather the dataset(s) they are planning to use, perform standard preprocessing steps and INSPECT THE RESULTS to uncover potential issues that need to be handled. Finally, datasets should be stored in CoNLL-U format and pushed to the repository together with a short documentation of how the data was created.**