# Text Preprocessing
Preprocessing of text data has long been a key enabler for natural language processing. As the focus was on "pure" text data, side information such as formatting was not considered relevant.

With the advent of large language models and their application to essentially any form of text -- that means, including e.g., HTML markup code and Python programs -- preprocessing has lost a lot of its prior importance, as nowadays the expectation is that the large language model should be able to appropriately handle e.g., HTML tags; i.e., it should identify them and either suppress them (if asked to extract the text) or correctly place them when prompted to produce a correctly coded website.

For this reason, we will treat preprocessing rather briefly and just highlight a few examples. We will work with spaCy.

## Prepratation

In [1]:
import spacy

In [2]:
# if necessary, install spacy and language models via anaconda or similar
# en_core_web_sm is called spacy-model-en_core_web_sm in Anaconda
# Or, directly from the notebook, you can install it with the following command:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m62.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
# Load a language model - here we choose a rather small one for English trained on data scapped from the web.
nlp = spacy.load("en_core_web_sm")

`nlp` is a 'traditional' language model, i.e., it contains the components of the classical NLP pipeline.

For a visual presentation, we can use the `display` component of `spaCy`.

**TECHNICAL HINT**: in a jupyter notebook, calling `display.serve(...)` will keep the cell busy (you will see a `*` on the left side, and you cannot run any other cell). To stop the cell and be able to continue with other parts of the notebook, you can interrupt the cell (with the "stop" buttom in the top ribbon).

In [4]:
from spacy import displacy

In [5]:
text = "The CAS AIS provides a targeted education in software, machine learning (ML) and artificial intelligence (AI)"

We can directly run this text through the the model:

In [6]:
nlp(text)

The CAS AIS provides a targeted education in software, machine learning (ML) and artificial intelligence (AI)

## Tokenization

We can now access the individual tokens of the text as a list:

In [7]:
print([str(token) for token in nlp(text)])

['The', 'CAS', 'AIS', 'provides', 'a', 'targeted', 'education', 'in', 'software', ',', 'machine', 'learning', '(', 'ML', ')', 'and', 'artificial', 'intelligence', '(', 'AI', ')']


In [8]:
print([str(token) for token in nlp(text.lower())])

['the', 'cas', 'ais', 'provides', 'a', 'targeted', 'education', 'in', 'software', ',', 'machine', 'learning', '(', 'ml', ')', 'and', 'artificial', 'intelligence', '(', 'ai', ')']


Our default sentence is rather simple in this regard. We move on to a more complicated one:

In [9]:
text = "Mary, don’t slap the green witch"
doc = nlp(text.lower())
print([str(token) for token in doc ])

['mary', ',', 'do', 'n’t', 'slap', 'the', 'green', 'witch']


Here we see that the part "don't" has been split into 'do' and "n't" (for 'not'), thus separating these two parts that have been concatenated together.

## Lemmatization and Morphology
spaCy returns many other informations about the tokens in the text:

In [10]:
# doc = nlp(u"he was running late")
for token in doc:
    print('{} -> {}: {}'.format(token, token.lemma_, token.morph))

mary -> mary: Number=Sing
, -> ,: PunctType=Comm
do -> do: Mood=Ind|Tense=Pres|VerbForm=Fin
n’t -> not: Polarity=Neg
slap -> slap: VerbForm=Inf
the -> the: Definite=Def|PronType=Art
green -> green: Number=Sing
witch -> witch: Number=Sing


In [11]:
doc = nlp(u"Andreas Streich was running late")
for token in doc:
    print('{} -> {}: {}'.format(token, token.lemma_, token.morph))

Andreas -> Andreas: Number=Sing
Streich -> Streich: Number=Sing
was -> be: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
running -> run: Aspect=Prog|Tense=Pres|VerbForm=Part
late -> late: 


## Sentence Parsing
Next, we can identify the different parts of the sentence (PoS):

In [12]:
doc = nlp("Mary slapped the green witch.")
for token in doc:
    print('{} - {}'.format(token, token.pos_))

Mary - PROPN
slapped - VERB
the - DET
green - PROPN
witch - PROPN
. - PUNCT


In [13]:
doc = nlp("he was running late.")
for token in doc:
    print('{} - {}'.format(token, token.pos_))

he - PRON
was - AUX
running - VERB
late - ADV
. - PUNCT


In [14]:
doc = nlp("The CAS AIS provides a targeted education in software, machine learning (ML) \
and artificial intelligence (AI). It is offered by ETH Zurich")

for token in doc:
    print('{} - {}'.format(token, token.pos_))

The - DET
CAS - PROPN
AIS - PROPN
provides - VERB
a - DET
targeted - ADJ
education - NOUN
in - ADP
software - NOUN
, - PUNCT
machine - NOUN
learning - NOUN
( - PUNCT
ML - PROPN
) - PUNCT
and - CCONJ
artificial - ADJ
intelligence - NOUN
( - PUNCT
AI - PROPN
) - PUNCT
. - PUNCT
It - PRON
is - AUX
offered - VERB
by - ADP
ETH - PROPN
Zurich - PROPN


In [15]:
displacy.serve(doc, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Stop Words
Stop words are very common words that are considered to be uninformative and therefore often removed in classical NLP approaches.

In [16]:
for token in doc:
    print('{} - {}'.format(token.text, token.is_stop))

The - True
CAS - False
AIS - False
provides - False
a - True
targeted - False
education - False
in - True
software - False
, - False
machine - False
learning - False
( - False
ML - False
) - False
and - True
artificial - False
intelligence - False
( - False
AI - False
) - False
. - False
It - True
is - True
offered - False
by - True
ETH - False
Zurich - False


## Noun Chunks and Named Entities
`spaCy` can also identify different noun chunks, i.e., base noun phrases:

In [17]:
for chunk in doc.noun_chunks:
    print ('{} - {}'.format(chunk, chunk.label_))

The CAS AIS - NP
a targeted education - NP
software - NP
machine learning - NP
ML - NP
artificial intelligence - NP
AI - NP
It - NP
ETH Zurich - NP


Next we want to look at named entities, i.e. persons, organisation etc. These are often of particular interest (in the sense of information extraction - who is this text about?), and they need to be handled specially when processing the text: Their names can consist of several words, and there is typically no translation:

In [18]:
mydoc = nlp("The CAS AIS provides a targeted education in software, machine learning (ML) \
and artificial intelligence (AI). It is offered by ETH Zurich")

In [19]:
for ent in mydoc.ents:
    print(ent.text, ent.label_)

CAS ORG
AIS ORG
ML ORG
ETH Zurich ORG


In [20]:
mydoc = nlp("My name is Andreas Streich. I work at ETH Zurich. \
             Last year, I was travelling to the United States of America")
for ent in mydoc.ents:
    print(ent.text, '-->', ent.label_)

Andreas Streich --> PERSON
ETH Zurich --> ORG
Last year --> DATE
the United States of America --> GPE


In [21]:
displacy.serve(mydoc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Dependency Parsing
Futhermore, we can identify which part of the sentence is depending on which other (e.g., subject, object, etc.)

In [22]:
mydoc = nlp("The CAS AIS provides a targeted education in software, machine learning (ML) \
and artificial intelligence (AI). It is offered by ETH Zurich")

In [23]:
for chunk in mydoc.noun_chunks:
    print(chunk.text, "-", chunk.root.text, "-", chunk.root.dep_, "-", chunk.root.head.text)

The CAS AIS - AIS - nsubj - provides
a targeted education - education - dobj - provides
software - software - pobj - in
machine learning - learning - conj - software
ML - ML - appos - learning
artificial intelligence - intelligence - conj - learning
AI - AI - appos - intelligence
It - It - nsubjpass - offered
ETH Zurich - Zurich - pobj - by


In [24]:
# doc = nlp("Andreas Streich was running late.")
displacy.serve(mydoc, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [25]:
mydoc = nlp("Andreas Streich was running late.")
displacy.serve(mydoc, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


**Exercise:** try out some sentences of your own. Can you find some sentences that the model does not handle well?