# Text Preprocessing
Preprocessing of text data has long been a key enabler for natural language processing. As the focus was on "pure" text data, side information such as formatting was not considered relevant.

With the advent of large language models and their application to essentially any form of text -- that means, including e.g., HTML markup code and Python programs -- preprocessing has lost a lot of its prior importance, as nowadays the expectation is that the large language model should be able to appropriately handle e.g., HTML tags; i.e., it should identify them and either suppress them (if asked to extract the text) or correctly place them when prompted to produce a correctly coded website.

For this reason, we will treat preprocessing rather briefly and just highlight a few examples. We will work with spaCy.

## Prepratation

In [None]:
import spacy

In [None]:
# if necessary, install spacy and language models via anaconda or similar
# en_core_web_sm is called spacy-model-en_core_web_sm in Anaconda
# Or, directly from the notebook, you can install it with the following command:
# !python -m spacy download en_core_web_sm

In [None]:
# Load a language model - here we choose a rather small one for English trained on data scapped from the web.
nlp = spacy.load("en_core_web_sm")

`nlp` is a 'traditional' language model, i.e., it contains the components of the classical NLP pipeline.

In [None]:
text = "The CAS AIS provides a targeted education in software, machine learning (ML) and artificial intelligence (AI)"

We can directly run this text through the the model:

In [None]:
nlp(text)

## Tokenization

We can now access the individual tokens of the text as a list:

In [None]:
print([str(token) for token in nlp(text)])

In [None]:
print([str(token) for token in nlp(text.lower())])

Our default sentence is rather simple in this regard. We move on to a more complicated one:

In [None]:
text = "Mary, don’t slap the green witch"
doc = nlp(text.lower())
print([str(token) for token in doc ])

Here we see that the part "don't" has been split into 'do' and "n't" (for 'not'), thus separating these two parts that have been concatenated together.

## Lemmatization and Morphology
spaCy returns many other informations about the tokens in the text:

In [None]:
# doc = nlp(u"he was running late")
for token in doc:
    print('{} -> {}: {}'.format(token, token.lemma_, token.morph))

In [None]:
doc = nlp(u"Andreas Streich was running late")
for token in doc:
    print('{} -> {}: {}'.format(token, token.lemma_, token.morph))

## Sentence Parsing
Next, we can identify the different parts of the sentence (PoS):

In [None]:
doc = nlp("Mary slapped the green witch.")
for token in doc:
    print('{} - {}'.format(token, token.pos_))

In [None]:
doc = nlp("he was running late.")
for token in doc:
    print('{} - {}'.format(token, token.pos_))

In [None]:
doc = nlp("The CAS AIS provides a targeted education in software, machine learning (ML) \
and artificial intelligence (AI). It is offered by ETH Zurich")

for token in doc:
    print('{} - {}'.format(token, token.pos_))

## Stop Words
Stop words are very common words that are considered to be uninformative and therefore often removed in classical NLP approaches.

In [None]:
for token in doc:
    print('{} - {}'.format(token.text, token.is_stop))

## Noun Chunks and Named Entities
`spaCy` can also identify different noun chunks, i.e., base noun phrases:

In [None]:
for chunk in doc.noun_chunks:
    print ('{} - {}'.format(chunk, chunk.label_))

Next we want to look at named entities, i.e. persons, organisation etc. These are often of particular interest (in the sense of information extraction - who is this text about?), and they need to be handled specially when processing the text: Their names can consist of several words, and there is typically no translation:

In [None]:
mydoc = nlp("The CAS AIS provides a targeted education in software, machine learning (ML) \
and artificial intelligence (AI). It is offered by ETH Zurich")

In [None]:
for ent in mydoc.ents:
    print(ent.text, ent.label_)

In [None]:
mydoc = nlp("My name is Andreas Streich. I wort at ETH Zurich. \
             Last year, I was travelling to the United States of America")
for ent in mydoc.ents:
    print(ent.text, '-->', ent.label_)

## Dependency Parsing
Futhermore, we can identify which part of the sentence is depending on which other (e.g., subject, object, etc.)

In [None]:
mydoc = nlp("The CAS AIS provides a targeted education in software, machine learning (ML) \
and artificial intelligence (AI). It is offered by ETH Zurich")

In [None]:
for chunk in mydoc.noun_chunks:
    print(chunk.text, "-", chunk.root.text, "-", chunk.root.dep_, "-", chunk.root.head.text)

For a visual presentation, we can use the `display` component of `spaCy`.

**A technical hint**: if you are running this as a jupyter notebook, calling `display.serve(...)` will keep the cell busy (you will see a `*` on the left side, and you cannot run any other cell). To stop the cell and be able to continue with other parts of the notebook, you can interrupt the cell (with the "stop" buttom in the top ribbon).

In [None]:
from spacy import displacy

In [None]:
# nlp = spacy.load("en_core_web_sm")
# doc = nlp("Andreas Streich was running late.")
displacy.serve(mydoc, style="dep")

In [None]:
mydoc = nlp("Andreas Streich was running late.")
displacy.serve(mydoc, style="dep")