![](https://drive.google.com/uc?export=view&id=1L9JLLQHPZoMRwzYfmKcyM9VME_SHeZrr)

# TP 1 : pre-processing texts

In this practical session, we will see a few basic processing of textual data.

Within a computer, text is encoded as a string of characters.
In order to analyze textual data within NLP applications, we first need to properly preprocess it.
An NLP preprocessing pipeline generally consists of the following steps :
* sentence segmentation
* tokenisation
* normalization: lower-casing, lemmatization, optionally removing stop-words and punctuation
* pos-tagging
* named entity recognition
* parsing

The first two steps are necessary, while the others are optional.

For these exercises, we will mainly use the module **spacy** (already installed on google colab, but some additional libraries might be missing).

Spacy provides ways to carry out tasks such as segmentation, tokenization, lemmatization and pos-tagging.


We will also extract information from Wikipedia pages as an example.

## 0- Upload and read the text files

At first, we're going to use a text written in English. Then, we'll try to apply the tools to French.
We'll use the wikipedia library to extract pages from wikipedia

In [1]:
! pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=77134c60d82db71fcc2e7af620c34483715833a50285d93e2075b85e8011f3e4
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [2]:
import wikipedia
wikipedia.set_lang('en')
text_en = wikipedia.page("AdaLovelace")
print(text_en.content[:1000])

Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852), also known as Ada Lovelace, was an English mathematician and writer chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.
Lovelace was the only legitimate child of poet Lord Byron and reformer Anne Isabella Milbanke. All her half-siblings, Lord Byron's other children, were born out of wedlock to other women. Lord Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when she was eight. Lady Byron was anxious about her daughter's upbringing and promoted Lovelace's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Lovelace remained interested in her father, naming her two sons Byron and Gordon. Upon her death, she was buried next 

In [3]:
wikipedia.set_lang('fr')
text_fr = wikipedia.page("Lovelace")
print(text_fr.content[:1000])

Ada Lovelace, de son nom complet Augusta Ada King, comtesse de Lovelace, née Ada Byron le 10 décembre 1815 à Londres et morte le 27 novembre 1852 à Marylebone dans la même ville, est une pionnière de la science informatique.
Elle est principalement connue pour avoir réalisé le premier véritable programme informatique, lors de son travail sur un ancêtre de l'ordinateur : la machine analytique de Charles Babbage. Dans ses notes, on trouve en effet le premier programme publié, destiné à être exécuté par une machine, ce qui fait d'Ada Lovelace la première personne à avoir programmé au monde. Elle a également entrevu et décrit certaines possibilités offertes par les calculateurs universels, allant bien au-delà du calcul numérique et de ce qu'imaginaient Babbage et ses contemporains.


== Biographie ==


=== Environnement familial ===

Ada était la seule fille légitime du poète George Gordon Byron et de son épouse Annabella Milbanke, une femme intelligente et cultivée, cousine de Caroline La

## 1- Using Spacy

All info about Spacy: https://spacy.io/ ; More info on the pipelines: https://spacy.io/usage/processing-pipelines

Spacy is a more realistic library for NLP than NLTK, with higher performances on the basic processing steps.

Spacy can be used to directly tokenize any text.
With spacy, we build a pipeline that does everything at once.
To make it work, you need to load a model specific to the target language, for example 'en' for English (there are also some domain specific models).


The model corresponds to a processing 'pipeline':
  by default, it includes the tokenisation, the lemmatization and the POS tagging

Using spacy:
- import the spacy module into Python
- load all the necessary models, e.g. for English


In [4]:
import spacy
nlp = spacy.load('en_core_web_sm')


Then process a text with the pipeline:



In [5]:
doc = nlp(text_en.content)

Try to see what is stored with the document, especially doc.vocab, doc.sent and see what you get when you iterate over a document or its parts.

Hint: You can either access Spacy's manual on the internet to find out how to access the information, or look at the built-in help by typing help(doc). https://spacy.io/api/doc

In [35]:
len(doc.vocab)
v = [w.text for w in doc.vocab]
print(v[:10])



['Moque-t-on', 'ü.', 'Turing', 'symbole', 'Perfection', 'commencé', 'illustré', 'Cette', 'c.', 'qui']


In [37]:
for w in doc.sents:
  print(w[:2])

Ada Lovelace
Ada King
, est
Elle est
Dans ses
Elle a
==
==
=
Environnement familial
Ada
était la


Byron recherchait
nécessaire].
Lady
Melbourne (
lui suggère
Cette union
Ada naît
Le premier
Le prénom
Le 21
Il ne
Annabella adorait
Byron l’
Annabella fit
En 1832
Le 5
Ils deviennent
Parmi ses
Elle se
Ils auront
William était
La famille
Son titre
Elle est
La santé
À cette
Les études
Ada prend
Le 6
Je crois
Elle mentionne
énergie inépuisable
En 1841
Elle tourne
==
= Mémoire
sur la
=


En octobre
Charles Wheatstone
Elle passe
Babbage lui-même
n'intervient
Il demande
Babbage propose
Elle ajoute
La note
Le programme
De plus
On ne
dans quelle
Ce qui
Selon Betty
En revanche
Cela ne
Babbage,
Analytical Engine
, tous
Ada Lovelace
Dans d'
Babbage avait
Mais il
En revanche
Ada Lovelace
Un autre
Ceci est
La Machine
en fait
»

 
==
= Ruine
Dans l’
Elle travailla
Elle mourut
Elle laissait
Cette dernière
Elle fut
Notoriété posthume
Et c'
L'idée
Ada est
nécessaire].

Ada
Lovelace est
On peut
L'entrepris

### 1.1 Tokenisation

**Exercise 1:** Tokenize the text in French
* Find a model for French and tokenize the text in the file. Hint: you will need to download the model first, that can be done in the notebook using: *spacy.cli.download( model_name )*
* What does contain the *doc* variable? Hint: You can either access Spacy's manual on the internet to find out how to access the information, or look at the built-in help by typing help(doc). https://spacy.io/api/doc
* Print the individual tokens. Do you see any error?
* How many tokens do we have?
* How many words / types / unique tokens do we have (i.e. vocabulary size)?
* Use Pandas to better visualize the results

In [109]:
import spacy.cli
spacy.cli.download("fr_core_news_sm")
spacy.cli.download("fr_core_news_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [110]:
import spacy
import numpy as np

# Load the model
nlp=spacy.load('fr_core_news_lg')

# Process the text
doc = nlp(text_fr.content)



print('Preprocessing done')
tokens = [w.text for w in doc]
print(tokens[:30])
# in our doc variable, and we can access its different annotations by looping through it



Preprocessing done
['Ada', 'Lovelace', ',', 'de', 'son', 'nom', 'complet', 'Augusta', 'Ada', 'King', ',', 'comtesse', 'de', 'Lovelace', ',', 'née', 'Ada', 'Byron', 'le', '10', 'décembre', '1815', 'à', 'Londres', 'et', 'morte', 'le', '27', 'novembre', '1852']


Do you see some ssues with tokenisation?


**il y'a des mots en anglais ?**

#### Getting some help

Hint: You can either access Spacy's manual on the internet to find out how to access the information, or look at the built-in help by typing *help(doc)*.

https://spacy.io/api/doc

#### Pandas

You can use Pandas to better visualize the results

In [111]:
# Using pandas for a better visualization
import pandas as pd

spacy_tokens = [w.text for w in doc]
data = pd.DataFrame(spacy_tokens,
             columns=['Word'])

In [112]:
data["Word"].value_counts().sort_values(ascending=False)

Unnamed: 0_level_0,count
Word,Unnamed: 1_level_1
",",213
de,145
.,108
Ada,79
et,71
...,...
Toute,1
59,1
Articles,1
connexes,1


### 1.2 Sentence segmentation

**Exercise 5:**
Apart from token segmentation, Spacy has also automatically segmented our document intro sentences.
* **(a)** Print out the different sentences of the document.
Hint: Look at the "Data descriptors " in the help page for 'doc'.



In [113]:
for sent in doc.sents:
  print(sent[:10])

Ada Lovelace, de son nom complet Augusta Ada King
Elle est principalement connue pour avoir réalisé le premier véritable
Dans ses notes, on trouve en effet le premier
Elle a également entrevu et décrit certaines possibilités offertes par
nécessaire]. Lady Melbourne (en) lui suggère sa propre
Cette union est ensuite encouragée par Augusta Leigh, la
Ada naît en décembre de cette même année.
Le premier prénom d'Ada, Augusta, aurait été
Le prénom Ada aurait été choisi par Byron lui-même,
Le 21 avril, Byron signe l'acte de séparation
Il ne les revit jamais.
Annabella adorait les
Byron l’appelait même parfois « la princesse des parallélogrammes
Annabella fit en sorte que les tuteurs d'Ada lui
En 1832, Ada rencontre Mary Somerville, éminente chercheuse
Le 5 juin 1833, Mary lui présente Charles Babbage
Ils deviennent très proches, Ada semblant trouver en Babbage
Parmi ses autres connaissances, on compte David Brewster,
Elle se marie en 1835 avec William King, 1er
Ils auront trois enfants : Byr

## 3- POS tagging

Remember that the model corresponds to a processing 'pipeline' in Spacy:
  - by default, it includes the tokenisation, the lemmatization and the POS tagging

**Exercise**
- print each individual token, together with its lemmatized form and part of speech tag
- Use Panda to better visualize the results
- Look at the results, do you see any error?
- You can use the method 'spacy.explain' to have information about some annotation, for example the POS tags. Apply it to each POS tag to get a more detailed label.


In [51]:
tokens = [w.text for w in doc]
len(tokens)

3556

In [71]:
tokens = [w for w in doc]
for token in tokens [:10]:
  print(token ,  token.lemma_, token.pos_, token.tag)

Ada Ada PROPN 96
Lovelace Lovelace PROPN 96
, , PUNCT 97
de de ADP 85
son son DET 90
nom nom NOUN 92
complet complet ADJ 84
Augusta Augusta PROPN 96
Ada Ada PROPN 96
King King PROPN 96


#### Pandas

You can use Pandas to better visualize the results

In [114]:
# Using pandas for a better visualization
import pandas as pd

spacy_pos_tagged = [(w, w.lemma_, w.tag_, w.pos_) for w in doc]
pd.DataFrame(spacy_pos_tagged,
             columns=['Word', 'Lemma', 'POS tag', 'Tag type'])

Unnamed: 0,Word,Lemma,POS tag,Tag type
0,Ada,Ada,PROPN,PROPN
1,Lovelace,Lovelace,PROPN,PROPN
2,",",",",PUNCT,PUNCT
3,de,de,ADP,ADP
4,son,son,DET,DET
...,...,...,...,...
3551,de,de,ADP,ADP
3552,l’,l’,DET,DET
3553,histoire,histoire,NOUN,NOUN
3554,des,de,ADP,ADP


#### Look at the results: do you see any issues?


#### Notes on POS tags:

* You can use the method 'explain' to have information about some annotation, for example the POS tags, see the code below.
* Here we used a very small set of POS (vs e.g. 36 in the PTB: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [115]:
# Inspect POS tags
all_tags = set()
for token in doc:
  all_tags.add(token.pos_)
for tag in all_tags:
  print( tag, spacy.explain(tag)) # explain each label

X other
SPACE space
NOUN noun
PROPN proper noun
PRON pronoun
ADV adverb
CCONJ coordinating conjunction
SYM symbol
SCONJ subordinating conjunction
ADJ adjective
VERB verb
PUNCT punctuation
NUM numeral
DET determiner
AUX auxiliary
ADP adposition


## 5- Named entity recognition

As part of the preprocessing pipeline, Spacy has also carried out named entity recognition.

**Exercise 8:**
* print out each named entity, together with the label assigned to it
* what do the labels stand for?
* Use the module called 'displacy' to visualize the Named Entities directly in the text.

In [116]:
for w in doc.ents:
  print(w.text, w.label_)

Ada Lovelace PER
Augusta Ada King MISC
comtesse de Lovelace PER
Ada Byron PER
Londres LOC
Marylebone LOC
Charles Babbage PER
Ada Lovelace PER
Babbage PER
Environnement LOC
Ada PER
George Gordon Byron PER
Annabella Milbanke PER
Caroline Lamb PER
Byron PER
Byron PER
Lady Melbourne PER
miss Milbanke PER
Augusta Leigh PER
Byron PER
Byron PER
Annabella PER
Ada PER
Ada MISC
Augusta LOC
Augusta PER
Byron PER
Ada MISC
Byron PER
Byron PER
Annabella PER
Byron PER
Ada MISC
Byron PER
acte de séparation MISC
Royaume-Uni LOC
Annabella PER
Byron PER
Annabella PER
Ada MISC
Ada PER
Mary Somerville PER
Mary PER
Charles Babbage PER
Ada MISC
Ada MISC
Babbage PER
David Brewster PER
Charles Wheatstone PER
Charles Dickens PER
Michael Faraday PER
William King PER
comte de PER
Byron PER
Annabella PER
Anne Blunt PER
Ralph Gordon PER
William PER
Ada MISC
Ada MISC
Ockham Park LOC
Ockham PER
La très honorable MISC
Augusta Ada PER
comtesse de Lovelace PER
Ada Lovelace PER
Lady Lovelace PER
Ada MISC
Babbage PER
Augu

In [117]:
# Explain the labels
for w in doc.ents[:10]:
  print(w.text, w.label_, spacy.explain(w.label_))

Ada Lovelace PER Named person or family.
Augusta Ada King MISC Miscellaneous entities, e.g. events, nationalities, products or works of art
comtesse de Lovelace PER Named person or family.
Ada Byron PER Named person or family.
Londres LOC Non-GPE locations, mountain ranges, bodies of water
Marylebone LOC Non-GPE locations, mountain ranges, bodies of water
Charles Babbage PER Named person or family.
Ada Lovelace PER Named person or family.
Babbage PER Named person or family.
Environnement LOC Non-GPE locations, mountain ranges, bodies of water


In [118]:
from spacy import displacy

#try with dept instead of ent too.
displacy.render(doc, style="ent", jupyter=True)


Note on Named Entity Recognition: do you see any issues?


## 6- Parsing

Finally, as part of the pipeline, Spacy has also performed a dependency parsing (note that each module can de disabled if not needed).

**Exercise 9:** Extract all Noun Phrases from the file

* Retrieve the information from the dependency parses: dependent and head of each token
* Use displacy to visualize a parse tree: first try with a simple sentence (e.g. *La petite brise la glace.*) then use the first sentence of the document.
* Navigating the parse tree. Each element of the tree is associated to attributes: you can use them to inspect the different elements of the trees:
  * Define a Panda dataframe with each token id associated to its head, with the relation between them. The eventual children of the current token are also printed.
* Print all the adjectives and the noun they modify



In [98]:
sentences = [w.text for w in doc.sents]
print(sentences[0])

Ada Lovelace, de son nom complet Augusta


In [121]:
# Print for each token, its dependents and head
doc2 = "la petite fille brise la glace"
doc2 = nlp(doc2)
phrase = sentences[0]
displacy.render(doc2, style="dep", jupyter=True)



In [None]:
# Visualization using displacy



#### Navigating the parse tree

In [None]:
# Navigating the parse tree


In [None]:
# Print the NOUNS modified by an ADJECTIVE


## 7- Putting it all together

Now we are going to use the skills practiced in the preceding exercises to build a simple question-answering system on a toy dataset (in French).

We will focus on specific questions of the form "Qui a peint X ?".
We will define patterns based on differents ways of formulating this question, and use them to extract the answer from a small toy corpus based on wikipedia pages on paintings.

When you're done with this exercise, try to answer other types of questions, such as "Où est exposé X ?", "Quand a été peinte X ?".

Below, we reload the spacy French model adding specific options to merge named entities containing multiple tokens.

In [None]:
# Load the model
nlp = spacy.load('fr_core_news_sm')

nlp.add_pipe("merge_entities")
nlp.add_pipe("merge_noun_chunks")

Here is the list of questions we will consider.
You also need a corpus of source document, you can find it on Moodle (corpus_qa.txt).

In [None]:
question_list = [
    'La Joconde est un tableau de qui ?',
    'Le radeau de la méduse est une peinture réalisée par qui ?'
]
corpus = 'corpus_qa.txt'

**Exercise 10:** In this part, we focus on the first question. This question is designed to be similar to the document containing the answer. We can thus define a pattern based on its structure to extract the answer from the document.

- Process the question using the spacy nlp pipeline
- display its parse tree and / or print a Pandas dataframe containing information from the parse tree
- Now to define a lexico-syntactic pattern, you are going to use spacy *DependencyMatcher* : https://spacy.io/usage/rule-based-matching#dependencymatcher
  * Look at the doc to understand how it works
  * Define a pattern that should match 'qui ' in the question

In [None]:
# Display the parse tree of the first question


In [None]:
from spacy.matcher import DependencyMatcher
# Define a lexico-syntactic pattern that allows to retrieve the answer

pattern_tableau = [

                   # ...

]

# If you match the pattern to the original question, it should output 'qui'

matcher = DependencyMatcher(nlp.vocab)
matcher.add("tableau", [pattern_tableau])

matches = matcher(q1_prep)

print(matches) # [(4851363122962674176, [6, 0, 10, 9])]
# Each token_id corresponds to one pattern dict
match_id, token_ids = matches[0]

for i in range(len(token_ids)):
  if pattern_tableau[i]["RIGHT_ID"] == 'mod':
    print(q1_prep[token_ids[i]].text)

**Exercise**:
Retrieve the documents that are relevant to the question, i.e. the ones containing the keyword 'La Joconde'.

It is recommended to define a function, that could be used for the next exercises. It could work by:

* first indexing all the documents using the named entities present in the document (i.e. build a dictionnary mapping a named entity to all documents where it is present)
* now the *retrieve_documents(...)* method should try to match the input keywork with a named entity and return the matching documents.
* Test the function with 'Joconde':
  * Does it work with the method based on named entities?
  * Add a backup solution : if no document is found,simply try to find the string corresponding to the keyword in the document

In [None]:
# Retrieve the document that is relevant to the question, i.e. the one
# containing 'La Joconde'
# Indexation of the documents based on named entities
# + if no NE found, simple pattern matching


#

**Exercise 12:** Test the pattern

- Apply the pattern to each sentence of the retrieved document, do you find the right answer?

In [None]:
#  Apply the pattern to each sentence of this document, do you find the right answer?




**Exercise 13:** Now define a pattern to answer the second question, and find the answer!

Be careful, there is a little issue here:
- Display the parse tree for the question and the matching document: what do you observe?
- Build a pattern to match the question and the answer (Hint : you can use a list of dependency relations in the pattern, using e.g. *"RIGHT_ATTRS": {"DEP": {"IN":[ "acl", "advcl" ] }*
- Finally, retrieve the answer

In [None]:
# Display the parse tree of the second question


In [None]:
# Display the parse tree of the matching document (first and unique sentence)


In [None]:
# Define a pattern that matches both the question and answer
# here we have a slight difference between the dep rel label for quest and answer
from spacy.matcher import DependencyMatcher
pattern_peinture = [
  #...

]



In [None]:
#  Apply the pattern to each sentence of this document, do you find the right answer?


**Exercise** Use the patterns defined to find the answers to the questions below. Here, we know that we're looking for the painter, we don'y want to match the question to the answer, we want to test known patterns to find the right answer.

- find a way to extract the name of the painting from the question
- retrieve the relevant document
- test the patterns defined previously to extract the correct answer

In [None]:
new_questions = [
    'Qui a peint American Gothic ?',
    'Qui est l\'auteur de la peinture La Nuit étoilée',
    'Qui a réalisé Un Garrochista ?',
]

In [None]:
# Search entities in the question (there should be only one..)


In [None]:
# Retrive document


In [None]:
# Test the documents against several patterns

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PAINTER", [pattern_tableau, pattern_peinture])





```
# This is formatted as code
```

**Exercise** To go further

Try to:
- find who painted 'Le Cri' and 'Arearea
- where are located 'La Joconde' and 'Arearea'