<h1 style="font-size:21px; font-weight:bold; background:#DDEEEE;padding: 15px;">References</h1>

### Install & configuration

- Select models: [Spacy Trained Models and Pipelines](https://spacy.io/models)
- Explore components in [French Models](https://spacy.io/models/fr)
- Install [Spacy transformers](https://spacy.io/universe/project/spacy-transformers)
  - Wraps Hugging Face's [transformers](https://github.com/huggingface/transformers) package
  - BERT, GPT-2, XLNet, etc.
- Corpora: https://spacy.io/api/corpus

### Linguistic Features

- [`Token` attributes](https://spacy.io/api/token#attributes)
- Parts of speeach: [Uniserval POS tagset](https://universaldependencies.org/u/pos/)
- Syntactic dependencies: [Stanford typed dependencies manual](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf)
- [NER 101](https://spacy.io/usage/linguistic-features#named-entities)
- French stop words:   
  `/Users/macbook/anaconda3/lib/python3.7/site-packages/spacy/lang/fr/stop_words.py`

### Library data structures

- [Language object](https://spacy.io/api/language) (`nlp`)
- [Vocab object](https://spacy.io/api/vocab)
- [Doc object](https://spacy.io/api/doc) (output of `nlp` applied to text)
- [Token object](https://spacy.io/api/token) (tokens and linguistic features)

<h1 style="font-size:21px; font-weight:bold; background:#DDEEEE;padding: 15px;">Install and Configuration</h1>

In [None]:
# Create a virtual environment
! python3 -m venv myenv
! source ./myenv/bin/activate

# Install Spacy
! pip install spacy

# Install models and data per language
! python -m spacy download fr_core_news_sm

! pip install spacy-transformers
! pip install fr_dep_news_trf

# Install other tools
! pip install textacy

 <a class="anchor" id="section1"></a>

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2022.53.13.png" width="900px">

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2022.57.15.png" width="900px">

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2021.53.48.png" width="600">

 <a class="anchor" id="section2"></a>

<h1 style="font-size:21px; font-weight:bold; background:#DDEEEE;padding: 15px;">Pipeline</h1>

### Read input

In [24]:
import spacy
nlp = spacy.load('fr_core_news_sm')

filename = 'test_files/introduction.txt'
text = open(filename).read()
doc = nlp(text)
print('Pipeline: ', nlp.pipe_names)

Pipeline:  ['tok2vec', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


### Sentences

In [25]:
sentences = list(doc.sents)
print(f'Found {len(sentences)} sentences.')

#for sentence in sentences:
#    print(f'{sentence.start:<4}, {sentence.end:<4}, {sentence}')

Found 6 sentences.


### Stop words

In [26]:
spacy_stopwords = spacy.lang.fr.stop_words.STOP_WORDS
print(len(spacy_stopwords), list(spacy_stopwords)[:8])

507 ['s’', 'différent', 'derrière', 'la', 'lors', 'nous-mêmes', 'pourrait', 'bas']


In [27]:
doc_stop_words = [token for token in doc if token.is_stop]
print(f'\nFound {len(doc_stop_words)} stop words in this document.')


Found 16 stop words in this document.


### Word frequency

In [28]:
words = [token.text for token in doc
         if not token.is_stop and not token.is_punct]

from collections import Counter
word_freq = Counter(words)

common_words = word_freq.most_common(5)
unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
print(len(common_words), len(unique_words))

5 17


### Tokens, morphological and part-of-speech features

In [29]:
import pandas as pd

cols = ('i', 'idx', 'token', 'lemma', 'tag', 'POS', 'explain', 'morph')
rows = []

for token in doc[:3]:
    row = [token.i, token.idx, 
           token.text, token.lemma_,
           token.tag_, token.pos_, spacy.explain(token.tag_),
           token.morph]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
print(df)

   i  idx     token     lemma    tag    POS      explain  \
0  0    0        Ce        ce    DET    DET   determiner   
1  1    3  tutoriel  tutoriel   NOUN   NOUN         noun   
2  2   12     Spacy     Spacy  PROPN  PROPN  proper noun   

                                      morph  
0  (Gender=Masc, Number=Sing, PronType=Dem)  
1                (Gender=Masc, Number=Sing)  
2                                        ()  


### Syntactic dependency features

In [30]:
cols = ('i', 'idx', 'token', 'lemma', 'syn_head', 'syn_dep', 'left_edge', 'right_edge')
rows = []

for token in doc[:3]:
    row = [token.i, token.idx, 
           token.text, token.lemma_,
           token.head.text, token.dep_,
           token.left_edge, token.right_edge] # (left|right)most token in descendants
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
print(df)

   i  idx     token     lemma  syn_head    syn_dep left_edge right_edge
0  0    0        Ce        ce  tutoriel        det        Ce         Ce
1  1    3  tutoriel  tutoriel  explique      nsubj        Ce      Spacy
2  2   12     Spacy     Spacy  tutoriel  flat:name     Spacy      Spacy


### Navigating the tree and subtree

In [31]:
token = doc[1]   # 'tutoriel'

print([tok.text for tok in token.children])   # syntactic dependents
print(token.nbor(-1))                         # previous neighboring node in subtree
print(token.nbor())                           # next neighboring node in subtree
print([tok.text for tok in token.lefts])      # tokens in the left subtree
print([tok.text for tok in token.rights])     # tokens in the right subtree
print(list(token.subtree))                    # subtree

['Ce', 'Spacy']
Ce
Spacy
['Ce']
['Spacy']
[Ce, tutoriel, Spacy]


In [32]:
def flatten_tree(tree):
    return ''.join(token.text_with_ws for token in list(tree)).strip()

print(flatten_tree(token.subtree))

Ce tutoriel Spacy


### Shallow parsing

In [33]:
#-- Noun phrase detection

for chunk in doc.noun_chunks:
    print(chunk)

Ce tutoriel Spacy
Traitement Automatique
Langage Naturel
tu peux
J'
ce
je

Angela Merkel présidente de l'Allemagne.


In [34]:
#-- Verb phrase detection (not built-in)
# Use Matcher with rules, or textacy

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)      # Initialize matcher with shared vocab

patterns = [
    [{'POS': 'VERB', 'OP': '?'}, {'POS': 'ADV', 'OP': '?'}, {'POS': 'VERB'}],
]
matcher.add("VERB_PATTERN", [patterns[0]])

matches = matcher(doc)            # Call the matcher on the doc
for id, start, end in matches:
    print(f'Match found: {doc[start:end].text}')

Match found: explique
Match found: explique comment faire
Match found: comment faire
Match found: faire
Match found: Laisse
Match found: Laisse tomber
Match found: tomber
Match found: oublié
Match found: voulais
Match found: voulais dire
Match found: dire


### Named Entity Recognition

In [35]:
cols = ('token', 'start_char', 'end_char', 'label_')
rows = []

for ent in doc.ents:
    # spacy.explain(ent.label_)
    row = [ent.text, ent.start_char, ent.end_char, ent.label_]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
print(df)

                                       token  start_char  end_char label_
0                                      Spacy          12        17    PER
1  Traitement Automatique du Langage Naturel          44        85   MISC
2                              Angela Merkel         162       175    PER
3                                  Allemagne         192       201    LOC


In [43]:
#-- Redact personal info: person's names

def replace_person_names(token):
    if token.ent_iob != 0 and token.ent_type_ == 'PER':
        return '[REDACTED]'
    return token.text

def redact_names(doc):
    tokens = map(replace_person_names, doc)
    return ' '.join(tokens)

redact_names(doc)

"Ce tutoriel [REDACTED] explique comment faire \n du Traitement Automatique du Langage Naturel . \n Est -ce que tu peux ... Laisse tomber . \n J' ai oublié ce que je voulais dire ! \n [REDACTED] [REDACTED] présidente de l' Allemagne . \n"

### Visualizing named entities

In [15]:
from spacy import displacy
displacy.render(doc, style='ent')  # jupyter: displacy.serve(doc, style="ent")

### Visualization the syntactic structure

In [16]:
from spacy import displacy
displacy.render(doc, style='dep')  # jupyter: displacy.serve(doc, style="dep")

 # Custom pipeline component: sentence delimiter

 #### Create new component

In [3]:
import spacy
from spacy.language import Language

@Language.component("set_custom_boundary")
def set_custom_boundary(doc):
    ''' Recognize '...' as sentence delimiter.'''
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i+1].is_sent_start = True
    return doc

 #### Add it to pipeline

In [4]:
custom_nlp = spacy.load('fr_core_news_sm')
custom_nlp.add_pipe('set_custom_boundary', before='parser')
# last=True | first=True | before='component' | after='component'
print('Pipeline: ', nlp.pipe_names)

<function __main__.set_custom_boundary(doc)>

 #### Use it

In [7]:
intro_filename = 'test_files/introduction.txt'
intro_text = open(intro_filename).read()
intro_doc = nlp(intro_text)
sentences = list(intro_doc.sents)

for sentence in sentences:
    print(sentence)

Ce tutoriel Spacy explique comment faire
du Traitement Automatique du Langage Naturel.

Est-ce que tu peux ...
Laisse tomber.

J'ai oublié ce que je voulais dire !




 <a class="anchor" id="section4"></a>

 <a class="anchor" id="section5"></a>

 <a class="anchor" id="section6"></a>

#  Customize the `nlp.tokenizer`

Pass various parameters to the `Tokenizer` class:

- `nlp.vocab`: 
  - Storage container for special cases
  - Ex: contractions, emoticons
  
  
- `prefix_search`:
  - Function used to handle preceding punctuation
  - Ex: opening parentheses
  
  
- `suffix_search`:
  - Function used to handle succeeding punctuation
  - Ex: closing parentheses
  
  
- `infix_finditer`:
  - Function used to handle non-whitespace separators
  - Ex: hyphens
  
  
- `token_match`:
  - Optional `Boolean` function used to match strings that should never be split.   
  - Ex: entities like URLs or numbers   
  - **Overrides** the previous rules.

In [41]:
# See example code