**Reference**: [NLP with Spacy](https://realpython.com/natural-language-processing-spacy-python/)

<h1 style="font-size:21px; font-weight:bold; background:#eeeeee;padding: 15px;">Contents</h1>

- [Install Spacy](#section1)
- [Read input](#section2)
- [Sentence detection](#section3)
- [Sentences, Tokens, Lemmas, Tagging & Parsing](#section4)
- [Word frenquency](#section5)
- [Visualization with displacy](#section6)

 <a class="anchor" id="section1"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Install Spacy</h1>

# Create a virtual environment

In [None]:
$ python3 -m venv myenv
$ source ./myenv/bin/activate
!pip install spacy

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2022.53.13.png" width="800px">

# Download models and data per language

- French Models: [fr_core_news_sm](https://spacy.io/models/fr)
- Install [Spacy transformers](https://spacy.io/universe/project/spacy-transformers)
- Select models: [Spacy Trained Models and Pipelines](https://spacy.io/models)

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2022.57.15.png" width="800px">

In [None]:
!python -m spacy download fr_core_news_sm

In [60]:
import spacy
from spacy.lang.fr.examples import sentences

nlp = spacy.load('fr_core_news_sm')
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
    print(f'{token.text:<21} {token.pos_:<8} {token.dep_}')

Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars
Apple                 NOUN     ROOT
cherche               VERB     amod
à                     ADP      mark
acheter               VERB     xcomp
une                   DET      det
start                 NOUN     obj
-                     PUNCT    punct
up                    DET      det
anglaise              NOUN     ROOT
pour                  ADP      case
1                     NUM      nummod
milliard              NOUN     nmod
de                    ADP      case
dollars               NOUN     nmod


<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2021.53.48.png" width="700">

 <a class="anchor" id="section2"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Read input</h1>

 ## Read a String

In [6]:
import spacy
nlp = spacy.load('fr_core_news_sm')

intro_text = ('Ce tutoriel Spacy explique comment faire'
              ' du Traitement Automatique du Langage Naturel.')
intro_doc = nlp(intro_text)
print([token.text for token in intro_doc])

['Ce', 'tutoriel', 'Spacy', 'explique', 'comment', 'faire', 'du', 'Traitement', 'Automatique', 'du', 'Langage', 'Naturel', '.']


## Read a file

In [32]:
import spacy
nlp = spacy.load('fr_core_news_sm')

intro_filename = 'test_files/introduction.txt'
intro_text = open(intro_file_name).read()
intro_doc = nlp(intro_file_text)
print([token.text for token in intro_file_doc])

['Ce', 'tutoriel', 'Spacy', 'explique', 'comment', 'faire', '\n', 'du', 'Traitement', 'Automatique', 'du', 'Langage', 'Naturel', '.', '\n', 'Est', '-ce', 'que', 'tu', 'peux', '...', 'Laisse', 'tomber', '.', '\n', "J'", 'ai', 'oublié', 'ce', 'que', 'je', 'voulais', 'dire', '!', '\n']


 <a class="anchor" id="section3"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Sentence detection</h1>

In [33]:
sentences = list(intro_doc.sents)
for sentence in sentences:
    print(sentence)

Ce tutoriel Spacy explique comment faire
du Traitement Automatique du Langage Naturel.

Est-ce que tu peux ...
Laisse tomber.

J'ai oublié ce que je voulais dire !




 # Custom pipeline component: sentence delimiter

 #### Create new component

In [3]:
import spacy
from spacy.language import Language

@Language.component("set_custom_boundary")
def set_custom_boundary(doc):
    ''' Recognize '...' as sentence delimiter.'''
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i+1].is_sent_start = True
    return doc

 #### Add it to pipeline

In [4]:
custom_nlp = spacy.load('fr_core_news_sm')
custom_nlp.add_pipe('set_custom_boundary', before='parser')

<function __main__.set_custom_boundary(doc)>

 #### Use it

In [7]:
intro_filename = 'test_files/introduction.txt'
intro_text = open(intro_filename).read()
intro_doc = nlp(intro_text)
sentences = list(intro_doc.sents)

for sentence in sentences:
    print(sentence)

Ce tutoriel Spacy explique comment faire
du Traitement Automatique du Langage Naturel.

Est-ce que tu peux ...
Laisse tomber.

J'ai oublié ce que je voulais dire !




 <a class="anchor" id="section4"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Sentences, Tokens, Lemmas, Tagging & Parsing</h1>

## Linguistic features documentation

- [`Token` attributes](https://spacy.io/api/token#attributes)
- Parts of speeach: [Uniserval POS tagset](https://universaldependencies.org/u/pos/)
- Syntactic dependencies: [Stanford typed dependencies manual](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf)
- French stop words:   
  `/Users/macbook/anaconda3/lib/python3.7/site-packages/spacy/lang/fr/stop_words.py`

In [None]:
import spacy
import pandas as pd

nlp = spacy.load('fr_core_news_sm')

intro_filename = 'test_files/introduction.txt'
intro_text = open(intro_filename).read()
intro_doc = nlp(intro_text)

 ### Sentences

In [41]:
sentences = list(intro_doc.sents)
print(f'Found {len(sentences)} sentences.')

for i, sentence in enumerate(sentences[:2]):
    print(f'Sentence {i}: {sentence.start} to {sentence.end}')

Found 5 sentences.
Sentence 0: 0 to 14
Sentence 1: 14 to 21


 ### Linguistic features

In [42]:
cols = ('i', 'idx', 'token', 'lemma', 'tag', 'POS', 'explain', 'morph', 'syn_head', 'syn_dep')
rows = []

for token in intro_doc[:5]:
    row = [token.i, token.idx, 
           token.text, token.lemma_, 
           token.tag_, token.pos_, spacy.explain(token.tag_), 
           token.morph,
           token.head.text, token.dep_]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
print(df)

   i  idx     token     lemma    tag    POS      explain  \
0  0    0        Ce        ce    DET    DET   determiner   
1  1    3  tutoriel  tutoriel   NOUN   NOUN         noun   
2  2   12     Spacy     Spacy  PROPN  PROPN  proper noun   
3  3   18  explique  explique   VERB   VERB         verb   
4  4   27   comment   comment    ADV    ADV       adverb   

                                               morph  syn_head    syn_dep  
0           (Gender=Masc, Number=Sing, PronType=Dem)  tutoriel        det  
1                         (Gender=Masc, Number=Sing)  explique      nsubj  
2                                                 ()  tutoriel  flat:name  
3  (Mood=Ind, Number=Sing, Person=3, Tense=Pres, ...  explique       ROOT  
4                                     (PronType=Int)     faire     advmod  


 ### Stop words

Usually removed because they distort the word frequency analysis

In [38]:
spacy_stopwords = spacy.lang.fr.stop_words.STOP_WORDS
print(len(spacy_stopwords))
print(list(spacy_stopwords)[:11])

507
['tend', 'trois', 'restent', 'auquel', 'plus', 'en', 'douzième', 'parlent', 'mon', 'hors', 'quatre']


In [40]:
doc_stop_words = [token for token in intro_doc if token.is_stop]
print(f'\nFound {len(doc_stop_words)} stop words in this document.')


Found 14 stop words in this document.


 <a class="anchor" id="section5"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Word frequency</h1>

In [100]:
words = [token.text for token in doc
         if not token.is_stop and not token.is_punct]

from collections import Counter
word_freq = Counter(words)

In [101]:
common_words = word_freq.most_common(5)
print(common_words)

[('Apple', 1), ('cherche', 1), ('acheter', 1), ('start', 1), ('up', 1)]


In [102]:
unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
print(unique_words)

['Apple', 'cherche', 'acheter', 'start', 'up', 'anglaise', '1', 'milliard', 'dollars']


 <a class="anchor" id="section6"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Visualization with displacy</h1>

- Visualize **dependency parsing** and **named entities**

In [None]:
from spacy import displacy
displacy.serve(doc, style='dep')  # jupyter: displacy.render(doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



#  Customize the `nlp.tokenizer`

Pass various parameters to the `Tokenizer` class:

- `nlp.vocab`: 
  - Storage container for special cases
  - Ex: contractions, emoticons
  
  
- `prefix_search`:
  - Function used to handle preceding punctuation
  - Ex: opening parentheses
  
  
- `suffix_search`:
  - Function used to handle succeeding punctuation
  - Ex: closing parentheses
  
  
- `infix_finditer`:
  - Function used to handle non-whitespace separators
  - Ex: hyphens
  
  
- `token_match`:
  - Optional `Boolean` function used to match strings that should never be split.   
  - Ex: entities like URLs or numbers   
  - **Overrides** the previous rules.

In [41]:
# See example code