**Reference**: [NLP with Spacy](https://realpython.com/natural-language-processing-spacy-python/)

<h1 style="font-size:21px; font-weight:bold; background:#eeeeee;padding: 15px;">Contents</h1>

- [Install Spacy](#section1)
- [Read input](#section2)
- [Sentence detection](#section3)
- [Tokenization](#section4)
- [Lemmatization](#section5)
- [Word frenquency](#section6)
- [Part-of-speech Tagging](#section7)

 <a class="anchor" id="section1"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Install Spacy</h1>

### Create a virtual environment

In [None]:
$ python3 -m venv myenv
$ source ./myenv/bin/activate
$ pip install spacy

### Download models and data per language

- French Models: [fr_core_news_sm](https://spacy.io/models/fr)
- Install [Spacy transformers](https://spacy.io/universe/project/spacy-transformers)
- Select models: [Spacy Trained Models and Pipelines](https://spacy.io/models)

In [None]:
$ python -m spacy download fr_core_news_sm

In [55]:
import spacy
from spacy.lang.fr.examples import sentences

nlp = spacy.load('fr_core_news_sm')
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
    print(f'{token.text:<21} {token.pos_:<8} {token.dep_}')

Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars
Apple                 NOUN     ROOT
cherche               VERB     amod
à                     ADP      mark
acheter               VERB     xcomp
une                   DET      det
start                 NOUN     obj
-                     PUNCT    punct
up                    DET      det
anglaise              NOUN     ROOT
pour                  ADP      case
1                     NUM      nummod
milliard              NOUN     nmod
de                    ADP      case
dollars               NOUN     nmod


In [59]:
custom_nlp = spacy.load('fr_dep_news_trf')

ValueError: [E002] Can't find factory for 'transformer' for language French (fr). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, future_entity_ruler, span_ruler, textcat_multilabel, fr.lemmatizer

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2021.53.48.png" width="800">

 <a class="anchor" id="section2"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Read input</h1>

 ### Read a String

In [6]:
import spacy
nlp = spacy.load('fr_core_news_sm')

intro_text = ('Ce tutoriel Spacy explique comment faire'
              ' du Traitement Automatique du Langage Naturel.')
intro_doc = nlp(intro_text)
print([token.text for token in intro_doc])

['Ce', 'tutoriel', 'Spacy', 'explique', 'comment', 'faire', 'du', 'Traitement', 'Automatique', 'du', 'Langage', 'Naturel', '.']


 ### Read a file

In [32]:
import spacy
nlp = spacy.load('fr_core_news_sm')

intro_filename = 'test_files/introduction.txt'
intro_text = open(intro_file_name).read()
intro_doc = nlp(intro_file_text)
print([token.text for token in intro_file_doc])

['Ce', 'tutoriel', 'Spacy', 'explique', 'comment', 'faire', '\n', 'du', 'Traitement', 'Automatique', 'du', 'Langage', 'Naturel', '.', '\n', 'Est', '-ce', 'que', 'tu', 'peux', '...', 'Laisse', 'tomber', '.', '\n', "J'", 'ai', 'oublié', 'ce', 'que', 'je', 'voulais', 'dire', '!', '\n']


 <a class="anchor" id="section3"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Sentence detection</h1>

In [33]:
sentences = list(intro_doc.sents)
for sentence in sentences:
    print(sentence)

Ce tutoriel Spacy explique comment faire
du Traitement Automatique du Langage Naturel.

Est-ce que tu peux ...
Laisse tomber.

J'ai oublié ce que je voulais dire !




 # Custom pipeline component: sentence delimiter

In [43]:
import spacy

 #### Create new component

In [47]:
from spacy.language import Language

@Language.component("set_custom_boundary")
def set_custom_boundary(doc):
    ''' Recognize '...' as sentence delimiter.'''
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i+1].is_sent_start = True
    return doc

 #### Add it to pipeline

In [58]:
custom_nlp = spacy.load('fr_dep_news_trf')
#custom_nlp.add_pipe('set_custom_boundary', before='parser')

ValueError: [E002] Can't find factory for 'transformer' for language French (fr). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, future_entity_ruler, span_ruler, textcat_multilabel, fr.lemmatizer

 #### Use it

In [31]:
intro_filename = 'test_files/introduction.txt'
intro_text = open(intro_file_name).read()
intro_doc = nlp(intro_file_text)
intro_sentences = list(intro_doc.sents)

for sentence in sentences:
    print(sentence)

Ce tutoriel Spacy explique comment faire
du Traitement Automatique du Langage Naturel.

Est-ce que tu peux ...
Laisse tomber.

J'ai oublié ce que je voulais dire !




 <a class="anchor" id="section4"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Tokenization</h1>

In [40]:
for token in intro_doc:
    print(f'{token.i:<5}{token.idx:<5}{token.text:<34}')

0    0    Ce                                
1    3    tutoriel                          
2    12   Spacy                             
3    18   explique                          
4    27   comment                           
5    35   faire                             
6    40   
                                 
7    41   du                                
8    44   Traitement                        
9    55   Automatique                       
10   67   du                                
11   70   Langage                           
12   78   Naturel                           
13   85   .                                 
14   86   
                                 
15   87   Est                               
16   90   -ce                               
17   94   que                               
18   98   tu                                
19   101  peux                              
20   106  ...                               
21   110  Laisse                            
22   117  

# `Token` attributes

See: [`Token` attributes](https://spacy.io/api/token#attributes)

- `token.i`: sentence index
- `token.idx`: token index in dictionary
  
  
- `token.text_with_ws`: token with trailing space if present
- `token.is_alpha`: consist of alphabetic characters
- `token.is_punct`:
- `token.is_space`:
- `token.is_stop`: is in the stop words list
  
  
- `token.shape_`: show orthographic features.  
  alphabetic characters replaced by `x`   
  numeric characters replaced by `X`

#  Customize the `nlp.tokenizer`

Pass various parameters to the `Tokenizer` class:

- `nlp.vocab`: 
  - Storage container for special cases
  - Ex: contractions, emoticons
  
  
- `prefix_search`:
  - Function used to handle preceding punctuation
  - Ex: opening parentheses
  
  
- `suffix_search`:
  - Function used to handle succeeding punctuation
  - Ex: closing parentheses
  
  
- `infix_finditer`:
  - Function used to handle non-whitespace separators
  - Ex: hyphens
  
  
- `token_match`:
  - Optional `Boolean` function used to match strings that should never be split.   
  - Ex: entities like URLs or numbers   
  - **Overrides** the previous rules.

In [41]:
# See example code

# Stop words

- Usually removed because they distort the word frequency analysis

#### Stop words in French:


 <a class="anchor" id="section5"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Lemmatization</h1>

 d


 <a class="anchor" id="section6"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Word frequency</h1>

 d


 <a class="anchor" id="section7"></a>

<h1 style="font-size:30px; font-weight:bold; background:#DDEEEE;padding: 15px;">Part-of-speech Tagging</h1>

 d
