<h1 style="font-size:21px; font-weight:bold; background:#eeeeee;padding: 15px;">Contents</h1>


Chapter 1: [Finding words, phrases, names and concepts](#chapter1)
- [Central data structures](#section1-1)
- [Documents, Tokens, Spans, Lexical Attributes](#section1-2)
- [Trained pipelines](#section1-3)
- [Rule-based matching](#section1-4)
- [Processing pipelines](#section1-5)

Chapter 2: [Processing pipelines](#chapter2)

<h1 style="font-size:21px; font-weight:bold; background:#eeeeee;padding: 15px;">Resources</h1>


- [API Reference](https://spacy.io/api)
  
  
- WOLF (Wordnet Libre du Fran√ßais, Free French Wordnet)
   Free Semantic Lexical resource (wordnet) for French


  <a class="anchor" id="chapter1"></a>

<h1 style="font-size:44px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 1</h1>

# Central data structures: <a class="anchor" id="section1-1"></a>

### `Language` object

- **Reference**: [Language](https://spacy.io/api/language)
- Usually called `nlp`
- Output of `spacy.load()`
- Tokenizes a text and turn it into a `Doc` object

### `Vocab` object

- **Reference**: [Vocab](https://spacy.io/api/vocab)
- Lookup table to access [Lexeme]() object
- Strings, word vectors and lexical attributes are **centralized** here   
  -> to avoid multiple copies of the data    
  -> save memory   
  -> single source of truth   

### `Doc` object

- **Reference**: [Doc](https://spacy.io/api/doc)
- The sequence of [Token](https://spacy.io/api/token) objects and their annotations
- Text annotations **centralized** here   
  -> `Token` and `Span` are views that point to it   
  -> single source of truth
- Constructed by the [Tokenizer](https://spacy.io/api/tokenizer)
- And then modified in place by the components of the pipeline

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2018.33.41.png" width="500">

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2018.35.08.png" width="600">

# Documents, Tokens, Spans, Lexical Attributes <a class="anchor" id="section1-2"></a>

In [26]:
import spacy

In [50]:
nlp = spacy.blank("en")                       #-- Language object:

print(nlp)
print(type(nlp))
#print(dir(nlp))

<spacy.lang.en.English object at 0x6518c21d0>
<class 'spacy.lang.en.English'>


In [51]:
doc = nlp("Hello world! How are you?")        #-- Doc object:

print(doc)
print(doc.text)
print(type(doc))
#print(dir(doc))

Hello world! How are you?
Hello world! How are you?
<class 'spacy.tokens.doc.Doc'>


In [57]:
for i, token in enumerate(doc):               #-- Token object:
    print(f'{i}:{token.text}', end=', ')
print()
    
token = doc[1]   # token.i = token index
print(token)
print(type(token))
#print(dir(token))

0:This, 1:costs, 2:$, 3:5, 4:., 
costs
<class 'spacy.tokens.token.Token'>


In [53]:
span = doc[3:6]                               #-- Span object:

print(span, ' - ', span.text)
print(type(span))
#print(dir(span))

How are you  -  How are you
<class 'spacy.tokens.span.Span'>


In [54]:
doc = nlp('This costs $5.')                   #-- Lexical Attributes

print("Tokens: ", [(token.i, token.text) for token in doc])
print(doc[0].is_alpha)
print(doc[-1].is_punct)
print(doc[-2].like_num)

Tokens:  [(0, 'This'), (1, 'costs'), (2, '$'), (3, '5'), (4, '.')]
True
True
True


# Trained pipelines <a class="anchor" id="section1-3"></a>

- Models that enable Spacy to predict linguistic attributes *in context*:
  - Part-of-speech tags
  - Syntactic dependencies
  - Named entities
  
  
- Trained on labeled example texts
- Can be updated with more example to fine-tune predictions

### Loading pipeline packages

The package:
- Binary weights that enable Spacy to make predictions
- Vocabulary
- Meta information about pipeline
- Configuration file used to train it
- Tells spaCy:
  - which language class to use
  - and how to configure the processing pipeline

In [58]:
import spacy
nlp = spacy.load('en_core_web_sm')

### Predicting linguistic annotations



In [73]:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"
doc = nlp(text)

In [74]:
for token in doc:
    print(f'{token.text:<12}:{token.pos:<10}:{token.dep:<10}')
    # text:token:dependency_tab

Upcoming    :84        :402       
iPhone      :96        :7037928807040764755
X           :92        :7037928807040764755
release     :92        :7037928807040764755
date        :92        :8206900633647566924
leaked      :100       :451       
as          :98        :423       
Apple       :96        :429       
reveals     :100       :399       
pre         :84        :416       
-           :92        :416       
orders      :92        :416       


In [75]:
for entity in doc.ents:
    print(f'{entity.text}:{entity.label_}')

Apple:ORG


### Predicting named entities in context

- 'iPhone X' has not been recognized as named entity

In [78]:
# Create a Span for it: don't have to do it manually (see next)
iphone_x = doc[1:3]
print(iphone_x)

iPhone X


 # `Matcher`: Rule-based matching <a class="anchor" id="section1-4"></a>


- **Reference**: [Rule-based matching](https://spacy.io/api/matcher)  
  Token, Phrase Matcher, Dependcy Matcher, Entity Ruler, Span Ruler, Models & Rules
- **Usage guide**: [Rule-based matching](https://spacy.io/usage/rule-based-matching)
- **Returns**: a list of `(match_id, start, end)` tuples
- Supported attributes for rule-based matching: [Available token attributes](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes)    
  See also: Operators, Regular expressions...

   
- Match on `Doc` objects, not just strings
- Use a model's predictions
- Lets you find words and phrases using rules describing their token attributes
- Rules can refer to **token annotations** (like the text or part-of-speech tags) as well as **lexical attributes** (like Token.is_punct)   
  
Match patterns:

- Lists of dictionaries, one per token
- Match exact token texts: `[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]`
- Match lexical attributes: `[{'LOWER': 'iphone'}, {'LOWER': 'x'}]`
- Match any token attributs: `[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]`

In [107]:
#-- Using the Matcher
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')

# Initialize matcher with shared vocab
matcher = Matcher(nlp.vocab)

# Process some text
doc = nlp(
    "Upcoming iPhone X release date leaked"
    
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
    
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
    
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

In [108]:
# Add patterns to the matcher

# "iPhone X"
patterns = [
    [{'TEXT': 'iPhone'}, {'TEXT': 'X'}],
    [{'TEXT': 'iPhone'}, {'TEXT': '10'}]
]
matcher.add("IPHONE_PATTERN", [pattern])

# iOS versions: "iOS 7", "iOS 11"
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}] 
matcher.add("IOS_VERSION_PATTERN", [pattern])

# 'download' + PROPN (proper noun)
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]
matcher.add('DOWNLOAD_THINGS_PATTERN', [pattern])

# 'ADJ' + one or two 'NOUN'
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]
matcher.add('ADJ_NOUN_PATTERN', [pattern])


In [113]:
# Call the matcher on the doc
matches = matcher(doc)
for id, start, end in matches:
    print(f'Match found: {doc[start:end].text}')

Match found: radical system
Match found: radical system
Match found: wide redesign
Match found: wide redesign
Match found: aesthetic upheaval
Match found: aesthetic upheaval
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: beautiful design
Match found: beautiful design
Match found: smart search
Match found: smart search
Match found: automatic labels
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
Match found: optional voice
Match found: optional voice responses


  <a class="anchor" id="chapter2"></a>

<h1 style="font-size:44px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 2</h1>

 # Processing Pipelines <a class="anchor" id="section2-1"></a>

d

 <a class="anchor" id="section1"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">NLP for Chatbots</h1>

![Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2017.23.08.png](attachment:Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2017.23.08.png)

![Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2017.25.11.png](attachment:Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2017.25.11.png)

![Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2017.38.06.png](attachment:Capture%20d%E2%80%99e%CC%81cran%202022-07-10%20a%CC%80%2017.38.06.png)