**Notes from:** Duygu ALTINOK - *Mastering spaCy*   

[Section 1](#section1): Getting started with spaCy

- [Chapter 1](#chapter1) – Getting started with spaCy
- [Chapter 2](#chapter2) – Core operations with spaCy

[Section 2](#section2): spaCy features

- [Chapter 3](#chapter3) – Linguistic features
- [Chapter 4](#chapter4) – Rule-based matching
- [Chapter 5](#chapter5) – Word Vectors and Semantic Similarity
- [Chapter 6](#chapter6) – Putting everything toghether: Semantic Parsing with spaCy

[Section 3](#section3): Machine Learning with spaCy

- [Chapter 7](#chapter7) – Customizing spaCy models
- [Chapter 8](#chapter8) – Text Classification with spaCy
- [Chapter 9](#chapter9) – spaCy and transformers
- [Chapter 10](#chapter10) – Putting everything together: Designing your chatbot with spaCy

 <a class="anchor" id="section1"></a>

<h1 style="font-size:13px; font-weight:bold; background:#eebbcc;padding: 15px;">SECTION 1 – Getting started with spaCy</h1>

<a class="anchor" id="chapter1"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 1 – Getting started with spaCy</h1>

Chatbots:
- Entity Extraction
- Intent recognition
- Context handling
- (= text classification)

Entity Linking = 
- 'Diana Spence' <-> 'Lady Diana'
- Semantic relations + General knowledge (See: Semantic Web on Wikipedia)
- available in Spacy

Spacy and Deep Learning:
- ML library `Thinc`
- Wrappers for: PyTorch, TensorFlow, MXNet, and Hugging Face transformers
- **46** state-of-the-art models for **16** languages

Pretrained models:
- `fr_core_web_sm`
- `fr`: language code
- `core`: model capability
- `web`: corpus type (`news`, `twitter`...)
- `sm`: model size, large `lg`, medium `md`, small `sm`
- **Important**: match model genre to your text type

# Install Spacy

In [None]:
pip install spacy

# if you have multiple Python versions 
# and you want to use spacy with a specific version
pip3.5 install spacy

In [None]:
# check your spacy version
python -m spacy info

# upgrade spacy
pip install -U spacy

In [None]:
# install with conda
conda install -c conda-forge spacy

In [None]:
# install spacy on macOS/OS X
# - install Xcode IDE
# - then install command-line development tools:
xcode-select -install
# - then install spacy

In [None]:
# install other Python modules:
! pip install numpy
! pip install scikit-learn
! pip install matplotlib
! pip install pandas

! pip install tensorflow   # TensorFlow >= 2.2.0

# Install Jupyter notebooks: https://jupyter.org/install

# Install language models

Spacy's `download` command: 
- selects and downloads the most compatible version of this model for you local spacy version
- deploys `pip` behind the scenes,    
  and `pip` installs the package and places it in your `site-packages` directory (just like any other Python package)

See book (p. 22, zoom lecture 33%): download a model via `pip` and `import` it as a module

In [None]:
python -m spacy download fr_core_web_md

# download a specific version
python -m spacy download fr_core_web_md-2.0.0 --direct

In [None]:
import spacy 
nlp = spacy.load('fr_core_web_md')  # then load the package
doc = nlp('Hello world')

# Visualization tool

See interactive demos:
- POS tags + Syntactic dependencies: https://explosion.ai/demos/displacy
- Named entities: https://explosion.ai/demos/displacy-ent

In [None]:
import spacy
from spacy import displacy
nlp = spacy.load('fr_core_web_md')
doc = nlp('Hello world')

# start the displacy web server:
displacy.serve(doc, style='dep')

# response with a link: http://0.0.0.0:5000
# = local address where displacy renders your graphics
# see p. 28 how to use another port
# click the link and navigate to the web page (localhost)
# Ctrl+C to shut down the displacy server and go back to Python shell

# style=ent for named entities

In [2]:
# see book p. 29:
#   displacy in Jupyter Notebook
#   displacy.render(doc, style='dep')

In [None]:
# see book p. 30:
#   export displacy renders in image files with python

# Install other tools

- Data annotation tools
  - Prodigy: https://prodi.gy/demo
    - annotate entities
  - Brat: https://brat.nlplab.org/introduction.html
    - demo website: https://brat.nlplab.org/examples.html   
    - Brat can also annotate relations

 <a class="anchor" id="chapter2"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 2 – Core operations with spaCy</h1>

# Processing steps

1. We create a spaCy pipeline object: `Language`, output of `spacy.load`
  
  
2. We apply the `nlp` pipeline on a text:
  - tokenizer -> `Doc` tagger -> `Doc` - parser -> `Doc` -> entity recognizer -> `Doc`
  - each component returns a `Doc` and passes it to the next component   
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.42.08.png" width="600"> 

### Components

Each correspond to a spaCy class   
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.48.22.png" width="600">

### Containers

Classes contain information about text data   
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.51.04.png" width="600">
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.52.39.png" width="600">

### Global architecture

- Processing pipeline **Components** (actions)
- Data **Containers** (inputs and outputs)
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.54.18.png" width="600">


In [None]:
import spacy
nlp = spacy.load('fr_core_web_md')
doc = nlp('Bonjour le monde !')

# Pipeline components

1. Tokenizer
  - [`Token`](https://spacy.io/api/token) -> `[token.text for token in doc]`
  - Based on language-specific rules
  - Can be customized
  - Debugging the tokenizer
  
  
2. Sentence segmentation
  - `doc.sents` -> `Token.is_sent_start`
  - Done by the dependency parser
  
  
3. Lemmatization
  - `token.lemma_`

# Container classes

**[`Doc`](https://spacy.io/api/doc)**
  - `doc.text`: Unicode representation of the text
  - `doc.sents`: sentences
  - `doc.ents`: named entities
  - `doc.lang`: language id  
    `doc.lang_`: language Unicode string
  
  
**[`Token`](https://spacy.io/api/token)**
  - `token.is_sent_start`: sentence start
  - `token.is_stop`: is a stop word
  - `token.lemma` + `token.lemma_`
  - `token.ent_type_`
  - `token.dep_` + `token.head_`
  - `token.is_oov`: out of vocabulary
  
  
**[`Span`](https://spacy.io/api/span)**

Use: `dir(doc)`, `dir(token)`, `dir(span)` to see all object attributes

 <a class="anchor" id="section2"></a>

<h1 style="font-size:13px; font-weight:bold; background:#eebbcc;padding: 15px;">SECTION 2 – spaCy features</h1>

 <a class="anchor" id="chapter3"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 3 – Linguistic features</h1>

# POS Tagging

- `token.pos` (int)   
  `token.pos_`: Unicode universal tags
- `token.tag` (int)   
  `token.tag_`: fine-grained tags
- `spacy.explain(tag_name)`
- `lang/<language_code>/tag_map.py` under each language submodule
  
  
- Verb, Noun, Pronoun, Determiner, Adjective, Adverb, Preposition, Conjunction, Interjection
- Same language can support different tagsets
- See:
  - http://partofspeech.org/,
  - [The eight parts of speech](http://www.butte.edu/departments/cas/tipsheets/grammar/parts_of_speech.html)
  
  
- POS taggers: sequential models, **Seq2seq**, spaCy uses an **LSTM** variation  
  (see state-of-the-art of POS Tagging on ACL website)

### Word Sense Disambiguation (WSD)

- Sometimes, POS tags can help identify a specific sense of a word
- Example: 
  - beat – strike someone – V
  - beat – defeat someone – V
  - beat – rythm in music and poetry – N
  - beat – bird wing movement – N
  - beat – completely exhausted – ADVJ

### Natural Language Understanding (NLU)

- Verb **tense** and **aspect** can help identify intent
- Example:
  - I flew to Rome 3 days ago. I still didn't get the bill, please send it ASAP.
  - I need to fly to Rome
  - I will fly to Rome next week. Check availabilities please.
  - I'm flying to Rome next week. Check flights on next Tuesday.

### Numbers, Symboles and Punctuations

- `NUM`, `SYM`, `PUNCT` (cf. book p. 78 fine-grained punctuation tags)
- Can be used in rule-based matching to recognize financial info...

# Dependency Parsing

- `ROOT`: sentence head
- `token.dep_`: dependency label in Unicode / `token.dep`: int
- `token.head_`: syntactic head
- `token.children`: syntactic subtree
  
  
- spaCy's English dependency labels:
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2023.03.06.png" width="450">


# NER

- One of the key components for understanding text topic (financial, medical, movies...)   
  named entities usually belong to a **semantic category**   
  Example: *BTS* = music, *Salma Hayek* = movies
  
  
- `doc.ents`: list of `Span` objects
- `token.ent_type` (int)   
  `token.ent_type_` (Unicode string) (empty if token is not an entity)   
  `spacy.explain(token.ent_type_)`
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2023.08.37.png" width="450">
  
  
- State of the art:
  - First modern NER tagger was a CRF (Conditional Random Field)   
    See implementation details [here](https://homepages.inf.ed.ac.uk/csutton/publications/crftutv2.pdf.)
  - Current state-of-the art: LSTM (+ CRF)


### Merging and splitting NERs

- multiword expressions, multiword named entities, typos
- `doc.retokenize`: tool for merging and splitting the spans
  - See p. 101, zoom lecture 33%
  - assigning new linguistic features to the merge/split spans (syntactic, pos...)

 <a class="anchor" id="chapter4"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 4 – Rule-based matching</h1>

- Entities **specific to your domain** (times, dates, phone numbers, IBAN, account numbers...)
  
  
- Can be recognized with rules   
  without having to train statistical models

# Token-based matching

- `Matcher` object:
  - Based on morphological features, POS tags, regex and other spaCy features
  - Can be applied to `Doc` and `Span`
  - Attributes the `Matcher` recognizes:    
    Examples for each in the book!
    <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2023.28.36.png" width="700">
  - Extended syntax:
    <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2023.31.46.png" width="650">
  - Regex-like operators:
    <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2023.33.08.png" width="400">
  - Regular expressions support: [regex101](https://regex101.com)
    <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2023.36.11.png" width="400">


- spaCy's `Matcher` [demo page](https://explosion.ai/demos/matcher):  
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2023.36.59.png" width="500">


# `PhraseMatcher`

- `PhraseMatcher` class scans long dictionaries
- When you have a long list of domain-specific phrases (legal, medical...)
- `Matcher` not handy: too manual
- See usage p. 123 (zoom lecture 42%)

# `EntityRuler`

- Component used to add rules on top of the statistical model   
  = even more powerful NER model
- it's not a `matcher`, it's a pipeline component: `nlp.add_pipe`
- appends its matches to `doc.ents`
- see usage in the book

# Combining spaCy models and matchers

Examples of NER extraction models:
- IBAN and account numbers
- Phone numbers
- Mentions in online comments
- Hashtags and emojis
- Expanding NERs (add title to person names...)
- Combining linguistic features (POS tags, dependencies) and named entities

 <a class="anchor" id="chapter5"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 5 – Word Vectors and Semantic Similarity</h1>

*Similar words occur in similar contexts*

- **distributional semantics**: information about the context of the target word
- **semantic similarity** methods   
  word vector computations: distance calculation, analogy calculation, visualization
- **text vectorization**
  
  
- Technical requirements: `NumPy`, `scikit-learn`, `Matplotlib`

# Understanding word vectors

- Sentence can be represented as a `(N, V)` matrix (`N` number of words, `V` vocabulary size)
  
  
- Simplest possible implementation
  - **one-hot encoding**
  - very sparse vectors: only one '1' and all others are '0's
  - solution: see next

### Word vectors

- fixed-size, dense, real-valued
- vector = learned representation of the text    
    semantically similar words have similar vectors
- = **distributional semantics**
- ex: Glove
- [word vector visualizer at TensorFlow](https://projector.tensorflow.org/)
  
  
- can capture synonyms, antonyms, semantic categories (animals, places, plantes...)

### Analogies and vector operations

- addition, subtraction
- word analogy = semantic relationship between pairs of words (synonyms, antonyms, wholepart relations)
  - ex: Queen - woman VS King - man
  
- Example operation: `king - man + woman = queen`
  - subtract `man` and add `woman`
  
<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2001.44.57.png" width="500">
  

### How word vectors are produced

Most popular pre-trained vectors and how they are produced:

- **word2vec**: Google
  - download [here](https://developer.syn.co.in/tutorial/bot/oscova/pretrained-vectors.html#word2vec-and-glove-models)
  - read details about [algorithm and data preparation steps](https://jalammar.github.io/illustrated-word2vec/)
  
  
- **Glove**: Stanford NLP group
  - depends on Singular Value Decomposition (SVD) applied to word co-occurrences matrix
  - [comprehensive guide](https://www.youtube.com/watch?v=Fn_U2OG1uqI)
  - [download pretrained vectors](https://nlp.stanford.edu/projects/glove/)
  
  
- **FastText**: Facebook Research
  - similar to word2vec but offers more: also predicts vectors for each subword
    more robust for mispelled words, rare words, words not in a proper lexicon
  - 157 languages
  - [download](https://fasttext.cc/docs/en/crawl-vectors.html)
  
  
Most trained on **huge corpus** such as Wikipedia, news or Twitter

# spaCy's pretrained vectors

- Part of many of spaCy's language models
- `sm` (small) models: 
  - no word vectors included, context-sensitive tensors instead
  - can still make semantic calculations, but not as accurate as word vector computations
  
  
- `token.vector`:
  - returns a NumPy `ndarray`
  - can be used with NumPy methods
  
  
- `doc.vector` & `span.vector`:
  - vector is an average of its word vectors
  - `doc[1:3].vector`
  
  
- Only words in the model's vocabulary have vectors: `token.has_vector` VS `token.is_oov`

### The `similarity` method

- In spaCy, every Container object has a `similarity()` method
- to calculate semantic similarity with other Container objects
- by comparing their word vectors
- even if different types of Containers
  - ex: compare `Token` and `Doc`, `Doc` and `Span`

In [None]:
doc1 = nlp('I visited England.')
doc2 = nlp('I went to London.')

doc1[1:3].similarity(doc2[1:4])    # compare Span objects
doc1[2].similarity(doc2[3])        # compare Token objects

doc1.similarity(doc1)              # compare object to itself -> returns 1.0

- **Visualization** of words similarity:
  - See code with `matplotlib` (p. 152, zoom lecture 49%)
  - 2 word groups

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2002.13.53.png" width="500">

# Using third party word vectors

- Import a third-party vector package into spaCy
- See book: example with `fastText`

# Advanced semantic similarity methods

### Understanding semantic similarity

- What do the similarity scores mean?
- Differences between distance metrics: **Euclidean** VS **Cosine**
  - Vector orientation
  - Vector magnitude

### Categorizing text using semantic similarity

- Categorize into different topics, categories
- Or spot relevant texts
  
  
- Example: 
  - eCommerce, search comments about 'perfume'
  - compare 'perfume' vector with comment vectors
  - problem: texts can be very long
  - **solution**: extract key phrases in the sentence

### Method 1: Extracting key phrases

- Extract only important words and phrases
- And compare them to the search key

### Method 2: Extracting and comparing named entities

- Extract only proper nouns
- Can help determine sentence topic

 <a class="anchor" id="chapter6"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 6 – Putting everything together: Semantic Parsing with spaCy</h1>

NLU system:
  
  
1. NER: 
  - method 1: spaCy Matcher
  - method 2: walking the dependency tree
  
  
2. Determine intent: based on syntactic relations
  - method 1: extract verb and direct objects
  - method 2: walking the dependency tree, recognizing multiple intents
  
  
3. Keywords matching: semantic similarity
  - method 1: compare keywords to synonyms list, to detect semantic similarity
  - method 2: compare keywords using vector-based semantic similarity methods
  
  
- Technical requirements: `pandas`

# Extracting named entities

### Get to know your corpus

- Named Entities play a key role in understanding a user utterance
  
  
- **Your corpus**:
  - What kind of utterances? (long texts, short)
  - What kinds of entities? (times and dates, people names, city, country names, organizations...)
  - How is punctuation? (correctly punctuated, no punctuation...)
  - How are grammatical rules followed? (capitalization correct, misspelled words, correct syntax)
  - More: number of utterances per intent
  - See book for code example
  
  
- Integrate observations from the corpus into your code
  
  
- **Example:** ATIS dataset (benchmark dataset for intent classification):
  - book a flight
  - get info about a flight: flight costs, destinations, timetables


### Extract entities with `Matcher`

- See book for code example

### Extract entities using the dependency tree

- example: 'to' + GPE
- See book for code example

# Using dependency relations for intent recognition

- Each intent = VERB + OBJECT
- E.g. *book flight, purchase meal*
  
  
1. Extract transitive verbs and their direct objects from utterances
2. Detect intent:
  - by finding synonyms of verbs and their direct objects
    - e.g. *book a flight*
  - by using wordlists: sometimes intent info not just in the verb phras
    - e.g. *make a reservation for a flight*, *would like to*, *need to*
  - by using semantic similarity methods
  - multiple intent: conjunctions, more than one verb in the sentence
  
  
`findFlight`, `bookFlight`, `cancelFlight`, `bookMeal`

# Semantic similarity methods for semantic parsing

- Users use a fairly wide set of phrases and expressions for each intent


- Chatbot building platforms:
  - RASA (https://rasa.com/) 
  - Dialogflow (https://dialogflow.cloud.google.com/)
  
  
- 2 ways to recognize semantic similarity:
  - with a synonyms dictionary for semantic similarity
  - with word-vector based semantic similarity methods

### Using synonyms list for semantic similarity

- Semantic groups: different verbs express the same action
  - E.g. landing, arriving, flying to / departing, leaving, flying from
  - book, make a reservation, buy, reserve
   
   
- In most cases: intent = TRANSITIVE VERB + DIRECT OBJECT
  - Check if verb and nouns are synonyms between 2 utterances
  
  
- Each synonym sets = **synset** = set of synonyms for our domain
 - language general synonyms (e.g. *plane*, *ariplane*)
 - domain-specific synonyms (e.g. *buy*, *book*)
 
 
- Synonym lists are very handy for very specific domains   
  - synonyms list fairly small
  - but can become inefficient for big synsets   
    have to look up the whole table each time
    
    
**SEE:** book for example code

### Using word vectors to recognize semantic similarity

- synonym lists are very efficient for very specific domains   
  synonyms list fairly small
- calculate similarity between verb's vectors
  
  
**SEE:** book for example code

# Putting it all together

**SEE:** book for example code

 <a class="anchor" id="section3"></a>

<h1 style="font-size:13px; font-weight:bold; background:#eebbcc;padding: 15px;">SECTION 3 – Machine Learning with spaCy</h1>

 <a class="anchor" id="chapter7"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 7 – Customizing spaCy models</h1>

- train, store and use custom pipeline components
- when we should perform custom model training
  
  
#### Fundamental steps of model training:
1. **Collect** data
  
  
2. **Preprocess** data into a format recognized by spaCy
  
  
3. **Label** your own data   
   using a data annotation tool **Prodigy**
   
   
4. **Model**: build a statistical component:
   - either update an existing statistical pipeline component with your own data    
     an example with spaCy's NER component
   - or create a statistical pipeline component from scratch,   
     with your own data and labels


# Getting started with data preparation

- Customize statistical models for our own custom domain and data
- spaCy models are good for general NLP: understanding sentence syntax, extracting some entities
- but the pretrained models didn't see some very specific domains during training
  
  
- Examples: 
  - Twitter texts, with hashtags, mentions and emoticons, phrases instead of full sentences
    - spaCy's POS tagger wouldn't perform as well, since trained of grammatically correct sentences
  - Medical domain
    - many domain-specific named entities (drugs, diseases, chemical compound names)

- When should you do custom training?
  - Do spaCy models perform well enough on your data?
  - Does your domain include many labels that are absent in spaCy models?
  - Is there a pre-trained model/application in GitHub or elsewhere already?   
    (don't reinvent the wheel!)

### Do spaCy models perform well enough on your data?

No need for new model from scratch if:
- above **0.75** accuracy?   
  -> train the NER model further to recognize entities as you want
- you need 1 or 2 new entity types?   
  -> add them to the model with `EntityRuler`

### Does your domain include many new labels?

- Ex: medical domain, very specialized and long list of entities   
  -> custom model training: update existing one or new from scratch
  
  
1. Collect data:
  - how much you need:
    - depends on complexity of task and domain
    - start with an acceptable amount, train your model, and see how it performs
    - then add more data if necessary, and retrain
    
    
2. Annotate data: in spaCy input format
  
  
3. Update training of an existing model:   
   - if entities in the existing model but bad performance on your data   
     -> update model with your own data
   - if entities not in the model at all   
     -> train custom model from scratch

# Annotating and preparing data

Collect:
  - collect user logs
  - save `JSON` format (spaCy input format)
  
  
Annotate:
  - intents
  - entities
  - POS tags
  - etc.
  - see book: `JSON` annotated example
  - Use Prodigy (spaCy's tool, not free) or Brat (free)
  
  
spaCy training data format:
  - annotate raw data
  - then convert each utterance into an `Example` object
  - see book: example code
  - (different training format for the dependency parser)

# Updating an existing pipeline component

- Train spaCy's NER model
  - with our own examples from **navigation** domain
  - because it doesn't label some of them with the correct entity type
  - we want to recognize some locations (street names, district names, 'home', 'work', 'office'...)
  
  
#### 3 steps:
  - "code block":  
    - disable other statistical components in the pipeline (including POS tagger and dependency parser)   
    - to train **only** the intended component
  - feed our domain examples to the training procedure
  - evaluate the new NER model
  
  
- also:
  - save updated NER model to disk
  - load updated NER model when need to use it

### 1. Disable other statistical models

- see book for disabling code

### 2. Model training procedure

- spaCy's NER model is a neural network model:
  - need to configure some **parameters** 
  - and provide **training examples
  
  
- Each prediction of the neural network is a **sum of its weight** values;   
  hence, the training procedure **adjusts** the weights of the neural network with our examples
  
  
#### `epochs` parameter:

- Shuffling:
  - we go over the training *several times*
  - because showing each training example once is not enough
  - at each iteration, we shuffle the training data, so that their order does not matter
  - shuffling helps training the neural network thoroughly
  
  
- In each `epoch`, training code updates the weights of the nn with a small number
  - then a **loss** value is calculated
  - by comparing the actual label with the nn's current output
  - then **optimizer** function updates nn's weight
  - with respect to this loss value
  - see book for code example: 
    - with SGD as optimizer (Stochastic Gradient Descent, iterative algorithm, used to minimize a function)
    - starts from a random point on the loss function,   
      and travels down its slope in steps,   
      until it reaches the lowest point of that function

### 3. Evaluate updated NER model

- Test if model recognizes our domain utterances,
- and didn't just memorize them
See book for code example

### 4. Save and load custom models

See book for code example

# Training a pipeline component from scratch

- Start with a small dataset to understand the training procedure
- Then work with real-world dataset
See book for code example

 <a class="anchor" id="chapter8"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 8 – Text Classification with spaCy</h1>

# The basics of text classification

- Text classification = 
  - supervised machine learning task
  - assign set of predefined class labels to texts
  - training dataset = text-class label pairs
  
  
- Classes: defined based on your data
  - e.g. customer reviews, classes = [positive, negative, neutral]
  
  
- Class types:
  - class labels: **categorical** (strings) or **numerical** (numbers)
  
  
- Three categories of text classification depending on number of classes:
  - **Binary**: 2 classes
  - **Multiclass**: more than 2 classes, mutually exclusive (each text only one class)
  - **Multilabel**: more than 2 classes, each text can be assigned more than one   
    (e.g. levels of toxicity, insult+threat+obscene)
  
  
- Used to **understand customer intent**
  
  
- Most common types of text classification and their use cases:
  - Topic detection
  - Language detection
  - Sentiment analysis (see example image)
  
<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2010.16.08.png" width="400">

# Training the spaCy text classifier

- spaCy's text classifier: `TextCategorizer` class, based on a neural network
- Optional and trainable pipeline component, with text-label pairs as training dataset
  
  
- First, add `TextCategorizer` to the NLP pipeline
  - comes after the essential components
- Then do the **training** procedure

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2010.26.20.png" width="450">


### Getting to know the `TextCategorizer` class

- Available as:
  - **single-label** classifier (predicts one class per example),
  - or **multilabel** classifier (predicts more than one class per example)
  
  
- Classifier's parameters:
  - **threshold** value: class assigned to a text if its probability is higher than the threshold (0.5, or more for more confidence)
  - **model name**
  
  
See book for example code

### Formatting training data for the `TextCategorizer`

- E.g. label is `sentiment`, values: `0` or `1`
- See book for training dataset format, how to add labels, and code example

### Defining the training loop

- First, disable other pipe components, so that only new text classifier is trained
- Then ...

### Testing the new component

- See book for example code

### Training `TextCategorizer` for multilabel classification

- See book for example code

# Sentiment analysis with spaCy

- Exploring the dataset: plotting the class distribution, class imbalance!, etc.
- Training the `TextClassifier` component
See book for example code

# Text classification with spaCy and Keras

- Blending **spaCy** with **TensorFlow** and its high-level API **Keras**
  
  
- **Keras** :
  - high-level deep learning API that can run on top of ML libraries such as TensorfFlow, Theano and CNTK
  - Very popular in research and development world because:
    - supports rapid prototyping
    - provides user-friendly API to neural network architectures
    
    
- **TensorFlow 2** integrates **tf.keras** high-level API
  - developers can take advantage of Keras' user-friendliness + TensorFlow's low-level methods
    
    
- Layers:
  - **input layer**: first one
  - **hidden layers**
  - **output layer**: last one

### What is a layer?

- Each layer transforms the input vectors   
  and feed them of the next layer   
  to get a final vector
  
  
- Keras provides all sorts of layers:
  - **input** layers:
    - send input data to the rest of the network
  - dense layers: 
    - transform the input of a given shape to the output shape we want   
    - e.g. 5-dimensional input into 1-dimensional input
  - **recurrent** layers: 
    - RNN, GRU, LSTM cells fully supported in Keras
  - **dropout** layers: 
    - dropout is a technique to prevent overfitting (when neurons memorize data instead of learning it)
    - dropout layers randomly select a given number of neurons   
      and set their weights to zero for the forward and backward passes   
      for one iterations
      usually placed after dense layers
  - **embedding** layers
  - **activation** layers
  - and so on
  
  
- See book for example code

### Sequential modeling with LSTMs

- LSTM = an RNN variation
  
  
- RNN: 
  - can process sequential data in steps
  - and capture info about the past sequence of elements,
  - by holding a **memory**, called **hidden state** (see figure)
  - in text data, inputs and outputs are not independent of each other
  - words depend on neighbor words
  - e.g. machine translation, word translation depends on what we predicted before
  
<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2014.59.33.png" width="450">
  
  
- **RNN** issues:
  - they forget some data back in the sequence
  - they have numerical stability issues due to chain multiplications, called **vanishing** and **exploding gradients** (see *Colah's* blog)
  
  
- **LSTMs** where invented to fix some computational problems of RNNs:
  - LSTMs cells (see figure)
    - are slightly more complicated that RNN cells
    - but the logic of the computation is the same:
      - we feed one input word at each time step: `xi`
      - and LSTM outputs an output value at each time step: `hi`
      - input steps and output steps are the same as RNN counterparts
  - strong support in Keras for RNN variations: GRU and LSTM, + simple API for training RNNs
  - RNN variations: 
    - crucial for NLP,
    - language data is sequential by nature
      - e.g. text is a sequence of words or characters, speech is a sequence of sounds

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2015.04.19.png" width="450">
  

### Keras Tokenizer

- Step 1:
  - Tokenize each sentence   
  - and turn sentences into a sequence of words
- Step 2:
  - Create a vocabulary from the set of words   
  - words that are supposed to be recognized by the nn
- Step 3:
  - Vocab should assign an ID to each word
- Step 4:
  - Map word IDs to word vectors
  
  
- See book for example code

### Embedding words

- We can now transform words into vectors
- **Embedding table**: a lookup table
  - each row: a word's word vector
  - row index: the word's ID
  
  
1. word -> word-ID
2. word-ID -> word vector

<img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2015.27.49.png" width="450">

### Neural Network architecture for Text Classification

- Step 1:
  - preprocess, tokenize, pad the sentences
  - output: list of sequences
  
  
- Step 2:
  - feed list of sequences to the neural network, throught the input layer
  
  
- Step 3:
  - vectorize each word by looking up its word ID in the embedding layer
  - a sentence is now a sequence of word vectors, each corresponding to a word
  
  
- Step 4:
  - feed the sequence of word vectors to LSTM
  
  
- Step 5: 
  - squash the LSTM output with a sigmoid layer
  - output: class probabilities
  
  
See book for code example:
- Dataset
- Data and vocabulary preparation
- input layer
- embedding layer
- LSTM layer
- compiling the model
- fitting the model and experiment evaluation

 <a class="anchor" id="chapter9"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 9 – spaCy and transformers</h1>

- Latest hot topic in NLP
- How to use them with TensorFlow and spaCy
  
  
**1.** Transformers and transfer learning    
**2.** Understanding **BERT** (Bidirectinoal Encoder Representations from Transformers), architecture details used for transformers    
**3.** How **BERT Tokenizer** and **WordPiece** algorithms work    
**4.** How to quickly get started with pre-trained transformer models from **HuggingFace** library    
**5.** Transformers with TensorFlow and Keras: practice fine-tuning HuggingFace   
**6.** Transformers and spaCy: how spaCy integrates transformer models as pre-trained pipelines    
  
  
- Build state-of-the-art NLP models with just a few lines of code
- with the power of Transformer models and transfer learning
   
   
#### Technical requirements

- Install `transformers` and `tensorflow` Python libraries

```
pip install transformers
pip install "tensorflow>=2.0.0"
```


# Transformers and Transfer Learning

- **2017**: NLP milestone, release of research paper *Attention is all you need*, by Vaswani et al.
- introduced a new machine learning idea and architecture: transformers
- transformers aim to solve sequential modeling tasks,   
  targets some problems introduced by LSTM architecture
   
   
- **Transfer learning**:
  - import knowledge from pre-trained word vectors or pre-trained statistical models
  - Glove and FastText word vectors are already trained on the Wikipedia corpus
  - we used them directly for our semantic similarity calculations
  
  
- Transformers offer thousands of pre-trained models to perform NLP tasks (text classification, summarization, question answering, machine translation, nlg...)
  - in more than 100 languages
  - aim to make state-of-the-art NLP accessible to everyone
  - select a model suitable for your task
  
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2015.53.21.png" width="550">
  
  
- LSTM:
  - has difficulties with learning statistical dependencies in long text
    - as the steps pass, LSTM forgets about earlier time steps
  - is sequential: process one word at each time step
    - parallelizing is not possible
    - performance bottleneck
    
    
- Transformers address these problems:
  - by not using recurrent layers at all
  - architecture in **2 parts**:
    - **Encoder** input (left in the figure)
    - **Decoder** output (right in the figure)
    - **Multi-Head Attention** bock
    - **Self-attention** mechanism
    - etc.
    <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2015.57.27.png" width="450">

# Understanding BERT

- **Bidirectional**: 
  - training on text data is bi-directional
  - meaning each input sentence is processed from left to right + right to left
  
  
- **Encoder**:
  - encodes the input sentence
  
  
- **Representations**:
  - a word vector
  
  
- **Transformers**:
  - architecture is transformer-based
  
  
- **Input**: a sentence
- **Output**: a sequence of word vectors, contextual (word vector assigned to a word based on the input sentence)
  
  
ETC. ... ... see book

# Transformers and TensorFlow

etc.

# Transformers and spaCy

- spacy v3.0 new feature: Transformer-based pipelines

    <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-13%20a%CC%80%2016.04.19.png" width="450">

 <a class="anchor" id="chapter10"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 10 – Putting it all together: Designing your chatbot with spaCy</h1>

- entity extraction
- intent recognition
- context handling
- syntactic and semantic parsing
- text classification
  
  
1. Explore **dataset** used to collect linguistic information about utterances
2. Perform **NER** by combining spaCy **NER model** and spaCy's **`Matcher`**
3. Perform **intent recognition**: 2 different techniques:
    - pattern-based method
    - statistical text classification with TensorFlow and Keras:    
      train a character-level LSTM to classify utterance intents
4. Sentence- and dialog-level semantics:
    - anaphora resolution
    - grammatical question types
    - differentiating subjects from objects
    
    
Chapter goal: design a real **chatbot NLU pipeline**

# Introduction to Conversational AI

d

# Entity Extraction

d

# Intent recognition

d