**Notes from:** Duygu ALTINOK - *Mastering spaCy*   

[Section 1](#section1): Getting started with spaCy

- [Chapter 1](#chapter1) – Getting started with spaCy
- [Chapter 2](#chapter2) – Core operations with spaCy

[Section 2](#section2): spaCy features

- [Chapter 3](#chapter3) – Linguistic features
- [Chapter 4](#chapter4) – Rule-based matching
- [Chapter 5](#chapter5) – Word Vectors and Semantic Similarity
- [Chapter 6](#chapter6) – Putting everything toghether: Semantic Parsing with spaCy

[Section 3](#section3): Machine Learning with spaCy

- [Chapter 7](#chapter7) – Customizing spaCy models
- [Chapter 8](#chapter8) – Text Classification with spaCy
- [Chapter 9](#chapter9) – spaCy and transformers
- [Chapter 10](#chapter10) – Putting everything together: Designing your chatbot with spaCy

 <a class="anchor" id="section1"></a>

<h1 style="font-size:13px; font-weight:bold; background:#eebbcc;padding: 15px;">SECTION 1 – Getting started with spaCy</h1>

<a class="anchor" id="chapter1"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 1 – Getting started with spaCy</h1>

Chatbots:
- Entity Extraction
- Intent recognition
- Context handling
- (= text classification)

Entity Linking = 
- 'Diana Spence' <-> 'Lady Diana'
- Semantic relations + General knowledge (See: Semantic Web on Wikipedia)
- available in Spacy

Spacy and Deep Learning:
- ML library `Thinc`
- Wrappers for: PyTorch, TensorFlow, MXNet, and Hugging Face transformers
- **46** state-of-the-art models for **16** languages

Pretrained models:
- `fr_core_web_sm`
- `fr`: language code
- `core`: model capability
- `web`: corpus type (`news`, `twitter`...)
- `sm`: model size, large `lg`, medium `md`, small `sm`
- **Important**: match model genre to your text type

# Install Spacy

In [None]:
pip install spacy

# if you have multiple Python versions 
# and you want to use spacy with a specific version
pip3.5 install spacy

In [None]:
# check your spacy version
python -m spacy info

# upgrade spacy
pip install -U spacy

In [None]:
# install with conda
conda install -c conda-forge spacy

In [None]:
# install spacy on macOS/OS X
# - install Xcode IDE
# - then install command-line development tools:
xcode-select -install
# - then install spacy

## Install language models

Spacy's `download` command: 
- selects and downloads the most compatible version of this model for you local spacy version
- deploys `pip` behind the scenes,    
  and `pip` installs the package and places it in your `site-packages` directory (just like any other Python package)

See book p. 22: download a model via `pip` and `import` it as a module

In [None]:
python -m spacy download fr_core_web_md

# download a specific version
python -m spacy download fr_core_web_md-2.0.0 --direct

In [None]:
import spacy 
nlp = spacy.load('fr_core_web_md')  # then load the package
doc = nlp('Hello world')

# Visualization

See interactive demos:
- POS tags + Syntactic dependencies: https://explosion.ai/demos/displacy
- Named entities: https://explosion.ai/demos/displacy-ent

In [None]:
import spacy
from spacy import displacy
nlp = spacy.load('fr_core_web_md')
doc = nlp('Hello world')

# start the displacy web server:
displacy.serve(doc, style='dep')

# response with a link: http://0.0.0.0:5000
# = local address where displacy renders your graphics
# see p. 28 how to use another port
# click the link and navigate to the web page (localhost)
# Ctrl+C to shut down the displacy server and go back to Python shell

# style=ent for named entities

In [2]:
# see book p. 29:
#   displacy in Jupyter Notebook
#   displacy.render(doc, style='dep')

In [None]:
# see book p. 30:
#   export displacy renders in image files with python

 <a class="anchor" id="chapter2"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 2 – Core operations with spaCy</h1>

# Processing steps

1. We create a spaCy pipeline object: `Language`, output of `spacy.load`
  
  
2. We apply the `nlp` pipeline on a text:
  - tokenizer -> `Doc` tagger -> `Doc` - parser -> `Doc` -> entity recognizer -> `Doc`
  - each component returns a `Doc` and passes it to the next component   
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.42.08.png" width="600"> 

### Components

Each correspond to a spaCy class   
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.48.22.png" width="600">

### Containers

Classes contain information about text data   
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.51.04.png" width="600">
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.52.39.png" width="600">

### Global architecture

Processing pipeline **Components** (actions) + Data **Containers** (inputs and outputs)
  <img src="images/Capture%20d%E2%80%99e%CC%81cran%202022-07-12%20a%CC%80%2020.54.18.png" width="600">


In [None]:
import spacy
nlp = spacy.load('fr_core_web_md')
doc = nlp('Bonjour le monde !')

# Pipeline components

1. Tokenizer
  - [`Token`](https://spacy.io/api/token) -> `[token.text for token in doc]`
  - Based on language-specific rules
  - Can be customized
  - Debugging the tokenizer
  
  
2. Sentence segmentation
  - `doc.sents` -> `Token.is_sent_start`
  - Done by the dependency parser
  
  
3. Lemmatization
  - `token.lemma_`

# Container classes

**[`Doc`](https://spacy.io/api/doc)**
  - `doc.text`: Unicode representation of the text
  - `doc.sents`: sentences
  - `doc.ents`: named entities
  - `doc.lang`: language id  
    `doc.lang_`: language Unicode string
  
  
**[`Token`](https://spacy.io/api/token)**
  - `token.is_sent_start`: sentence start
  - `token.is_stop`: is a stop word
  - `token.lemma` + `token.lemma_`
  - `token.ent_type_`
  - `token.dep_` + `token.head_`
  - `token.is_oov`: out of vocabulary
  
  
**[`Span`](https://spacy.io/api/span)**

Use: `dir(doc)`, `dir(token)`, `dir(span)` to see all object attributes

 <a class="anchor" id="section2"></a>

<h1 style="font-size:13px; font-weight:bold; background:#eebbcc;padding: 15px;">SECTION 2 – spaCy features</h1>

 <a class="anchor" id="chapter3"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 3 – Linguistic features</h1>

# POS Tagging

- `token.pos` (int)   
  `token.pos_`: Unicode universal tags
- `token.tag` (int)   
  `token.tag_`: fine-grained tags
- `spacy.explain(tag_name)`
- `lang/<language_code>/tag_map.py` under each language submodule
  
  
- Verb, Noun, Pronoun, Determiner, Adjective, Adverb, Preposition, Conjunction, Interjection
- Same language can support different tagsets
- See:
  - http://partofspeech.org/,
  - [The eight parts of speech](http://www.butte.edu/departments/cas/tipsheets/grammar/parts_of_speech.html)
  
  
- POS taggers: sequential models, **Seq2seq**, spaCy uses an **LSTM** variation  
  (see state-of-the-art of POS Tagging on ACL website)

### Word Sense Disambiguation (WSD)

- Sometimes, POS tags can help identify a specific sense of a word
- Example: 
  - beat – strike someone – V
  - beat – defeat someone – V
  - beat – rythm in music and poetry – N
  - beat – bird wing movement – N
  - beat – completely exhausted – ADVJ

### Natural Language Understanding (NLU)

- Verb **tense** and **aspect** can help identify intent
- Example:
  - I flew to Rome 3 days ago. I still didn't get the bill, please send it ASAP.
  - I need to fly to Rome
  - I will fly to Rome next week. Check availabilities please.
  - I'm flying to Rome next week. Check flights on next Tuesday.

### Numbers, Symboles and Punctuations

- `NUM`, `SYM`, `PUNCT` (cf. book p. 78 fine-grained punctuation tags)
- Can be used in rule-based matching to recognize financial info...

# Dependency Parsing

d

# NER

d

 <a class="anchor" id="chapter4"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 4 – Rule-based matching</h1>

# Token-based matching

d

# PhraseMatcher

d

# EntityRuler

d

# Combining spaCy models and matchers

d

 <a class="anchor" id="chapter5"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 5 – Word Vectors and Semantic Similarity</h1>

# Word vectors

d

# spaCy's pretrained vectors

d

# Third party word vectors

d

# Advanced semantic similarity methods

d

 <a class="anchor" id="chapter6"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 6 – Putting everything together: Semantic Parsing with spaCy</h1>

# Extracting named entities

d

# Using dependency relations for intent recognition

d

# Semantic similarity methods for semantic parsing

d

# Putting it all together

d

 <a class="anchor" id="section3"></a>

<h1 style="font-size:13px; font-weight:bold; background:#eebbcc;padding: 15px;">SECTION 3 – Machine Learning with spaCy</h1>

 <a class="anchor" id="chapter7"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 7 – Customizing spaCy models</h1>

# Getting started with data preparation

d

# Annotating and preparing data

d

# Updating an existing pipeline component

d

# Training a pipeline component from scratch

d

 <a class="anchor" id="chapter8"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 8 – Text Classification with spaCy</h1>

# The basics of text classification

d

# Training the spaCy text classifier

d

# Sentiment analysis with spaCy

d

# Text classification with spaCy and Keras

d

 <a class="anchor" id="chapter9"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 9 – spaCy and transformers</h1>

# Transformers and Transfer Learning

d

# Understanding BERT

d

# Transformers and TensorFlow

d

# Transformers and spaCy

d

 <a class="anchor" id="chapter10"></a>

<h1 style="font-size:34px; font-weight:bold; background:#DDEEEE;padding: 15px;">Chapter 10 – Putting it all together: Designing your chatbot with spaCy</h1>

# Introduction to Conversational AI

d

# Entity Extraction

d

# Intent recognition

d