# Introduction to NLP

## What is NLP?

> Natural language processing (NLP) is a subfield [...] concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data
>
> Patel, A. & Arasanipali, A. (2021). _Applied Natural Language Processing in the Enterprise._ O'Reilly.

Major applications include:
  - Machine translation
  - Speech recognition
  - Question answering
  - Text summarization
  - Chatbots
  - Sentiment analysis

Multiple approaches:
  - Rules based
  - Classical statistics
  - Neural networks

## Major Concepts

### Tokenization

Tokenization is the process of splitting text into its atoms: words, punctuation, symbols, numbers, etc. For example:

![](assets/sentence.png)

can be tokenized as

![](assets/tokens.png)

### Part-of-Speech Tagging

Part-of-speech (POS) tagging is the process of assigning, as the name suggests, parts of speech to each word token: noun, verb, adverb, adjective, etc.

![](assets/pos.jpg)

### Dependency Parsing

Dependency parsing involves labeling the relationships between individual tokens within a span: for example, in our above example, we might specify that "the apartment" is the object of the verb "robbed".

### Chunking

Chunking identifies multiple, related tokens into a single token. For example, "New York City" could be treated as a single token despite being three separate words.

### Named Entity Recognition

Named entity recognition is the process of assigning labels to known entities: names, locations, dates, currencies, and so forth.

### Lemmatization

Language is weird! It's fun. Except for NLP. Because we need to understand that "am", "will be", "was", and "have been" are all actually just different conjucations of the same verb.

Lemmatization refers to this process: instead of having to work with all word variants, we can just work directly with the word's base to simplify things a bit.

### Stemming

Stemming is a similar proces to lemmatization, but a little bit simpler. Here, we're just reducing words to their stems (so creative in what we name these things). For example, "big", "bigger", and "biggest" would all be stemmed to "big-".

We use stemming as a bit of a shortcut since it's substantially cheaper than lemmatization: we don't need to understand the part of speech, or any other metadata about the word or chunk. We can just take a hammer to things and break off all the bits at the end of the word that we don't like.

# NLP with spaCy

In [None]:
# Import spaCy and create the English nlp object
import spacy

nlp = spacy.blank("en")

# Process the text
doc = nlp("I like trees, kangaroos and narwhals.")

print([elem for elem in doc])

In [None]:
print(doc[2:5].text)

## Lexical Attributes

In [None]:
import spacy

nlp = spacy.blank("en")

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia"
    "were in extreme poverty. Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

In [None]:
print("is_alpha:", [token for token in doc if token.is_alpha])
print("is_punct:", [token for token in doc if token.is_punct])
print("like_num:", [token for token in doc if token.like_num])

## Trained Pipelines

- Models that enable spaCy to predict linguistic attributes in context
  - Part-of-speech tags
  - Syntactic dependencies
  - Named entities
- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

We can also predict syntactic dependencies with this package:

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

![](assets/dependency-labels.png)

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

In [None]:
print(f'A GPE entity includes {spacy.explain("GPE")}')

## Rule-Based Matching

### Rules vs regular expressions

Rules let us:
  - Match on Doc objects, not just strings
  - Match on tokens and token attributes
  - Use a model's predictions
  - Example: "duck" (verb) vs. "duck" (noun)

### Match patterns

  - Lists of dictionaries, one per token
  - Match exact token texts. E.g.,

```
[{'TEXT': 'iPhone', {'TEXT': 'X'}]
```

  - Match lexical attributes. E.g.,

```
[{'LOWER': 'iphone', {'LOWER': 'x'}]
```

  - Match any token attributes. E.g.,

```
[{'LEMMA': 'buy', {'POS': 'NOUN'}]
```

In [None]:
from spacy.matcher import Matcher

# Load a pipeline and create the nlp object
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", [pattern])

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)
print(matches)

In the above output, the tuple returned contains:
  - `match_id`: hash value of the pattern name
  - `start`: start index of matched span
  - `end`: end index of matched span

In [None]:
def print_matches(matches):
    match_text = []
    for match_id, start, end in matches:
        # Get the matched span
        matched_span = doc[start:end]
        match_text.append(matched_span.text)
    return match_text

In [None]:
print_matches(matches)

### Matching lexical attributes

In [None]:
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

doc = nlp("2018 FIFA World Cup: France won!")

matcher = Matcher(nlp.vocab)
matcher.add("FIFA_PATTERN", [pattern])

matches = matcher(doc)
print_matches(matches)

### Matching other token attributes

In [None]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

doc = nlp("I loved dogs but now I love cats more.")
matcher = Matcher(nlp.vocab)
matcher.add("LOVE_PATTERN", [pattern])

matches = matcher(doc)
print_matches(matches)

### Operators and quantifiers

In [None]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

# print(spacy.explain("DET"))
doc = nlp("I bought a smartphone. Now I'm buying apps.")
matcher = Matcher(nlp.vocab)
matcher.add("BUY_PATTERN", [pattern])

matches = matcher(doc)
print_matches(matches)

![](assets/operators.png)

# Large-Scale Data Analysis

## Data Structures

### Shared vocab and string store

  - `Vocab`: stores data shared across multiple documents
  - To save memory, spaCy encodes all strings to **hash values**
  - Strings are only stored once in the StringStore via `nlp.vocab.strings`
  - String store: bidirectional **lookup table**

```python
nlp.vocab.strings.add("coffee")
coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]
```

  - Hashes can't be reversed – that's why we need to provide the shared vocab

### Lexemes

- A `lexeme` object is an entry in the vocabulary

In [None]:
doc = nlp("I love coffee")
lexeme = nlp.vocab["coffee"]

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

  - Contains the context-independent information about a word:
      - Word text
      - Lexical attributes
  - Does not contain context-dependent POS tags, dependencies, or entity labels

## Word Vectors

### Word embeddings

Prebuilt pipelines will have taken large corpora of text and created **embeddings**: by looking at which words do or don't co-occur, and how frequently, we can create proximity maps of how near or far other tokens in the corpus are to a given word.

![](assets/word-cooccurrence.png)

This process gives us word embeddings that we can use to determine how similar or dissimilar from one another two arbitrary words are:

![](assets/word-embeddings.png)

### Semantic similarity in spaCy

  - spaCy can compare two objects and predict similarity
  - `Doc.similarity()`, `Span.similarity()`, and `Token.similarity()`
  - Take another object and return a similarity score (0 to 1)
  - Important: needs a pipeline that has word vectors included, for example:
    - ✅ en_core_web_md (medium)
    - ✅ en_core_web_lg (large)
    - 🚫 NOT en_core_web_sm (small)

In [None]:
# Load a larger pipeline with vectors
nlp = spacy.load("en_core_web_md")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

In [None]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

## Combining Predictions and Rules

![](assets/statistical-vs-rule.png)

In [None]:
matcher = Matcher(nlp.vocab)
matcher.add("DOG", [[{"LOWER": "golden"}, {"LOWER": "retriever"}]])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched span:", span.text)
    print("Root token:", span.root.text)
    print("Root head token:", span.root.head.text)
    print("Previous token:", doc[start - 1].text, doc[start - 1].pos_)

  - `PhraseMatcher` like regular expressions or keyword search – but with access to the tokens!
  - Takes `Doc` object as patterns
  - More efficient and faster than the Matcher
  - Great for matching large word lists

In [None]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", [pattern])
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print("Matched span:", span.text)

## Training and Updating Models

  - The entity recognizer tags words and phrases in context
  - Each token can only be part of one entity
  - Examples need to come with context

```python
doc = nlp("iPhone X is coming")
doc.ents = [Span(doc, 0, 2, label="GADGET")]
```
  
  - Texts with no entities are also important

```python
doc = nlp("I need a new phone! Any tips?")
doc.ents = []
```

**Goal: teach the model to generalize**

### Training data

  - Examples of what we want the model to predict in context
  - Update an existing model: a few hundred to a few thousand examples
  - Train a new category: a few thousand to a million examples
    - spaCy's English models: 2 million words
  - Usually created manually by human annotators
  - Can be semi-automated – for example, using spaCy's `Matcher`!

### Generate training corpus

In [None]:
from spacy.tokens import Span

nlp = spacy.blank("en")

# Create a Doc with entity spans
doc1 = nlp("iPhone X is coming")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
# Create another doc without entity spans
doc2 = nlp("I need a new phone! Any tips?")

docs = [doc1, doc2]  # and so on...

In [None]:
import random

random.shuffle(docs)
train_docs = docs[:len(docs) // 2]
dev_docs = docs[len(docs) // 2:]

In [None]:
from spacy.tokens import DocBin

# Create and save a collection of training docs
train_docbin = DocBin(docs=train_docs)
train_docbin.to_disk("./train.spacy")

# Create and save a collection of evaluation docs
dev_docbin = DocBin(docs=dev_docs)
dev_docbin.to_disk("./dev.spacy")

### Training the model

  - spaCy can auto-generate a default config file for you
  - interactive quickstart widget in the docs
  - `init config` command on the CLI

```bash
$ python -m spacy init config \
    ./config.cfg \
    --lang en \
    --pipeline ner
```

  - `init config`: the command to run
  - `config.cfg`: output path for the generated config
  - `--lang`: language class of the pipeline, e.g. en for English
  - `--pipeline`: comma-separated names of components to include

In [None]:
!python3.9 -m spacy init config \
    ./config.cfg \
    --lang en \
    --pipeline ner

In [None]:
!python3.9 -m spacy train \
    ./config.cfg \
    --output ./output \
    --paths.train train.spacy \
    --paths.dev dev.spacy

### Loading a trained pipeline

  - output after training is a regular loadable spaCy pipeline
    - `model-last`: last trained pipeline
    - `model-best`: best trained pipeline
  - load it with `spacy.load`

In [None]:
nlp = spacy.load("./output/model-best")
doc = nlp("iPhone 11 vs iPhone 8: What's the difference?")
print(doc.ents)