# Natural Language Processing Part 2

### Yahia Chammami - William James Mattingly Ph.D.

### Technical Review - FreeCodecamp

## I- SpaCy
### 1. What is SpaCy?

**SpaCy** is an open-source **natural language processing (NLP) library** released under the MIT license. It was created by **Explosion AI**

The **SpaCy** framework is widely used for various text processing and language understanding tasks. In a world where vast amounts of textual information is generated every day, understanding and extracting insights from text has become a critical and necessary skill.

**SpaCy** also seamlessly integrates with **machine learning** algorithms and models providing a powerful solution for **text classification** tasks. This type of capability has been very successful in reducing email spam on the internet in recent years and has now largely replaced the previous practice of looking for predefined patterns of text and content.

Finally **SpaCy** provides pre-trained word vectors (word embeddings) that capture semantic information about words, making it easier to work with semantics in text by understanding the relationships between words.

**Natural Language Processing (NLP) plays a pivotal role  in every sector of industry, from academics who leverage it to aid in research to financial analysts who try and predict the stock market. Lawyers use NLP to help analyze thousands of legal documents in seconds to target their research and medical doctors use it to parse patient charts.**


### 2. Install spaCy

In [1]:
# Install spaCy library
!pip install spacy

In [2]:
# Download the English language model for spaCy
!python -m spacy download en_core_web_sm

In [3]:
# Import the spaCy library
import spacy

In [4]:
# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

## II- Linguistic Annotations
In the context of NLP (Natural Language Processing), linguistic annotations are used to enhance the understanding of the structure, meaning, and relationships within a given text.

In [5]:
# Open and read the content of the "wiki_us.txt" file
with open("data/wiki_us.txt", "r") as f:
    text = f.read()

# Print the content of the file
print(text)


The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

### 1.  Doc Container

Containers are spaCy objects that contain a large quantity of data about a text. When we analyze texts with the spaCy framework, we create different container objects to do that. Here is a full list of all spaCy containers. We will be focusing on three (emboldened): Doc, Span, and Token.

In [6]:
# Process the text using the spaCy language model
doc = nlp(text)

# Print the spaCy Doc object
print(doc)


The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [7]:
print (len(doc))
print (len(text))

652
3525


**The Doc container, unlike the text object, contains a lot of valuable metadata, or attributes, hidden behind it.**


In [8]:
# Let's examine the length of the doc object and the text object.
# Iterate over the first 10 tokens in the text object
for token in text[:10]:
    print (token)

T
h
e
 
U
n
i
t
e
d


In [9]:
# Iterate over the first 10 tokens in the doc object
for token in doc[:10]:
    print (token)

The
United
States
of
America
(
U.S.A.
or
USA
)


**The open and close parentheses are also considered an item in the container. These are all known as tokens.**

**Tokens** are a fundamental building block of spaCy or any NLP framework. They can be words or punctuation marks. Tokens are something that has syntactic purpose in a sentence and is self-contained.

In [10]:
# Split the text into words and iterate over the first 10 words
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


**The parentheses are not removed or handled individually.**

**To see this more clearly, let's print off all tokens from index 5 to 8 in both the text and doc objects.**

In [11]:
# Split the text into words and create a list containing the first 10 words
words = text.split()[:10]

In [12]:
# Set an initial value for i
i = 5

# Iterate over a subset of tokens in the 'doc' object
for token in doc[i:8]:
    # Print information about each token and the corresponding word from the 'words' list
    print(f"SpaCy Token {i}:\n{token}\nWord Split {i}:\n{words[i]}\n\n")    
    # Increment the value of i
    i = i + 1

SpaCy Token 5:
(
Word Split 5:
(U.S.A.


SpaCy Token 6:
U.S.A.
Word Split 6:
or


SpaCy Token 7:
or
Word Split 7:
USA),




### 2. Sentence Boundary Detection (SBD)
Sentence boundary detection or SBD, is the natural language processing task of determining the boundaries between sentences in a given text. The goal of SBD is to identify the positions in the text where one sentence ends and the next one begins.


In [13]:
# To access the sentences in the Doc container, we can use the attribute sents, like so:
# Iterate over sentences in the processed document and print each sentence
for sent in doc.sents:
    print(sent)
    print()

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]

At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]

The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]

With a population of more than 331 million people, it is the third most populous country in the world.

The national capital is Washington, D.C., and the most populous city is New York.



Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.

The United States emerged from the thirteen British colo

In [14]:
# Extract the first sentence from the processed document
sentence1 = list(doc.sents)[0]

# Print the first sentence
print(sentence1)


The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


### 3. Token Attributes
The tokens are the building blocks of processed text, and each token has various attributes that provide information about its properties and linguistic features.

**Here are some common token attributes in spaCy:**

* .text
* .head
* .left_edge
* .right_edge
* .ent_type_
* .iob_
* .lemma_
* .morph
* .pos_
* .dep_
* .lang_

These attributes allow you to access information about the lexical, syntactic, and semantic properties of each token in a processed document.

**Here's an example of accessing these attributes for a token:**

In [15]:
# Access the third token in the first sentence and assign it to 'token2'
token2 = sentence1[2]

# Print the third token
print(token2)


States


In [16]:
# Accessing token attributes
print(f"Text: {token2.text}")                  # Original text of the token
print(f"Heading Token: {token2.head.text}")     # Text of the token's syntactic head
print(f"Leftmost Token: {token2.left_edge.text}")  # Text of the leftmost token in the syntactic span
print(f"Rightmost Token: {token2.right_edge.text}")  # Text of the rightmost token in the syntactic span
print(f"Named Entity Type: {token2.ent_type_}")  # Named entity type of the token
print(f"IOB Tag: {token2.ent_iob_}")            # Inside, Outside, Beginning tag for named entities
print(f"Lemma: {token2.lemma_}")                # Base or root form of the token
print(f"Morphological Analysis: {token2.morph}") # Morphological analysis of the token
print(f"Part-of-Speech Tag: {token2.pos_}")     # Part-of-speech tag of the token
print(f"Dependency Relation: {token2.dep_}")    # Syntactic dependency relation to the token's head
print(f"Language: {token2.lang_}")              # Language of the token


Text: States
Heading Token: is
Leftmost Token: The
Rightmost Token: America
Named Entity Type: GPE
IOB Tag: I
Lemma: States
Morphological Analysis: Number=Sing
Part-of-Speech Tag: PROPN
Dependency Relation: nsubj
Language: en


### 4. Part of Speech Tagging (POS)

Part of Speech Tagging (POS) is a natural language processing (NLP) task that involves assigning a grammatical category or part-of-speech tag to each word in a given text based on its syntactic function and role within a sentence. The goal of POS tagging is to categorize words into specific classes, such as nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and more.

In [17]:
# Iterate over tokens in 'sentence1' and print text, POS tag, and dependency relation
# Iterate over tokens in 'sentence1' and print information on separate lines
for token in sentence1:
    print(f"Text: {token.text}")              # Print the original text of the token
    print(f"POS Tag: {token.pos_}")           # Print the part-of-speech (POS) tag of the token
    print(f"Dependency Relation: {token.dep_}") # Print the syntactic dependency relation of the token
    print()  # Add an empty line for better separation between tokens

Text: The
POS Tag: DET
Dependency Relation: det

Text: United
POS Tag: PROPN
Dependency Relation: compound

Text: States
POS Tag: PROPN
Dependency Relation: nsubj

Text: of
POS Tag: ADP
Dependency Relation: prep

Text: America
POS Tag: PROPN
Dependency Relation: pobj

Text: (
POS Tag: PUNCT
Dependency Relation: punct

Text: U.S.A.
POS Tag: PROPN
Dependency Relation: appos

Text: or
POS Tag: CCONJ
Dependency Relation: cc

Text: USA
POS Tag: PROPN
Dependency Relation: conj

Text: )
POS Tag: PUNCT
Dependency Relation: punct

Text: ,
POS Tag: PUNCT
Dependency Relation: punct

Text: commonly
POS Tag: ADV
Dependency Relation: advmod

Text: known
POS Tag: VERB
Dependency Relation: acl

Text: as
POS Tag: ADP
Dependency Relation: prep

Text: the
POS Tag: DET
Dependency Relation: det

Text: United
POS Tag: PROPN
Dependency Relation: compound

Text: States
POS Tag: PROPN
Dependency Relation: pobj

Text: (
POS Tag: PUNCT
Dependency Relation: punct

Text: U.S.
POS Tag: PROPN
Dependency Relation: ap

The **displacy.render** function from spaCy is used to visualize the syntactic dependency structure of a sentence. It generates a graphical representation using a dependency tree. 

In [18]:
# Import the 'displacy' module from spaCy
from spacy import displacy

# Render the syntactic dependency structure of 'sentence1'
displacy.render(sentence1, style="dep")


### 5. Named Entity Recognition
Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and classifying named entities (real-world objects such as persons, organizations, locations, dates, monetary values, percentages, etc.) in unstructured text. The goal of NER is to extract structured information from text and classify entities into predefined categories.

In NER, each named entity is assigned a specific label or category based on its type. Common types of named entities include:

- Person: Individual names of people.
- Organization: Names of companies, institutions, or other organized groups.
- Location: Geographical locations, such as countries, cities, or landmarks.
- Date: Temporal expressions indicating dates or durations.
- Time: Expressions indicating specific points or periods of time.
- Money: Monetary values or currency expressions.
- Percentage: Percentage values.
- Product: Names of products or goods.

In [19]:
# Iterate over entities and print their text and label
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Entity: The United States of America, Label: GPE
Entity: U.S.A., Label: GPE
Entity: USA, Label: GPE
Entity: the United States, Label: GPE
Entity: U.S., Label: GPE
Entity: US, Label: GPE
Entity: America, Label: GPE
Entity: North America, Label: LOC
Entity: 50, Label: CARDINAL
Entity: five, Label: CARDINAL
Entity: 326, Label: CARDINAL
Entity: Indian, Label: NORP
Entity: 3.8 million square miles, Label: QUANTITY
Entity: 9.8 million square kilometers, Label: QUANTITY
Entity: fourth, Label: ORDINAL
Entity: The United States, Label: GPE
Entity: Canada, Label: GPE
Entity: Mexico, Label: GPE
Entity: Bahamas, Label: GPE
Entity: Cuba, Label: GPE
Entity: more than 331 million, Label: CARDINAL
Entity: third, Label: ORDINAL
Entity: Washington, Label: GPE
Entity: D.C., Label: GPE
Entity: New York, Label: GPE
Entity: Paleo-Indians, Label: NORP
Entity: Siberia, Label: LOC
Entity: North American, Label: NORP
Entity: at least 12,000 years ago, Label: DATE
Entity: European, Label: NORP
Entity: the 16th c

**Sometimes it can be difficult to read this output as raw data. In this case, we can again leverage spaCy's displaCy feature. Notice that this time we are altering the keyword argument, style, with the string "ent".**

In [20]:
# Visualize named entities
displacy.render(doc, style="ent")

### 3. Introducing Complex Rules and Variance to the EntityRuler (Advanced)
In some instances, labels may have a set type of variance that follow a distinct pattern or sets of patterns. One such example (included in the spaCy documentation) is phone numbers. In the United States, phone numbers have a few forms. The standard formal method is (xxx)-xxx-xxxx, but it is not uncommon to see xxx-xxx-xxxx or xxxxxxxxxx. If the owner of the phone number is giving that same number to someone outside the US, then +1(xxx)-xxx-xxxx.

If you are working within a United States domain, you can pass RegEx formulas to the pattern matcher to grab all of these instances.

In [21]:
#Sample text
text = "This is a sample number (555) 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {"label": "PHONE_NUMBER", "pattern": [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},
                {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
            ]
#add patterns to ruler
ruler.add_patterns(patterns)



#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

(555) 555-5555 PHONE_NUMBER


## III - Word Vectors

Word vectors, or word embeddings, are numerical representations of words in multidimensional space through matrices. The purpose of the word vector is to get a computer system to understand a word. Computers cannot understand text efficiently. They can, however, process numbers quickly and well. For this reason, it is important to convert a word into a number.

In [22]:
# Load the spaCy model with medium-sized word vectors for English
nlp = spacy.load("en_core_web_md")
# Read the content of the file "wiki_us.txt"
with open("data/wiki_us.txt", "r") as f:
    text = f.read()
# Process the text using the spaCy pipeline
doc = nlp(text)
# Extract the first sentence from the processed document
sentence1 = list(doc.sents)[0]
# Access the first token in the first sentence
sentence1[0]

The

In [23]:
# Access the vector representation of the first token in 'sentence1'
sentence1[0].vector

array([-7.2681e+00, -8.5717e-01,  5.8105e+00,  1.9771e+00,  8.8147e+00,
       -5.8579e+00,  3.7143e+00,  3.5850e+00,  4.7987e+00, -4.4251e+00,
        1.7461e+00, -3.7296e+00, -5.1407e+00, -1.0792e+00, -2.5555e+00,
        3.0755e+00,  5.0141e+00,  5.8525e+00,  7.3378e+00, -2.7689e+00,
       -5.1641e+00, -1.9879e+00,  2.9782e+00,  2.1024e+00,  4.4306e+00,
        8.4355e-01, -6.8742e+00, -4.2949e+00, -1.7294e-01,  3.6074e+00,
        8.4379e-01,  3.3419e-01, -4.8147e+00,  3.5683e-02, -1.3721e+01,
       -4.6528e+00, -1.4021e+00,  4.8342e-01,  1.2549e+00, -4.0644e+00,
        3.3278e+00, -2.1590e-01, -5.1786e+00,  3.5360e+00, -3.1575e+00,
       -3.5273e+00, -3.6753e+00,  1.5863e+00, -8.1594e+00, -3.4657e+00,
        1.5262e+00,  4.8135e+00, -3.8428e+00, -3.9082e+00,  6.7549e-01,
       -3.5787e-01, -1.7806e+00,  3.5284e+00, -5.1114e-02, -9.7150e-01,
       -9.0553e-01, -1.5570e+00,  1.2038e+00,  4.7708e+00,  9.8561e-01,
       -2.3186e+00, -7.4899e+00, -9.5389e+00,  8.5572e+00,  2.74

**Once a word vector model is trained, we can do similarity matches very quickly and very reliably. Let's explore some vectors from our medium sized model. Let's specifically try and find the words most closely related to the word dog.**

### 1. Doc Similarity
Document similarity refers to the measurement of how alike two documents are in terms of their content, meaning, or structure. 

Document similarity is often used in various applications, such as document clustering, recommendation systems, and information retrieval.

In [24]:
nlp = spacy.load("en_core_web_md")  # make sure to use larger package!
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1.similarity(doc2))
print(doc2.similarity(doc1))

0.691649353055761
0.691649353055761


### 2. Word Similarity
We can also calculate the similarity between two given words.

In [25]:
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries)
print(burgers)
print(french_fries.similarity(burgers))

salty fries
hamburgers
0.6938489079475403


### 3. Find similar Word

In [26]:
# Your target word
your_word = "dog"

In [27]:
import numpy as np
from spacy.vocab import Vocab
# Most similar words based on spaCy vocabulary vectors
ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
# Extract most similar words and distances
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
# Print the most similar words
print(words)


['dogsbody', 'wolfdogs', 'Baeg', 'duppy', 'pet(s', 'postcanine', 'Kebira', 'uppies', 'Toropets', 'moggie']


### 4. Other Rules-Based Matching Techniques in spaCy
There are two other rules-based methods in spaCy: Matcher and PhraseMatcher. We have already met the Matcher in **01.03: Rules-Based Matching**. We will be meeting other more complex rules-based matching methods in the next few notebooks.
####  How to use the spaCy Matcher

In [28]:
# Import the Matcher class from the spacy.matcher module
from spacy.matcher import Matcher

### Basic Example

In [29]:
# Load the spaCy language model
nlp = spacy.load("en_core_web_sm")
# Create a Matcher object with the vocabulary from the loaded model
matcher = Matcher(nlp.vocab)
# Define a pattern for the matcher
pattern = [{"LIKE_EMAIL": True}]
# Add the pattern to the matcher
matcher.add("EMAIL_ADDRESS", [pattern])
# Process the text with spaCy
doc = nlp("This is an email address: wmattingly@aol.com")
# Apply the matcher on the processed text (doc)
matches = matcher(doc)
# Print the matches
print (matches)

[(16571425990740197027, 6, 7)]


In [30]:
print (nlp.vocab[matches[0][0]].text)

EMAIL_ADDRESS


### Applied Matcher

In [31]:
# Open the file "wiki_mlk.txt" in read mode ("r")
with open("data/wiki_mlk.txt", "r") as f:
    # Read the contents of the file and store it in the variable 'text'
    text = f.read()

In [32]:
# Import the spaCy library
import spacy

# Load the English language model "en_core_web_sm" provided by spaCy
nlp = spacy.load("en_core_web_sm")

In [33]:
# Create a Matcher object using the spaCy vocabulary
matcher = Matcher(nlp.vocab)
# Define a pattern for the Matcher to find proper nouns (POS: PROPN)
pattern = [{"POS": "PROPN"}]
# Add the pattern to the Matcher with the label "PROPER_NOUNS"
matcher.add("PROPER_NOUNS", [pattern])
# Read the contents of the file into a spaCy Doc object
doc = nlp(text)
# Use the Matcher to find matches in the document
matches = matcher(doc)
# Print the number of matches found
print(len(matches))
# Print the details of the first 10 matches
for match_id, start, end in matches[:10]:
    # Print the match ID and the span of text that matched the pattern
    print(match_id, doc[start:end])



102
3232560085755078826 Martin
3232560085755078826 Luther
3232560085755078826 King
3232560085755078826 Jr.
3232560085755078826 Michael
3232560085755078826 King
3232560085755078826 Jr.
3232560085755078826 January
3232560085755078826 April
3232560085755078826 Baptist


### Improving it with Multi-Word Tokens

In [34]:
# Create a Matcher object using the spaCy vocabulary
matcher = Matcher(nlp.vocab)
# Define a pattern for the Matcher to find one or more proper nouns (POS: PROPN)
pattern = [{"POS": "PROPN", "OP": "+"}]
# Add the pattern to the Matcher with the label "PROPER_NOUNS"
matcher.add("PROPER_NOUNS", [pattern])
# Read the contents of the file into a spaCy Doc object
doc = nlp(text)
# Use the Matcher to find matches in the document
matches = matcher(doc)
# Print the number of matches found
print(len(matches))
# Print the details of the first 10 matches
for match_id, start, end in matches[:10]:
    # Print the match ID and the span of text that matched the pattern
    print(match_id, doc[start:end])

175
3232560085755078826 Martin
3232560085755078826 Martin Luther
3232560085755078826 Luther
3232560085755078826 Martin Luther King
3232560085755078826 Luther King
3232560085755078826 King
3232560085755078826 Martin Luther King Jr.
3232560085755078826 Luther King Jr.
3232560085755078826 King Jr.
3232560085755078826 Jr.


## RegEx with spaCy

### What is Regular Expressions (RegEx)?
Regex, short for regular expression, is a powerful and flexible tool used for pattern matching in strings. It's a sequence of characters that defines a search pattern. This search pattern can be used to match, locate, and manipulate text based on the patterns you define.

In spaCy it can be leveraged in a few different pipes (depending on the task at hand as we shall see), to identify things such as entities or pattern matching.
### 1. Examples of using RegEx

In [57]:
# Define a regex pattern to match Social Security Numbers (SSN)
# The pattern specifies a word boundary (\b), followed by three digits, a hyphen,
# two digits, another hyphen, and four digits. The \b at the end ensures a word boundary.
pattern = r'\b\d{3}-\d{2}-\d{4}\b'

# Sample text containing Social Security Numbers
text = "Social Security Numbers: 123-45-6789, 987-65-4321"

# Use the 'findall' function from the 're' module to find all occurrences
# of the specified pattern in the given text
matches = re.findall(pattern, text)
matches

['123-45-6789', '987-65-4321']

### The Strengths of RegEx
There are several strengths to RegEx.

1) Due to its complex syntax, it can allow for programmers to write robust rules in short spaces.<br>
2) It can allow the researcher to find all types of variance in strings<br>
3) It can perform remarkably quickly when compared to other methods.
4) It is universally supported

### The Weaknesses of RegEx
Despite these strengths, there are a few weaknesses to RegEx.
1) Its syntax is quite difficult for beginners. (I still find myself looking up how to do certain things).<br>
2) It order to work well, it requires a domain-expert to work alongside the programmer to think of all ways a pattern may vary in texts.<br>


In [58]:
# Import the 're' module, which provides support for regular expressions
import re

Now that we have it imported, we can begin to write out some RegEx rules. Let's say we want to find an occurrence of a date in a text. As noted in an earlier notebook, there are a finite number of ways this can be represented. Let's try to grab all instances of a day followed by a month first.

In [59]:
# Define a regex pattern to match date expressions in the format "day month"
# The pattern consists of two capturing groups:
# 1. (\d){1,2}: Matches one or two digits representing the day.
# 2. (January|February|March|April|May|June|July|August|September|October|November|December):
#    Matches the name of the month (case-sensitive).
pattern = r"((\d){1,2} (January|February|March|April|May|June|July|August|September|October|November|December))"

# Sample text containing date expressions
text = "This is a date 2 February. Another date would be 14 August."

# Use the 'findall' function from the 're' module to find all occurrences
# of the specified pattern in the given text
matches = re.findall(pattern, text)

# 'matches' now contains a list of tuples, where each tuple represents a match.
# The first element of the tuple is the full match, and the second and third
# elements are the capturing groups for the day and month, respectively.
print(matches)


[('2 February', '2', 'February'), ('14 August', '4', 'August')]


this pattern will match anything that functions as a set of one or two numbers followed by a month. What happens when we try and do this with a date that is formed the opposite way?

In [60]:
# Define a regex pattern to match date expressions in the format "month day"
# The pattern consists of two capturing groups:
# 1. (January|February|March|April|May|June|July|August|September|October|November|December):
#    Matches the name of the month (case-sensitive).
# 2. (\d){1,2}: Matches one or two digits representing the day.
pattern = r"((January|February|March|April|May|June|July|August|September|October|November|December) (\d){1,2})"

# Sample text containing date expressions
text = "This is a date February 2. Another date would be 14 August."

# Use the 'findall' function from the 're' module to find all occurrences
# of the specified pattern in the given text
matches = re.findall(pattern, text)

# 'matches' now contains a list of tuples, where each tuple represents a match.
# The first element of the tuple is the full match, the second element is the
# month, and the third element is the day.
print(matches)


[('February 2', 'February', '2')]


It fails. But this is no fault of RegEx. Our pattern cannot accommodate that variation. Nevertheless, we can account for it by adding it as a possible variation. Possible variations are accounted for with a *

In [61]:
# Define a regex pattern to match date expressions in the formats:
# 1. "day month" (e.g., 2 February)
# 2. "month day" (e.g., February 2)
# The pattern consists of two main alternatives separated by a pipe (|):
# Alternative 1: (\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December))
#   - (\d){1,2}: Matches one or two digits representing the day.
#   - ( (January|February|March|April|May|June|July|August|September|October|November|December)):
#     Matches a space followed by the name of the month (case-sensitive).
# Alternative 2: ( (January|February|March|April|May|June|July|August|September|October|November|December)) (\d){1,2}
#   - ( (January|February|March|April|May|June|July|August|September|October|November|December)):
#     Matches the name of the month followed by a space.
#   - (\d){1,2}: Matches one or two digits representing the day.
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"

# Sample text containing date expressions
text = "This is a date February 2. Another date would be 14 August."

# Use the 'findall' function from the 're' module to find all occurrences
# of the specified pattern in the given text
matches = re.findall(pattern, text)

# 'matches' now contains a list of tuples, where each tuple represents a match.
# The first element of the tuple is the full match, the second element is the day,
# and the third element is the month.
print(matches)


[('February 2', '', '', '', '', 'February 2', 'February ', 'February', '2'), ('14 August', '14 August', '4', ' August', 'August', '', '', '', '')]


However, that we have a lot of superfluous information for each match. These are the components of each match. There are several ways we can remove them. One way is to use the command finditer, rather than findall in RegEx.

In [62]:
# Define a regex pattern to match date expressions in the formats:
# 1. "day month" (e.g., 2 February)
# 2. "month day" (e.g., February 2)
# The pattern consists of two main alternatives separated by a pipe (|):
# Alternative 1: (\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December))
#   - (\d){1,2}: Matches one or two digits representing the day.
#   - ( (January|February|March|April|May|June|July|August|September|October|November|December)):
#     Matches a space followed by the name of the month (case-sensitive).
# Alternative 2: ( (January|February|March|April|May|June|July|August|September|October|November|December)) (\d){1,2}
#   - ( (January|February|March|April|May|June|July|August|September|October|November|December)):
#     Matches the name of the month followed by a space.
#   - (\d){1,2}: Matches one or two digits representing the day.
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"

# Sample text containing date expressions
text = "This is a date February 2. Another date would be 14 August."

# Use the 'finditer' function from the 're' module to find matches and get an iterator
iter_matches = re.finditer(pattern, text)

# Iterate through the matches and print relevant information
for match in iter_matches:
    # 'match.group()' returns the entire matched string
    # 'match.group(2)' returns the day
    # 'match.group(4)' returns the month
    print(f"Full match: {match.group()}, Day: {match.group(2)}, Month: {match.group(4)}")


Full match: February 2, Day: None, Month: None
Full match: 14 August, Day: 14 August, Month:  August


In [63]:
# Define a regex pattern to match date expressions in the formats:
# 1. "day month" (e.g., 2 February)
# 2. "month day" (e.g., February 2)
# The pattern consists of two main alternatives separated by a pipe (|):
# Alternative 1: (\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December))
#   - (\d){1,2}: Matches one or two digits representing the day.
#   - ( (January|February|March|April|May|June|July|August|September|October|November|December)):
#     Matches a space followed by the name of the month (case-sensitive).
# Alternative 2: ( (January|February|March|April|May|June|July|August|September|October|November|December)) (\d){1,2}
#   - ( (January|February|March|April|May|June|July|August|September|October|November|December)):
#     Matches the name of the month followed by a space.
#   - (\d){1,2}: Matches one or two digits representing the day.
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"

# Sample text containing date expressions
text = "This is a date February 2. Another date would be 14 August."

# Use the 'finditer' function from the 're' module to find matches and get an iterator
iter_matches = re.finditer(pattern, text)

# Iterate through the matches and print relevant information
for match in iter_matches:
    # 'match.group()' returns the entire matched string
    print(f"Full match: {match.group()}")
    # 'match.group(2)' returns the day
    print(f"Day: {match.group(2)}")
    # 'match.group(4)' returns the month
    print(f"Month: {match.group(4)}")
    # Add a separator line for better readability
    print("-" * 30)


Full match: February 2
Day: None
Month: None
------------------------------
Full match: 14 August
Day: 14 August
Month:  August
------------------------------


Within each of these is some very salient information, such as the start and end location (inside the span) and the text itself (match). We can use the start and end location to grab the text within the string.

In [64]:
# Define a regex pattern to match date expressions in the formats:
# 1. "day month" (e.g., 2 February)
# 2. "month day" (e.g., February 2)
# The pattern consists of two main alternatives separated by a pipe (|):
# Alternative 1: (\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December))
#   - (\d){1,2}: Matches one or two digits representing the day.
#   - ( (January|February|March|April|May|June|July|August|September|October|November|December)):
#     Matches a space followed by the name of the month (case-sensitive).
# Alternative 2: ( (January|February|March|April|May|June|July|August|September|October|November|December)) (\d){1,2}
#   - ( (January|February|March|April|May|June|July|August|September|October|November|December)):
#     Matches the name of the month followed by a space.
#   - (\d){1,2}: Matches one or two digits representing the day.
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"

# Sample text containing date expressions
text = "This is a date February 2. Another date would be 14 August."

# Use the 'finditer' function from the 're' module to find matches and get an iterator
iter_matches = re.finditer(pattern, text)

# Iterate through the matches and print the matched substrings
for match in iter_matches:
    # 'match.group()' returns the entire matched string
    matched_substring = match.group()
    print(matched_substring)


February 2
14 August


### Advanced RegEx in spaCy
Things like dates, times, IP Addresses, etc. that have either consistent or fairly consistent structures are excellent candidates for RegEx. Fortunately, spaCy has easy ways to implement RegEx in three pipes: Matcher, PhraseMatcher, and EntityRuler. One of the major drawbacks to the Matcher and PhraseMatcher, is that they do not align the matches as doc.ents. Because this textbook is about NER and our goal is to store the entities in the doc.ents, we will focus on using RegEx with the EntityRuler. In the next notebook, we will examine other methods.

In the previous notebook, we saw how the code below allowed for us to capture the phone number in the string. I have modified it a bit here for reasons that will become a bit more clear below.

In [65]:
#Sample text
text = "This is a sample number 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
                {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
            ]
#add patterns to ruler
ruler.add_patterns(patterns)

#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

555-5555 PHONE_NUMBER


This method worked well for grabbing the phone number. But what if we wanted to use RegEx as opposed to linguistic features, such as shape? First, let's write some RegEx to capturee 555-5555.

In [66]:
import re

# Define a regex pattern to match sequences in the format "###-####"
pattern = r"((\d){3}-(\d){4})"

# Sample text containing sequences
text = "This is a sample number 555-5555."

# Use the 'findall' function from the 're' module to find all occurrences
# of the specified pattern in the given text
matches = re.findall(pattern, text)

# Print the list of matches
print(matches)


[('555-5555', '5', '5')]


Okay. So, now we know that we have a RegEx pattern that works. Let's try and implement it in the spaCy EntityRuler. We can do that with the code below. When we execute the code below, we have no output.

In [67]:
#Sample text
text = "This is a sample number (555) 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){3}-(\d){4})"}}
                                                        ]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

This is for one very important reason. SpaCy's EntityRuler cannot use RegEx to pattern match across tokens. The dash in the phone number throws off the EntityRuler. So, what are we to do in this scenario? Well, we have a few different options that we will explore in the next notebook. But before we get to that, let's try and use RegEx to capture the phone number with no hyphen.

In [68]:
#Sample text
text = "This is a sample number 5555555."
#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){5})"}}
                                                        ]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

5555555 PHONE_NUMBER


### Extract Multi-Word Tokens
First, we need to grab the multi-word tokens. In this notebook, we are going to try and grab a multi-word token. In this case, a person whose first name begins with Paul. In the RegEx below, we specify that we are looking for any string that starts with "Paul" and then is followed by a capitalized letter. We then tell it to grab the entire second word until the end of the word.

In [69]:
import re

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

# Define a regex pattern to match occurrences of the name "Paul" followed by an uppercase word
pattern = r"Paul [A-Z]\w+"

# Use the 'finditer' function from the 're' module to find matches and get an iterator
matches = re.finditer(pattern, text)

# Iterate through the matches and print the match objects
for match in matches:
    print(match)


<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


Note that we have not grabbed the final "Paul" which is not followed by a last name. In this case, we are not interested in that Paul. Now that we know how to grab the multi-word tokens, we need to have a way to parse them in spaCy.
### Reconstruct Spans

This next stage is a bit more complicated, but works quite well once you understand the process. First, we need to import the libraries we will need. Note that we are also adding Span from spacy.tokens.

In [47]:
import re
import spacy
from spacy.tokens import Span

In [70]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."
pattern = r"Paul [A-Z]\w+"

Here, we will create a blank spaCy English model and create the doc object of the text. It will have no entities in it because we are working with a blank model that does not have an "ner" component.

In [71]:
nlp = spacy.blank("en")
doc = nlp(text)

In [72]:
original_ents = list(doc.ents)

Now, let's iterate over the results from re.finditer(). In this cell, we are goingg to grab the start and end from each match. we will then create a temporary span that will be equal to where the characters start and end in the doc object. This is important because tokens and characters do not always align correctly. Finally, we append to mwt_ents, the start, end, and text. The text is not necessary but it will help with debugging.

In [73]:
# Assuming 'doc' is a spaCy Doc object and 'pattern' is a pre-defined regex pattern
mwt_ents = []

# Iterate through the matches of the regular expression in the spaCy processed text
for match in re.finditer(pattern, doc.text):
    start, end = match.span()

    # Create a spaCy Span object using char_span
    span = doc.char_span(start, end)

    # If the span is not None, append information to the 'mwt_ents' list
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))

### Inject the Spans into the doc.ents
With that data, we can iterate over each entity and identify where it begins and ends in spaCy. Note, we are using the spaCy Span class. This allows us to create a span object and assign it a custom label. With this data, we can append each Span to original_ents.

In [74]:
# Assuming 'doc' is a spaCy Doc object and 'mwt_ents' is a list of tuples with start, end, and text
original_ents = []

# Iterate through the list of tuples in 'mwt_ents'
for ent in mwt_ents:
    start, end, name = ent

    # Create a spaCy Span object with label "PERSON"
    per_ent = Span(doc, start, end, label="PERSON")

    # Append the spaCy Span object to the 'original_ents' list
    original_ents.append(per_ent)


And finally, we set doc.ents equal to original_ents. This effectively loads the spans back into the spaCy doc.ents.

In [75]:
doc.ents = original_ents

Let's iterate over the ents as we normally would.

In [76]:
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
Paul Hollywood PERSON


### Give priority to Longer Spans
Sometimes, the situation is not so neat. Sometimes our custom RegEx entities will overlap with spaCy's Entities

In [77]:
import re
import spacy

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host."
pattern = r"Hollywood"

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP


## Thanks For Your Time 

### Yahia Chammami

#### Stay Tuned For The NLP Projects Implementation With Python ♥