# Text Pre-processing

In this notebook, you will learn how to pre-process text data using two key techniques: **regular expressions** and **SpaCy**.

## Objectives
1. **Regular Expressions:**
   - Understand the basics of regular expressions (regex) and their syntax.
   - Learn how to use regex for tasks such as extracting patterns, validating formats, and cleaning text data.

2. **SpaCy:**
   - Get familiar with SpaCy, a powerful Natural Language Processing (NLP) library in Python.
   - Discover how to perform various text pre-processing tasks with SpaCy, including:
     - **Tokenization:** Splitting text into individual words or tokens.
     - **Lemmatization:** Reducing words to their base or root forms.
     - **Stop Word Removal:** Identifying and removing common words that may not carry significant meaning.
     - **Lowercasing:** Converting all text to lowercase for consistency.


## Part 0 Web-Scraping


### `requests` package

In [1]:
import requests
# example of parsing
# JSONPlaceholder API URL to get a list of posts
URL = 'https://jsonplaceholder.typicode.com/posts'

# Make a GET request
response = requests.get(URL)
# if authentification is needed:
# data = {'username': 'user', 'password': 'pass'}
# response = requests.post(URL+'/login', data=data)

# Check if the request was successful
posts=None
if response.status_code == 200:
    posts = response.json()  # Parse the JSON response

else:
    print('Error:', response.status_code)# same as if response.status_code == 404
print(posts)
print(type(posts))#==<class 'list'> of dicts of the following type:
# {'userId': 4, 'id': 32, 'title': 'doloremque illum aliquid sunt', 'body': 'value'}
if posts:
  for post in posts[:5]:  # Print the first 5 posts
      print(f"Post ID: {post['id']}, Title: {post['title']}")

[{'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}, {'userId': 1, 'id': 2, 'title': 'qui est esse', 'body': 'est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla'}, {'userId': 1, 'id': 3, 'title': 'ea molestias quasi exercitationem repellat qui ipsa sit aut', 'body': 'et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut'}, {'userId': 1, 'id': 4, 'title': 'eum et est occaecati', 'body': 'ullam et saepe reiciendis voluptatem adipisci\nsit amet autem assumenda provident rerum culpa\nquis hic c

**Session Management**

`requests` library supports session management, allowing you to maintain state (like cookies) across multiple requests.
This is useful for logging in to websites or interacting with web applications that require authentication.
Example:
```
session = requests.Session()
#create a session object
session = requests.Session()

# set default headers (optional)
session.headers.update({'User-Agent': 'my-app'})

# make the first request
response1 = session.get('https://jsonplaceholder.typicode.com/posts/1')
print('First Post Title:', response1.json()['title'])

# make another request using the same session
response2 = session.get('https://jsonplaceholder.typicode.com/posts/2')
print('Second Post Title:', response2.json()['title'])

#close the session when done
session.close()
```

### `beautifulsoup4` package

In [2]:
# Save the URL of the webpage we want to scrape to a variable
url = 'https://docs.python.org/3/library/random.html#module-random'
# Send a get request and assign the response to a variable
response = requests.get(url)


In [3]:
from bs4 import BeautifulSoup
# Turn the undecoded content into a Beautiful Soup object and assign it to a variable
soup = BeautifulSoup(response.content)
type(soup)

In [4]:
# You can print the prettified version of the soup object for better readability
# print(soup.prettify())

In [5]:
# 1. Extracting the Title of the Page
title = soup.title.string
print('Page Title:', title)

# 2. Finding Specific Elements
# For example, let's find all function definitions in the random module
function_definitions = soup.find_all('dl')

# 3. Printing the Function Definitions
print('\nFunction Definitions:')
for func in function_definitions[:3]:
    print(func.get_text())  # Print the text of each function definition

# 4. Extracting Links
# Find all links on the page
links = soup.find_all('a')
print('\nLinks on the Page:')
for link in links[:3]:
    href = link.get('href')  # Get the href attribute
    text = link.string  # Get the link text
    print(f'Text: {text}, URL: {href}')

Page Title: random — Generate pseudo-random numbers — Python 3.13.0 documentation

Function Definitions:


random.seed(a=None, version=2)¶
Initialize the random number generator.
If a is omitted or None, the current system time is used.  If
randomness sources are provided by the operating system, they are used
instead of the system time (see the os.urandom() function for details
on availability).
If a is an int, it is used directly.
With version 2 (the default), a str, bytes, or bytearray
object gets converted to an int and all of its bits are used.
With version 1 (provided for reproducing random sequences from older versions
of Python), the algorithm for str and bytes generates a
narrower range of seeds.

Changed in version 3.2: Moved to the version 2 scheme which uses all of the bits in a string seed.


Changed in version 3.11: The seed must be one of the following types:
None, int, float, str,
bytes, or bytearray.




random.getstate()¶
Return an object capturing the current interna

In [6]:
# 5. Extracting Descriptions
# Find all descriptions (the first <dt> element followed by <dd>)
# <dt>=definition term. <dd>=definition description
print('\nDescriptions of Functions:')
for func in function_definitions[:3]:
    dt_tags = func.find_all('dt')
    dd_tags = func.find_all('dd')
    for dt, dd in zip(dt_tags, dd_tags):
        print(f'{dt.get_text()}: {dd.get_text()}')


Descriptions of Functions:

random.seed(a=None, version=2)¶: Initialize the random number generator.
If a is omitted or None, the current system time is used.  If
randomness sources are provided by the operating system, they are used
instead of the system time (see the os.urandom() function for details
on availability).
If a is an int, it is used directly.
With version 2 (the default), a str, bytes, or bytearray
object gets converted to an int and all of its bits are used.
With version 1 (provided for reproducing random sequences from older versions
of Python), the algorithm for str and bytes generates a
narrower range of seeds.

Changed in version 3.2: Moved to the version 2 scheme which uses all of the bits in a string seed.


Changed in version 3.11: The seed must be one of the following types:
None, int, float, str,
bytes, or bytearray.



random.getstate()¶: Return an object capturing the current internal state of the generator.  This
object can be passed to setstate() to restore

## Part 1: Regular Expressions


Regular expressions are sequences of characters that define search patterns, primarily used for string matching and manipulation. They enable powerful text processing by allowing users to define rules for finding, replacing, or extracting specific parts of a text. Commonly used for tasks like validating input (e.g., email addresses), searching for specific patterns (e.g., dates), or extracting substrings, regular expressions operate based on a series of predefined symbols and syntax.

The `re` module has several key functions:

- `re.findall` (returns all matches as a list),
- `re.match` (compares the pattern with the string from the beginning),
- `re.search` (searches the entire string for a match with the pattern),
- `re.sub` (replaces matches in the string with something else),
- `re.split` (splits the string based on matches with the pattern).

A **pattern** is the regular expression itself. It describes, using a special language, what we want to find in the string. It will be clearer with examples.


In [7]:
import re
text = """It is only with the heart that one can see rightly; what is essential is invisible to the eye.

The little prince, who was very curious, traveled from planet to planet, meeting different characters. He learned that true beauty comes from the heart, and the most important things are often simple and pure.

One day, he met a fox who told him, “You become responsible, forever, for what you have tamed. You are responsible for your rose.”

As he continued his journey, he realized that friendship and love are what make life truly worthwhile.
"""

The simplest pattern is the substring we want to find. For example, if we want to find all occurrences of the uppercase letters "I," then the pattern will simply be "I."

In [8]:
re.findall('I', text), re.findall('Y', text)

(['I'], ['Y', 'Y'])

In [9]:
re.findall('y', 'YYY') # register makes sense!

[]

In [10]:
re.sub('O', 'a', 'OOO')# substitute all O with a

'aaa'

You can also use ranges to avoid listing all the characters, such as all letters or all digits.

### Common Ranges:

- `[a-z]` - all lowercase English letters
- `[A-Z]` - all uppercase English letters
- `[0-9]` - all digits
- `[A-z]` - it is better to avoid using this range because, although it won't cause errors, it will include many unnecessary characters. This is because ranges are considered based on the Unicode table (https://unicode-table.com/en/), and there are other characters between English letters.
- `[a-zA-Z]`- all English letters
- `[a-zàâçéèêëîïôùûü]` - toutes les lettres minuscules françaises
- `[A-ZÀÂÇÉÈÊËÎÏÔÙÛÜ]` - toutes les lettres majuscules françaises


  

In [11]:
re.sub("[a-zA-Z]", '', text)# remove all letters from our text

'          ;        .\n\n  ,    ,     ,   .         ,          .\n\n ,       , “  , ,     .      .”\n\n    ,            .\n'

The symbol `^` inside brackets means negation.


In [12]:
re.findall('[^D-Z]', text)# matches everything besides [D-Z]

['t',
 ' ',
 'i',
 's',
 ' ',
 'o',
 'n',
 'l',
 'y',
 ' ',
 'w',
 'i',
 't',
 'h',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'h',
 'e',
 'a',
 'r',
 't',
 ' ',
 't',
 'h',
 'a',
 't',
 ' ',
 'o',
 'n',
 'e',
 ' ',
 'c',
 'a',
 'n',
 ' ',
 's',
 'e',
 'e',
 ' ',
 'r',
 'i',
 'g',
 'h',
 't',
 'l',
 'y',
 ';',
 ' ',
 'w',
 'h',
 'a',
 't',
 ' ',
 'i',
 's',
 ' ',
 'e',
 's',
 's',
 'e',
 'n',
 't',
 'i',
 'a',
 'l',
 ' ',
 'i',
 's',
 ' ',
 'i',
 'n',
 'v',
 'i',
 's',
 'i',
 'b',
 'l',
 'e',
 ' ',
 't',
 'o',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'e',
 'y',
 'e',
 '.',
 '\n',
 '\n',
 'h',
 'e',
 ' ',
 'l',
 'i',
 't',
 't',
 'l',
 'e',
 ' ',
 'p',
 'r',
 'i',
 'n',
 'c',
 'e',
 ',',
 ' ',
 'w',
 'h',
 'o',
 ' ',
 'w',
 'a',
 's',
 ' ',
 'v',
 'e',
 'r',
 'y',
 ' ',
 'c',
 'u',
 'r',
 'i',
 'o',
 'u',
 's',
 ',',
 ' ',
 't',
 'r',
 'a',
 'v',
 'e',
 'l',
 'e',
 'd',
 ' ',
 'f',
 'r',
 'o',
 'm',
 ' ',
 'p',
 'l',
 'a',
 'n',
 'e',
 't',
 ' ',
 't',
 'o',
 ' ',
 'p',
 'l',
 'a',
 'n',
 'e',
 't',
 ',',
 '

There are abstract operators that can specify even more options.

### The main operators are:

- `\w \W` - any letter or digit and any character that is not a letter or digit
- `\d \D` - any digit and any character that is not a digit
- `.` - any character except for a newline
- `\s` - whitespace character


In [13]:
# Only spaces and punctuation will remain
re.sub('\w', '', text)

'          ;        .\n\n  ,    ,     ,   .         ,          .\n\n ,       , “  , ,     .      .”\n\n    ,            .\n'

In [14]:
# Only letters will remain
re.sub('\W', '', text)

'ItisonlywiththeheartthatonecanseerightlywhatisessentialisinvisibletotheeyeThelittleprincewhowasverycurioustraveledfromplanettoplanetmeetingdifferentcharactersHelearnedthattruebeautycomesfromtheheartandthemostimportantthingsareoftensimpleandpureOnedayhemetafoxwhotoldhimYoubecomeresponsibleforeverforwhatyouhavetamedYouareresponsibleforyourroseAshecontinuedhisjourneyherealizedthatfriendshipandlovearewhatmakelifetrulyworthwhile'

In [15]:
# \d - no digits in the string, so it's empty
re.findall('\d', text)

[]

In [16]:
# \D - matches any non-digit, i.e., the entire string
re.sub('\D', '_', text)

'______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________'

In [17]:
# \s - this way you can create a simple tokenizer
re.split("\s", text)

['It',
 'is',
 'only',
 'with',
 'the',
 'heart',
 'that',
 'one',
 'can',
 'see',
 'rightly;',
 'what',
 'is',
 'essential',
 'is',
 'invisible',
 'to',
 'the',
 'eye.',
 '',
 'The',
 'little',
 'prince,',
 'who',
 'was',
 'very',
 'curious,',
 'traveled',
 'from',
 'planet',
 'to',
 'planet,',
 'meeting',
 'different',
 'characters.',
 'He',
 'learned',
 'that',
 'true',
 'beauty',
 'comes',
 'from',
 'the',
 'heart,',
 'and',
 'the',
 'most',
 'important',
 'things',
 'are',
 'often',
 'simple',
 'and',
 'pure.',
 '',
 'One',
 'day,',
 'he',
 'met',
 'a',
 'fox',
 'who',
 'told',
 'him,',
 '“You',
 'become',
 'responsible,',
 'forever,',
 'for',
 'what',
 'you',
 'have',
 'tamed.',
 'You',
 'are',
 'responsible',
 'for',
 'your',
 'rose.”',
 '',
 'As',
 'he',
 'continued',
 'his',
 'journey,',
 'he',
 'realized',
 'that',
 'friendship',
 'and',
 'love',
 'are',
 'what',
 'make',
 'life',
 'truly',
 'worthwhile.',
 '']

To repeat the same character, you can place a `+` after it (indicating 1 or more repetitions) or a `*` (indicating 0 or more repetitions).


In [18]:
re.findall('\w+', text)#find all sequences: words and digits

['It',
 'is',
 'only',
 'with',
 'the',
 'heart',
 'that',
 'one',
 'can',
 'see',
 'rightly',
 'what',
 'is',
 'essential',
 'is',
 'invisible',
 'to',
 'the',
 'eye',
 'The',
 'little',
 'prince',
 'who',
 'was',
 'very',
 'curious',
 'traveled',
 'from',
 'planet',
 'to',
 'planet',
 'meeting',
 'different',
 'characters',
 'He',
 'learned',
 'that',
 'true',
 'beauty',
 'comes',
 'from',
 'the',
 'heart',
 'and',
 'the',
 'most',
 'important',
 'things',
 'are',
 'often',
 'simple',
 'and',
 'pure',
 'One',
 'day',
 'he',
 'met',
 'a',
 'fox',
 'who',
 'told',
 'him',
 'You',
 'become',
 'responsible',
 'forever',
 'for',
 'what',
 'you',
 'have',
 'tamed',
 'You',
 'are',
 'responsible',
 'for',
 'your',
 'rose',
 'As',
 'he',
 'continued',
 'his',
 'journey',
 'he',
 'realized',
 'that',
 'friendship',
 'and',
 'love',
 'are',
 'what',
 'make',
 'life',
 'truly',
 'worthwhile']

If the character is optional, you can place a question mark.


In [19]:
# A comma at the end is required
re.findall('\w+,', text)

['prince,',
 'curious,',
 'planet,',
 'heart,',
 'day,',
 'him,',
 'responsible,',
 'forever,',
 'journey,']

In [20]:
# A comma is optional
re.findall('\w+,?', text)

['It',
 'is',
 'only',
 'with',
 'the',
 'heart',
 'that',
 'one',
 'can',
 'see',
 'rightly',
 'what',
 'is',
 'essential',
 'is',
 'invisible',
 'to',
 'the',
 'eye',
 'The',
 'little',
 'prince,',
 'who',
 'was',
 'very',
 'curious,',
 'traveled',
 'from',
 'planet',
 'to',
 'planet,',
 'meeting',
 'different',
 'characters',
 'He',
 'learned',
 'that',
 'true',
 'beauty',
 'comes',
 'from',
 'the',
 'heart,',
 'and',
 'the',
 'most',
 'important',
 'things',
 'are',
 'often',
 'simple',
 'and',
 'pure',
 'One',
 'day,',
 'he',
 'met',
 'a',
 'fox',
 'who',
 'told',
 'him,',
 'You',
 'become',
 'responsible,',
 'forever,',
 'for',
 'what',
 'you',
 'have',
 'tamed',
 'You',
 'are',
 'responsible',
 'for',
 'your',
 'rose',
 'As',
 'he',
 'continued',
 'his',
 'journey,',
 'he',
 'realized',
 'that',
 'friendship',
 'and',
 'love',
 'are',
 'what',
 'make',
 'life',
 'truly',
 'worthwhile']

In [21]:
m = re.search('\s(.+?)\s', text)  # with .search, you can extract what is captured in parentheses like this
m.group(1)#the first word after the space in text

'is'

Here’s a breakdown of the basic components used in regular expressions (regex):

1. **Literal Characters**  
   These are the exact characters you want to search for.  
   **Example:** The pattern `cat` will match the string "cat" exactly.

2. **Metacharacters**  
   These characters have special meanings in regex patterns:  
   - `.`: Matches any single character (except newline).  
     **Example:** `c.t` matches "cat", "cut", "cot".  
   - `^`: Matches the start of a string.  
     **Example:** `^a` matches "apple", but not "banana".  
   - `$`: Matches the end of a string.  
     **Example:** `e$` matches "apple", but not "ape".  
   - `*`: Matches 0 or more occurrences of the preceding element.  
     **Example:** `a*` matches "", "a", "aa", "aaa", etc.  
   - `+`: Matches 1 or more occurrences of the preceding element.  
     **Example:** `a+` matches "a", "aa", "aaa", but not "".  
   - `?`: Matches 0 or 1 occurrence of the preceding element.  
     **Example:** `a?` matches "" or "a".  
   - `\`: Escapes a special character, so it can be matched as a literal.  
     **Example:** `\.` matches a period (.) rather than any character.

3. **Character Classes**  
   These define sets of characters you want to match:  
   - `[abc]`: Matches any single character that is a, b, or c.  
     **Example:** `b[aeiou]t` matches "bat", "bet", "bit", "bot", "but".  
   - `[^abc]`: Matches any character except a, b, or c.  
     **Example:** `[^0-9]` matches any non-digit character.  
   - `[a-z]`: Matches any lowercase letter from a to z.  
     **Example:** `[a-z]` matches any single lowercase letter.  
   - `[A-Z]`: Matches any uppercase letter from A to Z.  
     **Example:** `[A-Z]` matches any single uppercase letter.  
   - `[0-9]`: Matches any digit from 0 to 9.  
     **Example:** `[0-9]` matches any single digit.

4. **Quantifiers**  
   These specify how many times a preceding character or group should occur:  
   - `{n}`: Exactly n occurrences.  
     **Example:** `a{3}` matches "aaa", but not "aa".  
   - `{n,}`: At least n occurrences.  
     **Example:** `a{2,}` matches "aa", "aaa", "aaaa", etc.  
   - `{n,m}`: Between n and m occurrences.  
     **Example:** `a{2,4}` matches "aa", "aaa", "aaaa".

5. **Anchors**  
   These match specific positions in the text:  
   - `^`: Start of a string.  
     **Example:** `^Hello` matches "Hello world", but not "Hi, Hello".  
   - `$`: End of a string.  
     **Example:** `world$` matches "Hello world", but not "world, hello".

6. **Groups and Capturing**  
   - `(...)`: Groups multiple tokens together. Captures the matched text inside parentheses.  
     **Example:** `(abc)+` matches "abc", "abcabc", etc.  
   - Non-capturing group `(?:...)`: Groups tokens without capturing.  
     **Example:** `(?:abc)+` matches "abc", "abcabc", but won’t capture it.

7. **Alternation**  
   - `|`: Represents a logical OR between patterns.  
     **Example:** `cat|dog` matches either "cat" or "dog".

8. **Escaped Characters**  
   - `\d`: Matches any digit ([0-9]).  
     **Example:** `\d` matches "1", "2", "3", etc.  
   - `\D`: Matches any non-digit.  
     **Example:** `\D` matches "a", "b", "#", etc.  
   - `\w`: Matches any word character (letters, digits, underscore).  
     **Example:** `\w` matches "a", "1", "_", etc.  
   - `\W`: Matches any non-word character.  
     **Example:** `\W` matches "!", "@", etc.  
   - `\s`: Matches any whitespace character (spaces, tabs, newlines).  
     **Example:** `\s` matches a space or tab.  
   - `\S`: Matches any non-whitespace character.  
     **Example:** `\S` matches "a", "1", etc.

### Example Summary  
**Pattern:** `^a.*z$`  
**Explanation:**  
- `^`: Start of the string.  
- `a`: The letter "a".  
- `.*`: Any characters (0 or more).  
- `z`: The letter "z".  
- `$`: End of the string.  

**Matches:** Strings that start with "a" and end with "z", like "abcz", "amaz", etc.


### Exercise 1

Define a regular expression pattern to match valid email addresses. Use the `re.findall()`.
The output should contain all email addresses in the text.


In [22]:
html_content = """
<html>
<head><title>Email Extraction Example</title></head>
<body>
    <p>Contact us at: john.doe@example.com</p>
    <p>Support: jane.smith@website.org</p>
    <p>Invalid email: invalid-email@</p>
    <p>Info: info@company.com</p>
    <p>Phone: +1-800-555-0199</p>
</body>
</html>
"""
soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text()
print(text)
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Find all matching email addresses in the extracted text
email_matches = re.findall(email_pattern, text)

# Output the valid email addresses
print("Valid Email Addresses:")
for email in email_matches:
    print(email)



Email Extraction Example

Contact us at: john.doe@example.com
Support: jane.smith@website.org
Invalid email: invalid-email@
Info: info@company.com
Phone: +1-800-555-0199



Valid Email Addresses:
john.doe@example.com
jane.smith@website.org
info@company.com


##Part 2: SpaCy Tutorial

In [23]:
# !pip install spacy==3.0.5

In [24]:
import spacy

In [25]:
!python -m spacy download fr_core_news_sm
!python -m spacy download en_core_web_sm

Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/

**In case of error executing cell below**  

Select (top menu item) Runtime->Restart Runtime or **_Ctrl+M_+.**  
Then: execute that cell _below_ (upper ones not needed)



In [26]:
import spacy # in case of restarting
spacy.load('en_core_web_sm')

<spacy.lang.en.English at 0x7a33953b1810>

In [27]:
import pandas as pd
from spacy.matcher import DependencyMatcher
from spacy.matcher import Matcher

### Text Processing with Pretrained Pipelines

To process text, you first need to load a pretrained pipeline — a language model of the `Language` class, which includes a vocabulary, language data (rules used for tokenization, lemmatization, lexical attribute extraction, etc.), and the model's training results (weights).

## Language Models

Language models for the same language differ in the size of the vocabulary and the accuracy of components (pipes).

### Components (Pipes):
- `transformer`
- `tagger`
- `parser`
- `ner`
- `attribute_ruler`
- `lemmatizer`
- `senter`
- `tok2vec`
- Trainable pipes (non-default)

## Viewing Components and Analysis

You can view the list of available components and analyze the results of their operation using the following command:

```python
nlp.analyze_pipes(pretty=True)


In [28]:
nlp = spacy.load('en_core_web_sm')

In [29]:
nlp.component_names

['tok2vec',
 'tagger',
 'parser',
 'senter',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [30]:
spacy.load('en_core_web_sm').component_names

['tok2vec',
 'tagger',
 'parser',
 'senter',
 'attribute_ruler',
 'lemmatizer',
 'ner']

You can view the accuracy parameters of various English language models [here](https://spacy.io/models/en) (Accuracy Evaluation).


### Processing Text

To process text, you need to pass it through the pipeline, which will return an instance of the [Doc](https://spacy.io/api/doc) class.


In [31]:
text = 'The Leonardo da Vinci exhibition is held under the high patronage of French President Emmanuel Macron. \
The year 2019 marks the 500-year anniversary of the death of Leonardo da Vinci in France, of particular importance for the Louvre, \
which holds the largest collection in the world of da Vinci’s paintings, as well as 22 drawings.'
text

'The Leonardo da Vinci exhibition is held under the high patronage of French President Emmanuel Macron. The year 2019 marks the 500-year anniversary of the death of Leonardo da Vinci in France, of particular importance for the Louvre, which holds the largest collection in the world of da Vinci’s paintings, as well as 22 drawings.'

In [32]:
# input text to nlp model and get SpaCy Document object
doc = nlp(text)
type(doc)

spacy.tokens.doc.Doc

Complete list of methods available in the `Doc` class: https://spacy.io/api/doc.


In [33]:
print(*dir(doc), sep ='\n')

_
__bytes__
__class__
__delattr__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__getitem__
__gt__
__hash__
__init__
__init_subclass__
__iter__
__le__
__len__
__lt__
__ne__
__new__
__pyx_vtable__
__reduce__
__reduce_ex__
__repr__
__setattr__
__setstate__
__sizeof__
__str__
__subclasshook__
__unicode__
_bulk_merge
_context
_get_array_attrs
_realloc
_vector
_vector_norm
cats
char_span
copy
count_by
doc
ents
extend_tensor
from_array
from_bytes
from_dict
from_disk
from_docs
from_json
get_extension
get_lca_matrix
has_annotation
has_extension
has_unknown_spaces
has_vector
is_nered
is_parsed
is_sentenced
is_tagged
lang
lang_
mem
noun_chunks
noun_chunks_iterator
remove_extension
retokenize
sentiment
sents
set_ents
set_extension
similarity
spans
tensor
text
text_with_ws
to_array
to_bytes
to_dict
to_disk
to_json
to_utf8_array
user_data
user_hooks
user_span_hooks
user_token_hooks
vector
vector_norm
vocab


In this task, only the **sents** method will be used.

(From Spacy documentation):  
`sents`: **YIELDS** Sentences in the document.

This is a generator, where each iterable element is a sentence (produced by the sentencizer).  
Element type: `spacy.tokens.span.Span`. The methods of this class overlap with those of `spacy.tokens.doc.Doc`.

\* _The values yielded by the generator can be converted into a list._


In [34]:
doc.sents

<generator at 0x7a32ed8f34c0>

In [35]:
for sent in doc.sents:
  print(sent)

The Leonardo da Vinci exhibition is held under the high patronage of French President Emmanuel Macron.
The year 2019 marks the 500-year anniversary of the death of Leonardo da Vinci in France, of particular importance for the Louvre, which holds the largest collection in the world of da Vinci’s paintings, as well as 22 drawings.


In [36]:
print(*list(doc.sents), sep='\n\n')

The Leonardo da Vinci exhibition is held under the high patronage of French President Emmanuel Macron.

The year 2019 marks the 500-year anniversary of the death of Leonardo da Vinci in France, of particular importance for the Louvre, which holds the largest collection in the world of da Vinci’s paintings, as well as 22 drawings.


In [37]:
for sent in doc.sents:
  print(type(sent))

<class 'spacy.tokens.span.Span'>
<class 'spacy.tokens.span.Span'>


###Span and Token attributes

In [38]:
dir(spacy.tokens.span.Span)

['_',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_fix_dep_copy',
 '_vector',
 '_vector_norm',
 'as_doc',
 'char_span',
 'conjuncts',
 'doc',
 'end',
 'end_char',
 'ent_id',
 'ent_id_',
 'ents',
 'get_extension',
 'get_lca_matrix',
 'has_extension',
 'has_vector',
 'id',
 'id_',
 'kb_id',
 'kb_id_',
 'label',
 'label_',
 'lefts',
 'lemma_',
 'n_lefts',
 'n_rights',
 'noun_chunks',
 'orth_',
 'remove_extension',
 'rights',
 'root',
 'sent',
 'sentiment',
 'sents',
 'set_extension',
 'similarity',
 'start',
 'start_char',
 'subtree',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'vector',
 'vector_norm',
 'vocab']

In [39]:
# Main Token Attributes
for sent in doc.sents:
    print('Sentence:', sent)
    print('Length:', sent.end - sent.start)  # Number of words (tokens) in the sentence (including punctuation)
    print('Noun chunks:', set(sent.noun_chunks))  # Base noun phrases
    print('Ents:', sent.ents)  # Named entities in the span
    print()

Sentence: The Leonardo da Vinci exhibition is held under the high patronage of French President Emmanuel Macron.
Length: 17
Noun chunks: {the high patronage, The Leonardo da Vinci exhibition, French President Emmanuel Macron}
Ents: [Leonardo da Vinci, French, Emmanuel Macron]

Sentence: The year 2019 marks the 500-year anniversary of the death of Leonardo da Vinci in France, of particular importance for the Louvre, which holds the largest collection in the world of da Vinci’s paintings, as well as 22 drawings.
Length: 46
Noun chunks: {Leonardo da Vinci, the largest collection, the world, the 500-year anniversary, da Vinci’s paintings, which, 22 drawings, The year, particular importance, the Louvre, the death, France}
Ents: [The year 2019, 500-year, Leonardo da Vinci, France, da Vinci, 22]



Noun chunks - noun phrases.  
Ents = named entity - a list of words/phrases in the sentence that have been recognized through [Named Entity Recognition](https://machinelearningknowledge.ai/named-entity-recognition-ner-in-spacy-library/?ysclid=m21epwuc6o95221647).  
Let's examine the types of named entities in the text.

In [40]:
for sent in doc.sents:
  entities = zip(sent.ents, [(s.label_, spacy.explain(str(s.label_))) for s in sent.ents])
  print('named entities:',*entities, sep ='\n', end = '\n\n')

named entities:
(Leonardo da Vinci, ('PERSON', 'People, including fictional'))
(French, ('NORP', 'Nationalities or religious or political groups'))
(Emmanuel Macron, ('PERSON', 'People, including fictional'))

named entities:
(The year 2019, ('DATE', 'Absolute or relative dates or periods'))
(500-year, ('DATE', 'Absolute or relative dates or periods'))
(Leonardo da Vinci, ('PERSON', 'People, including fictional'))
(France, ('GPE', 'Countries, cities, states'))
(da Vinci, ('PERSON', 'People, including fictional'))
(22, ('CARDINAL', 'Numerals that do not fall under another type'))



Full list of named entities:

"PERSON": "People, including fictional",  
"NORP": "Nationalities or religious or political groups",  
"FACILITY": "Buildings, airports, highways, bridges, etc.",  
"FAC": "Buildings, airports, highways, bridges, etc.",  
"ORG": "Companies, agencies, institutions, etc.",  
"GPE": "Countries, cities, states",  
"LOC": "Non-GPE locations, mountain ranges, bodies of water",  
"PRODUCT": "Objects, vehicles, foods, etc. (not services)",  
"EVENT": "Named hurricanes, battles, wars, sports events, etc.",
"WORK_OF_ART": "Titles of books, songs, etc.",
"LAW": "Named documents made into laws.",  
"LANGUAGE": "Any named language",  
"DATE": "Absolute or relative dates or periods",  
"TIME": "Times smaller than a day",  
"PERCENT": 'Percentage, including "%"',  
"MONEY": "Monetary values, including unit",  
"QUANTITY": "Measurements, as of weight or distance",  
"ORDINAL": '"first", "second", etc.',  
"CARDINAL": "Numerals that do not fall under another type"  

When iterating over a sentence, the objects returned are instances of the **Token** class.


In [41]:
for token in list(doc.sents)[0]:
    print(token)

The
Leonardo
da
Vinci
exhibition
is
held
under
the
high
patronage
of
French
President
Emmanuel
Macron
.


In [42]:
print(type(list(doc.sents)[0][0]))
print(*dir(list(doc.sents)[0][0]), sep ='\n')

<class 'spacy.tokens.token.Token'>
_
__bytes__
__class__
__delattr__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__gt__
__hash__
__init__
__init_subclass__
__le__
__len__
__lt__
__ne__
__new__
__pyx_vtable__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__str__
__subclasshook__
__unicode__
ancestors
check_flag
children
cluster
conjuncts
dep
dep_
doc
ent_id
ent_id_
ent_iob
ent_iob_
ent_kb_id
ent_kb_id_
ent_type
ent_type_
get_extension
has_dep
has_extension
has_head
has_morph
has_vector
head
i
idx
iob_strings
is_alpha
is_ancestor
is_ascii
is_bracket
is_currency
is_digit
is_left_punct
is_lower
is_oov
is_punct
is_quote
is_right_punct
is_sent_end
is_sent_start
is_space
is_stop
is_title
is_upper
lang
lang_
left_edge
lefts
lemma
lemma_
lex
lex_id
like_email
like_num
like_url
lower
lower_
morph
n_lefts
n_rights
nbor
norm
norm_
orth
orth_
pos
pos_
prefix
prefix_
prob
rank
remove_extension
right_edge
rights
sent
sent_start
sentiment
set_extension
set_morph
shape
shape_
s

That is, using the attributes mentioned above, you can:

- Obtain a list of words related to the token, including conjunctions (conjuncts), the text of the token (text), prefix, suffix, part-of-speech tag (POS tag), detailed part-of-speech tag (TAG), dependency tag (DEP), etc.

Some `_Token` attributes are similar but have two variations: one with an underscore and one without. The variation with the underscore will return a string value, while the one without will return an integer ID value.  
Let’s focus on the attributes that will be useful for completing the task.


In [43]:
print(f"{'Token' : <13}{'.i' : <5}{'.head' : <10}{'.norm_' : <13}\
    {'.lemma_' : <13}{'.pos_' : <10}{'.tag_ ': <10}{'.dep_': <10}")
for token in list(doc.sents)[0]:
    #print(*[token, token.text, token.lemma_, token.tag_, token.pos_, token.dep_], sep = '\t')
    print(f"{str(token) : <13}{str(token.i) : <5}{str(token.head) : <10}{str(token.norm_) : <13}\
    {str(token.lemma_) : <13}{str(token.pos_) : <10}{str(token.tag_ ): <10}{str(token.dep_): <10}")

Token        .i   .head     .norm_           .lemma_      .pos_     .tag_     .dep_     
The          0    exhibitionthe              the          DET       DT        det       
Leonardo     1    Vinci     leonardo         Leonardo     PROPN     NNP       compound  
da           2    Vinci     da               da           PROPN     NNP       compound  
Vinci        3    exhibitionvinci            Vinci        PROPN     NNP       compound  
exhibition   4    held      exhibition       exhibition   NOUN      NN        nsubjpass 
is           5    held      is               be           AUX       VBZ       auxpass   
held         6    held      held             hold         VERB      VBN       ROOT      
under        7    held      under            under        ADP       IN        prep      
the          8    patronage the              the          DET       DT        det       
high         9    patronage high             high         ADJ       JJ        amod      
patronage    10   und

You can view the complete list of tags [here](https://github.com/explosion/spaCy/blob/master/spacy/glossary.py).

For convenience, the tag dictionary for English language models is presented below.


In [44]:
GLOSSARY = {
    # POS tags
    # Universal POS Tags
    # http://universaldependencies.org/u/pos/
    "ADJ": "adjective",
    "ADP": "adposition",
    "ADV": "adverb",
    "AUX": "auxiliary",
    "CONJ": "conjunction",
    "CCONJ": "coordinating conjunction",
    "DET": "determiner",
    "INTJ": "interjection",
    "NOUN": "noun",
    "NUM": "numeral",
    "PART": "particle",
    "PRON": "pronoun",
    "PROPN": "proper noun",
    "PUNCT": "punctuation",
    "SCONJ": "subordinating conjunction",
    "SYM": "symbol",
    "VERB": "verb",
    "X": "other",
    "EOL": "end of line",
    "SPACE": "space",
    # POS tags (English)
    # OntoNotes 5 / Penn Treebank
    # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    # https://universaldependencies.org/docs/en/pos/
    ".": "punctuation mark, sentence closer",
    ",": "punctuation mark, comma",
    "-LRB-": "left round bracket",
    "-RRB-": "right round bracket",
    "``": "opening quotation mark",
    '""': "closing quotation mark",
    "''": "closing quotation mark",
    ":": "punctuation mark, colon or ellipsis",
    "$": "symbol, currency",
    "#": "symbol, number sign",
    "AFX": "affix",
    "CC": "conjunction, coordinating",
    "CD": "cardinal number",
    "DT": "determiner",
    "EX": "existential there",
    "FW": "foreign word",
    "HYPH": "punctuation mark, hyphen",
    "IN": "conjunction, subordinating or preposition",
    "JJ": "adjective",
    "JJR": "adjective, comparative",
    "JJS": "adjective, superlative",
    "LS": "list item marker",
    "MD": "verb, modal auxiliary",
    "NIL": "missing tag",
    "NN": "noun, singular or mass",
    "NNP": "noun, proper singular",
    "NNPS": "noun, proper plural",
    "NNS": "noun, plural",
    "PDT": "predeterminer",
    "POS": "possessive ending",
    "PRP": "pronoun, personal",
    "PRP$": "pronoun, possessive",
    "RB": "adverb",
    "RBR": "adverb, comparative",
    "RBS": "adverb, superlative",
    "RP": "adverb, particle",
    "TO": 'infinitival "to"',
    "UH": "interjection",
    "VB": "verb, base form",
    "VBD": "verb, past tense",
    "VBG": "verb, gerund or present participle",
    "VBN": "verb, past participle",
    "VBP": "verb, non-3rd person singular present",
    "VBZ": "verb, 3rd person singular present",
    "WDT": "wh-determiner",
    "WP": "wh-pronoun, personal",
    "WP$": "wh-pronoun, possessive",
    "WRB": "wh-adverb",
    "SP": "space",
    "ADD": "email",
    "NFP": "superfluous punctuation",
    "GW": "additional word in multi-word expression",
    "XX": "unknown",
    "BES": 'auxiliary "be"',
    "HVS": 'forms of "have"',
    # Noun chunks
    "NP": "noun phrase",
    "PP": "prepositional phrase",
    "VP": "verb phrase",
    "ADVP": "adverb phrase",
    "ADJP": "adjective phrase",
    "SBAR": "subordinating conjunction",
    "PRT": "particle",
    "PNP": "prepositional noun phrase",
    # Dependency Labels (English)
    # ClearNLP / Universal Dependencies
    # https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
    # https://universaldependencies.org/docs/en/dep/
     "acl": "clausal modifier of noun (adjectival clause)",
    "acomp": "adjectival complement",
    "advcl": "adverbial clause modifier",
    "advmod": "adverbial modifier",
    "agent": "agent",
    "amod": "adjectival modifier",
    "appos": "appositional modifier",
    "attr": "attribute",
    "aux": "auxiliary",
    "auxpass": "auxiliary (passive)",
    "case": "case marking",
    "cc": "coordinating conjunction",
    "ccomp": "clausal complement",
    "clf": "classifier",
    "complm": "complementizer",
    "compound": "compound",
    "conj": "conjunct",
    "cop": "copula",
    "csubj": "clausal subject",
    "csubjpass": "clausal subject (passive)",
    "dative": "dative",
    "dep": "unclassified dependent",
    "det": "determiner",
    "discourse": "discourse element",
    "dislocated": "dislocated elements",
    "dobj": "direct object",
    "expl": "expletive",
    "fixed": "fixed multiword expression",
    "flat": "flat multiword expression",
    "goeswith": "goes with",
    "hmod": "modifier in hyphenation",
    "hyph": "hyphen",
    "infmod": "infinitival modifier",
    "intj": "interjection",
    "iobj": "indirect object",
    "list": "list",
    "mark": "marker",
    "meta": "meta modifier",
    "neg": "negation modifier",
    "nmod": "modifier of nominal",
    "nn": "noun compound modifier",
    "npadvmod": "noun phrase as adverbial modifier",
    "nsubj": "nominal subject",
    "nsubjpass": "nominal subject (passive)",
    "nounmod": "modifier of nominal",
    "npmod": "noun phrase as adverbial modifier",
    "num": "number modifier",
    "number": "number compound modifier",
    "nummod": "numeric modifier",
    "oprd": "object predicate",
    "obj": "object",
    "obl": "oblique nominal",
    "orphan": "orphan",
    "parataxis": "parataxis",
    "partmod": "participal modifier",
    "pcomp": "complement of preposition",
    "pobj": "object of preposition",
    "poss": "possession modifier",
    "possessive": "possessive modifier",
    "preconj": "pre-correlative conjunction",
    "prep": "prepositional modifier",
    "prt": "particle",
    "punct": "punctuation",
    "quantmod": "modifier of quantifier",
    "rcmod": "relative clause modifier",
    "relcl": "relative clause modifier",
    "reparandum": "overridden disfluency",
    "root": "root",
    "vocative": "vocative",
    "xcomp": "open clausal complement",
    # Named Entity Recognition
    # OntoNotes 5
    # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
    "PERSON": "People, including fictional",
    "NORP": "Nationalities or religious or political groups",
    "FACILITY": "Buildings, airports, highways, bridges, etc.",
    "FAC": "Buildings, airports, highways, bridges, etc.",
    "ORG": "Companies, agencies, institutions, etc.",
    "GPE": "Countries, cities, states",
    "LOC": "Non-GPE locations, mountain ranges, bodies of water",
    "PRODUCT": "Objects, vehicles, foods, etc. (not services)",
    "EVENT": "Named hurricanes, battles, wars, sports events, etc.",
    "WORK_OF_ART": "Titles of books, songs, etc.",
    "LAW": "Named documents made into laws.",
    "LANGUAGE": "Any named language",
    "DATE": "Absolute or relative dates or periods",
    "TIME": "Times smaller than a day",
    "PERCENT": 'Percentage, including "%"',
    "MONEY": "Monetary values, including unit",
    "QUANTITY": "Measurements, as of weight or distance",
    "ORDINAL": '"first", "second", etc.',
    "CARDINAL": "Numerals that do not fall under another type"}

In [45]:
GLOSSARY["nsubj"] # Examples of sentences: https://universaldependencies.org/docs/en/dep/

'nominal subject'

### Token.morph



In [46]:
print(f"{'Token' : <12}{'.morph' : <12}")
for token in list(doc.sents)[0]:
    print(f"{str(token) : <12}{str(token.morph) : <12}")

Token       .morph      
The         Definite=Def|PronType=Art
Leonardo    Number=Sing 
da          Number=Sing 
Vinci       Number=Sing 
exhibition  Number=Sing 
is          Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
held        Aspect=Perf|Tense=Past|VerbForm=Part
under                   
the         Definite=Def|PronType=Art
high        Degree=Pos  
patronage   Number=Sing 
of                      
French      Degree=Pos  
President   Number=Sing 
Emmanuel    Number=Sing 
Macron      Number=Sing 
.           PunctType=Peri


### Text pre-processing for French data

For stemming you can use SnowballStemmer: https://www.nltk.org/api/nltk.stem.SnowballStemmer.html?highlight=stopwords

In [47]:
from nltk.stem import SnowballStemmer

# Load the French SpaCy model
nlp_french = spacy.load('fr_core_news_sm')

# Sample French text
text = "SpaCy est une bibliothèque NLP incroyable. Elle aide au traitement de texte !"

# Process the text
doc = nlp_french(text)

# 1 Sentence segmentation
sentences = list(doc.sents)
print("Sentences:")
for sentence in sentences:
    print(sentence.text)

# 2 Tokenization
tokens = [token.text for token in doc]
print("\nTokens:", tokens)

# 3 Lowercasing
lowercase_tokens = [token.text.lower() for token in doc]
print("Lowercase Tokens:", lowercase_tokens)

# 4 Check for stop words
stop_words = [token.text for token in doc if token.is_stop]
print("Stop Words:", stop_words)

# 5 Lemmatization
lemmatized_tokens = [token.lemma_ for token in doc]
print("Lemmatized Tokens:", lemmatized_tokens)

# 6 Stemming
stemmer = SnowballStemmer("french")
stemmed_tokens = [stemmer.stem(token.text) for token in doc]
print("Stemmed Tokens:", stemmed_tokens)

Sentences:
SpaCy est une bibliothèque NLP incroyable.
Elle aide au traitement de texte !

Tokens: ['SpaCy', 'est', 'une', 'bibliothèque', 'NLP', 'incroyable', '.', 'Elle', 'aide', 'au', 'traitement', 'de', 'texte', '!']
Lowercase Tokens: ['spacy', 'est', 'une', 'bibliothèque', 'nlp', 'incroyable', '.', 'elle', 'aide', 'au', 'traitement', 'de', 'texte', '!']
Stop Words: ['est', 'une', 'Elle', 'au', 'de']
Lemmatized Tokens: ['spacy', 'être', 'un', 'bibliothèque', 'NLP', 'incroyable', '.', 'lui', 'aide', 'au', 'traitement', 'de', 'texte', '!']
Stemmed Tokens: ['spacy', 'est', 'une', 'bibliothequ', 'nlp', 'incroi', '.', 'elle', 'aid', 'au', 'trait', 'de', 'text', '!']


### Matcher patterns



To check if a sentence matches a specified pattern, you can use one of three SpaCy classes:

1) **Matcher** - You can use `pos` and `tag` tags, as well as token attributes, to construct the pattern.

2) **DependencyMatcher** - You can use syntactic dependencies (e.g., be a sibling of) to create a pattern for the syntactic tree.

3) **PhraseMatcher** - Matches specific phrases and word combinations.

### Matcher class
First, you need to create a `pattern`, then instantiate `matcher = Matcher(vocab=nlp.vocab)` and pass the `pattern` to the matcher using `matcher.add("pattern name", patterns=[pattern])`.

You can view the results of text matches for each `doc` with `matcher(doc, as_spans=True)`.


Creating a pattern is the primary task.  
In the case of **Matcher**, the `pattern` must be a list of dictionaries. Each dictionary corresponds to one token, with the keys representing the token attributes. You can also use the key `_OP`. Possible values for `_OP` are:

1. `!`  Negate the pattern by requiring it to match exactly 0 times.
2. `?`  Make the pattern optional by allowing it to match 0 or 1 times.
3. `+`  Require the pattern to match 1 or more times.
4. `*`  Allow the pattern to match 0 or more times.

Other available keys and their descriptions can be found at [Pattern format](https://spacy.io/api/matcher).


In [48]:
matcher = Matcher(vocab=nlp.vocab)
matcher

<spacy.matcher.matcher.Matcher at 0x7a32e50072e0>

In [49]:
noun_phrase_verb = [{'POS': 'NOUN', 'OP': '+'}, {'POS': 'VERB', 'OP': '+'}]
matcher.add("noun_phrase+verb", patterns=[noun_phrase_verb])

In [50]:
# load text corpus
!wget -O input.txt 'https://drive.google.com/uc?export=download&id=1cK8FATtEGxC2kWv3FCthdrZZWM01c2sw'

--2024-10-09 05:37:02--  https://drive.google.com/uc?export=download&id=1cK8FATtEGxC2kWv3FCthdrZZWM01c2sw
Resolving drive.google.com (drive.google.com)... 64.233.181.102, 64.233.181.100, 64.233.181.101, ...
Connecting to drive.google.com (drive.google.com)|64.233.181.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1cK8FATtEGxC2kWv3FCthdrZZWM01c2sw&export=download [following]
--2024-10-09 05:37:02--  https://drive.usercontent.google.com/download?id=1cK8FATtEGxC2kWv3FCthdrZZWM01c2sw&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 172.217.214.132, 2607:f8b0:4001:c05::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|172.217.214.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2146822 (2.0M) [application/octet-stream]
Saving to: ‘input.txt’


2024-10-09 05:37:05 (154 MB/s) - ‘input.txt’ saved [2146822/214682

In [51]:
with open('./' + 'input.txt', encoding='utf-8') as file:
    for text in file.readlines()[:2]:
        doc = nlp(text)
        for sent in list(doc.sents)[:2]:
            d = nlp(str(sent))
            # as_spans=True - for text representation of results
            results = matcher(doc, as_spans=True)
            for result in results:
                print(nlp.vocab[result.label].text, '\t', result)

noun_phrase+verb 	 complainant wants
noun_phrase+verb 	 woman named
noun_phrase+verb 	 judge acquit
noun_phrase+verb 	 prosecutor revealed
noun_phrase+verb 	 complainant wants
noun_phrase+verb 	 woman named
noun_phrase+verb 	 judge acquit
noun_phrase+verb 	 prosecutor revealed


The found matches correspond to the pattern.  
However, for instance, when searching for a noun and a verb with words of other parts of speech in between, the pattern may become overly complicated. To tackle such tasks, or, for example, the task of finding the subject and predicate, it is necessary to use syntactic trees, i.e., to look for a noun whose parent is the root (in the case of a simple sentence). For these types of tasks, you should use **DependencyMatcher**.


### Exercise 2

Write code to use the matcher to extract the following phrases from the text:

- **Pattern 1:** Match any noun followed by a verb (e.g., "cat sleeps").
- **Pattern 2:** Match the phrase "not [adjective]" (e.g., "not happy").
- **Pattern 3:** Match a proper noun followed by a verb (e.g., "Alice runs").


In [52]:
# Your code
# matcher = Matcher(vocab=nlp.vocab)
# pattern = []
# matcher.add("pattern", patterns=[pattern])
# with open('./' + 'input.txt', encoding='utf-8') as file:
#     for text in file.readlines()[:2]:
#         doc = nlp(text)
#         for sent in list(doc.sents)[:2]:
#             d = nlp(str(sent))
#             # as_spans=True - for text representation of results
#             results = matcher(doc, as_spans=True)
#             for result in results:
#                 print(nlp.vocab[result.label].text, '\t', result)

### DependencyMatcher class

The principle of creating a **DependencyMatcher** is similar to that of the **Matcher**, but it differs significantly in pattern construction.

A description of the pattern keys can be found in the documentation: [Pattern format](https://spacy.io/api/dependencymatcher).

### Pattern for **DependencyMatcher**

The pattern must be a list, where each element specifies the match for a token and is a dictionary.

___
**Dictionary for the First Element**
___
The dictionary for the first element must include the following keys:
- `RIGHT_ID`: the name of the token (can be any name);
- `RIGHT_ATTRS`: a dictionary of the token's attributes.

The attributes can include any properties of the token.

___
**Dictionaries for the Other Elements**
___
Dictionaries for matching other tokens must include the following four keys: `LEFT_ID`, `REL_OP`, `RIGHT_ID`, and `RIGHT_ATTRS`.

- `LEFT_ID`: the name of the left-dependent vertex;
- `REL_OP`: the operand that describes the dependency of the token from this dictionary on the vertex `LEFT_ID`;
- `RIGHT_ID`: the name of the token (also specified at the creator's discretion);
- `RIGHT_ATTRS`: a dictionary where the keys are the attributes of the token.

Possible `REL_OP` values can be found in the [Pattern format](https://spacy.io/api/dependencymatcher):

- **A < B:** A is the immediate dependent of B.
- **A > B:** A is the immediate head of B.
- **A << B:** A is the dependent in a chain to B following dep → head paths.
- **A >> B:** A is the head in a chain to B following head → dep paths.
- **A . B:** A immediately precedes B, i.e., A.i == B.i - 1, and both are within the same dependency tree.
- **A .* B:** A precedes B, i.e., A.i < B.i, and both are within the same dependency tree (not in Semgrex).
- **A ; B:** A immediately follows B, i.e., A.i == B.i + 1, and both are within the same dependency tree (not in Semgrex).
- **A ;* B:** A follows B, i.e., A.i > B.i, and both are within the same dependency tree (not in Semgrex).
- **A \$+ B:** B is a right immediate sibling of A, i.e., A and B have the same parent and A.i == B.i - 1.
- **A \$- B:** B is a left immediate sibling of A, i.e., A and B have the same parent and A.i == B.i + 1.
- **A \$++ B:** B is a right sibling of A, i.e., A and B have the same parent and A.i < B.i.
- **A \$-- B:** B is a left sibling of A, i.e., A and B have the same parent and A.i > B.i.


To choose an `REL_OP`, it is advisable to first look at the values of the attributes `token.head` and `token.root` in a specific sentence.


In [53]:
text = ''

In [54]:
with open('./' + 'input.txt', encoding='utf-8') as file:
  text = file.readlines()[3]

In [55]:
doc = nlp(text)

In [56]:
print(f"{'Token' : <13}{'.i' : <5}{'.head' : <10}{'.norm_' : <13}\
{'.pos_' : <10}{'.tag_ ': <10}{'.dep_': <10}")
for token in list(doc.sents)[0]:
    #print(*[token, token.text, token.lemma_, token.tag_, token.pos_, token.dep_], sep = '\t')
    print(f"{str(token) : <13}{str(token.i) : <5}{str(token.head) : <10}{str(token.norm_) : <13}{str(token.pos_) : <10}{str(token.tag_ ): <10}{str(token.dep_): <10}")

Token        .i   .head     .norm_       .pos_     .tag_     .dep_     
﻿When        0    have      ﻿when        ADV       RB        advmod    
a            1    few       a            DET       DT        quantmod  
few          2    people    few          ADJ       JJ        nummod    
completely   3    different completely   ADV       RB        advmod    
different    4    people    different    ADJ       JJ        amod      
people       5    have      people       NOUN      NNS       nsubj     
have         6    have      have         VERB      VBP       ROOT      
to           7    share     to           PART      TO        aux       
share        8    have      share        VERB      VB        xcomp     
a            9    number    a            DET       DT        det       
flat         10   number    flat         ADJ       JJ        amod      
,            11   number    ,            PUNCT     ,         punct     
very         12   often     very         ADV       RB        adv

Let's try to create a pattern to search for: **number(s) of + noun**.

In the above sentence, **number** is the head for **of**, and **of** is the head for **difficulties**.

The operand corresponding to "be a head of" is: **A < B**.


In [57]:
dep_matcher = DependencyMatcher(vocab=nlp.vocab)

In [58]:
dep_pattern = [{'RIGHT_ID': 'number', 'RIGHT_ATTRS': {'NORM': 'number'}},
               {'LEFT_ID': 'number', 'REL_OP': '>', 'RIGHT_ID': 'prep', 'RIGHT_ATTRS': {'NORM': 'of'}},
                {'LEFT_ID': 'prep', 'REL_OP': '>', 'RIGHT_ID': 'noun', 'RIGHT_ATTRS': {'TAG':{'IN': ['NNS', 'NN']}, 'DEP': 'pobj'}}]

In [59]:
dep_matcher.add('number_of_pattern', patterns=[dep_pattern])
dep_matches = dep_matcher(doc)
dep_matches

[(8480868639249277778, [15, 16, 17])]

In [60]:
for match in dep_matches:
    pattern_name = match[0]
    matches = match[1]
    print(nlp.vocab[pattern_name].text, '\t', doc[ matches[0]], '...', doc[matches[1]], doc[matches[2]])

number_of_pattern 	 number ... of difficulties
