# Natural Language Processing

## Phrase Matching and Vocabulary

In [22]:
# Import

import spacy # import the spacy Library

In [23]:
# Spacy

nlp = spacy.load('en_core_web_sm') # has fewer lemmas than medium (_md, _lg)

## A. Rule-Based Matching (Token)

spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [24]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

In [25]:
# SolarPower
pattern1 = [{'LOWER': 'solarpower'}]

# Solar-power
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

# Solar power
pattern3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]

In [26]:
# Add to matcher

matcher.add('SolarPower', [pattern1, pattern2, pattern3])

In [27]:
# Test for a sample document

doc = nlp(u"The Solar Power industry continues to grow as solarpower increases. Solar-Power is amazing.")

In [28]:
# Find the matches

found_matches = matcher(doc)
print(found_matches)

# (String ID, start, stop)

[(8656102463236116519, 1, 3), (8656102463236116519, 8, 9), (8656102463236116519, 11, 14)]


In [29]:
# Helper Loop

for match_id, start, end in found_matches:

  string_id = nlp.vocab.strings[match_id]
  span = doc[start : end]
  print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 11 14 Solar-Power


In [30]:
# Remove particular pattern

matcher.remove('SolarPower')

#### New Set of Patterns

In [31]:
# New pattern

# solarpower, SolarPower
pattern1 = [{'LOWER': 'solarpower'}]

# solar-power, solar--power, solar.power, etc.
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP': '*'}, {'LOWER': 'power'}]

In [32]:
# Add to matcher

matcher.add('SolarPower', [pattern1, pattern2])

In [33]:
# Sample Doc

doc2 = nlp(u"Solar--power is solarpower yehey!")

In [34]:
# Find the matches

found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 5)]


## B. Phrase Matching (Predefined phrases or substrings)

An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [35]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

In [36]:
# Request
# Open File

import requests

# URL of the raw .txt file on GitHub
# https://github.com/renatomaaliw3/public_files/blob/master/Data%20Sets/NLP/reaganomics.txt

url = "https://raw.githubusercontent.com/renatomaaliw3/public_files/master/Data%20Sets/NLP/reaganomics.txt"

# Fetch the file content
response = requests.get(url)

doc3 = nlp(response.text)
doc3

REAGANOMICS
https://en.wikipedia.org/wiki/Reaganomics

Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.

The four pillars of Reagan's economic policy were to reduce the growth of government spending, reduce the federal income tax and capital gains tax, reduce government regulation, and tighten the money supply in order to reduce inflation.[2]

The results of Reaganomics are still debated. Supporters point to the end of stagflation, stronger GDP growth, and an entrepreneur revolution in the decades that followed.[3][4] Critics point to the widening income gap, an atmosphere of greed, and the national debt tripling in eight years which ultimately reversed the pos

In [37]:
# Phrase List

phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [38]:
# Convert each phrase into a document object

phrase_patterns = [nlp(text) for text in phrase_list]

In [39]:
# Pass into matcher

matcher.add('EconMatcher', phrase_patterns)

In [40]:
# Build list of matches

found_matches = matcher(doc3)
found_matches

[(3680293220734633682, 41, 45),
 (3680293220734633682, 49, 53),
 (3680293220734633682, 54, 56),
 (3680293220734633682, 61, 65),
 (3680293220734633682, 673, 677),
 (3680293220734633682, 2987, 2991)]

In [41]:
# Helper Loop

for match_id, start, end in found_matches:

  string_id = nlp.vocab.strings[match_id]
  span = doc3[start : end]
  print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2987 2991 trickle-down economics


### Dependency Matcher

Syntactic pattern matching

In [42]:
# Create a Matcher
matcher = Matcher(nlp.vocab)

# Define patterns
pattern1 = [
    {"POS": "ADJ"},  # Match an adjective
    {"POS": "NOUN"}  # Followed by a noun
]

pattern2 = [
    {"POS": "NOUN"},  # Match a noun
    {"POS": "AUX"}    # Followed by an auxiliary verb
]

# Add patterns to the matcher
matcher.add("AdjNoun", [pattern1])  # Pass patterns as a list of lists
matcher.add("NounAux", [pattern2])

# Process text
doc4 = nlp(
    "She eats delicious food and enjoys reading books. It's a beautiful sunset. "
    "The dog is barking loudly. Cats were sleeping peacefully."
)

# Apply matcher to the Doc
matches = matcher(doc4)

# Print matched spans
for match_id, start, end in matches:

    span = doc4[start:end]
    print(f"Matched Text: '{span.text}' (Pattern: {nlp.vocab.strings[match_id]})")


Matched Text: 'delicious food' (Pattern: AdjNoun)
Matched Text: 'beautiful sunset' (Pattern: AdjNoun)
Matched Text: 'dog is' (Pattern: NounAux)
Matched Text: 'Cats were' (Pattern: NounAux)
