# Rule-based Matching

- Compared with regular expressions, the matcher works with doc objects  and tokens objects instead of only strings.
- It is also more flexible, you can search for text but also other lexical attributes. 



### Match patterns

Match patters are lists of dictionaries, each dictionary describes one token. The keys are the names of the token's attributes mapped to the expected values:

### Match exact token texts:

```python
[{'ORTH': 'phone'}, {'ORTH': 'samsung'}]
```
In this example we are looking for two tokens with the texts `phone` and `samsung`.

### Match lexical attributes:
We can also match on other tokens attributes. Here we have two token whose lower case forms equal `iphone` and `x`:

```python
[{'LOWER': 'iphone'}, {'LOWER': 'x'}]
```
### Match any token attributes
We can even write patters using attributes predicted by model. Here we are matching a token with the lemma `buy` plus a noun. 
```python
[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]
```
The lemma is the base form, so this patter would match prhases like "buying milk" or "buy flowers".

## Using the Matcher 


In [1]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

We also load the model and create the nlp object:

In [2]:
# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

The matcher is initialize with the shared vocabulary `nlp.vocab`:

In [3]:
matcher = Matcher(nlp.vocab)

The `matcher.add` method lets you add the pattern:

In [4]:
# Add the pattern to the matcher
pattern = [{'ORTH': 'samsung'}, {'ORTH': 'phone'}]
matcher.add('PHONE_PATTERN', None, pattern)

The first argument it is a unique id to identify which patter we would match. The second argument it is an optional call back. We don't need one here, so we set it to `None`. The third argument is the pattern. 

To match the patter on the text we can call the Matcher on any `Doc`:

In [5]:
# Process some text
doc = nlp("New samsung phone release date out now")

# Call the matcher on the doc
matches = matcher(doc)

This will return the matches. When you call the Matcher on the doc it returns a list of tuples. It tubple consists on three values:

- the `match_id`: hash value of the pattern name
- the `start`: index of matched span
- the `end`: index of the matched span

This means that we can iterate over the matches and create a span object. Slice of the `Doc` at the start and end index.

In [6]:
# Iterate over the matches
for match_id, start, end, in matches:
    
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

samsung phone


## Matching for lexical attributes

Here it is an example of a more complex pattern using lexical attributes. We are looking for five tokens. A tokenn consisting on only digists, free case insesitive tokens for `fifa`, `word` and `cup` and a token which consists on punctuation.

In [7]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

In [8]:
doc = nlp("2018 FIFA World Cup: France won!")

In [9]:
# Add the pattern to the matcher
matcher.add('FIFA_PATTERN', None, pattern)

In [10]:
# Call the matcher on the doc
matches = matcher(doc)

In [11]:
# Iterate over the matches
for match_id, start, end, in matches:
    
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


The pattern matches the tokens `2018 FIFA World Cup:`

## Matching other token attributes

In this example, we are looking for two tokens. A verb with the lemma `love` followed by a noun.

In [12]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

In [13]:
doc = nlp("I loved dogs but now I love cats more.")

In [14]:
# Add the pattern to the matcher
matcher.add('LOVE', None, pattern)

In [15]:
# Call the matcher on the doc
matches = matcher(doc)

In [16]:
# Iterate over the matches
for match_id, start, end, in matches:
    
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs


This patter will match `loved dogs` and `love cats`.

## Using operators and quantifiers (1)

Operators and quantifiers let you define how often the token should be matched. They can be added using the `OP` key. 

In [18]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'}, # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

Here, the question mark operator makes the determinant token optional so, it will match a token with the lemma `buy` an optional article and a noun.

In [19]:
doc = nlp("I bought a smartphone. Now I'm buying accessories for it.")

In [20]:
# Add the pattern to the matcher
matcher.add('BUY', None, pattern)

In [21]:
# Call the matcher on the doc
matches = matcher(doc)

In [22]:
# Iterate over the matches
for match_id, start, end, in matches:
    
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying accessories


## Using operators and quantifiers (2)

`OP` can have one of four values:

| op/quant| Description                       |
|----------------------|-------------------------|
| {'OP': !}|	Negation: match 0 times      |
| {'OP': ?}|	Optional: match 0 or 1 times |
| {'OP': +}|	Match 1 or more times |
| {'OP': *}|	Match 0 or more times |

- An exclamation mark neagates the token, so it match zero times.
- The question mark makes the token optional and matches it zero or one times.
- A plus matches the token one or more times.
- An asterix matches zero or more times.

Operators can make your patterns a lot more powerfull. But they can add also more complexity. So use them wiselly. 

Token-based matching opens up lots of new possibilities for information extraction. Let's see some examples:

## Using the Matcher

Let's try spaCy's rule-based `Matcher`. You'll be using the example from the previous exercise and write a pattern that can match the phrase "iPhone X" in the text. The `nlp` object and a processed `doc` are already available.

- Import the `Matcher` from `spacy.matcher`.
- Initialize it with the `nlp` object's shared `vocab`.


In [28]:
# Import the Matcher
from spacy.matcher import Matcher

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Process some text
doc = nlp("New samsung phone release date leaked")

- Create a pattern that matches the `TEXT` values of two tokens: "samsung" and "phone".
- Use the `matcher.add` method to add the pattern to the matcher.

## Writing match patterns
In this exercise, you'll practice writing more complex match patterns using different token attributes and operators. A matcher is already initialized and available as the variable matcher.

- Write **one** pattern that only matches mentions of the *full* iOS versions: "iOS 7", "iOS 11" and "iOS 10".

In [29]:
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', None, pattern)

matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


- Write **one** pattern that only matches forms of "download" (tokens with the lemma "download"), followed by a proper noun.

In [30]:
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


- Write **one** pattern that matches adjectives followed by one or two nouns (one noun and one optional noun).

In [31]:
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 4
Match found: downloaded Fortnite
Match found: my laptop
Match found: downloading Minecraft
Match found: download Winzip


## Reference:

- [Rule-based matching examples on spaCy](https://spacy.io/usage/linguistic-features#section-rule-based-matching)

- [Writing Your Own Resume Parser](https://www.omkarpathak.in/2018/12/18/writing-your-own-resume-parser/#sixth-step)