# Introduction

As we have already seen, this is a flights dataset. Hence, we expect to see city/country names, airport names, and airline names.

Also, the atis_abbreviation intent contains utterances that are inquiries about some abbreviations. Flight abbreviations can be fare codes (for example, M
= Economy), airline name codes (for example, United Airlines = UA), and airport codes (for example, Berlin Airport = BER), and so on. Examples include the following:

```
what does the abbreviation ua mean
what does restriction ap 57 mean
explain restriction ap please
what's fare code yn
```

Let's visualize some utterances from the dataset:

In [38]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_md")
docs = [nlp("show me the flights from montreal to chicago"),
        nlp("does american airlines fly from boston to san francisco"),
        nlp("show me flights from minneapolis to seattle on july second"),
        nlp("what flights leave after 7 pm from philadelphia to boston")]
displacy.render(docs, style="ent")

Next, let's see all the entity types and their frequencies more systematically:

In [8]:
from collections import Counter
import spacy

nlp = spacy.load('en_core_web_md')
corpus = open('data/atis_utterances.txt', 'r').read().split('\n')

all_ent_labels = []
for sentence in corpus:
    doc = nlp(sentence.strip())
    ents = doc.ents
    all_ent_labels += [ent.label_ for ent in ents]

Counter(all_ent_labels)

Counter({'GPE': 8888,
         'TIME': 1006,
         'DATE': 1440,
         'CARDINAL': 281,
         'MONEY': 48,
         'ORDINAL': 193,
         'ORG': 412,
         'LOC': 24,
         'FAC': 67,
         'NORP': 98,
         'PRODUCT': 9,
         'PERSON': 10,
         'LANGUAGE': 1,
         'EVENT': 1})

We observe that the most frequent entity labels are `GPE` (location names), `DATE`, `TIME`, and `ORG` (organizations). Obviously, the location entities refer to destination and source cities/countries, hence they play a very important role in the overall semantic success of our application.

# Extracting named entities with Matcher

## Locations

We'll first extract the location entities by spaCy Matcher by searching for a pattern of the preposition location_name form. The following code extracts location entities preceded with a preposition:

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_md")
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADP"}, {"ENT_TYPE": "GPE"}]
matcher.add("prepositionLocation", [pattern])
doc = nlp("show me flights from denver to boston on tuesday")
matches = matcher(doc)
for mid, start, end in matches:
    print(doc[start:end])

from denver
to boston


Although the `from` and `to` prepositions dominate in this dataset, verbs about leaving and arriving can be used with a variety of prepositions. Here are some more example sentences from the dataset:

```
i'm looking for a flight that goes from ontario to westchester and stops in chicago
what flights arrive in chicago on sunday on continental
yes i'd like a flight from long beach to st. louis by way of dallas
what are the evening flights flying out of dallas
```

we see some phrasal verbs such as arrive `in`, as well as preposition and verb combinations such as `stop in` and `fly out of`. `By the way of Dallas` does not include a verb at all. The user indicated that they want to make a stop at Dallas. `to`, `from`, `in`, `out`, and `of` are common prepositions that are used in a traveling context.

## Airline information

The `ORG` entity label means an organization and it corresponds to airline company names in our dataset. The following code segment extracts the organization names, possibly multi-worded names:

In [41]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_md")
matcher = Matcher(nlp.vocab)
pattern = [{"ENT_TYPE": "ORG", "OP": "+"}]
matcher.add("AirlineName", [pattern])
doc = nlp("what is the earliest united airlines flight flying from denver")
matches = matcher(doc)
spans = [doc[start:end] for mid, start, end in matches]
longest_spans = spacy.util.filter_spans(spans)
for span in longest_spans:
    print(span)

united airlines


## Dates and times

We can extract dates and times very similarly:

In [103]:
import spacy
from spacy.matcher import Matcher
import itertools

nlp = spacy.load("en_core_web_md")
matcher = Matcher(nlp.vocab)
date_pattern = [{"POS": "ADP", "OP": "?"}, {"ENT_TYPE": "DATE", "OP": "+"}, {"POS": "NOUN", "OP": "?"}]
time_pattern = [{"POS": "ADP", "OP": "?"}, {"POS": "DET", "OP": "?"}, {"ENT_TYPE": "TIME", "OP": "+"}]
matcher.add("FlightTime", [date_pattern, time_pattern])

def match_and_print(sentence):
    doc = nlp(sentence)
    matches = matcher(doc)
    spans = [doc[start:end] for mid, start, end in matches]
    longest_spans = spacy.util.filter_spans(spans)
    print(longest_spans)

match_and_print("show me all flights from boston to pittsburgh on wednesday of next week which leave boston after 2 o'clock pm")
match_and_print("show me all flights from atlanta to denver which leave after 5 o'clock pm the day after tomorrow")
match_and_print("show me the flights from boston to pittsburgh next wednesday night after 6 o'clock")
match_and_print("show me all the delta flights leaving or arriving at pittsburgh between 12 and 4 in the afternoon")
match_and_print("show me all the flights before 11 am on august second from boston to denver on delta")


[on wednesday of next week, after 2 o'clock pm]
[after 5 o'clock pm, the day after tomorrow]
[next wednesday night, after 6 o'clock]
[between 12 and 4, in the afternoon]
[before 11 am, on august second]


## Abbreviations

Extracting the abbreviation entities is a bit trickier. First, we will have a look at how the abbreviations appear:
```
what does restriction ap 57 mean
what does the abbreviation co mean
what does fare code qo mean
what is the abbreviation d10
what does code y mean
what does the fare code f and fn mean
what is booking class c
```

In [2]:
import spacy
from spacy import displacy
import warnings

warnings.filterwarnings('ignore')

nlp = spacy.load("en_core_web_md")
docs = [nlp("what does restriction ap 57 mean"),
        nlp("what does the abbreviation co mean"),
        nlp("what does fare code qo mean"),
        nlp("what is the abbreviation d10"),
        nlp("what does code y mean"),
        nlp("what does the fare code f and fn mean"),
        nlp("what is booking class c")]
displacy.render(docs, style="ent")

Only one of these sentences includes an entity. The first example sentence includes n `CARDINAL` entity, which is `57`. Other than that, abbreviations are not marked with any entity type at all. In this case, we have to provide some custom rules to the `Matcher`. Let's make some observations first, and then form a `Matcher` pattern:

1) An abbreviation can be broken into two parts: letters, and digits.
2) The letter part can be 1-2 characters long.
3) The digit part is also 1-2 characters long.
4) The presence of digits indicates an abbreviation entity.
5) The presence of the following words indicates an abbreviation entity: `class`, `code`, `abbreviation`.
6) The POS tag of an abbreviation is a noun. If the candidate word is a 1-letter or 2-letter word, then we can look at the POS tag and see whether it's a noun. This approach eliminates the false positives, such as `us` (pronoun), `me` (pronoun), `a` (determiner), and `an` (determiner).

In [5]:
import spacy
from spacy.matcher import Matcher
import itertools

pattern1 = [{"TEXT": {"REGEX": "\w{1,2}\d{1,2}"}}]
pattern2 = [{"SHAPE": { "IN": ["x", "xx"]}}, {"SHAPE": {"IN": ["d", "dd"]}}]
pattern3 = [{"TEXT": {"IN": ["class", "code", "abbrev", "abbreviation"]}}, {"SHAPE": { "IN": ["x", "xx"]}}]
pattern4 = [{"POS": "NOUN", "SHAPE": { "IN": ["x", "xx"]}}]

nlp = spacy.load("en_core_web_md")
matcher = Matcher(nlp.vocab)
matcher.add("abbrevEntities", [pattern1, pattern2, pattern3, pattern4])

sentences = [
    'what does restriction ap 57 mean',
    'what does the abbreviation co mean',
    'what does fare code qo mean',
    'what is the abbreviation d10',
    'what does code y mean',
    'what does the fare code f and fn mean',
    'what is booking class c'
]

for sent in sentences:
    doc = nlp(sent)
    matches = matcher(doc)
    spans = [doc[start:end] for mid, start, end in matches]
    longest_spans = spacy.util.filter_spans(spans)
    for span in longest_spans:
        print(span)

ap 57
abbreviation co
code qo
d10
code y
code f
fn
class c


spaCy `Matcher` makes life easy for us by allowing us to make use of token shape, context clues, and a token POS tag. We made a very successful entity extraction by extracting locations, airline names, dates, times, and abbreviations.

# Using dependency trees for extracting entities

In the previous subsection, we extracted entities where the context provides obvious clues. Extracting the destination city from the following sentence is easy. We can look for the `to + GPE` pattern:
```
I want to fly to Munich tomorrow.
```

But suppose the user provides one of the following sentences instead:
```
I'm going to a conference in Munich. I need an air ticket.
My sister's wedding will be held in Munich. I'd like to book a flight.
```

Here, the preposition `to` refers to `conference`, not `Munich`, in the first sentence. In this sentence, we need a pattern such as `to + .... + GPE`. Then, we have to be careful what words can come in between `to` and the city name, as well as what words should not come. For instance, this sentence carries a completely different meaning and shouldn't match:
```
I want to book a flight to my conference without stopping at Berlin.
```

In the second sentence, there's no to at all. Here, as we see from these examples, we need to examine the syntactic relations between words. This can be achieved by walking the dependency trees.

Walking a dependency tree means visiting the tokens in a custom order, not necessarily from left to right. Usually, we stop iterating over the dependency tree once we find what we're looking for.

Every word in a sentence has to involve at least one relation. This fact guarantees that we'll visit each word while walking through the sentence. `ROOT` is a special dependency label and is always assigned to the main verb of the sentence. In every relation, one of the tokens is the syntactic parent (called the `HEAD`) and the other is dependent (called the `CHILD`).

Coming back to our examples, we'll iterate the utterance dependency trees to find out whether the preposition `to` is syntactically related to the location entity, `Munich`. First of all, let's see the dependency parse of our example sentence `I'm going to a conference in Munich`:

In [5]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_md")
doc = nlp("I'm going to a conference in Munich")
displacy.render(doc, style="dep", options={"compact": True})

There are no incoming arcs into the verb `going`, so going is the `ROOT` of the dependency tree. If we follow the arc to its immediate right, we encounter `to`; jumping over the arcs to the right we reach `Munich`. This shows that there's a syntactic relation between `to` and `Munich`.

There are two possible ways to connect to and Munich:
- Left to right: we start from `to` and try to reach `Munich` by visiting `to`'s syntactic children. This approach may not be a very good idea, because if `to` has more than one child, then we need to check each child and keep track of all the possible paths.
- Right to left: we start from `Munich`, jump onto its head, and follow the head's head, and so on. Since each word has exactly one head, it's guaranteed that there will be only one path. Then we determine whether `to` is on this path or not.

In [6]:
import spacy

nlp = spacy.load("en_core_web_md")

def reach_parent(source_token, dest_token):
    source_token = source_token.head
    while source_token != dest_token:
        if source_token.head == source_token:
            return None
        source_token = source_token.head
    return source_token

doc = nlp("I'm going to a conference in Munich.")
reach_parent(doc[-2], doc[3])

to

Dependency parsing is necessary for intent recognition, which is the subject of the next chapter.