# Introduction

As we have already seen, this is a flights dataset. Hence, we expect to see city/country names, airport names, and airline names.

Also, the atis_abbreviation intent contains utterances that are inquiries about some abbreviations. Flight abbreviations can be fare codes (for example, M
= Economy), airline name codes (for example, United Airlines = UA), and airport codes (for example, Berlin Airport = BER), and so on. Examples include the following:

```
what does the abbreviation ua mean
what does restriction ap 57 mean
explain restriction ap please
what's fare code yn
```

Let's visualize some utterances from the dataset:

In [38]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_md")
docs = [nlp("show me the flights from montreal to chicago"),
        nlp("does american airlines fly from boston to san francisco"),
        nlp("show me flights from minneapolis to seattle on july second"),
        nlp("what flights leave after 7 pm from philadelphia to boston")]
displacy.render(docs, style="ent")

Next, let's see all the entity types and their frequencies more systematically:

In [34]:
from collections import Counter
import spacy
import pprint

nlp = spacy.load('en_core_web_md')
corpus = open('../data/atis_utterances.txt', 'r').read().split('\n')

all_ent_labels = []
for sentence in corpus:
    doc = nlp(sentence.strip())
    ents = doc.ents
    all_ent_labels += [ent.label_ for ent in ents]
c = Counter(all_ent_labels)

pprint.pprint(c)

Counter({'GPE': 8888,
         'DATE': 1440,
         'TIME': 1006,
         'ORG': 412,
         'CARDINAL': 281,
         'ORDINAL': 193,
         'NORP': 98,
         'FAC': 67,
         'MONEY': 48,
         'LOC': 24,
         'PERSON': 10,
         'PRODUCT': 9,
         'LANGUAGE': 1,
         'EVENT': 1})


We observe that the most frequent entity labels are `GPE` (location names), `DATE`, `TIME`, and `ORG` (organizations). Obviously, the location entities refer to destination and source cities/countries, hence they play a very important role in the overall semantic success of our application.

# Extracting named entities with Matcher

## Locations

We'll first extract the location entities by spaCy Matcher by searching for a pattern of the preposition location_name form. The following code extracts location entities preceded with a preposition:

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_md")
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADP"}, {"ENT_TYPE": "GPE"}]
matcher.add("prepositionLocation", [pattern])
doc = nlp("show me flights from denver to boston on tuesday")
matches = matcher(doc)
for mid, start, end in matches:
    print(doc[start:end])

from denver
to boston


Although the `from` and `to` prepositions dominate in this dataset, verbs about leaving and arriving can be used with a variety of prepositions. Here are some more example sentences from the dataset:

```
i'm looking for a flight that goes from ontario to westchester and stops in chicago
what flights arrive in chicago on sunday on continental
yes i'd like a flight from long beach to st. louis by way of dallas
what are the evening flights flying out of dallas
```

we see some phrasal verbs such as arrive `in`, as well as preposition and verb combinations such as `stop in` and `fly out of`. `By the way of Dallas` does not include a verb at all. The user indicated that they want to make a stop at Dallas. `to`, `from`, `in`, `out`, and `of` are common prepositions that are used in a traveling context.

## Airline information

The `ORG` entity label means an organization and it corresponds to airline company names in our dataset. The following code segment extracts the organization names, possibly multi-worded names:

In [41]:
matcher = Matcher(nlp.vocab)
pattern = [{"ENT_TYPE": "ORG", "OP": "+"}]
matcher.add("AirlineName", [pattern])
doc = nlp("what is the earliest united airlines flight flying from denver")
matches = matcher(doc)
spans = [doc[start:end] for mid, start, end in matches]
longest_spans = spacy.util.filter_spans(spans)
for span in longest_spans:
    print(span)

united airlines


## Dates and times

We can extract dates and times very similarly:

In [103]:
import itertools

matcher = Matcher(nlp.vocab)
date_pattern = [{"POS": "ADP", "OP": "?"}, {"ENT_TYPE": "DATE", "OP": "+"}, {"POS": "NOUN", "OP": "?"}]
time_pattern = [{"POS": "ADP", "OP": "?"}, {"POS": "DET", "OP": "?"}, {"ENT_TYPE": "TIME", "OP": "+"}]
matcher.add("FlightTime", [date_pattern, time_pattern])

def match_and_print(sentence):
    doc = nlp(sentence)
    matches = matcher(doc)
    spans = [doc[start:end] for mid, start, end in matches]
    longest_spans = spacy.util.filter_spans(spans)
    print(longest_spans)

match_and_print("show me all flights from boston to pittsburgh on wednesday of next week which leave boston after 2 o'clock pm")
match_and_print("show me all flights from atlanta to denver which leave after 5 o'clock pm the day after tomorrow")
match_and_print("show me the flights from boston to pittsburgh next wednesday night after 6 o'clock")
match_and_print("show me all the delta flights leaving or arriving at pittsburgh between 12 and 4 in the afternoon")
match_and_print("show me all the flights before 11 am on august second from boston to denver on delta")


[on wednesday of next week, after 2 o'clock pm]
[after 5 o'clock pm, the day after tomorrow]
[next wednesday night, after 6 o'clock]
[between 12 and 4, in the afternoon]
[before 11 am, on august second]
