# Part 1 - Information Extraction

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. Information Extraction may be presented in three subtasks:

* **Named Entity Recognition**, retrieve entities (like persons, location, etc.) in the text. 
* **Relation Extraction**, find the relation between two entities in the text.
* **Template Filling**, find the correct entity to fill a certain template, for instance.

In this BLU we are going to learn some of the basic techniques to extract specific (pre-specified) information from textual sources. From the three specified task, we are going to **focus on the task of named-entity recognition (NER)** where our objective is to **retrieve all the mentions** of entities like persons, locations, time, among others. The other two are mentioned for the sake of completeness and you should definitely research more about them, specially if you're eager to learn more about NLP.

![robot entities](./media/robot_entities.jpg)

In [1]:
import re
import json

import pandas as pd
import spacy

We are going to work in a corpus containing forum discussions. We extracted a sample from Reddit for this use. For more interesting examples, you may find more textual data available at https://files.pushshift.io/reddit/

In [2]:
docs = []
with open('./datasets/sample_data.json') as fp:
    for line in fp:
        entry = json.loads(line)
        docs.append(entry['body'])
        
print('I read {} documents'.format(len(docs)))

I read 1000 documents


### Information Extraction with Regular Expressions

In BLU7, we became pros of regular expressions. We're going to try to use them to our task of recognizing entities. Take a moment to think about all the possibilities of Entities that we can find in a text. Do you think such a task will be achievable using only regular expressions?

![regex](./media/regex.gif "regex")

As a refresher, let's say that your boss asked you to retrieve all the **dates** mentioned in our sample corpus. We learned in BLU7 that it is easy to use a regular expression for that.

In [3]:
# Let's find all possible dates in the format xx/xx/xxxx
data = ' '.join(docs)
re.findall('\d{1,2}/\d{1,2}/\d{2,4}', data)

['14/09/30', '7/12/2007', '4/16/2007', '3/27/2007', '2/28/2007']

Ok, this looks like it's going to be a breeze. However, now your boss decides to ask you to retrieve all the **country names** which appear in the corpus instead. 

One possible approach is to retrieve a list of all countries that exist and look for the occurence of such elements in the corpus. Let's try that, shall we?

![alt text](./media/countries_meme.jpg)

In [4]:
countries = []
with open('./datasets/countries.txt') as fp:
    for line in fp:
        countries.append(line.rstrip())
countries

['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctic Lands',
 'Antarctica',
 'Antigua',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Ashmore Islands',
 'Ashmore and Cartier Islands',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Baker Island',
 'Bangladesh',
 'Barbados',
 'Barbuda',
 'Bassas da India',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Borneo',
 'Bosnia',
 'Bosnia Herzegovina',
 'Bosnia and Herzegovina',
 'Botswana',
 'Bouvet Island',
 'Brazil',
 'Britain',
 'British Indian Ocean Territory',
 'British Virgin Islands',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burma',
 'Burundi',
 'Caicos Islands',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Cartier Islands',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Christmas Island',
 'Clipperton',
 'Clipperton Island',
 'Cocos',
 'Colombia',
 'Comoros',
 'Congo',
 '

We could use again regular expressions for this. Let's see how:

In [5]:
# Sort country list by length. This is important to match longer spans before short 
# ones (like in 'Papua New Guinea' vs. 'Papua')
countries.sort(key=len, reverse=True)

# Make a regex to recognize all possible names.
# '|' creates the or operation in regex
# \b means word boundaries (punctuation or white spaces)
# re.escape is used to escape regex operators like '.'    
countries_regex = r'\b(' + '|'.join([re.escape(c) for c in countries]) + r')\b'

# finditer is similar to findall
# the flag re.I means to ignore casing (accept both lowercase and uppercase letters as the same)
for i, m in enumerate(re.finditer(countries_regex, data, flags=re.I)):
    print( (m.group(), m.start(), m.end()) )
    # just show the first 20
    if i > 20:
        break    

('us', 763, 765)
('United States', 827, 840)
('UK', 6971, 6973)
('US', 7000, 7002)
('Puerto rico', 8026, 8037)
('us', 8638, 8640)
('France', 19815, 19821)
('us', 21563, 21565)
('Puerto Rico', 27659, 27670)
('Puerto Rico', 27754, 27765)
('US', 28101, 28103)
('Canada', 29439, 29445)
('USA', 32880, 32883)
('Norway', 34749, 34755)
('Korea', 34837, 34842)
('USA', 35738, 35741)
('United States', 41060, 41073)
('us', 42290, 42292)
('us', 42403, 42405)
('Soviet', 44563, 44569)
('us', 49625, 49627)
('Chad', 51352, 51356)


**Is this approach working?**

It seems like the word **'us'**, for example, has caused some confusion. It could be the country _U.S._, or just the pronoun _us_. In this case, just comparing the word form we are not able to disambiguate the two forms. We will need either more **context** or more **linguistic information** and regular expression won't give us none of that.

Luckily, you already know an NLP library which can provide you the correct information to disambiguate the word 'us'. In the next examples, we will use SpaCy as our NLP toolkit to give us just that.

## Deeper look in information extraction using SpaCy
![Spacy](./media/spacy.jpg)

If you remember BLU8, we used SpaCy to understand word vectors (aka word embeddings). We will make use of the medium sized SpaCy english model once again. In case you haven't downloaded it yet, here's the command once again:

```
python -m spacy download en_core_web_md
```
    
But of course we could have used any english model (en_core_web_sm, en_core_web_md, en_core_web_lg) provided by SpaCy.

In [6]:
# Here we are disabling the synctatic parser from pipeline to improve speed.
nlp = spacy.load('en_core_web_md', disable=['parser'])

With SpaCy, we will process the documents with the complete NLP pipeline using [pipe](https://spacy.io/usage/processing-pipelines). This means that `pipe` will process our text, tokenize it and extract information from it using all the CPU cores from our machine. Concretely, it will Part-of-Speech tag (more on that later), parse and extract entities.

We won't get into details on how SpaCy does this -- what matters is that it uses fast machine learning models with good enough accuracy.

In [7]:
# We are going to use the function pipe to process all documents.
# One of the strenghts for SpaCy is the parallel processing using all your computer cores.
# In this step, SpaCy performs the NLP pipeline for all the docs, so it may take a while.
docs = list(nlp.pipe(docs))

Let's say that we want to do NER (Named Entity Extraction) in a piece of text. We can get an example sentence from our corpus:

In [8]:
example = docs[631]
print(example)

JRR Tolkien. Gandalf, Aragorn, Frodo, Bilbo Baggins, Gollum...


In SpaCy, it's really easy to extract entities - we can simply use `.ents` in our previously processed text, and SpaCy will use its built-in model to get the entities present in the text!

In [9]:
for ent in example.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

JRR Tolkien 0 11 PERSON
Gandalf 13 20 PERSON
Aragorn 22 29 PERSON
Frodo 31 36 PERSON
Bilbo Baggins 38 51 PERSON
Gollum 53 59 PERSON


In our example sentence, SpaCy correctly labels all these LOTR characters with the Person entity. You could further argue that Gandalf is a wizard and Frodo/Bilbo are hobbits, but let's not penalize SpaCy on that one!

Now that our text is processed and we know how to get entities, let's build a `Matcher` in SpaCy.

A `Matcher` is SpaCy's version of a regular expression - it searches for patterns in your text, according to the rules you give it. However, it is much more powerful since it has access to the outputs of the aforementioned NLP pipeline. That means we can search patterns that include certain entities or Part-of-Speech tags. 

In this `Matcher` we will define templates which we will use later to match elements in the text (thus using it to do information extraction). The `Matcher` is initialized using the vocabulary object, which must be shared with the documents the matcher will operate on.

In [10]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab) # Pass the vocabulary object to Matcher.__init__()

Let's build a similar matcher as we did above with regular expressions. We are going to get each country name and add it as a pattern to the `matcher`. To add a pattern, we can simply use `.add()`. It receives:

- an ID (the name we want to give our pattern)
- a callable function that is called when there is a match (we're not going to use anything)
- the pattern itself

In [11]:
for country in countries:
    # Build a pattern from the country name. For example: United States -> [{'LOWER': 'united'}, {'LOWER': 'states'}]
    # LOWER means to match the words in the lowercased token.
    pattern = [{'LOWER': c.lower()} for c in country.split()]
    matcher.add(country, None, pattern)

In [12]:
# for screen economy, let's just show the matches for the first 400 documents.
for i, doc in enumerate(docs[:400]):
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # the matched span
        print(i, start, end, span)

10 1 2 us
12 4 6 United States
58 22 23 UK
58 28 29 US
64 18 20 Puerto rico
69 50 51 us
146 4 5 France
167 29 30 us
213 99 101 Puerto Rico
213 121 123 Puerto Rico
213 198 199 US
229 4 5 Canada
255 86 87 USA
263 78 79 Norway
263 101 102 Korea
267 2 3 USA
312 4 6 United States
320 35 36 us
320 58 59 us
335 38 39 Soviet
335 38 39 Soviet
349 4 5 us
367 7 8 Chad
369 11 12 Chad
369 18 19 Chad
369 41 42 Chad
386 4 6 United States


As we mentioned, in order to disambiguate the retrieval of 'U.S.' vs 'us' we need to add more linguistic information to the `matcher`. Let's play with Part of Speech (PoS).

## But what is Part-of-Speech?

If you remember from your language classes, you could categorize words in a sentence according to the role they have in it. In NLP, we call this Part of Speech tags. For the English language, common PoS tags are: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection.

SpaCy adopts the Universal PoS tagset where any language has a common subset of PoS defined. The list of all possible values can be consulted [here](https://spacy.io/api/annotation#pos-tagging).

In this case, we are interested in matching the country names that were tagged as **Proper Nouns** ('PROPN' tag obtained from the tagset list).

![Pronoun meme](./media/pronoun.jpg)

In SpaCy, just as entities of a document are inside `doc.ents`, for each token of a document we can find its assigned POS tag by using `.pos_`.

But `Matcher` is pretty smart, so we only really need to add to a `'POS'` entry in the pattern dictionary and the tag we are looking for as the value.

In [13]:
# new matcher instance
matcher = Matcher(nlp.vocab)

for country in countries:
    # same as before, but now with one more restriction: the Part-of-speech should be a Pronoun.
    pattern = [{'LOWER': c.lower(), 'POS': 'PROPN'} for c in country.split()]    
    matcher.add(country, None, pattern)

In [14]:
for token in example:
    print(token.text, token.pos_)

JRR PROPN
Tolkien PROPN
. PUNCT
Gandalf PROPN
, PUNCT
Aragorn PROPN
, PUNCT
Frodo PROPN
, PUNCT
Bilbo PROPN
Baggins PROPN
, PUNCT
Gollum PROPN
... PUNCT


In [15]:
for i, doc in enumerate(docs[:400]):
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id] 
        span = doc[start:end]
        print(i, start, end, span)

12 4 6 United States
58 22 23 UK
58 28 29 US
146 4 5 France
213 99 101 Puerto Rico
213 121 123 Puerto Rico
213 198 199 US
229 4 5 Canada
255 86 87 USA
263 78 79 Norway
263 101 102 Korea
267 2 3 USA
312 4 6 United States
367 7 8 Chad
369 11 12 Chad
369 18 19 Chad
369 41 42 Chad
386 4 6 United States


Unfortunatelly the PoS tagger is based on a machine learning method, so it is prone to errors. Notice how it causes _Puerto rico_ of document 64 to be out of this list.

### Extracting using complex patterns

Let's now look into other types of information extraction methods which use complex structures. For example, let's say we want to extract places. Usually, places come up in text in structures similar to:

* go to xx
* went from xxx
* going to xx

**Note**: Notice that such patterns could be interesting to the task of relation extraction we mentioned in the intro. But that's something we will leave up to you to look further into.

In order to build a SpaCy pattern for the proposed sentence structure, we are going to use the lemma word 'go' (remember lemmatization from BLU07? We can do this in SpaCy pretty easily as well!), which is invariant for all possible verb inflexitions, a preposition (POS tag name - ADP) and a proper noun (POS tag name - PROPN).

In [16]:
matcher = Matcher(nlp.vocab)
pattern = [{'LEMMA': 'go'}, {'POS': 'ADP'}, {'POS': 'PROPN'}]
matcher.add('LOC', None, pattern)

In [17]:
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # the matched span
        span_text = span.text  # the span as a string
        print(start, end, span_text)

24 27 goes to GTA
246 249 going to Osaka
81 84 gone to Irvine
91 94 going with Robbie


These sure aren't all the locations that are present in our corpus! Not what we expected then :( 

Once again, we are finding out that it is very difficult to build patterns to match these type of ocurrences in the text. Addressing all possible patterns for person, location, etc. this way is very inneficient and difficult. 

Another possible way to go is to annotate examples in a corpus. We can train machine learning systems to automatically extract patterns from annotated corpora. Such class of machine learning methods are known as sequencial labeling and the most famous approaches are [CRFs](https://people.cs.umass.edu/~wallach/technical_reports/wallach04conditional.pdf) and [Seq2seq](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf).

Fortunately, as explained above, Spacy already contains pre-trained models for standard named-entities. Besides _Person_ (PER) entities like _Bilbo_ and _Organization_ (ORG) entities like _PayPal_ , we can also extract _Location_ entities with the code GPE!

Let's try to extract all Locations using the built-in model.

In [18]:
for i, doc in enumerate(docs[:600]):
    for e in doc.ents:
        print(i, e.text, e.start_char, e.end_char)

1 November 3 – 5 0 14
1 the Portland Expo Center 18 42
3 New York 0 8
3 North America 57 70
11 Soraka 10 16
12 143413934| &gt 0 14
12 United States Anonymous 16 39
12 the Democratic Party 95 115
12 SJW 175 178
15 Nova 0 4
20 8 months or so 268 282
23 Isoprop 0 7
25 Russians 71 79
25 Wikileaks 92 101
27 UCCI 35 39
27 a week 97 103
27 days 363 367
27 24-48 hour 481 491
30 Portland 5 13
30 the weekends 151 163
30 Portland 402 410
30 Some days 412 421
33 Cod BO2 53 60
34 0-3 0 3
35 1](/r/AskReddit/wiki 72 92
37 Shangela 7 15
38 two 0 3
40 CL 22 24
40 Accord 265 271
40 first 300 305
47 two 22 25
49 Breanne at Stone Salon 0 22
49 Hoover 26 32
49 Facebook 161 169
50 Brad 0 4
50 Lonzo Ball's 44 56
50 6 57 58
50 Lakers 88 94
54 RHONJ 15 20
54 Teresa 70 76
58 UK 117 119
58 US 146 148
58 Phoebe Tonkins 173 187
58 The Secret Circle 194 211
59 200 29 32
60 Brees 50 55
61 two 263 266
62 2 hours 0 7
63 Heatwaffle 0 10
63 280 11 14
64 Puerto rico 81 92
65 next week 47 56
65 Brees 83 88
65 next week 10

Still, let's not forget that, as in any machine learning model, we are also prone to errors in our prediction.

Could we train a better model? Sure! Given a good corpus for training and the right tools we could achieve a very high accuracy. However, as this is not the objective of this BLU we are going to leave you some links if you want to learn more about this.

https://spacy.io/usage/training

Here's another handy link - https://spacy.io/usage/linguistic-features - you can find here all kind of features SpaCy can extract for you!