# Machine Learning & Computational Text Analysis

[John McLevey](johnmclevey.com)   
Winter 2018

Time to get started with computational text analysis! Along the way, we will learn about applications of machine learning and natural language processing  more broadly. Like other methods, these methods are best learned by doing. So let's work through some examples. Today we are going to focus mostly on processing text at the *beginning* of a text analysis workflow. We will go from unstructured text > processed text > bag of words > feature matrix for training machine learning models.

I'll begin with a brief overview of how computational text analysis relates to other methods of content analysis -- quantitative and qualitative -- with long histories in the social sciences. Then, we will get into the specifics of computational text analysis.

Please create more cells (or add to this one) for your notes as I lecture. Once we are sufficiently on the same page about the high-level overview, we will start working through specific examples.

In the next few classes, we will cover:

- TODAY: processing text  
  - tokenization, cleaning, stemming, lemmatization, bag of words / vector space model
  - computing IF-IDF weights
  - part of speech tagging
- NEXT CLASS: three different types of unsupervised methods
  - $k$-means clustering of documents based on cosine similarity
  - LDA topic model
  - topic detection in semantic networks
- NEXT CLASS: one supervised method
  - Naïve Bayes
  - a framework for integrating both unsupervised and supervised methods in the same text analysis project

Start by reading through the flowchart below. I created it to help you get a high-level overview of the process(es) of computational text analysis, from unstructured text, to numbers, to (validated) models, to enlightenment. Or at least reasonable conclusions...


![](img/text_as_data.png)

And here is another view on the motivations behind supervised and unsupervised learning in theory-driven social research, and on when and how they might be combined. From James Evans and Pedro Aceves (2016) "Machine Translation: Mining Text for Social Theory."

![](img/text_theory.png)

And finally a view from Laura Nelson's "computational grounded theory." We will talk more about this, and Evans and Aceves (2016) in a couple of classes.

![](img/nelson.png)

Let's get started.

# Processing Text

### Garbage in ---> Garbage Out

It is not possible to do high-quality computational text analysis without initial text processing. To illustrate some basic concepts, we will start with a very small text dataset: political party manifestos from the 2015 Canadian federal election. I've downloaded the data for the Liberals, Conservatives, New Democrats, and Greens from the [Manifesto Project](https://manifesto-project.wzb.eu). You can easily obtain data for other years, countries, and parties.

The data is stored as 4 `csv` files. Let's peek at one before we do anything.

If you have not already installed [nltk](https://www.nltk.org/install.html), please do it now.

In [27]:
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shruti/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/shruti/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [28]:
base_path = "/Users/shruti/Documents/Studies/2018/INTEG 475/Machine Learning"

LIB = pd.read_csv(base_path + '/manifesto_project_data/liberal_2015.csv', encoding = 'utf-8')
LIB[:10]

Unnamed: 0,text,cmp_code
0,PLATEFORME ELECTORALE PARTI LIBERAL DU CANADA ...,
1,ECONOMIC SECURITY FOR THE MIDDLE CLASS,
2,A strong economy starts with a strong middle c...,408.0
3,Our plan offers real help to Canada's middle c...,704.0
4,When our middle class has more money in their ...,704.0
5,We will give families more money to help with ...,504.0
6,We will cancel tax breaks and benefits for the...,503.0
7,and introduce a new Canada Child Benefit to gi...,504.0
8,"With the Canada Child Benefit, nine out of ten...",305.0
9,"For the typical family of four, that means an ...",305.0


The data from the Manifesto Project includes codes (`cmp_code`) that we do not need for this particular analysis (although you should look into them, as they have been used in applications of supervised text analysis).

Let's just grab the text.

In [29]:
def get_content(csvfile):
    df = pd.read_csv(csvfile, encoding = 'utf-8')
    text = [row['text'] for index, row in df.iterrows()]
    text = " ".join(text)
    return text

In [30]:
lib = get_content(base_path + '/manifesto_project_data/liberal_2015.csv')
con = get_content(base_path + '/manifesto_project_data/conservative_2015.csv')
ndp = get_content(base_path + '/manifesto_project_data/ndp_2015.csv')
gre = get_content(base_path + '/manifesto_project_data/green_2015.csv')

You can print out each of our party lists and check to see if the function did what it was supposed to do. I don't want to print a lot to the screen in this notebook right now, so I will just grab a small slice of characters from part way through the NDP document. (I did check the others as well...)

In [31]:
ndp[420:850]

' age, my parents taught me and my brothers and sisters the importance of living by your values.   These values are at the heart of who I am as a husband, a father and grandfather.   As your Prime Minister, I will wake up every day focused on building the country of our dreams.   With the NDP, that means:   Bringing in quality, affordable childcare.   Strengthening our public health care system.   Ensuring a cleaner environment'

Our unit of analysis in this toy analysis is the manifesto from each party in 2015. Even though we have that data, we are not ready to go. We still need to break each document down to a set of tokens, process those tokens, and then construct a bag of words.

We will use the word tokenizer from NLTK. Let's start with the Liberals and, once we have a workflow in place, process the other parties as well.

In [32]:
lib_tok = word_tokenize(lib)
lib_tok[:10]

['PLATEFORME',
 'ELECTORALE',
 'PARTI',
 'LIBERAL',
 'DU',
 'CANADA',
 '2015',
 'ECONOMIC',
 'SECURITY',
 'FOR']

As you can see, we have broken the Liberal document down to individual tokens. Since we used a word tokenizer, you shouldn't be surprised to see a list of individual words! (If we use a sentence tokenizer, it would have split the document into sentences...)

From here, it's easy to quantify our text, turning it into a simple "[bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model)."

**Spoiler**: we haven't done enough processing yet, so this bag of words is going to suck. Garbage in, garbage out.

I'm not going to bother assigning the results to a variable, let's just see what it looks like by peaking at the top _n_ with the `most_common` method for counter objects.

In [33]:
Counter(lib_tok).most_common(40)

[(',', 857),
 ('and', 754),
 ('.', 749),
 ('to', 648),
 ('the', 608),
 ('will', 557),
 ('We', 361),
 ('of', 322),
 ('in', 271),
 ('for', 228),
 ('a', 227),
 ('that', 207),
 ('Canada', 167),
 ('our', 163),
 ('more', 151),
 ('with', 150),
 ('Canadians', 136),
 ("'s", 112),
 ('we', 109),
 ('make', 94),
 ('is', 91),
 ('on', 91),
 ('new', 88),
 ('$', 85),
 ('by', 85),
 ('are', 78),
 ('help', 74),
 ('work', 72),
 ('their', 71),
 ('also', 71),
 ('government', 70),
 ('it', 64),
 ('be', 63),
 ('support', 62),
 ('an', 61),
 ('not', 60),
 ('year', 59),
 ('Canadian', 57),
 ('all', 55),
 ('million', 53)]

We didn't learn much about the Liberal document from that peek at the bag of words, did we?

There are a lot of things that can and should be different. Words counts are split if a word is sometimes capitalized and sometimes not. Many of these words are not useful when they are removed from their semantic context (e.g. "and", "the", "a"). We don't need punctuation. Etc.

Fortunately, a bit of pre-processing will get us a _much_ better bag of words.

Here, with a bit of list comprehension to make things more concise, is some processing that will return any alpha characters as lower case _unless_ they show up in NLTK's list of English [stop words](https://en.wikipedia.org/wiki/Stop_words). We could have used another list, but the NLTK list is good.

Sometimes you should keep stop words, sometimes you shouldn't. As always: it depends.

In [34]:
without_stops = [t.lower() for t in lib_tok if t.isalpha()
                 and t not in stopwords.words('english')]

Let's take a quick peek at the top _n_ words in our new bag of words.

In [35]:
Counter(without_stops).most_common(40)

[('we', 362),
 ('canada', 178),
 ('canadians', 142),
 ('new', 96),
 ('make', 94),
 ('government', 79),
 ('help', 76),
 ('work', 72),
 ('also', 71),
 ('support', 65),
 ('canadian', 60),
 ('year', 59),
 ('harper', 53),
 ('million', 53),
 ('ensure', 53),
 ('communities', 52),
 ('stephen', 48),
 ('to', 48),
 ('invest', 47),
 ('families', 43),
 ('provide', 43),
 ('national', 43),
 ('federal', 41),
 ('infrastructure', 40),
 ('this', 39),
 ('need', 37),
 ('funding', 37),
 ('investments', 36),
 ('veterans', 36),
 ('provinces', 35),
 ('first', 35),
 ('benefit', 34),
 ('including', 34),
 ('nations', 34),
 ('economy', 32),
 ('care', 32),
 ('service', 32),
 ('security', 31),
 ('territories', 31),
 ('public', 31)]

Much better! Still, we could benefit from [stemming](https://en.wikipedia.org/wiki/Stemming) or [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation). Stemming reduces words to their stem, while lemmatization groups words together into their most basic dictionary form. Depending on your case, stemming or lemmatization will allow you to group together related words that have different surface forms (e.g. 'run,' 'running', and 'runners', or 'sick' and 'ill'). This will improve the quality of any models we train.

Stemmed words are a little less readable because word stems are typically not words we actually use. Lemmas are readable because they are valid words. The tradeoff is that stems are faster to compute and the validity ends up being comparable to lemmatization. Both methods have been shown to work very well for computational text analysis. Unless you have a good reason not to (e.g. variations on a word carry a lot of meaning in your particular case), then you should just pick one.

Let's try lemmatization for now. We will add it to a new processing text function that combines our tokenization and processing steps from previous cells. We will design it so that it accepts text directly from our `get_content` function.

In [36]:
def process_text(text):
    tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]
    no_stop_words = [t for t in tokens if t not in stopwords.words('english')]
    wordnet_lemmatizer = WordNetLemmatizer()
    lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stop_words]
    bag_of_words = Counter(lemmatized)
    return bag_of_words

Good news, everyone! We now have an easy way to get a bag of words from these political parties. Let's take a look at the most frequently occurring words from each party document.

![](https://media.giphy.com/media/3o7abA4a0QCXtSxGN2/giphy.gif)

In [37]:
lib_bow = process_text(lib)
con_bow = process_text(con)
ndp_bow = process_text(ndp)
gre_bow = process_text(gre)

In [45]:
print(lib_bow.most_common(100))

[('canadian', 202), ('canada', 178), ('make', 101), ('new', 96), ('year', 87), ('government', 86), ('help', 78), ('work', 77), ('also', 71), ('support', 70), ('community', 63), ('investment', 63), ('family', 62), ('benefit', 60), ('service', 57), ('million', 56), ('harper', 53), ('ensure', 53), ('need', 50), ('stephen', 48), ('invest', 47), ('nation', 45), ('provide', 43), ('national', 43), ('program', 41), ('federal', 41), ('job', 40), ('infrastructure', 40), ('veteran', 40), ('economy', 37), ('funding', 37), ('province', 35), ('first', 35), ('including', 34), ('child', 34), ('tax', 33), ('care', 32), ('territory', 32), ('security', 31), ('plan', 31), ('public', 31), ('access', 30), ('student', 30), ('create', 29), ('change', 29), ('health', 29), ('protect', 29), ('better', 26), ('time', 26), ('increase', 26), ('housing', 26), ('election', 26), ('give', 25), ('one', 25), ('people', 25), ('economic', 24), ('income', 24), ('affordable', 24), ('right', 24), ('class', 23), ('opportunity',

In [42]:
print(con_bow.most_common(100))

[('canadian', 244), ('canada', 165), ('government', 163), ('conservative', 121), ('tax', 114), ('new', 108), ('family', 101), ('job', 85), ('year', 80), ('support', 70), ('economy', 69), ('business', 69), ('plan', 68), ('continue', 58), ('program', 54), ('country', 49), ('help', 48), ('child', 48), ('million', 47), ('world', 46), ('budget', 40), ('senior', 40), ('investment', 40), ('market', 38), ('also', 38), ('one', 37), ('make', 36), ('protect', 34), ('balanced', 33), ('economic', 32), ('small', 32), ('work', 31), ('industry', 31), ('worker', 30), ('home', 30), ('development', 30), ('need', 28), ('product', 28), ('system', 27), ('benefit', 26), ('good', 26), ('billion', 26), ('saving', 26), ('create', 25), ('trade', 25), ('including', 25), ('security', 25), ('income', 24), ('relief', 23), ('infrastructure', 23), ('sector', 23), ('credit', 23), ('community', 22), ('percent', 22), ('research', 22), ('project', 22), ('federal', 21), ('service', 21), ('opportunity', 21), ('part', 21), (

In [43]:
print(ndp_bow.most_common(100))

[('canada', 167), ('ndp', 163), ('canadian', 162), ('government', 99), ('ensure', 73), ('health', 63), ('plan', 62), ('care', 60), ('support', 60), ('community', 59), ('year', 58), ('work', 57), ('make', 57), ('harper', 56), ('conservative', 55), ('new', 53), ('family', 52), ('stephen', 51), ('woman', 50), ('program', 48), ('help', 48), ('need', 48), ('indigenous', 44), ('national', 44), ('change', 41), ('federal', 41), ('access', 39), ('province', 39), ('job', 37), ('tax', 37), ('funding', 36), ('also', 36), ('public', 35), ('benefit', 35), ('country', 34), ('million', 34), ('investment', 34), ('ensuring', 32), ('home', 32), ('first', 31), ('service', 31), ('improve', 30), ('infrastructure', 29), ('act', 29), ('working', 29), ('cut', 28), ('veteran', 28), ('including', 28), ('child', 27), ('provide', 27), ('living', 26), ('affordable', 26), ('economy', 26), ('create', 26), ('protect', 26), ('economic', 26), ('take', 25), ('right', 25), ('worker', 25), ('get', 24), ('youth', 24), ('man

In [44]:
print(gre_bow.most_common(100))

[('canadian', 135), ('canada', 113), ('government', 46), ('work', 44), ('green', 44), ('community', 43), ('ensure', 42), ('public', 37), ('first', 37), ('need', 35), ('care', 34), ('time', 34), ('party', 33), ('health', 32), ('system', 31), ('national', 30), ('job', 29), ('new', 29), ('economy', 28), ('climate', 28), ('make', 27), ('nation', 27), ('plan', 25), ('must', 24), ('bill', 24), ('people', 23), ('local', 23), ('service', 22), ('parliament', 22), ('u', 20), ('year', 20), ('sustainable', 20), ('energy', 20), ('together', 19), ('country', 19), ('funding', 19), ('create', 19), ('mp', 19), ('good', 18), ('harper', 18), ('infrastructure', 18), ('support', 17), ('strategy', 17), ('economic', 16), ('education', 16), ('access', 16), ('rail', 16), ('every', 15), ('policy', 15), ('business', 15), ('tax', 15), ('housing', 15), ('federal', 15), ('law', 15), ('oil', 15), ('building', 14), ('vote', 14), ('family', 14), ('resource', 14), ('percent', 14), ('change', 14), ('right', 14), ('one',

If you know anything about those four political parties and the campaigns they ran in 2015, you can probably see that our processed text is already showing us some important differences. But we are _just getting started..._

If we wanted to, we could now transform our bag of words into a document-term feature matrix. In this matrix, each document in our analysis (currently only 4!) would be a row, and every feature would be a column. In this and other computational text analyses, the features are unique words from the corpus. If we were to continue with this toy example, how many features would there be in our matrix? However many unique tokens there are in our entire corpus! If there are 1 million tokens, then our matrix would have 1 million features.

Unless we compute some statistic, the cells in our matrix will just be raw counts of the number of times a given token is in a given document. For example, if the token "education" appears in a document 'liberal' 6 times, then that cell in the matrix will be 6.

This kind of matrix is very common, and very useful. The basic idea, as you might have guessed, is that words that appear more frequently in a document are more important to that document. However, some words (like "canada" or "government") are common but not especially meaningful because they appear frequently across all documents. Instead of raw counts, we may want the cells of our matrix to be some sort of weight that better measures how important that word is to the document. That's where TF-IDF comes in.

# Measuring Word Importance with TF-IDF

One thing you may have noticed in the bags of words above is that some words (e.g. "canada", 'government') are so common across the documents in our corpus that they don't help us understand important differences in what is talked about in each document. For the same reasons that we removed stop words, we may want to pay less attention to words that are very common across documents in our corpus, and pay more attention to words that appear frequently within documents and less frequently across the corpus. That's what [term frequency - inverse document frequency](https://en.wikipedia.org/wiki/Tf–idf), or TF-IDF, is for.

The tf-idf statistic measures the importance of a word in a document in a larger corpus. Formally, it looks like this:

<!-- $$w_i,_j = tf_i,_j * log\Big(\frac{N}{df_i}\Big)$$ -->

Where the tf-idf weight $w$ for word $i$ in document $j$ is equal to the term frequency multiplied by the inverse document frequency. Clear as mud?

<!-- The term frequency $tf_i,_j$ is just the number of times the word $i$ appears in document $j$. The inverse document frequency the log of the total number of documents in the corpus divided by the number of documents that the word $i$ appears in. TF-IDF is just the product of those two numbers. -->

TF-IDF can be computed easily using Python packages like [gensim](https://radimrehurek.com/gensim/) and [scikit-learn](http://scikit-learn.org/stable/). Let's take a look at both using a more serious example instead of this toy example. We will do so in another notebook ('unsupervised.ipynb').

Before we do that, let's take a quick look at two more text processing methods: parts-of-speech tagging and named entity recognition.

# Part-of-speech Tagging

Computational linguists have developed highly accurate probabilistic approaches to chunking sentences into their parts of speech, such as nouns, proper nouns adjectives, verbs, adverbs, etc.

![](img/evans_aceves.png)

In [49]:
import spacy
nlp = spacy.load('en')

In [50]:
c_nlp = nlp(con)
g_nlp = nlp(gre)
l_nlp = nlp(lib)
n_nlp = nlp(ndp)

In [51]:
for word in l_nlp:
    print(word, word.pos_)

PLATEFORME ADP
ELECTORALE PROPN
PARTI PROPN
LIBERAL ADJ
DU PROPN
CANADA PROPN
2015 NUM
ECONOMIC PROPN
SECURITY NOUN
FOR ADP
THE DET
MIDDLE ADJ
CLASS NOUN
A DET
strong ADJ
economy NOUN
starts VERB
with ADP
a DET
strong ADJ
middle ADJ
class NOUN
. PUNCT
Our ADJ
plan NOUN
offers VERB
real ADJ
help NOUN
to ADP
Canada PROPN
's PART
middle ADJ
class NOUN
and CCONJ
all ADJ
those DET
working VERB
hard ADV
to PART
join VERB
it PRON
. PUNCT
When ADV
our ADJ
middle ADJ
class NOUN
has VERB
more ADJ
money NOUN
in ADP
their ADJ
pockets NOUN
to PART
save VERB
, PUNCT
invest VERB
, PUNCT
and CCONJ
grow VERB
the DET
economy NOUN
, PUNCT
we PRON
all DET
benefit VERB
. PUNCT
We PRON
will VERB
give VERB
families NOUN
more ADJ
money NOUN
to PART
help VERB
with ADP
the DET
high ADJ
cost NOUN
of ADP
raising VERB
their ADJ
kids NOUN
. PUNCT
We PRON
will VERB
cancel VERB
tax NOUN
breaks NOUN
and CCONJ
benefits NOUN
for ADP
the DET
wealthy ADJ
including VERB
the DET
Universal PROPN
Child PROPN
Care PROPN
Benefi

burden NOUN
on ADP
Canadians PROPN
facing VERB
job NOUN
relocation NOUN
, PUNCT
the DET
death NOUN
of ADP
a DET
spouse NOUN
, PUNCT
marital ADJ
breakdown NOUN
, PUNCT
or CCONJ
a DET
decision NOUN
to PART
accommodate VERB
an DET
elderly ADJ
family NOUN
member NOUN
. PUNCT
We PRON
will VERB
direct VERB
the DET
Canada PROPN
Mortgage PROPN
and CCONJ
Housing PROPN
Corporation PROPN
and CCONJ
the DET
new ADJ
Canada PROPN
Infrastructure PROPN
Bank PROPN
to PART
provide VERB
financing NOUN
to PART
support VERB
the DET
construction NOUN
of ADP
new ADJ
, PUNCT
affordable ADJ
rental ADJ
housing NOUN
for ADP
middle- NOUN
and CCONJ
low ADJ
- PUNCT
income NOUN
Canadians PROPN
. PUNCT
We PRON
will VERB
conduct VERB
an DET
inventory NOUN
of ADP
all DET
available ADJ
federal ADJ
lands NOUN
and CCONJ
buildings NOUN
that ADJ
could VERB
be VERB
repurposed VERB
, PUNCT
and CCONJ
make VERB
some DET
of ADP
these DET
lands NOUN
available ADJ
at ADP
low ADJ
cost NOUN
for ADP
affordable ADJ
housing NOUN
in ADP


Canada PROPN
's PART
economy NOUN
has VERB
faltered VERB
, PUNCT
and CCONJ
our ADJ
middle ADJ
class NOUN
now ADV
finds VERB
it PRON
harder ADV
and CCONJ
harder ADV
to PART
make VERB
ends NOUN
meet VERB
It PRON
is VERB
time NOUN
for ADP
smart ADJ
, PUNCT
strategic ADJ
investments NOUN
that ADJ
will VERB
turn VERB
our ADJ
economy NOUN
around ADV
and CCONJ
get VERB
it PRON
growing VERB
again ADV
. PUNCT
Our ADJ
plan NOUN
will VERB
deliver VERB
the DET
services NOUN
we PRON
need VERB
, PUNCT
create VERB
jobs NOUN
, PUNCT
and CCONJ
restore VERB
economic ADJ
security NOUN
to ADP
the DET
middle ADJ
class NOUN
. PUNCT
We PRON
will VERB
invest VERB
now ADV
in ADP
the DET
projects NOUN
our ADJ
country NOUN
needs VERB
and CCONJ
the DET
people NOUN
who NOUN
can VERB
build VERB
them PRON
. PUNCT
Interest NOUN
rates NOUN
are VERB
at ADP
historic ADJ
lows NOUN
, PUNCT
our ADJ
current ADJ
infrastructure NOUN
is VERB
aging VERB
rapidly ADV
, PUNCT
and CCONJ
our ADJ
economy NOUN
is VERB
stuck VERB
in AD

Trois PROPN
- PUNCT
Rivi¨res NOUN
, PUNCT
this DET
means VERB
improving VERB
the DET
Maples PROPN
and CCONJ
Cardinal PROPN
- PUNCT
Roy PROPN
reservoirs NOUN
, PUNCT
and CCONJ
finding VERB
ways NOUN
to PART
mitigate VERB
the DET
regular ADJ
flooding NOUN
of ADP
the DET
Millette PROPN
, PUNCT
Bettez PROPN
, PUNCT
and CCONJ
Lacerte PROPN
rivers NOUN
. PUNCT
In ADP
Calgary PROPN
and CCONJ
Southern PROPN
Alberta PROPN
, PUNCT
this DET
means VERB
investments NOUN
in ADP
flood NOUN
mitigation NOUN
projects NOUN
, PUNCT
to PART
help VERB
protect VERB
local ADJ
families NOUN
and CCONJ
businesses NOUN
. PUNCT
NEW ADJ
BUILDING PROPN
CANADA PROPN
FUND PROPN
We PRON
will VERB
make VERB
the DET
New PROPN
Building PROPN
Canada PROPN
Fund PROPN
more ADV
focused VERB
and CCONJ
more ADV
transparent ADJ
. PUNCT
The DET
New PROPN
Building PROPN
Canada PROPN
Fund PROPN
is VERB
an DET
important ADJ
source NOUN
of ADP
infrastructure NOUN
funding NOUN
for ADP
Canadian ADJ
communities NOUN
, PUNCT
but CCONJ
it

. PUNCT
LABOUR PROPN
- PUNCT
SPONSORED VERB
FUNDS NOUN
We PRON
will VERB
reinstate VERB
the DET
tax NOUN
credit NOUN
for ADP
contributions NOUN
made VERB
to ADP
labour NOUN
- PUNCT
sponsored VERB
funds NOUN
, PUNCT
to PART
help VERB
support VERB
economic ADJ
growth NOUN
and CCONJ
help VERB
Canadians PROPN
save VERB
for ADP
their ADJ
retirement NOUN
. PUNCT
In ADP
many ADJ
parts NOUN
of ADP
the DET
country NOUN
, PUNCT
labour NOUN
- PUNCT
sponsored VERB
funds NOUN
are VERB
used VERB
to PART
help VERB
small- ADJ
and CCONJ
medium ADJ
- PUNCT
sized ADJ
businesses NOUN
get VERB
off ADP
the DET
ground NOUN
, PUNCT
creating VERB
jobs NOUN
and CCONJ
economic ADJ
growth NOUN
. PUNCT
In ADP
Quebec PROPN
, PUNCT
they PRON
also ADV
serve VERB
as ADP
an DET
important ADJ
retirement NOUN
savings NOUN
vehicle NOUN
. PUNCT
There ADV
, PUNCT
labour NOUN
- PUNCT
sponsored VERB
funds NOUN
help VERB
650,000 NUM
workers NOUN
save VERB
for ADP
retirement NOUN
, PUNCT
while ADP
investing VERB
in ADP
Canada P

added VERB
sugars NOUN
and CCONJ
artificial ADJ
dyes NOUN
in ADP
processed VERB
foods NOUN
. PUNCT
To PART
help VERB
Canadian ADJ
children NOUN
avoid VERB
and CCONJ
manage VERB
known ADJ
health NOUN
risks NOUN
, PUNCT
we PRON
will VERB
increase VERB
funding NOUN
to ADP
the DET
Public PROPN
Health PROPN
Agency PROPN
of ADP
Canada PROPN
by ADP
$ SYM
15 NUM
million NUM
in ADP
each DET
of ADP
the DET
next ADJ
two NUM
years NOUN
, PUNCT
to PART
support VERB
a DET
national ADJ
strategy NOUN
to PART
increase VERB
vaccination NOUN
rates NOUN
and CCONJ
raise VERB
awareness NOUN
for ADP
parents NOUN
, PUNCT
coaches NOUN
, PUNCT
and CCONJ
athletes NOUN
on ADP
concussion NOUN
treatment NOUN
. PUNCT
This DET
will VERB
be VERB
based VERB
on ADP
the DET
best ADJ
science NOUN
and CCONJ
will VERB
support VERB
existing VERB
provincial ADJ
and CCONJ
territorial ADJ
efforts NOUN
. PUNCT
We PRON
will VERB
introduce VERB
plain ADJ
packaging NOUN
requirements NOUN
for ADP
tobacco NOUN
products NOUN
, PUNCT
s

limits NOUN
as ADV
well ADV
. PUNCT
LEADER NOUN
'S PART
DEBATE NOUN
We PRON
will VERB
establish VERB
an DET
independent ADJ
commission NOUN
to PART
organize VERB
leaders NOUN
' PART
debates NOUN
. PUNCT
Elections NOUN
are VERB
a DET
time NOUN
for ADP
Canadians PROPN
to PART
learn VERB
more ADJ
about ADP
political ADJ
parties NOUN
, PUNCT
their ADJ
leaders NOUN
, PUNCT
and CCONJ
their ADJ
policies NOUN
. PUNCT
When ADV
it PRON
comes VERB
to ADP
leaders NOUN
' PART
debates NOUN
, PUNCT
the DET
focus NOUN
should VERB
be VERB
on ADP
educating VERB
and CCONJ
engaging ADJ
Canadians PROPN
, PUNCT
not ADV
on ADP
twisting VERB
the DET
rules NOUN
for ADP
political ADJ
advantage NOUN
. PUNCT
We PRON
will VERB
establish VERB
an DET
independent ADJ
commission NOUN
to PART
organize VERB
leaders NOUN
' PART
debates NOUN
and CCONJ
bring VERB
an DET
end NOUN
to ADP
partisan ADJ
gamesmanship NOUN
. PUNCT
ELECTORAL ADJ
REFORM NOUN
We PRON
will VERB
make VERB
every DET
vote NOUN
count NOUN
. PUNCT
We PRON

bring VERB
in PART
expert ADJ
witnesses NOUN
, PUNCT
and CCONJ
are VERB
sufficiently ADV
staffed VERB
to PART
continue VERB
to PART
provide VERB
reliable ADJ
, PUNCT
non ADJ
- PUNCT
partisan ADJ
research NOUN
. PUNCT
To PART
increase VERB
accountability NOUN
, PUNCT
we PRON
will VERB
strengthen VERB
the DET
role NOUN
of ADP
Parliamentary ADJ
committee NOUN
chairs NOUN
, PUNCT
including VERB
elections NOUN
by ADP
secret ADJ
ballot NOUN
. PUNCT
We PRON
will VERB
also ADV
change VERB
the DET
rules NOUN
so ADP
that ADP
Ministers PROPN
and CCONJ
Parliamentary PROPN
Secretaries PROPN
no ADV
longer ADV
have VERB
a DET
vote NOUN
on ADP
committees NOUN
. PUNCT
BETTER ADJ
SERVICE PROPN
FOR ADP
CANADIANS PROPN
In ADP
a DET
digital ADJ
era NOUN
, PUNCT
Canadians PROPN
have VERB
high ADJ
standards NOUN
for ADP
the DET
service NOUN
they PRON
receive VERB
. PUNCT
Dealing VERB
with ADP
the DET
government NOUN
should VERB
be VERB
no DET
exception NOUN
. PUNCT
Better ADJ
service NOUN
for ADP
Canadians P

to ADP
our ADJ
children NOUN
and CCONJ
grandchildren NOUN
a DET
country NOUN
even ADV
more ADV
beautiful ADJ
, PUNCT
more ADV
sustainable ADJ
, PUNCT
and CCONJ
more ADV
prosperous ADJ
than ADP
the DET
one NOUN
we PRON
have VERB
now ADV
. PUNCT
CLIMATE NOUN
CHANGE NOUN
We PRON
will VERB
provide VERB
national ADJ
leadership NOUN
and CCONJ
join VERB
with ADP
the DET
provinces NOUN
and CCONJ
territories NOUN
to PART
take VERB
action NOUN
on ADP
climate NOUN
change NOUN
, PUNCT
put VERB
a DET
price NOUN
on ADP
carbon NOUN
, PUNCT
and CCONJ
reduce VERB
carbon NOUN
pollution NOUN
. PUNCT
Climate NOUN
change NOUN
is VERB
an DET
immediate ADJ
and CCONJ
significant ADJ
threat NOUN
to ADP
our ADJ
communities NOUN
and CCONJ
our ADJ
economy NOUN
. PUNCT
Stephen PROPN
Harper PROPN
has VERB
had VERB
nearly ADV
a DET
decade NOUN
to PART
take VERB
action NOUN
on ADP
climate NOUN
change NOUN
but CCONJ
has VERB
failed VERB
to PART
do VERB
so ADV
. PUNCT
His ADJ
lack NOUN
of ADP
leadership NOUN
has VERB
t

that ADP
its ADJ
composition NOUN
reflects VERB
regional ADJ
views NOUN
and CCONJ
has VERB
sufficient ADJ
expertise NOUN
in ADP
fields NOUN
like ADP
environmental ADJ
science NOUN
, PUNCT
community NOUN
development NOUN
, PUNCT
and CCONJ
Indigenous ADJ
traditional ADJ
knowledge NOUN
. PUNCT
We PRON
will VERB
end VERB
the DET
practice NOUN
of ADP
having VERB
federal ADJ
Ministers PROPN
interfere VERB
in ADP
the DET
environmental ADJ
assessment NOUN
process NOUN
. PUNCT
We PRON
will VERB
also ADV
ensure VERB
that ADP
environmental ADJ
assessments NOUN
include VERB
an DET
analysis NOUN
of ADP
upstream ADJ
impacts NOUN
and CCONJ
greenhouse NOUN
gas NOUN
emissions NOUN
resulting VERB
from ADP
projects NOUN
under ADP
review NOUN
. PUNCT
We PRON
will VERB
undertake VERB
, PUNCT
in ADP
full ADJ
partnership NOUN
and CCONJ
consultation NOUN
with ADP
First PROPN
Nations PROPN
, PUNCT
Inuit PROPN
, PUNCT
and CCONJ
the DET
Mtis PROPN
Nation PROPN
, PUNCT
a DET
full ADJ
review NOUN
of ADP
laws NOUN


Elections PROPN
Act PROPN
make VERB
it PRON
harder ADJ
for ADP
Indigenous PROPN
Peoples PROPN
to PART
exercise VERB
their ADJ
right NOUN
to PART
vote VERB
. PUNCT
We PRON
will VERB
repeal VERB
those DET
changes NOUN
. PUNCT
Finally ADV
, PUNCT
we PRON
will VERB
ensure VERB
that ADP
the DET
Kelowna PROPN
Accord PROPN
and CCONJ
the DET
spirit NOUN
of ADP
reconciliation NOUN
that ADJ
drove VERB
it PRON
is VERB
embraced VERB
, PUNCT
and CCONJ
its ADJ
objectives NOUN
implemented VERB
in ADP
a DET
manner NOUN
that ADJ
meets VERB
today NOUN
's PART
challenges NOUN
. PUNCT
A DET
NEW ADJ
FISCAL ADJ
RELATIONSHIP NOUN
We PRON
will VERB
expand VERB
investment NOUN
in ADP
First PROPN
Nations PROPN
communities NOUN
and CCONJ
work VERB
toward ADP
forging VERB
a DET
new ADJ
fiscal ADJ
relationship NOUN
with ADP
First PROPN
Nations PROPN
. PUNCT
For ADP
nearly ADV
20 NUM
years NOUN
, PUNCT
investments NOUN
in ADP
First PROPN
Nations PROPN
programs NOUN
have VERB
been VERB
subject ADJ
to ADP
a DET
two N

injury NOUN
, PUNCT
we PRON
will VERB
invest VERB
$ SYM
25 NUM
million NUM
each DET
year NOUN
to PART
expand VERB
access NOUN
to ADP
the DET
Permanent PROPN
Impairment PROPN
Allowance PROPN
. PUNCT
We PRON
will VERB
invest VERB
a DET
further ADV
$ SYM
40 NUM
million NUM
each DET
year NOUN
to PART
provide VERB
injured ADJ
veterans NOUN
with ADP
90 NUM
percent NOUN
of ADP
their ADJ
pre ADJ
- PUNCT
release NOUN
salary NOUN
, PUNCT
and CCONJ
will VERB
index VERB
this DET
benefit NOUN
so ADP
that ADP
it PRON
keeps VERB
pace NOUN
with ADP
inflation NOUN
. PUNCT
EDUCATION NOUN
AND CCONJ
TRAINING PROPN
We PRON
will VERB
honour VERB
the DET
service NOUN
of ADP
our ADJ
veterans NOUN
and CCONJ
provide VERB
new ADJ
career NOUN
opportunities NOUN
through ADP
a DET
new ADJ
Veterans PROPN
Education PROPN
Benefit PROPN
. PUNCT
Veterans NOUN
have VERB
proven VERB
their ADJ
ability NOUN
to PART
work VERB
hard ADV
and CCONJ
succeed VERB
, PUNCT
but CCONJ
many ADJ
find VERB
it PRON
difficult ADJ
to PART
b

investment NOUN
in ADP
border NOUN
infrastructure NOUN
, PUNCT
invest VERB
in ADP
technologies NOUN
to PART
enhance VERB
our ADJ
border NOUN
guards NOUN
' PART
ability NOUN
to PART
detect VERB
and CCONJ
halt VERB
illegal ADJ
guns NOUN
from ADP
the DET
United PROPN
States PROPN
entering VERB
into ADP
Canada PROPN
. PUNCT
We PRON
will VERB
not ADV
create VERB
a DET
new ADJ
national ADJ
long ADJ
- PUNCT
gun NOUN
registry NOUN
to PART
replace VERB
the DET
one NOUN
that ADJ
has VERB
been VERB
dismantled VERB
. PUNCT
We PRON
will VERB
ensure VERB
that ADP
Canada PROPN
becomes VERB
a DET
party NOUN
to ADP
the DET
international ADJ
Arms PROPN
Trade PROPN
Treaty PROPN
. PUNCT
MARIJUANA NOUN
We PRON
will VERB
legalize VERB
, PUNCT
regulate VERB
, PUNCT
and CCONJ
restrict VERB
access NOUN
to ADP
marijuana NOUN
. PUNCT
Canada PROPN
's PART
current ADJ
system NOUN
of ADP
marijuana NOUN
prohibition NOUN
does VERB
not ADV
work VERB
. PUNCT
It PRON
does VERB
not ADV
prevent VERB
young ADJ
people NOUN


new ADJ
plan NOUN
to PART
address VERB
PTSD PROPN
. PUNCT
We PRON
ask VERB
public ADJ
safety NOUN
officers NOUN
to PART
stand VERB
in ADP
harm NOUN
's PART
way NOUN
to PART
protect VERB
Canadians PROPN
and CCONJ
keep VERB
us PRON
safe ADJ
, PUNCT
and CCONJ
they PRON
deserve VERB
the DET
highest ADJ
level NOUN
of ADP
support NOUN
and CCONJ
care VERB
when ADV
the DET
unthinkable ADJ
happens VERB
. PUNCT
We PRON
will VERB
introduce VERB
a DET
public ADJ
safety NOUN
officer NOUN
compensation NOUN
benefit NOUN
to PART
be VERB
paid VERB
to ADP
the DET
families NOUN
of ADP
fire NOUN
fighters NOUN
, PUNCT
police NOUN
officers NOUN
, PUNCT
and CCONJ
paramedics NOUN
killed VERB
or CCONJ
permanently ADV
disabled ADJ
in ADP
the DET
line NOUN
of ADP
duty NOUN
. PUNCT
This DET
$ SYM
300,000 NUM
benefit NOUN
will VERB
offer VERB
a DET
measure NOUN
of ADP
financial ADJ
security NOUN
to ADP
families NOUN
who NOUN
are VERB
struggling VERB
with ADP
permanently ADV
changed VERB
life NOUN
circumstances NOU

expertise NOUN
to ADP
Canada PROPN
's PART
Immigration PROPN
and CCONJ
Refugee PROPN
Board PROPN
. PUNCT
HELP PROPN
FOR ADP
THE DET
WORLD PROPN
'S PART
POOR NOUN
We PRON
will VERB
refocus VERB
our ADJ
development NOUN
assistance NOUN
on ADP
helping VERB
the DET
poorest ADJ
and CCONJ
most ADV
vulnerable ADJ
. PUNCT
Over ADP
the DET
past ADJ
ten NUM
years NOUN
, PUNCT
Stephen PROPN
Harper PROPN
has VERB
steadily ADV
shifted VERB
aid NOUN
away ADV
from ADP
the DET
world NOUN
's PART
poorest ADJ
countries NOUN
, PUNCT
particularly ADV
in ADP
Africa PROPN
. PUNCT
We PRON
will VERB
consult VERB
with ADP
Canadian ADJ
and CCONJ
international ADJ
aid NOUN
organizations NOUN
to PART
review VERB
current ADJ
policies NOUN
and CCONJ
funding NOUN
frameworks NOUN
that ADJ
will VERB
refocus VERB
our ADJ
aid NOUN
priorities NOUN
on ADP
poverty NOUN
reduction NOUN
. PUNCT
As ADP
part NOUN
of ADP
rebalancing VERB
our ADJ
priorities NOUN
, PUNCT
we PRON
will VERB
ensure VERB
that ADP
every DET
dollar NOUN

stage NOUN
has VERB
steadily ADV
diminished VERB
. PUNCT
Instead ADV
of ADP
working VERB
with ADP
other ADJ
countries NOUN
constructively ADV
at ADP
the DET
United PROPN
Nations PROPN
, PUNCT
the DET
Harper PROPN
Conservatives PROPN
have VERB
turned VERB
their ADJ
backs NOUN
on ADP
the DET
UN PROPN
and CCONJ
other ADJ
multilateral ADJ
institutions NOUN
, PUNCT
while ADP
also ADV
weakening VERB
Canada PROPN
's PART
military NOUN
, PUNCT
our ADJ
diplomatic ADJ
service NOUN
, PUNCT
and CCONJ
our ADJ
development NOUN
programs NOUN
. PUNCT
Whether ADP
confronting VERB
climate NOUN
change NOUN
, PUNCT
terrorism NOUN
and CCONJ
radicalization NOUN
, PUNCT
or CCONJ
international ADJ
conflicts NOUN
, PUNCT
the DET
need NOUN
for ADP
effective ADJ
Canadian ADJ
diplomacy NOUN
has VERB
never ADV
been VERB
greater ADJ
than ADP
it PRON
is VERB
today NOUN
. PUNCT
Our ADJ
plan NOUN
will VERB
restore VERB
Canada PROPN
as ADP
a DET
leader NOUN
in ADP
the DET
world NOUN
. PUNCT
Not ADV
only ADV
to PART
pro

refocus VERB
Canada PROPN
's PART
military ADJ
contribution NOUN
in ADP
the DET
region NOUN
on ADP
the DET
training NOUN
of ADP
local ADJ
forces NOUN
, PUNCT
while ADP
providing VERB
more ADJ
humanitarian ADJ
support NOUN
and CCONJ
immediately ADV
welcoming VERB
25,000 NUM
more ADJ
refugees NOUN
from ADP
Syria PROPN
. PUNCT
CENTRAL PROPN
AND CCONJ
EASTERN PROPN
EUROPE PROPN
We PRON
will VERB
remain VERB
fully ADV
committed ADJ
to ADP
Canada PROPN
's PART
existing VERB
military ADJ
contributions NOUN
in ADP
Central ADJ
and CCONJ
Eastern PROPN
Europe PROPN
. PUNCT
This DET
includes VERB
Canada PROPN
's PART
participation NOUN
in ADP
NATO PROPN
assurance NOUN
measures NOUN
in ADP
Central ADJ
and CCONJ
Eastern PROPN
Europe PROPN
( PUNCT
Operation PROPN
REASSURANCE PROPN
) PUNCT
and CCONJ
the DET
multinational ADJ
training NOUN
mission NOUN
in ADP
Ukraine PROPN
( PUNCT
Operation PROPN
UNIFIER PROPN
) PUNCT
. PUNCT


In [52]:
[(i, i.tag_) for i in l_nlp]

[(PLATEFORME, 'IN'),
 (ELECTORALE, 'NNP'),
 (PARTI, 'NNP'),
 (LIBERAL, 'JJ'),
 (DU, 'NNP'),
 (CANADA, 'NNP'),
 (2015, 'CD'),
 (ECONOMIC, 'NNP'),
 (SECURITY, 'NN'),
 (FOR, 'IN'),
 (THE, 'DT'),
 (MIDDLE, 'JJ'),
 (CLASS, 'NN'),
 (A, 'DT'),
 (strong, 'JJ'),
 (economy, 'NN'),
 (starts, 'VBZ'),
 (with, 'IN'),
 (a, 'DT'),
 (strong, 'JJ'),
 (middle, 'JJ'),
 (class, 'NN'),
 (., '.'),
 (Our, 'PRP$'),
 (plan, 'NN'),
 (offers, 'VBZ'),
 (real, 'JJ'),
 (help, 'NN'),
 (to, 'IN'),
 (Canada, 'NNP'),
 ('s, 'POS'),
 (middle, 'JJ'),
 (class, 'NN'),
 (and, 'CC'),
 (all, 'PDT'),
 (those, 'DT'),
 (working, 'VBG'),
 (hard, 'RB'),
 (to, 'TO'),
 (join, 'VB'),
 (it, 'PRP'),
 (., '.'),
 (When, 'WRB'),
 (our, 'PRP$'),
 (middle, 'JJ'),
 (class, 'NN'),
 (has, 'VBZ'),
 (more, 'JJR'),
 (money, 'NN'),
 (in, 'IN'),
 (their, 'PRP$'),
 (pockets, 'NNS'),
 (to, 'TO'),
 (save, 'VB'),
 (,, ','),
 (invest, 'VB'),
 (,, ','),
 (and, 'CC'),
 (grow, 'VB'),
 (the, 'DT'),
 (economy, 'NN'),
 (,, ','),
 (we, 'PRP'),
 (all, 'DT'),
 (be

In [53]:
lib_noun_chunks = [chunk.text for chunk in l_nlp.noun_chunks]
con_noun_chunks = [chunk.text for chunk in c_nlp.noun_chunks]
gre_noun_chunks = [chunk.text for chunk in g_nlp.noun_chunks]
ndp_noun_chunks = [chunk.text for chunk in n_nlp.noun_chunks]

In [54]:
Counter(ndp_noun_chunks).most_common(25)

[('The NDP', 91),
 ('Canada', 70),
 ('We', 66),
 ('Canadians', 61),
 ('we', 49),
 ('it', 48),
 ('they', 45),
 ('the NDP', 34),
 ('who', 26),
 ('Stephen Harper', 25),
 ('women', 24),
 ('provinces', 19),
 ('funding', 16),
 ('territories', 16),
 ('Indigenous communities', 16),
 ('access', 15),
 ('them', 15),
 ('disabilities', 15),
 ('It', 14),
 ('the Conservatives', 14),
 ('you', 13),
 ('the provinces', 13),
 ('Parliament', 13),
 ('life', 12),
 ('health care', 12)]

In [55]:
Counter(gre_noun_chunks).most_common(25)

[('We', 109),
 ('we', 69),
 ('Canada', 54),
 ('Canadians', 32),
 ('it', 29),
 ('It', 27),
 ('they', 23),
 ('us', 19),
 ('who', 19),
 ('Parliament', 13),
 ('government', 12),
 ('time', 12),
 ('our communities', 10),
 ('people', 10),
 ('all Canadians', 10),
 ('education', 9),
 ('you', 8),
 ('access', 8),
 ('Green MPs', 8),
 ('I', 7),
 ('our economy', 7),
 ('the world', 7),
 ('They', 7),
 ('health care', 7),
 ('First Nations', 7)]

In [56]:
Counter(con_noun_chunks).most_common(25)

[('we', 127),
 ('We', 90),
 ('it', 64),
 ('Canadians', 62),
 ('Canada', 62),
 ('A re-elected Conservative Government', 62),
 ('It', 36),
 ('who', 35),
 ('they', 32),
 ('the world', 31),
 ('them', 29),
 ('jobs', 27),
 ('taxes', 24),
 ('our economy', 20),
 ('families', 20),
 ('Our Conservative Government', 19),
 ('children', 19),
 ('small businesses', 17),
 ('the country', 16),
 ('the next four years', 15),
 ('what', 15),
 ('Canadian families', 15),
 ("Canada's economy", 14),
 ('Canadian businesses', 14),
 ('life', 13)]

In [57]:
Counter(lib_noun_chunks).most_common(25)

[('We', 361),
 ('we', 109),
 ('Canadians', 77),
 ('it', 64),
 ('Canada', 58),
 ('they', 49),
 ('who', 44),
 ('Stephen Harper', 35),
 ('territories', 31),
 ('year', 22),
 ('government', 20),
 ('them', 19),
 ('our economy', 19),
 ('It', 18),
 ('the provinces', 16),
 ('provinces', 15),
 ('investments', 15),
 ('time', 14),
 ('Parliament', 14),
 ('part', 12),
 ('our communities', 12),
 ('funding', 11),
 ('communities', 11),
 ('support', 11),
 ('Our plan', 10)]

In [58]:
print(set(gre_noun_chunks) & set(lib_noun_chunks))

{'health', 'omnibus bills', 'the resources', 'good jobs', 'terrorism', 'the burden', 'entrepreneurs', 'others', 'the rights', 'China', 'First Nations', 'provisions', 'research', 'its budget', 'what', 'our veterans', 'prorogation', 'WE', 'energy', 'injury', 'Canadians', 'our part', 'an election', 'cleaner air', 'benefits', 'Mtis', 'any Canadian', 'science', 'work', 'the same standard', 'kids', 'the Harper Conservatives', 'a world leader', 'carbon emissions', 'reserves', 'Canada Post', 'all parties', 'CANADA', 'affordable rental housing', 'more jobs', 'Calgary', 'the deficit', 'anyone', 'place', 'information', 'pocket', 'the federal government', 'courts', 'our country', 'the United States', 'Canadian workers', 'cooperation', 'decisions', 'the long-form census', 'Bill C-51', 'retirement', 'affordable housing', 'Cabinet', 'parents', 'billions', 'They', 'new Canadians', 'our communities', 'Ottawa', 'school', 'young Canadians', 'our sovereignty', 'subsidies', 'new funding', 'part', 'the prov

# Named Entity Recognition

In [59]:
nlp.entity

toy = nlp('My cats Lando and Dorothy are sleeping next to me on the couch in our house in Kitchener.')
toy.ents

(Lando, Dorothy, Kitchener)

In [61]:
l_nlp.ents[:20]

(Canada,
 Universal Child Care Benefit,
 Canada,
 Child Benefit,
 Canadian,
 the Canada Child Benefit,
 nine,
 ten,
 Canadian,
 Stephen Harper's,
 four,
 an additional $2,500,
 Child Benefit,
 315,000,
 Canadian,
 Stephen Harper,
 MIDDLE CLASS,
 Canadians,
 Canadians,
 20.5 percent)

In [62]:
for ent in l_nlp.ents:
    print(ent,ent.label_)

for ent in l_nlp.ents:
    if ent.label_ == "NORP":
        print(ent)

Canada GPE
Universal Child Care Benefit ORG
Canada GPE
Child Benefit PERSON
Canadian NORP
the Canada Child Benefit ORG
nine CARDINAL
ten CARDINAL
Canadian NORP
Stephen Harper's PERSON
four CARDINAL
an additional $2,500 MONEY
Child Benefit PERSON
315,000 CARDINAL
Canadian NORP
Stephen Harper PERSON
MIDDLE CLASS PERSON
Canadians NORP
Canadians NORP
20.5 percent PERCENT
22 percent PERCENT
seven percent PERCENT
Canadians NORP
annual DATE
between $44,700 and $89,401 MONEY
up to $670 MONEY
1,340 MONEY
two CARDINAL
one percent PERCENT
Canadians NORP
33 percent PERCENT
more than $200,000 MONEY
each year DATE
$2 billion MONEY
one CARDINAL
Canadians NORP
15 percent PERCENT
Canadian NORP
Stephen Harper's PERSON
Canadian NORP
Canadians NORP
ten years DATE
Stephen Harper PERSON
Canadians NORP
Canadians NORP
40,000 CARDINAL
5,000 CARDINAL
the next three years DATE
$300 million MONEY
Youth Employment PERSON
11,000 CARDINAL
Canadians NORP
Skills Link each year LOC
Canadians NORP
Aboriginal WORK_OF_ART

Canada GPE
71% PERCENT
the harper decade DATE
Resource ORG
Canadians NORP
Indigenous PERSON
Canada GPE
Canadians NORP
  NORP
the National Energy Board ORG
First Nations ORG
Inuit NORP
the Mtis Nation ORG
Crown ORG
Aboriginal and Treaty LAW
the United Nations Declaration on the Rights of Indigenous Peoples LAW
Indigenous Peoples PERSON
Stephen Harper's PERSON
the Fisheries Act LAW
the Navigable Waters Protection Act ORG
Canada GPE
Canada GPE
nearly $40 billion MONEY
each year DATE
Canadian NORP
Canada GPE
the Great Lakes LOC
the St. Lawrence River Basin LOC
the Lake Winnipeg Basin LOC
the Cohen Commission ORG
the Fraser River LOC
$1.5 million MONEY
annual DATE
Conservatives NORP
Canada GPE
Experimental Lakes Area LOC
Stephen Harper's PERSON
Canada GPE
five percent PERCENT
2017 DATE
ten percent PERCENT
2020 DATE
$8 million MONEY
Stephen Harper PERSON
$40 million MONEY
Indigenous Peoples PERSON
PARKS ORG
Canada GPE
National Parks GPE
Canada GPE
National Parks GPE
Canadians NORP
nearly $5 

NORAD ORG
the North Atlantic Treaty Organization ORG
Canadian NORP
Arctic LOC
the Canadian Rangers ORG
the United Nations ORG
Canadians NORP
Canadian Armed Forces ORG
zero CARDINAL
the Canadian Forces' Report on Transformation ORG
The Canadian Armed Forces' ORG
Canada GPE
The Report on Transformation WORK_OF_ART
the Canadian Armed Forces ORG
IRAQ GPE
Canada GPE
Iraq GPE
Canada GPE
25,000 CARDINAL
Syria GPE
CENTRAL ORG
EASTERN ORG
EUROPE LOC
Canada GPE
Central and Eastern Europe LOC
Canada GPE
NATO ORG
Central and Eastern Europe LOC
Ukraine GPE
Canadian
Canadian
Canadian
Canadians
Canadians
Canadians
Canadians
Canadians
Canadian
Canadian
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadian
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadian
Canadians
Canadians
Canadian
Canadians
Canadian
Canadian
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians
Canadians


In [63]:
for ent in l_nlp.ents:
    if ent.label_ == "GPE":
        print(ent)

Canada
Canada
12-month
Canada
Toronto
Vancouver
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
British Columbia's
Canada
Toronto
Alexandra Park
Alexandra Park
Canada
St. John's
Millette
Calgary
Canada
Canada
Canada
Canada
Canada
Canada
Canada
C-525
Canada
Canada
Quebec
Canada
Canada
the United Kingdom
Canada
Quebec
U.S.
Canada
Australia
the United Kingdom
CANADA
NORTH
Canada
North
Canada
North
Canada
Canada
Parliamentary
OTTAWA
Canada
Parliamentary
Canada
Canada
Canada
Canada
Canada
Canada
Paris
Canada
the United States
Mexico
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
National Parks
Canada
National Parks
Canada
Canada
Ontario
Canada
Canada
Canada
Ottawa
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
United States
Canada
Canada
MARIJUANA
Canada
Canada
Canada
Targeted
Canada
Canada
Canada
RESCUE
Canada
Canada
St. John's
Vancouver
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Canada
Ca

In [66]:
def get_ents(party):
    mod = nlp(party)
    ent_list = []
    for ent in mod.ents:
        if ent.label_ == "GPE":
            ent_list.append(ent)
    return ent_list

In [67]:
nd = get_ents(ndp)
co = get_ents(con)
gr = get_ents(gre)
li = get_ents(lib)