# Natural Language Processing

### The problem?

- Endless amounts of unstructured data found in emails, tweets, letters, memos, etc.
- Even in transcripts
- How can we make sense of all this data?
- How can we 'easily' find relevant information for our reporting?

### The solution?
- Computer programming to process all that text using **natural language processing**!
- <a href="https://machinelearningmastery.com/natural-language-processing/">Learn more</a> about the complexity and the history of NLP.

### Journalism examples

- <a href="http://doctors.ajc.com/part_1_license_to_betray/">License to betray</a> – Finding word stems and roots to uncover abuse.
- <a href="https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/">Federal judge’s rulings favored companies in which he owned stock</a> – Finding all stock owned by judges in disclosure forms and comparing to caseloads.
- <a href="https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html">LAPD underreported serious assaults, skewing crime stats for 8 years</a> – Text classification analysis.

### The tools

- Spacy v. NLTK
- NLTK launched in 2001, Spacy in 2015
- NLTK is now bloated and complex, requiring many steps to deal with many changes etc.
- Spacy is lean and modern, and can compute some text 4x to 20x faster than NLTK.
- Spacy does **nearly** everything that NLTK does, but better.
- NLTK, however, is still the library of choice for sentiment analysis.

# Working with Spacy

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

In [None]:
## Conda install or...
# !conda install -c conda-forge spacy


In [1]:
!pip install -U spacy


Requirement already up-to-date: spacy in /Users/sandeep.junnarkar/opt/anaconda3/lib/python3.8/site-packages (2.3.2)


In [2]:
## import libary.

import spacy

#### Troubleshoot here if problems with setup:
https://github.com/explosion/spacy-models

## Step 2. Install language model


In [3]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


### Place English libary into a ```nlp``` pipeline

In [4]:
## build nlp pipeline (a function will tokenize, parse and ner for us)
nlp = spacy.load("en_core_web_sm")

In [5]:
## what type of object is nlp
type(nlp)

spacy.lang.en.English

## Step 3. Text analysis

In [51]:
### Sampel English text:
text = u'''\
On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, \
creator of the VoIP service Skype, for $8.5 billion. \
Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. \
Sandeep Junnarkar got this from Wikipedia. \
But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. \
Mount Washington, which is really Agiocochook, is the highest peak in the Northeastern United States \
at 6,288.2 ft and the most topographically prominent mountain east \
of the Mississippi River. It's not in Mississippi.
'''

### Tokenize our text

- Tokenizing is always the first step in text analysis. 
- It breaks all text into isolated but related units (including spaces, symbols, punctuation, numbers, words etc.)
- However, it retains the connection between all the words, sentences, and paragraphs.

In [52]:
## let's run the nlp function and create a spacy doc
doc = nlp(text)

In [25]:
## what type of data is it?
type(doc)

spacy.tokens.doc.Doc

In [26]:
## show each token
for token in doc:
    print(token)

On
May
10
,
2011
,
Microsoft
announced
its
acquisition
of
 
Skype
Technologies
,
creator
of
the
 
VoIP
 
service
 
Skype
,
for
$
8.5
billion
.
Microsoft
is
headquartered
near
Seattle
Washington
while
Skype
remains
in
Palo
Alto
,
California
.
Sandeep
Junnarkar
got
this
from
Wikipedia
.
But
he
'd
rather
head
to
Paris
,
France
to
see
the
Mona
Lisa
at
the
Louvre
.
Mount
Washington
,
which
is
really
Agiocochook
,
is
the
highest
peak
in
the
Northeastern
United
States
at
6,288.2
ft
and
the
most
topographically
prominent
mountain
east
of
the
Mississippi
River
.
It
's
not
in
Mississippi
.




### Stop Words

- These are common words that add no additional meaning to our analysis.
- Words like ```the```, ```and``` and ```any```.
- Spacy has just over 320 ```stop words``` in its defalt library.
- Read more on <a href="https://medium.com/@saitejaponugoti/stop-words-in-nlp-5b248dadad47">stop words</a>

In [28]:
## show all default stop words
print(nlp.Defaults.stop_words)

{'of', 'such', 'might', 'sixty', 'its', 'third', 'yourself', 'former', 'however', 'others', 'ever', 'top', 'because', "'ll", 'upon', 'used', 'he', 'less', 'behind', 'ca', 'anyway', 'serious', 'everything', 'i', '‘ve', "'s", '‘ll', 'how', 'keep', 'us', '‘m', 'her', '’m', 'themselves', 'seems', 'beyond', 'whereas', 'had', "'m", "'ve", 'formerly', 'somewhere', 'too', 'this', '’re', 'his', 'since', 'move', 'within', 'do', 'still', 'doing', 'cannot', 'am', 'using', 'beside', 'be', 'must', 'about', 'me', 'whoever', 'only', 'off', 'whole', '’ll', '’ve', 'in', 'hundred', 'whereupon', 'get', 'ourselves', 'himself', 'five', 'further', 'has', 'fifty', 'noone', 'latter', 'back', 'as', 'whose', 'here', 'did', 'many', 'eleven', 'during', 'through', 'these', 'who', 'hereupon', 'regarding', 'via', 'any', 'until', 'besides', 'a', 'on', 'thus', 'hereby', 'four', 'throughout', 'last', 'together', '’s', 'take', 'so', 'either', 'alone', 'may', 'and', 'whenever', 'among', 'make', 'for', 'becomes', 'nine', '

In [33]:
## check if a word (have, near, be) is a stop word 
nlp.vocab["have"].is_stop

True

In [34]:
## how many do stop words do we have?
len(nlp.Defaults.stop_words)

326

In [35]:
## Add your own stop word
nlp.Defaults.stop_words.add("lol")
nlp.vocab["lol"].is_stop = True

In [36]:
## CHECK IF 'lol' is a stop word
nlp.vocab['lol'].is_stop

True

In [37]:
## how many do stop words do we have now?
len(nlp.Defaults.stop_words)

327

In [39]:
## Remove a stop word from list because it is relevant.
## notice the word "empty" is a stop word.
# nlp.vocab['empty'].is_stop
nlp.Defaults.stop_words.remove("empty")
nlp.vocab["empty"].is_stop = False


In [40]:
## CHECK IF 'empty' is a stop word
nlp.vocab["empty"].is_stop

False

### Parts of speech



In [42]:
## print all parts of speech words
for token in doc:
    print(f"{token.text:{15}} {token.pos:{10}} {token.pos_:{5}}")

On                      85 ADP  
May                     96 PROPN
10                      93 NUM  
,                       97 PUNCT
2011                    93 NUM  
,                       97 PUNCT
Microsoft               96 PROPN
announced              100 VERB 
its                     90 DET  
acquisition             92 NOUN 
of                      85 ADP  
                       103 SPACE
Skype                   96 PROPN
Technologies            96 PROPN
,                       97 PUNCT
creator                 92 NOUN 
of                      85 ADP  
the                     90 DET  
                       103 SPACE
VoIP                    92 NOUN 
                       103 SPACE
service                 92 NOUN 
                       103 SPACE
Skype                   96 PROPN
,                       97 PUNCT
for                     85 ADP  
$                       99 SYM  
8.5                     93 NUM  
billion                 93 NUM  
.                       97 PUNCT
Microsoft 

### Step 4. Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [50]:
## find all entities

for word in doc.ents:
    print(spacy.explain(word.label_))

Absolute or relative dates or periods
Companies, agencies, institutions, etc.
Companies, agencies, institutions, etc.
Objects, vehicles, foods, etc. (not services)
Companies, agencies, institutions, etc.
Monetary values, including unit
Companies, agencies, institutions, etc.
Countries, cities, states
Countries, cities, states
Companies, agencies, institutions, etc.
Countries, cities, states
Countries, cities, states
People, including fictional
Countries, cities, states
Countries, cities, states
Titles of books, songs, etc.
Non-GPE locations, mountain ranges, bodies of water
People, including fictional
Non-GPE locations, mountain ranges, bodies of water
Measurements, as of weight or distance
Non-GPE locations, mountain ranges, bodies of water
Non-GPE locations, mountain ranges, bodies of water
Countries, cities, states


In [45]:
## find all entities with their label
for word in doc.ents:
    print(f"{word.text} ---->{word.label_}")
    

May 10, 2011 ---->DATE
Microsoft ---->ORG
Skype Technologies ---->ORG
VoIP  ---->PRODUCT
Skype ---->ORG
$8.5 billion ---->MONEY
Microsoft ---->ORG
Seattle ---->GPE
Washington ---->GPE
Skype ---->ORG
Palo Alto ---->GPE
California ---->GPE
Sandeep Junnarkar ---->PERSON
Paris ---->GPE
France ---->GPE
the Mona Lisa ---->WORK_OF_ART
Mount Washington ---->LOC
Agiocochook ---->PERSON
the Northeastern United States ---->LOC
6,288.2 ft ---->QUANTITY
mountain east ---->LOC
the Mississippi River ---->LOC
Mississippi ---->GPE


In [48]:
## find all entities with their label and label descriptors
for word in doc.ents:
    print(f"{word.text} ---->{word.label} ----->{word.label_} ---->{spacy.explain(word.label_)}")

May 10, 2011 ---->391 ----->DATE ---->Absolute or relative dates or periods
Microsoft ---->383 ----->ORG ---->Companies, agencies, institutions, etc.
Skype Technologies ---->383 ----->ORG ---->Companies, agencies, institutions, etc.
VoIP  ---->386 ----->PRODUCT ---->Objects, vehicles, foods, etc. (not services)
Skype ---->383 ----->ORG ---->Companies, agencies, institutions, etc.
$8.5 billion ---->394 ----->MONEY ---->Monetary values, including unit
Microsoft ---->383 ----->ORG ---->Companies, agencies, institutions, etc.
Seattle ---->384 ----->GPE ---->Countries, cities, states
Washington ---->384 ----->GPE ---->Countries, cities, states
Skype ---->383 ----->ORG ---->Companies, agencies, institutions, etc.
Palo Alto ---->384 ----->GPE ---->Countries, cities, states
California ---->384 ----->GPE ---->Countries, cities, states
Sandeep Junnarkar ---->380 ----->PERSON ---->People, including fictional
Paris ---->384 ----->GPE ---->Countries, cities, states
France ---->384 ----->GPE ---->Co

### Create a CSV that holds all the organizations/companies in a document

In [53]:
## find all entities and place in a list using list comprehension
entities = [word.text for word in doc.ents]
ent_labels = [word.label_ for word in doc.ents]

# entities = []
# for word in doc.ents:
#     entities.append(word.text)

In [57]:
print(entities)
print(ent_labels)

['May 10, 2011', 'Microsoft', 'Skype Technologies', 'VoIP\xa0', 'Skype', '$8.5 billion', 'Microsoft', 'Seattle', 'Washington', 'Skype', 'Palo Alto', 'California', 'Sandeep Junnarkar', 'Paris', 'France', 'the Mona Lisa', 'Mount Washington', 'Agiocochook', 'the Northeastern United States', '6,288.2 ft', 'mountain east', 'the Mississippi River', 'Mississippi']
['DATE', 'ORG', 'ORG', 'PRODUCT', 'ORG', 'MONEY', 'ORG', 'GPE', 'GPE', 'ORG', 'GPE', 'GPE', 'PERSON', 'GPE', 'GPE', 'WORK_OF_ART', 'LOC', 'PERSON', 'LOC', 'QUANTITY', 'LOC', 'LOC', 'GPE']


In [59]:
### Turn the two lists into a dictionary using a for loop
my_entities_fl = []
for (key, value) in zip(ent_labels, entities):
    mydict = {key: value}
    my_entities_fl.append(mydict)

my_entities_fl

[{'DATE': 'May 10, 2011'},
 {'ORG': 'Microsoft'},
 {'ORG': 'Skype Technologies'},
 {'PRODUCT': 'VoIP\xa0'},
 {'ORG': 'Skype'},
 {'MONEY': '$8.5 billion'},
 {'ORG': 'Microsoft'},
 {'GPE': 'Seattle'},
 {'GPE': 'Washington'},
 {'ORG': 'Skype'},
 {'GPE': 'Palo Alto'},
 {'GPE': 'California'},
 {'PERSON': 'Sandeep Junnarkar'},
 {'GPE': 'Paris'},
 {'GPE': 'France'},
 {'WORK_OF_ART': 'the Mona Lisa'},
 {'LOC': 'Mount Washington'},
 {'PERSON': 'Agiocochook'},
 {'LOC': 'the Northeastern United States'},
 {'QUANTITY': '6,288.2 ft'},
 {'LOC': 'mountain east'},
 {'LOC': 'the Mississippi River'},
 {'GPE': 'Mississippi'}]

In [60]:
### Turn the two lists into a dictionary using a list comprehension
my_entities = [{key: value} for (key, value) in zip(ent_labels, entities)]
my_entities

[{'DATE': 'May 10, 2011'},
 {'ORG': 'Microsoft'},
 {'ORG': 'Skype Technologies'},
 {'PRODUCT': 'VoIP\xa0'},
 {'ORG': 'Skype'},
 {'MONEY': '$8.5 billion'},
 {'ORG': 'Microsoft'},
 {'GPE': 'Seattle'},
 {'GPE': 'Washington'},
 {'ORG': 'Skype'},
 {'GPE': 'Palo Alto'},
 {'GPE': 'California'},
 {'PERSON': 'Sandeep Junnarkar'},
 {'GPE': 'Paris'},
 {'GPE': 'France'},
 {'WORK_OF_ART': 'the Mona Lisa'},
 {'LOC': 'Mount Washington'},
 {'PERSON': 'Agiocochook'},
 {'LOC': 'the Northeastern United States'},
 {'QUANTITY': '6,288.2 ft'},
 {'LOC': 'mountain east'},
 {'LOC': 'the Mississippi River'},
 {'GPE': 'Mississippi'}]

In [71]:
## the previous lists hold all entities. 
## let's narrow them down to the orgs/companies
all_orgs = [{key: value} for (key, value) in zip(ent_labels, entities) if key == "ORG"]
all_orgs

[{'ORG': 'Microsoft'},
 {'ORG': 'Skype Technologies'},
 {'ORG': 'Skype'},
 {'ORG': 'Microsoft'},
 {'ORG': 'Skype'}]

In [72]:
## What data types are these?
for thing in all_orgs:
    for key, value in thing.items():
        print(f"{type(value)}")

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


In [None]:
## Let's make sure all the key and value pairs are strings 
## instead of spacy objects so we can move them into a df and csv


In [None]:
## confirm key, value are both strings


### Let's deduplicate

In [75]:
## deduplicate a dictionary
orgs_only = {frozenset(item.items()) : item for item in all_orgs}.values()
list(orgs_only)

[{'ORG': 'Microsoft'}, {'ORG': 'Skype Technologies'}, {'ORG': 'Skype'}]

In [76]:
## import pandas
import pandas as pd

In [77]:
# ## use pandas to write to csv file
filename = "test_entities.csv"
df = pd.DataFrame(orgs_only) ## we turn our life dict into a dataframe which we're call df
df.to_csv(filename, encoding='utf-8', index=False)

print(f"{filename} is in your project folder!")

test_entities.csv is in your project folder!


In [78]:

## function to find entities
def show_entities(my_text):
    each_token = "Token"
    entity_type = "Entity"
    entity_def = "Entity Defined"
    print(f"{each_token:{30}}{entity_type:{15}}{entity_def}")
    if my_text.ents:
        for word in doc.ents:
            print(f"{word.text:{30}} {word.label_:{15}} {str(spacy.explain(word.label_))}")
    else:
        print("There are no entities in this text")


In [79]:
words = [token.text.replace(u'\xa0', ' ') for token in doc if token.is_stop != True and token.is_punct != True]
print(words)

['10', '2011', 'Microsoft', 'announced', 'acquisition', ' ', 'Skype', 'Technologies', 'creator', ' ', 'VoIP', ' ', 'service', ' ', 'Skype', '$', '8.5', 'billion', 'Microsoft', 'headquartered', 'near', 'Seattle', 'Washington', 'Skype', 'remains', 'Palo', 'Alto', 'California', 'Sandeep', 'Junnarkar', 'got', 'Wikipedia', 'head', 'Paris', 'France', 'Mona', 'Lisa', 'Louvre', 'Mount', 'Washington', 'Agiocochook', 'highest', 'peak', 'Northeastern', 'United', 'States', '6,288.2', 'ft', 'topographically', 'prominent', 'mountain', 'east', 'Mississippi', 'River', 'Mississippi', '\n']


In [80]:
## show entities in my english sentence
show_entities(doc)

Token                         Entity         Entity Defined
May 10, 2011                   DATE            Absolute or relative dates or periods
Microsoft                      ORG             Companies, agencies, institutions, etc.
Skype Technologies             ORG             Companies, agencies, institutions, etc.
VoIP                           PRODUCT         Objects, vehicles, foods, etc. (not services)
Skype                          ORG             Companies, agencies, institutions, etc.
$8.5 billion                   MONEY           Monetary values, including unit
Microsoft                      ORG             Companies, agencies, institutions, etc.
Seattle                        GPE             Countries, cities, states
Washington                     GPE             Countries, cities, states
Skype                          ORG             Companies, agencies, institutions, etc.
Palo Alto                      GPE             Countries, cities, states
California                   

## Word Frequency

In [81]:
from collections import Counter  ## a package that helps us count up frequency
## Counter(some_variable)
## variable_name.most.common(some_number)

#remove stopwords and punctuations
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True and token.text != '\xa0']
word_freq = Counter(words)
common_words = word_freq.most_common(25)  ## use most.common()
print (common_words)

[('Skype', 3), ('Microsoft', 2), ('Washington', 2), ('Mississippi', 2), ('10', 1), ('2011', 1), ('announced', 1), ('acquisition', 1), ('Technologies', 1), ('creator', 1), ('VoIP', 1), ('service', 1), ('$', 1), ('8.5', 1), ('billion', 1), ('headquartered', 1), ('near', 1), ('Seattle', 1), ('remains', 1), ('Palo', 1), ('Alto', 1), ('California', 1), ('Sandeep', 1), ('Junnarkar', 1), ('got', 1)]


## Install other languages
#### Other languages can be found at https://spacy.io/usage/models

#### Disclaimer: Language models are built by open source communities. English and German are the most advanced language models.

### Spanish language model

In [82]:
## !python install the library
!python -m spacy download es_core_news_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')


In [83]:
## import the library and create nlp pipleline
import es_core_news_sm
nlp = es_core_news_sm.load()

In [84]:
### Sample Spanish Text (sorry!)
stext = """
El 10 de mayo de 2011, Microsoft anunció la adquisición de Skype Technologies, \
creador del servicio de VoIP Skype, por $ 8.5 mil millones. \
Microsoft tiene su sede cerca de Seattle Washington, mientras que Skype permanece en Palo Alto, California. \
Sandeep Junnarkar obtuvo esto de Wikipedia. \
Pero preferiría ir a París, Francia, para ver la Mona Lisa en el Louvre. \
Mount Washington, que en realidad es Agiocochook, es el pico más alto del noreste de Estados Unidos.
a 6.288,2 pies y la montaña más prominente topográficamente al este \
del río Mississippi.
"""

In [85]:
## tokenize and show parts of speech for each token
doc = nlp(stext)

In [88]:
type(doc)

spacy.tokens.doc.Doc

In [86]:
## show the tokens
for token in doc:
    print(token)



El
10
de
mayo
de
2011
,
Microsoft
anunció
la
adquisición
de
Skype
Technologies
,
creador
del
servicio
de
VoIP
Skype
,
por
$
8.5
mil
millones
.
Microsoft
tiene
su
sede
cerca
de
Seattle
Washington
,
mientras
que
Skype
permanece
en
Palo
Alto
,
California
.
Sandeep
Junnarkar
obtuvo
esto
de
Wikipedia
.
Pero
preferiría
ir
a
París
,
Francia
,
para
ver
la
Mona
Lisa
en
el
Louvre
.
Mount
Washington
,
que
en
realidad
es
Agiocochook
,
es
el
pico
más
alto
del
noreste
de
Estados
Unidos
.


a
6.288,2
pies
y
la
montaña
más
prominente
topográficamente
al
este
del
río
Mississippi
.




In [87]:
## show entities
show_entities(doc)

Token                         Entity         Entity Defined
Microsoft                      ORG             Companies, agencies, institutions, etc.
Skype Technologies             ORG             Companies, agencies, institutions, etc.
VoIP Skype                     MISC            Miscellaneous entities, e.g. events, nationalities, products or works of art
Microsoft                      ORG             Companies, agencies, institutions, etc.
Seattle Washington             LOC             Non-GPE locations, mountain ranges, bodies of water
Skype                          MISC            Miscellaneous entities, e.g. events, nationalities, products or works of art
Palo Alto                      LOC             Non-GPE locations, mountain ranges, bodies of water
California                     LOC             Non-GPE locations, mountain ranges, bodies of water
Sandeep Junnarkar              LOC             Non-GPE locations, mountain ranges, bodies of water
Wikipedia                      MISC

### Chinese language model

In [None]:
## !python install the library


In [None]:
## import the library and create nlp pipleline


In [None]:
### Sample Chinese Text (sorry!)
ctext = '''
2011年5月10日，微軟宣布收購Skype Technologies，
VoIP服務的創造者，價格為85億美元。
微軟總部位於華盛頓州西雅圖市附近，而Skype仍位於加利福尼亞州帕洛阿爾托。\
Sandeep Junnarkar從Wikipedia獲得了此信息。\
但他寧願前往法國巴黎在羅浮宮看《蒙娜麗莎》。\
華盛頓山（實際上是Agiocochook）是美國東北部的最高峰\
位於6,288.2英尺，是東面地形最突出的山脈\
密西西比河。
'''

In [None]:
## create a spacy doc object


In [None]:
## run our function!
