# Natural Language Processing

### The problem?

- Endless amounts of unstructured data found in emails, tweets, letters, memos, etc.
- Even in transcripts
- How can we make sense of all this data?
- How can we 'easily' find relevant information for our reporting?

### The solution?
- Computer programming to process all that text using **natural language processing**!
- <a href="https://machinelearningmastery.com/natural-language-processing/">Learn more</a> about the complexity and the history of NLP.

### Journalism examples

- <a href="http://doctors.ajc.com/part_1_license_to_betray/">License to betray</a> – Finding word stems and roots to uncover abuse. (<a href="http://doctors.ajc.com/about_this_investigation/?ecmp=doctorssexabuse_microsite_stories">More info</a>)
- <a href="https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/">Federal judge’s rulings favored companies in which he owned stock</a> – Finding all stock owned by judges in disclosure forms and comparing to caseloads.
- <a href="https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html">LAPD underreported serious assaults, skewing crime stats for 8 years</a> – Text classification analysis.

### The tools

- Spacy v. NLTK
- NLTK launched in 2001, Spacy in 2015
- NLTK is now bloated and complex, requiring many steps to deal with many changes etc.
- Spacy is lean and modern, and can compute some text 4x to 20x faster than NLTK.
- Spacy does **nearly** everything that NLTK does, but better.
- NLTK, however, is still the library of choice for sentiment analysis.

However, sentiment analysis in journalism can be problematic. Be extra wary of NLP's use for news analysis. AI can easily misinterpret the sentiment in this sentence:

"It is a great movie if you have the taste and sensibilities of a five-year-old boy."

It's best to stick to the following types of analysis:

- Mentions of a word or concept (who said something...when and how many times?)
- Frequency of target terms or topics (how often were keywords used in speeches, transcripts, etc)
- Words over time (a timeline that shows frequency of words over time)
- Missing words (really a flip of words over time to show how people stopped using certain concepts or terms)
- Key people, places, companies (identify proper nouns and places for reporting)
- Comparisons (for example financial disclosures over time...which stocks were added or removed over the years)

# Working with Spacy

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

### TURN OFF FOR COLAB
Run for ANACONDA

In [None]:
conda install -c conda-forge spacy

### TURN OFF FOR ANACONDA
Run for Colab

In [None]:
## COLAB pip install
# !pip install -U spacy


In [None]:
## import libary.

import spacy

#### Which language model is best for you?
<a href="https://spacy.io/usage/models">https://spacy.io/usage/models</a>

## Step 2. Install language model


### ANACONDA ONLY

In [None]:
conda install -c conda-forge spacy-model-en_core_web_sm

### COLAB ONLY

In [None]:
# !python -m spacy download en_core_web_trf

In [None]:
## import that language model
import en_core_web_sm

### Place English libary into a ```nlp``` pipeline

In [None]:
## build nlp pipeline (a function will tokenize, parse and ner for us)
nlp = en_core_web_sm.load()

In [None]:
## what type of object is nlp
type(nlp)

## Step 3. Text analysis

In [78]:
### Sample English text:
text = u'''\
On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, \
creator of the VoIP service Skype, for $8.5 billion. \
Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. \
Sandeep Junnarkar got this from Wikipedia. \
We're beaming you aboard Captain Kirk. Or should i call you William Shatner.\
But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. \
The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." \
Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.
'''

In [79]:
## CALL the text
text

'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. We\'re beaming you aboard Captain Kirk. Or should i call you William Shatner.But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [80]:
## PRINT the tex
print(text)

On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. We're beaming you aboard Captain Kirk. Or should i call you William Shatner.But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.



### Tokenize our text

- Tokenizing is always the first step in text analysis. 
- It breaks all text into isolated but related units (including spaces, symbols, punctuation, numbers, words etc.)
- However, it retains the connection between all the words, sentences, and paragraphs.

In [81]:
## let's run the nlp function and create a spacy doc
doc = nlp(text)

In [None]:
## CALL doc
doc

In [None]:
## what type of data is it?
type(doc)

In [82]:
## show each token
for token in doc:
    print(token)
    print("***********")

On
***********
May
***********
10
***********
,
***********
2011
***********
,
***********
Microsoft
***********
announced
***********
its
***********
acquisition
***********
of
***********
 
***********
Skype
***********
Technologies
***********
,
***********
creator
***********
of
***********
the
***********
 
***********
VoIP
***********
 
***********
service
***********
 
***********
Skype
***********
,
***********
for
***********
$
***********
8.5
***********
billion
***********
.
***********
Microsoft
***********
is
***********
headquartered
***********
near
***********
Seattle
***********
Washington
***********
while
***********
Skype
***********
remains
***********
in
***********
Palo
***********
Alto
***********
,
***********
California
***********
.
***********
Sandeep
***********
Junnarkar
***********
got
***********
this
***********
from
***********
Wikipedia
***********
.
***********
We
***********
're
***********
beaming
***********
you
***********
aboard
***********
Capt

### Parts of speech



In [None]:
## print all parts of speech words
for token in doc:
    print(f"{token.text:{15}} {token.pos:{10}} {token.pos_:{5}}")

### Step 4. Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [None]:
### call text
text

In [None]:
## find all entities
for word in doc.ents:
    print(word)
    

In [None]:
## find all entities with their label

for word in doc.ents:
    print(f"{word}------->{word.label_}")

In [83]:
## find all entities with their label and label descriptors
for word in doc.ents:
    print(f"{word}------->{word.label_}------>{spacy.explain(word.label_)}")

May 10, 2011------->DATE------>Absolute or relative dates or periods
Microsoft------->ORG------>Companies, agencies, institutions, etc.
Skype Technologies------->ORG------>Companies, agencies, institutions, etc.
Skype------->ORG------>Companies, agencies, institutions, etc.
$8.5 billion------->MONEY------>Monetary values, including unit
Microsoft------->ORG------>Companies, agencies, institutions, etc.
Seattle------->GPE------>Countries, cities, states
Washington------->GPE------>Countries, cities, states
Skype------->ORG------>Companies, agencies, institutions, etc.
Palo Alto------->GPE------>Countries, cities, states
California------->GPE------>Countries, cities, states
Sandeep Junnarkar------->PERSON------>People, including fictional
Wikipedia------->ORG------>Companies, agencies, institutions, etc.
Kirk------->PERSON------>People, including fictional
William Shatner------->PERSON------>People, including fictional
Paris------->GPE------>Countries, cities, states
France------->GPE---

### Create a CSV that holds all the organizations/companies in a document

In [None]:
## find all entities and place in a list using list comprehension

entities = [word.text for word in doc.ents]## find all entities
ent_labels = [word.label_ for word in doc.ents]## find all entity labels

# ent_labels
# entities


In [None]:
### Turn the two lists into a dictionary using a for loop
my_entities_fl = []
for (key, value) in zip(ent_labels, entities):
    mydict = {key: value}
    my_entities_fl.append(mydict)
    
my_entities_fl

In [None]:
### Turn the two lists into a dictionary using 
### dictionary comprehension within list comprehension

my_ents_lc = [{k:v} for (k, v) in zip(ent_labels, entities)]
my_ents_lc

In [None]:
## the previous lists hold all entities. 
## let's narrow them down to the orgs/companies
orgs_only = [{k:v} for (k, v) in zip(ent_labels, entities) if k == "ORG"]
orgs_only

In [None]:
## What data types are these?


### Let's deduplicate

We could wait and use unique in Pandas.

In [None]:
## deduplicate a dictionary
# orgs_only = {frozenset(thing.items()) : thing for thing in orgs_only}.values()
# list(orgs_only)

In [None]:
## import pandas
import pandas as pd

In [84]:

## function to find entities
## function to find entities
def show_entities(my_text):
  '''
  my_text must be a spacy doc tokenized object; already run through nlp pipeline

  '''
  each_token = "Token"
  entity_type = "Entity"
  entity_def = "Entity Defined"
  print(f"{each_token:{30}}{entity_type:{15}}{entity_def}")
  if my_text.ents:
      for word in my_text.ents:
          print(f"{word.text:{30}} {word.label_:{15}} {str(spacy.explain(word.label_))}")
  else:
      print("There are no entities in this text")


In [85]:
## show entities in my english sentence
show_entities(doc)

Token                         Entity         Entity Defined
May 10, 2011                   DATE            Absolute or relative dates or periods
Microsoft                      ORG             Companies, agencies, institutions, etc.
Skype Technologies             ORG             Companies, agencies, institutions, etc.
Skype                          ORG             Companies, agencies, institutions, etc.
$8.5 billion                   MONEY           Monetary values, including unit
Microsoft                      ORG             Companies, agencies, institutions, etc.
Seattle                        GPE             Countries, cities, states
Washington                     GPE             Countries, cities, states
Skype                          ORG             Companies, agencies, institutions, etc.
Palo Alto                      GPE             Countries, cities, states
California                     GPE             Countries, cities, states
Sandeep Junnarkar              PERSON          Pe

## Specialized function to capture entity types

In [86]:
## create function to return list of dictionaries of entities and entity labels
def find_ner(doc, type):
    '''
    doc: any text must be a spacy tokenized object;
    doc: already run through nlp pipeline;
    type: "ORG", "PERSON", "DATE", ETC. must be in quotes
    '''
    ent_labels = [token.label_ for token in doc.ents]
    entities = [token.text for token in doc.ents]
    return [{key:value} for (key, value) in zip(ent_labels, entities) if key == type]

In [87]:
type(doc)

spacy.tokens.doc.Doc

In [88]:
## test it to find orgs
info_list = find_ner(doc, "GPE")
info_list

[{'GPE': 'Seattle'},
 {'GPE': 'Washington'},
 {'GPE': 'Palo Alto'},
 {'GPE': 'California'},
 {'GPE': 'Paris'},
 {'GPE': 'France'},
 {'GPE': 'New York'}]

## Install other languages
#### Other languages can be found at https://spacy.io/usage/models

#### Disclaimer: Language models are built by open source communities. English and German are the most advanced language models.

### Spanish language model

### ANACONDA ONLY

In [None]:
conda update -n base -c conda-forge conda

### COLAB & ANACONDA ONLY

In [None]:
# !python -m spacy download es_core_news_sm


In [None]:
## import the library and create nlp pipleline
import es_core_news_sm
nlp = es_core_news_sm.load()

In [None]:
### Sample Spanish Text (sorry!)
stext = """
El 10 de mayo de 2011, Microsoft anunció la adquisición de Skype Technologies,\
creador del servicio de VoIP Skype, por 8.500 millones de dólares. Microsoft tiene\
su sede cerca de Seattle, Washington, mientras que Skype permanece en Palo Alto,\
California. Sandeep Junnarkar obtuvo esto de Wikipedia. Pero preferiría ir a París,\
Francia, a ver la Mona Lisa en el Louvre. El río Hudson realmente debería llamarse por\
su nombre nativo original, Mahicantuck, que significa "el río\
que fluye en dos direcciones". Mahicantuck fluye por 315 millas hacia el Océano Atlántico\
desde su origen en Mt. Mercy, el pico más alto del estado de Nueva York.
"""

In [None]:
## tokenize and show parts of speech for each token
doc_s = nlp(stext)

In [None]:
## show the tokens
type(doc_s)

In [None]:
## show entities
show_entities(doc_s)

## Import file(s)

Unzip and place <a href="https://drive.google.com/file/d/1mWpUK819KlOjsLPe4l1c5dsmXqX2QqVS/view?usp=share_link">this folder</a> at the same level as this notebook. It contains two files.

In [89]:
## import package
import glob

In [93]:
## import demo text
myfiles = glob.glob("nlp-files/*.txt")
myfiles

['nlp-files/biden-africa.txt', 'nlp-files/russia-oil.txt']

In [94]:
## read the document
all_text =[]
for file in myfiles:
    print(file)
    with open(file, "r") as some_text:
        demo_text = some_text.read()
        doc = nlp(demo_text)
        all_text.append(doc)
        

nlp-files/biden-africa.txt
nlp-files/russia-oil.txt


In [None]:
all_text[0]

In [None]:
## show entities
show_entities(all_text[0])

In [None]:
## show specific entities in my demo_text
find_ner(all_text[0], "PERSON")

In [95]:
## look for all people via list comprehension
all_people = [find_ner(token, "PERSON")for token in all_text]
all_people

[[{'PERSON': 'Xi Jinping'},
  {'PERSON': 'Biden'},
  {'PERSON': 'Biden'},
  {'PERSON': 'Biden'},
  {'PERSON': 'Murithi Mutiga'},
  {'PERSON': 'Macky Sall'},
  {'PERSON': 'Barack Obama'},
  {'PERSON': 'Obama'},
  {'PERSON': 'Antony J. Blinken'},
  {'PERSON': 'Biden'},
  {'PERSON': 'Biden'},
  {'PERSON': 'George W. Bush'},
  {'PERSON': 'Obama'},
  {'PERSON': 'Biden'},
  {'PERSON': 'Judd Devermont'},
  {'PERSON': 'Devermont'},
  {'PERSON': 'Michelle D. Gavin'},
  {'PERSON': 'Clinton'},
  {'PERSON': 'Cameron Hudson'},
  {'PERSON': 'Biden'},
  {'PERSON': 'Abiji Mary Immaculate'},
  {'PERSON': 'Sithembile Mbete'},
  {'PERSON': 'Biden'}],
 [{'PERSON': 'Maciej Onoszko'},
  {'PERSON': 'Slav Okov'},
  {'PERSON': 'Vladimir Putin'},
  {'PERSON': 'Putin'},
  {'PERSON': 'Piotr Naimski'},
  {'PERSON': 'Katja Yafimava'},
  {'PERSON': 'Simone Tagliapietra'},
  {'PERSON': 'Gazprombank'},
  {'PERSON': 'Yafimava'},
  {'PERSON': 'Piotr Skolimowski'},
  {'PERSON': 'Piotr Bujnicki'},
  {'PERSON': 'Ewa Krukow

In [96]:
## package to flattend
import itertools

In [97]:
## flatten list of list
people_list = list(itertools.chain(*all_people))
people_list

[{'PERSON': 'Xi Jinping'},
 {'PERSON': 'Biden'},
 {'PERSON': 'Biden'},
 {'PERSON': 'Biden'},
 {'PERSON': 'Murithi Mutiga'},
 {'PERSON': 'Macky Sall'},
 {'PERSON': 'Barack Obama'},
 {'PERSON': 'Obama'},
 {'PERSON': 'Antony J. Blinken'},
 {'PERSON': 'Biden'},
 {'PERSON': 'Biden'},
 {'PERSON': 'George W. Bush'},
 {'PERSON': 'Obama'},
 {'PERSON': 'Biden'},
 {'PERSON': 'Judd Devermont'},
 {'PERSON': 'Devermont'},
 {'PERSON': 'Michelle D. Gavin'},
 {'PERSON': 'Clinton'},
 {'PERSON': 'Cameron Hudson'},
 {'PERSON': 'Biden'},
 {'PERSON': 'Abiji Mary Immaculate'},
 {'PERSON': 'Sithembile Mbete'},
 {'PERSON': 'Biden'},
 {'PERSON': 'Maciej Onoszko'},
 {'PERSON': 'Slav Okov'},
 {'PERSON': 'Vladimir Putin'},
 {'PERSON': 'Putin'},
 {'PERSON': 'Piotr Naimski'},
 {'PERSON': 'Katja Yafimava'},
 {'PERSON': 'Simone Tagliapietra'},
 {'PERSON': 'Gazprombank'},
 {'PERSON': 'Yafimava'},
 {'PERSON': 'Piotr Skolimowski'},
 {'PERSON': 'Piotr Bujnicki'},
 {'PERSON': 'Ewa Krukowska'}]

## Next week:

- Word frequency
- Context around words
- Any questions you may have...