# AI - Natural Language Processing

### The problem?

- Endless amounts of unstructured data found in emails, tweets, letters, memos, etc.
- Even in transcripts
- How can we make sense of all this data?
- How can we 'easily' find relevant information for our reporting?

### The solution?
- Artificial Intelligence to process all that text using **natural language processing**!
- <a href="https://machinelearningmastery.com/natural-language-processing/">Learn more</a> about the complexity and the history of NLP.
- The use of ```large language models```!

### Journalism examples

- <a href="http://doctors.ajc.com/part_1_license_to_betray/">License to betray</a> – Finding word stems and roots to uncover abuse. (<a href="http://doctors.ajc.com/about_this_investigation/?ecmp=doctorssexabuse_microsite_stories">More info</a>)
- <a href="https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/">Federal judge’s rulings favored companies in which he owned stock</a> – Finding all stock owned by judges in disclosure forms and comparing to caseloads.
- <a href="https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html">LAPD underreported serious assaults, skewing crime stats for 8 years</a> – Text classification analysis.

### The tools

- Spacy v. NLTK
- NLTK launched in 2001, Spacy in 2015
- NLTK is now bloated and complex, requiring many steps to deal with many changes etc.
- Spacy is lean and modern, and can compute some text 4x to 20x faster than NLTK.
- Spacy does **nearly** everything that NLTK does, but better.
- NLTK, however, is still the library of choice for sentiment analysis.

However, sentiment analysis in journalism can be problematic. Be extra wary of NLP's use for news analysis. AI can easily misinterpret the sentiment in this sentence:

"It is a great movie if you have the taste and sensibilities of a five-year-old boy."

It's best to stick to the following types of analysis:

- Mentions of a word or concept (who said something...when and how many times?)
- Frequency of target terms or topics (how often were keywords used in speeches, transcripts, etc)
- Words over time (a timeline that shows frequency of words over time)
- Missing words (really a flip of words over time to show how people stopped using certain concepts or terms)
- Key people, places, companies (identify proper nouns and places for reporting)
- Comparisons (for example financial disclosures over time...which stocks were added or removed over the years)

# Working with Spacy

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

In [1]:
pip install -U spacy

Note: you may need to restart the kernel to use updated packages.


#### Which language model is best for you?
<a href="https://spacy.io/usage/models">https://spacy.io/usage/models</a>

## Step 2. Install language model


### ## import that language model

In [3]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m102.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
## import libary.
import pandas as pd
import glob
import spacy

### Place English libary into a ```nlp``` pipeline

In [7]:
## build nlp pipeline (a function will tokenize, parse and ner for us)
nlp = spacy.load("en_core_web_sm")

In [9]:
## what type of object is nlp


spacy.lang.en.English

## Step 3. Text analysis

In [11]:
### Sample English text:
text = u'''\
On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, \
creator of the VoIP service Skype, for $8.5 billion. \
Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. \
Sandeep Junnarkar got this from Wikipedia. \
But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. \
The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." \
Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.
'''

In [13]:
## CALL the text


'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [15]:
## PRINT the tex


str

### Tokenize our text

- Tokenizing is always the first step in text analysis. 
- It breaks all text into isolated but related units (including spaces, symbols, punctuation, numbers, words etc.)
- However, it retains the connection between all the words, sentences, and paragraphs.

In [18]:
## let's run the nlp function and create a spacy doc


In [20]:
## CALL doc


On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.

In [22]:
## what type of data is it?


spacy.tokens.doc.Doc

In [24]:
## show each token


On
********
May
********
10
********
,
********
2011
********
,
********
Microsoft
********
announced
********
its
********
acquisition
********
of
********
 
********
Skype
********
Technologies
********
,
********
creator
********
of
********
the
********
 
********
VoIP
********
 
********
service
********
 
********
Skype
********
,
********
for
********
$
********
8.5
********
billion
********
.
********
Microsoft
********
is
********
headquartered
********
near
********
Seattle
********
Washington
********
while
********
Skype
********
remains
********
in
********
Palo
********
Alto
********
,
********
California
********
.
********
Sandeep
********
Junnarkar
********
got
********
this
********
from
********
Wikipedia
********
.
********
But
********
he
********
'd
********
rather
********
head
********
to
********
Paris
********
,
********
France
********
to
********
see
********
the
********
Mona
********
Lisa
********
at
********
the
********
Louvre
********
.
********
The
***

### Parts of speech



In [27]:
## print all parts of speech words


On--->85--->ADP
********
May--->96--->PROPN
********
10--->93--->NUM
********
,--->97--->PUNCT
********
2011--->93--->NUM
********
,--->97--->PUNCT
********
Microsoft--->96--->PROPN
********
announced--->100--->VERB
********
its--->95--->PRON
********
acquisition--->92--->NOUN
********
of--->85--->ADP
********
 --->103--->SPACE
********
Skype--->96--->PROPN
********
Technologies--->96--->PROPN
********
,--->97--->PUNCT
********
creator--->92--->NOUN
********
of--->85--->ADP
********
the--->90--->DET
********
 --->103--->SPACE
********
VoIP--->96--->PROPN
********
 --->103--->SPACE
********
service--->92--->NOUN
********
 --->103--->SPACE
********
Skype--->96--->PROPN
********
,--->97--->PUNCT
********
for--->85--->ADP
********
$--->99--->SYM
********
8.5--->93--->NUM
********
billion--->93--->NUM
********
.--->97--->PUNCT
********
Microsoft--->96--->PROPN
********
is--->87--->AUX
********
headquartered--->100--->VERB
********
near--->85--->ADP
********
Seattle--->96--->PROPN
********
W

### Step 4. Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [30]:
### call text


'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [32]:
## find all entities

    

May 10, 2011
Microsoft
Skype Technologies
VoIP
Skype
$8.5 billion
Microsoft
Seattle
Washington
Skype
Palo Alto
California
Sandeep Junnarkar
Wikipedia
Paris
France
Louvre
The Hudson River
Mahicantuck
two
315 miles
the Atlantic Ocean
Mt. Mercy
New York


In [34]:
## find all entities with their label

 

May 10, 2011---->DATE
Microsoft---->ORG
Skype Technologies---->ORG
VoIP---->PERSON
Skype---->ORG
$8.5 billion---->MONEY
Microsoft---->ORG
Seattle---->GPE
Washington---->GPE
Skype---->ORG
Palo Alto---->GPE
California---->GPE
Sandeep Junnarkar---->PERSON
Wikipedia---->ORG
Paris---->GPE
France---->GPE
Louvre---->PERSON
The Hudson River---->LOC
Mahicantuck---->PERSON
two---->CARDINAL
315 miles---->QUANTITY
the Atlantic Ocean---->LOC
Mt. Mercy---->LOC
New York---->GPE


In [36]:
## find all entities with their label and label descriptors
 

May 10, 2011---->DATE---->Absolute or relative dates or periods
Microsoft---->ORG---->Companies, agencies, institutions, etc.
Skype Technologies---->ORG---->Companies, agencies, institutions, etc.
VoIP---->PERSON---->People, including fictional
Skype---->ORG---->Companies, agencies, institutions, etc.
$8.5 billion---->MONEY---->Monetary values, including unit
Microsoft---->ORG---->Companies, agencies, institutions, etc.
Seattle---->GPE---->Countries, cities, states
Washington---->GPE---->Countries, cities, states
Skype---->ORG---->Companies, agencies, institutions, etc.
Palo Alto---->GPE---->Countries, cities, states
California---->GPE---->Countries, cities, states
Sandeep Junnarkar---->PERSON---->People, including fictional
Wikipedia---->ORG---->Companies, agencies, institutions, etc.
Paris---->GPE---->Countries, cities, states
France---->GPE---->Countries, cities, states
Louvre---->PERSON---->People, including fictional
The Hudson River---->LOC---->Non-GPE locations, mountain ranges,

## Import hearings
Download <a href="https://drive.google.com/file/d/1EUYLeHpHAAW2MGsrT6_jov9cJ-IuDLg-/view?usp=sharing">this senate hearing</a> and turn it into a spacy doc.

In [39]:
## pull hearing into notebook


['senate-hearing.txt']

In [41]:
## read into a variable


<class '_io.TextIOWrapper'>
<class 'str'>


### FUNCTION TO TOKENIZE

In [44]:
## create function to read a globbed list
# define function


In [46]:
## save senate hearing as nlp doc


[Senate Hearing 118-22]
[From the U.S. Government Publishing Office]


                                                        S. Hrg. 118-22

                   IMPLEMENTING IIJA: PERSPECTIVES ON
          THE DRINKING WATER AND WASTEWATER INFRASTRUCTURE ACT


                                HEARING

                               BEFORE THE

                              COMMITTEE ON
                      ENVIRONMENT AND PUBLIC WORKS

                          UNITED STATES SENATE

                    ONE HUNDRED EIGHTEENTH CONGRESS

                             FIRST SESSION

                               __________

                             MARCH 15, 2023

                               __________

  Printed for the use of the Committee on Environment and Public Works
  
[GRAPHIC NOT AVAILABLE IN TIFF FORMAT]  


        Available via the World Wide Web: http://www.govinfo.gov
        
                               __________

                                
                

### Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [48]:
for word in doc.ents[:10]:
    print(word.text, word.label_, spacy.explain(word.label_))
    print("*******")

Senate ORG Companies, agencies, institutions, etc.
*******
118-22 CARDINAL Numerals that do not fall under another type
*******
the U.S. Government Publishing Office ORG Companies, agencies, institutions, etc.
*******
118-22 CARDINAL Numerals that do not fall under another type
*******
PERSPECTIVES CARDINAL Numerals that do not fall under another type
*******
THE DRINKING WATER AND WASTEWATER INFRASTRUCTURE ACT ORG Companies, agencies, institutions, etc.
*******
ONE HUNDRED CARDINAL Numerals that do not fall under another type
*******
FIRST ORDINAL "first", "second", etc.
*******
MARCH 15, 2023 DATE Absolute or relative dates or periods
*******
the Committee on Environment and Public Works ORG Companies, agencies, institutions, etc.
*******


## Specialized function to capture entity types

In [50]:
## create function to return list of dictionaries of entities and entity labels
## function to find entities
## def function


In [51]:
## test it to find orgs


Unnamed: 0,word,label,meaning
0,Senate,ORG,"Companies, agencies, institutions, etc."
1,118-22,CARDINAL,Numerals that do not fall under another type
2,the U.S. Government Publishing Office,ORG,"Companies, agencies, institutions, etc."
3,118-22,CARDINAL,Numerals that do not fall under another type
4,PERSPECTIVES,CARDINAL,Numerals that do not fall under another type
...,...,...,...
1470,that spring,DATE,Absolute or relative dates or periods
1471,the years,DATE,Absolute or relative dates or periods
1472,Green \nBay,LOC,"Non-GPE locations, mountain ranges, bodies of ..."
1473,Whereupon,ORG,"Companies, agencies, institutions, etc."


In [52]:
## search for people only


Unnamed: 0,word,label,meaning
19,BENJAMIN L. CARDIN,PERSON,"People, including fictional"
21,KEVIN CRAMER,PERSON,"People, including fictional"
29,EDWARD J. MARKEY,PERSON,"People, including fictional"
41,Courtney Taylor,PERSON,"People, including fictional"
43,Adam Tomlinson,PERSON,"People, including fictional"
...,...,...,...
1433,Aaron,PERSON,"People, including fictional"
1446,Carper,PERSON,"People, including fictional"
1461,Billy Graham,PERSON,"People, including fictional"
1463,Sheila,PERSON,"People, including fictional"


In [53]:
## search for orgs only


Unnamed: 0,word,label,meaning
0,Senate,ORG,"Companies, agencies, institutions, etc."
2,the U.S. Government Publishing Office,ORG,"Companies, agencies, institutions, etc."
5,THE DRINKING WATER AND WASTEWATER INFRASTRUCTU...,ORG,"Companies, agencies, institutions, etc."
9,the Committee on Environment and Public Works,ORG,"Companies, agencies, institutions, etc."
11,U.S. GOVERNMENT PUBLISHING,ORG,"Companies, agencies, institutions, etc."
...,...,...,...
1435,Committee,ORG,"Companies, agencies, institutions, etc."
1436,the Federal Government,ORG,"Companies, agencies, institutions, etc."
1437,State,ORG,"Companies, agencies, institutions, etc."
1438,Wastewater,ORG,"Companies, agencies, institutions, etc."


In [54]:
## CALL SAMPLE 30


Unnamed: 0,word,label,meaning
834,State,ORG,"Companies, agencies, institutions, etc."
1414,the last 15 years,DATE,Absolute or relative dates or periods
1427,today,DATE,Absolute or relative dates or periods
820,three,CARDINAL,Numerals that do not fall under another type
231,Committee,ORG,"Companies, agencies, institutions, etc."
475,First,ORDINAL,"""first"", ""second"", etc."
1032,first,ORDINAL,"""first"", ""second"", etc."
54,3,CARDINAL,Numerals that do not fall under another type
1341,Philadelphia,GPE,"Countries, cities, states"
599,12,CARDINAL,Numerals that do not fall under another type
