# AI - Natural Language Processing

### The problem?

- Endless amounts of unstructured data found in emails, tweets, letters, memos, etc.
- Even in transcripts
- How can we make sense of all this data?
- How can we 'easily' find relevant information for our reporting?

### The solution?
- Artificial Intelligence to process all that text using **natural language processing**!
- <a href="https://machinelearningmastery.com/natural-language-processing/">Learn more</a> about the complexity and the history of NLP.
- The use of ```large language models```!

### Journalism examples

- <a href="http://doctors.ajc.com/part_1_license_to_betray/">License to betray</a> – Finding word stems and roots to uncover abuse. (<a href="http://doctors.ajc.com/about_this_investigation/?ecmp=doctorssexabuse_microsite_stories">More info</a>)
- <a href="https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/">Federal judge’s rulings favored companies in which he owned stock</a> – Finding all stock owned by judges in disclosure forms and comparing to caseloads.
- <a href="https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html">LAPD underreported serious assaults, skewing crime stats for 8 years</a> – Text classification analysis.

### The tools

- Spacy v. NLTK
- NLTK launched in 2001, Spacy in 2015
- NLTK is now bloated and complex, requiring many steps to deal with many changes etc.
- Spacy is lean and modern, and can compute some text 4x to 20x faster than NLTK.
- Spacy does **nearly** everything that NLTK does, but better.
- NLTK, however, is still the library of choice for sentiment analysis.

However, sentiment analysis in journalism can be problematic. Be extra wary of NLP's use for news analysis. AI can easily misinterpret the sentiment in this sentence:

"It is a great movie if you have the taste and sensibilities of a five-year-old boy."

It's best to stick to the following types of analysis:

- Mentions of a word or concept (who said something...when and how many times?)
- Frequency of target terms or topics (how often were keywords used in speeches, transcripts, etc)
- Words over time (a timeline that shows frequency of words over time)
- Missing words (really a flip of words over time to show how people stopped using certain concepts or terms)
- Key people, places, companies (identify proper nouns and places for reporting)
- Comparisons (for example financial disclosures over time...which stocks were added or removed over the years)

# Working with Spacy

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

### TURN OFF FOR COLAB
Run for ANACONDA

In [1]:
conda install -c conda-forge spacy

done
Solving environment: done


  current version: 22.9.0
  latest version: 23.9.0

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Retrieving notices: ...working... done

Note: you may need to restart the kernel to use updated packages.


### TURN OFF FOR ANACONDA
Run for Colab

In [None]:
## COLAB pip install
# !pip install -U spacy


In [53]:
## import libary.
import pandas as pd
import spacy

#### Which language model is best for you?
<a href="https://spacy.io/usage/models">https://spacy.io/usage/models</a>

## Step 2. Install language model


### ANACONDA ONLY

In [23]:
conda install -c conda-forge spacy-model-en_core_web_sm

done
Solving environment: done


  current version: 22.9.0
  latest version: 23.9.0

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Retrieving notices: ...working... done

Note: you may need to restart the kernel to use updated packages.


### COLAB ONLY

In [24]:
# !python -m spacy download en_core_web_trf

In [25]:
## import that language model
import en_core_web_sm

### Place English libary into a ```nlp``` pipeline

In [28]:
## build nlp pipeline (a function will tokenize, parse and ner for us)
nlp = en_core_web_sm.load()

In [29]:
## what type of object is nlp
type(nlp)

spacy.lang.en.English

## Step 3. Text analysis

In [85]:
### Sample English text:
text = u'''\
On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, \
creator of the VoIP service Skype, for $8.5 billion. \
Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. \
Sandeep Junnarkar got this from Wikipedia. \
But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. \
The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." \
Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.
'''

In [31]:
## CALL the text
text

'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [32]:
## PRINT the tex
type(text)

str

### Tokenize our text

- Tokenizing is always the first step in text analysis. 
- It breaks all text into isolated but related units (including spaces, symbols, punctuation, numbers, words etc.)
- However, it retains the connection between all the words, sentences, and paragraphs.

In [33]:
## let's run the nlp function and create a spacy doc
doc = nlp(text)

In [34]:
## CALL doc
doc

On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.

In [35]:
## what type of data is it?
type(doc)

spacy.tokens.doc.Doc

In [36]:
## show each token
for item in doc:
    print(item)
    print("********")

On
********
May
********
10
********
,
********
2011
********
,
********
Microsoft
********
announced
********
its
********
acquisition
********
of
********
 
********
Skype
********
Technologies
********
,
********
creator
********
of
********
the
********
 
********
VoIP
********
 
********
service
********
 
********
Skype
********
,
********
for
********
$
********
8.5
********
billion
********
.
********
Microsoft
********
is
********
headquartered
********
near
********
Seattle
********
Washington
********
while
********
Skype
********
remains
********
in
********
Palo
********
Alto
********
,
********
California
********
.
********
Sandeep
********
Junnarkar
********
got
********
this
********
from
********
Wikipedia
********
.
********
But
********
he
********
'd
********
rather
********
head
********
to
********
Paris
********
,
********
France
********
to
********
see
********
the
********
Mona
********
Lisa
********
at
********
the
********
Louvre
********
.
********
The
***

### Parts of speech



In [37]:
## print all parts of speech words
for token in doc:
    print(f"{token.text}--->{token.pos}--->{token.pos_}")
    print("********")

On--->85--->ADP
********
May--->96--->PROPN
********
10--->93--->NUM
********
,--->97--->PUNCT
********
2011--->93--->NUM
********
,--->97--->PUNCT
********
Microsoft--->96--->PROPN
********
announced--->100--->VERB
********
its--->95--->PRON
********
acquisition--->92--->NOUN
********
of--->85--->ADP
********
 --->103--->SPACE
********
Skype--->96--->PROPN
********
Technologies--->96--->PROPN
********
,--->97--->PUNCT
********
creator--->92--->NOUN
********
of--->85--->ADP
********
the--->90--->DET
********
 --->103--->SPACE
********
VoIP--->96--->PROPN
********
 --->103--->SPACE
********
service--->92--->NOUN
********
 --->103--->SPACE
********
Skype--->96--->PROPN
********
,--->97--->PUNCT
********
for--->85--->ADP
********
$--->99--->SYM
********
8.5--->93--->NUM
********
billion--->93--->NUM
********
.--->97--->PUNCT
********
Microsoft--->96--->PROPN
********
is--->87--->AUX
********
headquartered--->100--->VERB
********
near--->85--->ADP
********
Seattle--->96--->PROPN
********
W

### Step 4. Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [38]:
### call text
text

'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [39]:
## find all entities
for word in doc.ents:
    print(word)
    

May 10, 2011
Microsoft
Skype Technologies
Skype
$8.5 billion
Microsoft
Seattle
Washington
Skype
Palo Alto
California
Sandeep Junnarkar
Wikipedia
Paris
France
the Mona Lisa
Louvre
The Hudson River
Mahicantuck
two
Mahicantuck
315 miles
the Atlantic Ocean
Mt. Mercy
New York


In [40]:
## find all entities with their label

for word in doc.ents:
    print(f"{word}---->{word.label_}")  

May 10, 2011---->DATE
Microsoft---->ORG
Skype Technologies---->ORG
Skype---->ORG
$8.5 billion---->MONEY
Microsoft---->ORG
Seattle---->GPE
Washington---->GPE
Skype---->ORG
Palo Alto---->GPE
California---->GPE
Sandeep Junnarkar---->PERSON
Wikipedia---->ORG
Paris---->GPE
France---->GPE
the Mona Lisa---->WORK_OF_ART
Louvre---->LOC
The Hudson River---->LOC
Mahicantuck---->ORG
two---->CARDINAL
Mahicantuck---->WORK_OF_ART
315 miles---->QUANTITY
the Atlantic Ocean---->LOC
Mt. Mercy---->LOC
New York---->GPE


In [41]:
## find all entities with their label and label descriptors
for word in doc.ents:
    print(f"{word}---->{word.label_}---->{spacy.explain(word.label_)}") 

May 10, 2011---->DATE---->Absolute or relative dates or periods
Microsoft---->ORG---->Companies, agencies, institutions, etc.
Skype Technologies---->ORG---->Companies, agencies, institutions, etc.
Skype---->ORG---->Companies, agencies, institutions, etc.
$8.5 billion---->MONEY---->Monetary values, including unit
Microsoft---->ORG---->Companies, agencies, institutions, etc.
Seattle---->GPE---->Countries, cities, states
Washington---->GPE---->Countries, cities, states
Skype---->ORG---->Companies, agencies, institutions, etc.
Palo Alto---->GPE---->Countries, cities, states
California---->GPE---->Countries, cities, states
Sandeep Junnarkar---->PERSON---->People, including fictional
Wikipedia---->ORG---->Companies, agencies, institutions, etc.
Paris---->GPE---->Countries, cities, states
France---->GPE---->Countries, cities, states
the Mona Lisa---->WORK_OF_ART---->Titles of books, songs, etc.
Louvre---->LOC---->Non-GPE locations, mountain ranges, bodies of water
The Hudson River---->LOC----

## More NLP:

- Text summarization
- Word frequency
- Context around words
- Surprise ending?