# AI - Natural Language Processing

### The problem?

- Endless amounts of unstructured data found in emails, tweets, letters, memos, etc.
- Even in transcripts
- How can we make sense of all this data?
- How can we 'easily' find relevant information for our reporting?

### The solution?
- Artificial Intelligence to process all that text using **natural language processing**!
- <a href="https://machinelearningmastery.com/natural-language-processing/">Learn more</a> about the complexity and the history of NLP.
- The use of ```large language models```!

### Journalism examples

- <a href="http://doctors.ajc.com/part_1_license_to_betray/">License to betray</a> – Finding word stems and roots to uncover abuse. (<a href="http://doctors.ajc.com/about_this_investigation/?ecmp=doctorssexabuse_microsite_stories">More info</a>)
- <a href="https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/">Federal judge’s rulings favored companies in which he owned stock</a> – Finding all stock owned by judges in disclosure forms and comparing to caseloads.
- <a href="https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html">LAPD underreported serious assaults, skewing crime stats for 8 years</a> – Text classification analysis.

### The tools

- Spacy v. NLTK
- NLTK launched in 2001, Spacy in 2015
- NLTK is now bloated and complex, requiring many steps to deal with many changes etc.
- Spacy is lean and modern, and can compute some text 4x to 20x faster than NLTK.
- Spacy does **nearly** everything that NLTK does, but better.
- NLTK, however, is still the library of choice for sentiment analysis.

However, sentiment analysis in journalism can be problematic. Be extra wary of NLP's use for news analysis. AI can easily misinterpret the sentiment in this sentence:

"It is a great movie if you have the taste and sensibilities of a five-year-old boy."

It's best to stick to the following types of analysis:

- Mentions of a word or concept (who said something...when and how many times?)
- Frequency of target terms or topics (how often were keywords used in speeches, transcripts, etc)
- Words over time (a timeline that shows frequency of words over time)
- Missing words (really a flip of words over time to show how people stopped using certain concepts or terms)
- Key people, places, companies (identify proper nouns and places for reporting)
- Comparisons (for example financial disclosures over time...which stocks were added or removed over the years)

# Working with Spacy

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

### TURN OFF FOR COLAB
Run for ANACONDA

In [1]:
conda install -c conda-forge spacy

done
Solving environment: done


  current version: 23.1.0
  latest version: 23.9.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.9.0



## Package Plan ##

  environment location: /Users/sajinashrestha/opt/anaconda3

  added / updated specs:
    - spacy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    catalogue-2.0.10           |   py39h6e9494a_0          35 KB  conda-forge
    commonmark-0.9.1           |             py_0          46 KB  conda-forge
    confection-0.1.3           |   py39hbd905a8_1          66 KB  conda-forge
    cymem-2.0.6                |   py39he9d5cce_0          34 KB
    cython-blis-0.7.9          |   py39hacda100_0         4.1 MB
    langcodes-3.3.0            |     pyhd8ed1ab_0         156 KB  conda-forge
    murmurhash-1.0.7     

srsly-2.4.6          | 567 KB    | #                                     |   3% [A[A[A[A[A[A[A[A[A[A[A[A














rich-12.6.0          | 170 KB    | ###4                                  |   9% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A







spacy-loggers-1.0.5  | 21 KB     | ##################################### | 100% [A[A[A[A[A[A[A[A












python_abi-3.9       | 4 KB      | ##################################### | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A













confection-0.1.3     | 66 KB     | #########                             |  24% [A[A[A[A[A[A[A[A[A[A[A[A[A[A















cymem-2.0.6          | 34 KB     | #################6                    |  48% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A





wasabi-1.1.2         | 47 KB     | ##################################### | 100% [A[A[A[A[A[A





wasabi-1.1.2         | 47 KB     | ##################################### | 100% [A[A[A[A[A[A













### TURN OFF FOR ANACONDA
Run for Colab

In [None]:
## COLAB pip install
# !pip install -U spacy


In [4]:
## import libary.

import spacy

#### Which language model is best for you?
<a href="https://spacy.io/usage/models">https://spacy.io/usage/models</a>

## Step 2. Install language model


### ANACONDA ONLY

In [7]:
conda install -c conda-forge spacy-model-en_core_web_sm

done
Solving environment: done


  current version: 23.1.0
  latest version: 23.9.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.9.0



## Package Plan ##

  environment location: /Users/sajinashrestha/opt/anaconda3

  added / updated specs:
    - spacy-model-en_core_web_sm


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    spacy-model-en_core_web_sm-3.5.0|     pyhd8ed1ab_0        12.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        12.1 MB

The following NEW packages will be INSTALLED:

  spacy-model-en_co~ conda-forge/noarch::spacy-model-en_core_web_sm-3.5.0-pyhd8ed1ab_0 



Downloading and Extracting Packages
                                                          

### COLAB ONLY

In [None]:
# !python -m spacy download en_core_web_trf

In [9]:
## import that language model
import en_core_web_sm

### Place English libary into a ```nlp``` pipeline

In [10]:
## build nlp pipeline (a function will tokenize, parse and ner for us)
nlp = en_core_web_sm.load()

In [11]:
## what type of object is nlp
type(nlp)

spacy.lang.en.English

## Step 3. Text analysis

In [12]:
### Sample English text:
text = u'''\
On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, \
creator of the VoIP service Skype, for $8.5 billion. \
Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. \
Sandeep Junnarkar got this from Wikipedia. \
But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. \
The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." \
Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.
'''

In [13]:
## CALL the text
text

'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [14]:
## PRINT the tex
type(text)

str

### Tokenize our text

- Tokenizing is always the first step in text analysis. 
- It breaks all text into isolated but related units (including spaces, symbols, punctuation, numbers, words etc.)
- However, it retains the connection between all the words, sentences, and paragraphs.

In [15]:
## let's run the nlp function and create a spacy doc
doc = nlp(text)

In [16]:
## CALL doc
doc

On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.

In [17]:
## what type of data is it?
type(doc)

spacy.tokens.doc.Doc

In [21]:
## show each token
for token in doc:
    print(token)
    print("***********")

On
***********
May
***********
10
***********
,
***********
2011
***********
,
***********
Microsoft
***********
announced
***********
its
***********
acquisition
***********
of
***********
 
***********
Skype
***********
Technologies
***********
,
***********
creator
***********
of
***********
the
***********
 
***********
VoIP
***********
 
***********
service
***********
 
***********
Skype
***********
,
***********
for
***********
$
***********
8.5
***********
billion
***********
.
***********
Microsoft
***********
is
***********
headquartered
***********
near
***********
Seattle
***********
Washington
***********
while
***********
Skype
***********
remains
***********
in
***********
Palo
***********
Alto
***********
,
***********
California
***********
.
***********
Sandeep
***********
Junnarkar
***********
got
***********
this
***********
from
***********
Wikipedia
***********
.
***********
But
***********
he
***********
'd
***********
rather
***********
head
***********
to
*****

### Parts of speech



In [22]:
## print all parts of speech words
for token in doc:
    print(f"{token.text}---> {token.pos}--->{token.pos_}")
    print("*********")

On---> 85--->ADP
*********
May---> 96--->PROPN
*********
10---> 93--->NUM
*********
,---> 97--->PUNCT
*********
2011---> 93--->NUM
*********
,---> 97--->PUNCT
*********
Microsoft---> 96--->PROPN
*********
announced---> 100--->VERB
*********
its---> 95--->PRON
*********
acquisition---> 92--->NOUN
*********
of---> 85--->ADP
*********
 ---> 103--->SPACE
*********
Skype---> 96--->PROPN
*********
Technologies---> 96--->PROPN
*********
,---> 97--->PUNCT
*********
creator---> 92--->NOUN
*********
of---> 85--->ADP
*********
the---> 90--->DET
*********
 ---> 103--->SPACE
*********
VoIP---> 96--->PROPN
*********
 ---> 103--->SPACE
*********
service---> 92--->NOUN
*********
 ---> 103--->SPACE
*********
Skype---> 96--->PROPN
*********
,---> 97--->PUNCT
*********
for---> 85--->ADP
*********
$---> 99--->SYM
*********
8.5---> 93--->NUM
*********
billion---> 93--->NUM
*********
.---> 97--->PUNCT
*********
Microsoft---> 96--->PROPN
*********
is---> 87--->AUX
*********
headquartered---> 100--->VERB
****

### Step 4. Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [23]:
### call text
text

'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [24]:
## find all entities

for word in doc.ents:
    print(word)

May 10, 2011
Microsoft
Skype Technologies
VoIP
Skype
$8.5 billion
Microsoft
Seattle
Washington
Skype
Palo Alto
California
Sandeep Junnarkar
Wikipedia
Paris
France
The Hudson River
Mahicantuck
two
315 miles
the Atlantic Ocean
Mt. Mercy
New York


In [26]:
## find all entities with their label

for word in doc.ents:
    print(f"{word}---->{word.label_}")

May 10, 2011---->DATE
Microsoft---->ORG
Skype Technologies---->ORG
VoIP---->LOC
Skype---->ORG
$8.5 billion---->MONEY
Microsoft---->ORG
Seattle---->GPE
Washington---->GPE
Skype---->ORG
Palo Alto---->GPE
California---->GPE
Sandeep Junnarkar---->PERSON
Wikipedia---->GPE
Paris---->GPE
France---->GPE
The Hudson River---->LOC
Mahicantuck---->PERSON
two---->CARDINAL
315 miles---->QUANTITY
the Atlantic Ocean---->LOC
Mt. Mercy---->LOC
New York---->GPE


In [28]:
## find all entities with their label and label descriptors
for word in doc.ents:
    print(f"{word}---->{word.label_}---->{spacy.explain(word.label_)}")

May 10, 2011---->DATE---->Absolute or relative dates or periods
Microsoft---->ORG---->Companies, agencies, institutions, etc.
Skype Technologies---->ORG---->Companies, agencies, institutions, etc.
VoIP---->LOC---->Non-GPE locations, mountain ranges, bodies of water
Skype---->ORG---->Companies, agencies, institutions, etc.
$8.5 billion---->MONEY---->Monetary values, including unit
Microsoft---->ORG---->Companies, agencies, institutions, etc.
Seattle---->GPE---->Countries, cities, states
Washington---->GPE---->Countries, cities, states
Skype---->ORG---->Companies, agencies, institutions, etc.
Palo Alto---->GPE---->Countries, cities, states
California---->GPE---->Countries, cities, states
Sandeep Junnarkar---->PERSON---->People, including fictional
Wikipedia---->GPE---->Countries, cities, states
Paris---->GPE---->Countries, cities, states
France---->GPE---->Countries, cities, states
The Hudson River---->LOC---->Non-GPE locations, mountain ranges, bodies of water
Mahicantuck---->PERSON----

### Create a CSV that holds all the organizations/companies in a document

In [None]:
## find all entities and place in a list using list comprehension

## find all entities
 ## find all entity labels



In [None]:
### Turn the two lists into a dictionary using a for loop


In [None]:
### Turn the two lists into a dictionary using 
### dictionary comprehension within list comprehension


In [None]:
## the previous lists hold all entities. 
## let's narrow them down to the orgs/companies


In [None]:
## What data types are these?


### Deduplicate?

If you need to deduplicate the results you can do so by using ```unique()``` in Pandas.

But perhaps you want uncover a pattern in how often terms are used and when.


### Export instead

In [None]:
## import pandas
import pandas as pd

In [None]:
# ## use pandas to write to csv file
filename = "test-entities-1.csv"
df = pd.DataFrame(all_orgs) ## we turn our life dict into a dataframe which we're call df
df.to_csv(filename, encoding='utf-8', index=False)


### Create a function to process entities

In [None]:

## function to find entities
def show_entities(my_text):
  '''
  my_text must be a spacy doc tokenized object; already run through nlp pipeline

  '''
  each_token = "Token"
  entity_type = "Entity"
  entity_def = "Entity Defined"
  print(f"{each_token:{30}}{entity_type:{15}}{entity_def}")
  if my_text.ents:
      for word in my_text.ents:
          print(f"{word.text:{30}} {word.label_:{15}} {str(spacy.explain(word.label_))}")
  else:
      print("There are no entities in this text")


In [None]:
## show entities in my english sentence


## Specialized function to capture entity types

In [None]:
## create function to return list of dictionaries of entities and entity labels


In [None]:
## test it to find orgs


## Install other languages
#### Other languages can be found at https://spacy.io/usage/models

#### Disclaimer: Language models are built by open source communities. English and German are the most advanced language models.

### Spanish language model

### ANACONDA ONLY

In [None]:
conda update -n base -c conda-forge conda

### COLAB ONLY

In [None]:
# !python -m spacy download es_core_news_sm


In [None]:
## import the library and create nlp pipleline
import es_core_news_sm
nlp = es_core_news_sm.load()

In [None]:
### Sample Spanish Text (sorry!)
stext = """
El 10 de mayo de 2011, Microsoft anunció la adquisición de Skype Technologies,\
creador del servicio de VoIP Skype, por 8.500 millones de dólares. Microsoft tiene\
su sede cerca de Seattle, Washington, mientras que Skype permanece en Palo Alto,\
California. Sandeep Junnarkar obtuvo esto de Wikipedia. Pero preferiría ir a París,\
Francia, a ver la Mona Lisa en el Louvre. El río Hudson realmente debería llamarse por\
su nombre nativo original, Mahicantuck, que significa "el río\
que fluye en dos direcciones". Mahicantuck fluye por 315 millas hacia el Océano Atlántico\
desde su origen en Mt. Mercy, el pico más alto del estado de Nueva York.
"""

In [None]:
## tokenize and show parts of speech for each token


In [None]:
## show the tokens


In [None]:
## show entities


## More NLP:

- Text summarization
- Word frequency
- Context around words
- Surprise ending?