# Natural Language Processing

### The problem?

- Endless amounts of unstructured data found in emails, tweets, letters, memos, etc.
- Even in transcripts
- How can we make sense of all this data?
- How can we 'easily' find relevant information for our reporting?

### The solution?
- Computer programming to process all that text using **natural language processing**!
- <a href="https://machinelearningmastery.com/natural-language-processing/">Learn more</a> about the complexity and the history of NLP.

### Journalism examples

- <a href="http://doctors.ajc.com/part_1_license_to_betray/">License to betray</a> – Finding word stems and roots to uncover abuse.
- <a href="https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/">Federal judge’s rulings favored companies in which he owned stock</a> – Finding all stock owned by judges in disclosure forms and comparing to caseloads.
- <a href="https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html">LAPD underreported serious assaults, skewing crime stats for 8 years</a> – Text classification analysis.

### The tools

- Spacy v. NLTK
- NLTK launched in 2001, Spacy in 2015
- NLTK is now bloated and complex, requiring many steps to deal with many changes etc.
- Spacy is lean and modern, and can compute some text 4x to 20x faster than NLTK.
- Spacy does **nearly** everything that NLTK does, but better.
- NLTK, however, is still the library of choice for sentiment analysis.

# Working with Spacy

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

In [None]:
## Conda install or...
# !conda install -c conda-forge spacy


In [None]:
## pip install
!pip install -U spacy


In [None]:
## import libary.

import spacy

#### Troubleshoot here if problems with setup:
https://github.com/explosion/spacy-models

## Step 2. Install language model


In [None]:
!python -m spacy download en_core_web_sm

### Place English libary into a ```nlp``` pipeline

In [None]:
## build nlp pipeline (a function will tokenize, parse and ner for us)


In [None]:
## what type of object is nlp


## Step 3. Text analysis

In [None]:
### Sampel English text:
text = u'''\
On May 10, 2011, Microsoft announced its acquisition of Skype Technologies,\
creator of the VoIP service Skype, for $8.5 billion.\
Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California.\
Sandeep Junnarkar got this from Wikipedia.\
But he'd rather head to Paris, France to see the Mona Lisa at the Louvre.\
Mount Washington, which is really Agiocochook, is the highest peak in the Northeastern United States\
at 6,288.2 ft and the most topographically prominent mountain east\
of the Mississippi River. It's not in Mississippi.
'''

### Tokenize our text

- Tokenizing is always the first step in text analysis. 
- It breaks all text into isolated but related units (including spaces, symbols, punctuation, numbers, words etc.)
- However, it retains the connection between all the words, sentences, and paragraphs.

In [None]:
## let's run the nlp function and create a spacy doc


In [None]:
## what type of data is it?


In [None]:
## show each token


### Stop Words

- These are common words that add no additional meaning to our analysis.
- Words like ```the```, ```and``` and ```any```.
- Spacy has just over 320 ```stop words``` in its defalt library.
- Read more on <a href="https://medium.com/@saitejaponugoti/stop-words-in-nlp-5b248dadad47">stop words</a>

In [None]:
## show all default stop words


In [None]:
## check if a word (have, near, be) is a stop word 


In [None]:
## how many do stop words do we have?


In [None]:
## Add your own stop word


In [None]:
## CHECK IF 'lol' is a stop word


In [None]:
## how many do stop words do we have now?


In [None]:
## Remove a stop word from list because it is relevant.
## notice the word "empty" is a stop word.


In [None]:
## CHECK IF 'empty' is a stop word


### Parts of speech



In [None]:
## print all parts of speech words


### Step 4. Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [None]:
## find all entities

    

In [None]:
## find all entities with their label

    

In [None]:
## find all entities with their label and label descriptors


### Create a CSV that holds all the organizations/companies in a document

In [None]:
## find all entities and place in a list using list comprehension


In [None]:
### Turn the two lists into a dictionary using a for loop


In [None]:
### Turn the two lists into a dictionary using a list comprehension


In [None]:
## the previous lists hold all entities. 
## let's narrow them down to the orgs/companies


In [None]:
## What data types are these?


In [None]:
## Let's make sure all the key and value pairs are strings 
## instead of spacy objects so we can move them into a df and csv


In [None]:
## confirm key, value are both strings


### Let's deduplicate

In [None]:
## deduplicate a dictionary


In [None]:
## import pandas
import pandas as pd

In [None]:
# ## use pandas to write to csv file
filename = "test_entities.csv"
df = pd.DataFrame(orgs_only) ## we turn our life dict into a dataframe which we're call df
df.to_csv(filename, encoding='utf-8', index=False)

print(f"{filename} is in your project folder!")

In [None]:

## function to find entities
def show_entities(my_text):
    each_token = "Token"
    entity_type = "Entity"
    entity_def = "Entity Defined"
    print(f"{each_token:{30}}{entity_type:{15}}{entity_def}")
    if my_text.ents:
        for word in doc.ents:
            print(f"{word.text:{30}} {word.label_:{15}} {str(spacy.explain(word.label_))}")
    else:
        print("There are no entities in this text")


In [None]:
words = [token.text.replace(u'\xa0', ' ') for token in doc if token.is_stop != True and token.is_punct != True]
print(words)

In [None]:
## show entities in my english sentence
show_entities(doc)

## Word Frequency

In [None]:
from collections import Counter  ## a package that helps us count up frequency
## Counter(some_variable)
## variable_name.most.common(some_number)

#remove stopwords and punctuations
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True and token.text != '\xa0']
word_freq = Counter(words)
common_words = word_freq.most_common(25)  ## use most.common()
print (common_words)

## Install other languages
#### Other languages can be found at https://spacy.io/usage/models

#### Disclaimer: Language models are built by open source communities. English and German are the most advanced language models.

### Spanish language model

In [None]:
## !python install the library


In [None]:
## import the library and create nlp pipleline


In [None]:
### Sample Spanish Text (sorry!)
stext = """
El 10 de mayo de 2011, Microsoft anunció la adquisición de Skype Technologies, \
creador del servicio de VoIP Skype, por $ 8.5 mil millones. \
Microsoft tiene su sede cerca de Seattle Washington, mientras que Skype permanece en Palo Alto, California. \
Sandeep Junnarkar obtuvo esto de Wikipedia. \
Pero preferiría ir a París, Francia, para ver la Mona Lisa en el Louvre. \
Mount Washington, que en realidad es Agiocochook, es el pico más alto del noreste de Estados Unidos.
a 6.288,2 pies y la montaña más prominente topográficamente al este \
del río Mississippi.
"""

In [None]:
## tokenize and show parts of speech for each token


In [None]:
## show the tokens


In [None]:
## show entities


### Chinese language model

In [None]:
## !python install the library


In [None]:
## import the library and create nlp pipleline


In [None]:
### Sample Chinese Text (sorry!)
ctext = '''
2011年5月10日，微軟宣布收購Skype Technologies，
VoIP服務的創造者，價格為85億美元。
微軟總部位於華盛頓州西雅圖市附近，而Skype仍位於加利福尼亞州帕洛阿爾托。\
Sandeep Junnarkar從Wikipedia獲得了此信息。\
但他寧願前往法國巴黎在羅浮宮看《蒙娜麗莎》。\
華盛頓山（實際上是Agiocochook）是美國東北部的最高峰\
位於6,288.2英尺，是東面地形最突出的山脈\
密西西比河。
'''

In [None]:
## create a spacy doc object


In [None]:
## run our function!
