# AI - Natural Language Processing
## Part 2 - Functionalize NLP for entities


# ONLY IF NEEDED

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

### TURN OFF FOR COLAB
Run for ANACONDA

In [None]:
conda install -c conda-forge spacy

#### Which language model is best for you?
<a href="https://spacy.io/usage/models">https://spacy.io/usage/models</a>

## Step 2. Install language model


### ANACONDA ONLY

In [None]:
pip install spacy

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
conda install -c conda-forge spacy-model-en_core_web_sm

# Import libs

In [None]:
## import libary.
import pandas as pd
import spacy
import glob

In [None]:
## import that language model
import en_core_web_sm

In [None]:
## build nlp pipeline (a function will tokenize, parse and ner for us)
nlp = en_core_web_sm.load()

## Import hearings
Download <a href="https://drive.google.com/file/d/1EUYLeHpHAAW2MGsrT6_jov9cJ-IuDLg-/view?usp=sharing">this senate hearing</a> and turn it into a spacy doc.

Create a spreadsheet with columns for the entity, the label, and its meaning.

(remember, you will have to also tap elements from weeks' lessons to accomplish this)

In [15]:
## pull hearing into notebook
hearing = glob.glob("*hearing*")
hearing

['senate-hearing.txt']

In [16]:
## read file and place in text
with open(hearing[0], "r") as f:
    print(type(f))
    all_text = f.read()
    print(type(all_text))

<class '_io.TextIOWrapper'>
<class 'str'>


In [17]:
## create a reusable function to read a globbed list
# define function
def tokenize_file(file):
    # open the file
    with open(file[0], "r") as text:
    ## turn io object into string
        all_text = text.read()
    ## run through pipeline to tokenize ## return
        return nlp(all_text)

In [18]:
## save senate hearing as nlp doc
doc = tokenize_file(hearing)
doc

[Senate Hearing 118-22]
[From the U.S. Government Publishing Office]


                                                        S. Hrg. 118-22

                   IMPLEMENTING IIJA: PERSPECTIVES ON
          THE DRINKING WATER AND WASTEWATER INFRASTRUCTURE ACT


                                HEARING

                               BEFORE THE

                              COMMITTEE ON
                      ENVIRONMENT AND PUBLIC WORKS

                          UNITED STATES SENATE

                    ONE HUNDRED EIGHTEENTH CONGRESS

                             FIRST SESSION

                               __________

                             MARCH 15, 2023

                               __________

  Printed for the use of the Committee on Environment and Public Works
  
[GRAPHIC NOT AVAILABLE IN TIFF FORMAT]  


        Available via the World Wide Web: http://www.govinfo.gov
        
                               __________

                                
                

In [None]:
## create function to return list of dictionaries of entities and entity labels
## function to find entities
## def function
def cat_entities(doc):
    '''
    provide text as a doc tokenized object
    returns a df with all the entities
    '''
    ent_list = []
    if doc.ents:
        for word in doc.ents:
            temp_dict = {"word": word.text, 
                         "label": word.label_,
                         "meaning": spacy.explain(word.label_)}
            ent_list.append(temp_dict)
    else:
        print("There are no entities in this text")
        
    return pd.DataFrame(ent_list)

In [19]:
## test it to find orgs
df = cat_entities(doc)
df

Unnamed: 0,word,label,meaning
0,Senate,ORG,"Companies, agencies, institutions, etc."
1,118,CARDINAL,Numerals that do not fall under another type
2,the U.S. Government Publishing Office,ORG,"Companies, agencies, institutions, etc."
3,118,CARDINAL,Numerals that do not fall under another type
4,UNITED STATES,ORG,"Companies, agencies, institutions, etc."
...,...,...,...
1454,Philadelphia,GPE,"Countries, cities, states"
1455,West Virginia,GPE,"Countries, cities, states"
1456,the years,DATE,Absolute or relative dates or periods
1457,Whereupon,ORG,"Companies, agencies, institutions, etc."


In [20]:
## search for people only
df.query("label == 'PERSON'")

Unnamed: 0,word,label,meaning
16,THOMAS R. CARPER,PERSON,"People, including fictional"
23,CYNTHIA M. LUMMIS,PERSON,"People, including fictional"
26,MARKWAYNE MULLIN,PERSON,"People, including fictional"
31,EDWARD J. MARKEY,PERSON,"People, including fictional"
37,ROGER WICKER,PERSON,"People, including fictional"
...,...,...,...
1421,Aaron,PERSON,"People, including fictional"
1432,Carper,PERSON,"People, including fictional"
1448,Billy Graham,PERSON,"People, including fictional"
1450,Sheila,PERSON,"People, including fictional"
