# (5B) Named Entity Recognition

In this second installment of the NLP module, let's look at a more advanced problem: how can we detect places and people in text?

Then, we'll practice mapping geographic patterns in text.

-----

## How can I identify “named entities” (e.g. proper nouns of places and people)?


In [1]:
# Manzanar loves Named Entities in Chapter 35

manzanar = """Despite everything, every sports event, concert, and whatnot was happening at the same time. L.A.
marathoners slouched by the droves across the finish line at the Coliseum. At the Rose Bowl: UCLA versus
USC; the Bruin mascot had been carried off the field with heat stroke, and the Trojan horse was tied up
after throwing its sweaty rider. The Clippers were attempting a comeback in overtime at the Sports Arena.
It was the end of the seventhinning stretch, and Nomo fans at Chavez Ravine hunkered down with their cold
beers and Dodger dogs. Scottie Pippen fouled Shaq who sank a free throw for the Lakers at the Forum in the
last seconds. The Trekkie convention warped into five at the L.A. Convention Center. Bud Girls paraded
between boxing matches at the Olympic Auditorium. Plácido Domingo belted Rossini at the Dorothy Chandler
under the improbable abstract/minimal/baroque direction of Peter Sellers. At the Shrine, executive
producer Richard Sakai accepted an Oscar for the movie version of The Simpsons. The helicopter landed for
the 944th time on the set of Miss Saigon at the Ahmanson, and Beauty smacked the Beast at the Shubert.
Chinese housewives went for the big stakes in pai gow in the Asian room at the Bicycle Club. Live-laughter
sitcom audiences and boisterous crowds for the daytime and nighttime talks filled every available studio
in Hollywood and Burbank. Thousands of fans melted away with Julio Iglesias at the Universal Amphitheater."""

In [2]:
# ha ha

nasa = """Ryan Heuser cannot wait until he graduates from Stanford University.
He will take up position as Head Engineer of NASA's secret "Send Literary Critics to Mars" mission."""

In [3]:
# do some imports
import os
import nltk
import pandas as pd
pd.set_option('display.max_colwidth', 0)

### (1) NLTK

We've used NLTK before for word tokenizing, sentence tokenizing, and part of speech tagging. Now let's use it again for named entity recognition using its function `ne_chunk()`.

In [4]:
### Step 1: Split sentences

# tokenize sentences
sentences = nltk.sent_tokenize(nasa)
sentences

['Ryan Heuser cannot wait until he graduates from Stanford University.',
 'He will take up position as Head Engineer of NASA\'s secret "Send Literary Critics to Mars" mission.']

In [5]:
### Step 2: Get words from each sentence

# loop over sentences
for sent in sentences:
    # get words
    sent_words = nltk.word_tokenize(sent)
    print(sent_words)
    
    # empty line
    print()

['Ryan', 'Heuser', 'can', 'not', 'wait', 'until', 'he', 'graduates', 'from', 'Stanford', 'University', '.']

['He', 'will', 'take', 'up', 'position', 'as', 'Head', 'Engineer', 'of', 'NASA', "'s", 'secret', '``', 'Send', 'Literary', 'Critics', 'to', 'Mars', "''", 'mission', '.']



In [8]:
### Step 3: Get POS for each sentence

# loop over sentences
for sent in sentences:
    # get words
    sent_words = nltk.word_tokenize(sent)
    
    # get POS
    sent_pos = nltk.pos_tag(sent_words)
    print(sent_pos)
    
    # empty line
    print()

[('Ryan', 'NNP'), ('Heuser', 'NNP'), ('can', 'MD'), ('not', 'RB'), ('wait', 'VB'), ('until', 'IN'), ('he', 'PRP'), ('graduates', 'VBZ'), ('from', 'IN'), ('Stanford', 'NNP'), ('University', 'NNP'), ('.', '.')]

[('He', 'PRP'), ('will', 'MD'), ('take', 'VB'), ('up', 'RP'), ('position', 'NN'), ('as', 'IN'), ('Head', 'NNP'), ('Engineer', 'NNP'), ('of', 'IN'), ('NASA', 'NNP'), ("'s", 'POS'), ('secret', 'JJ'), ('``', '``'), ('Send', 'NNP'), ('Literary', 'NNP'), ('Critics', 'NNPS'), ('to', 'TO'), ('Mars', 'NNP'), ("''", "''"), ('mission', 'NN'), ('.', '.')]



In [9]:
### Step 4: Get named entity 'chunks' for each sentence

# loop over sentences
for sent in sentences:
    # get words
    sent_words = nltk.word_tokenize(sent)
    
    # get POS
    sent_pos = nltk.pos_tag(sent_words)
    
    # get named entity 'chunks'
    chunks = nltk.ne_chunk(sent_pos)
    print(chunks)
    
    # empty line
    print()


(S
  (PERSON Ryan/NNP)
  (PERSON Heuser/NNP)
  can/MD
  not/RB
  wait/VB
  until/IN
  he/PRP
  graduates/VBZ
  from/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  ./.)

(S
  He/PRP
  will/MD
  take/VB
  up/RP
  position/NN
  as/IN
  (PERSON Head/NNP Engineer/NNP)
  of/IN
  (ORGANIZATION NASA/NNP)
  's/POS
  secret/JJ
  ``/``
  Send/NNP
  Literary/NNP
  Critics/NNPS
  to/TO
  (PERSON Mars/NNP)
  ''/''
  mission/NN
  ./.)



In [11]:
import nltk
#nltk.download('averaged_perceptron_tagger')

In [17]:
### Step 5: Get information about the named entity chunks

# loop over sentences
for sent in sentences:
    # get words
    sent_words = nltk.word_tokenize(sent)
    
    # get POS
    sent_pos = nltk.pos_tag(sent_words)
    
    # get named entity 'chunks'
    chunks = nltk.ne_chunk(sent_pos)

    # loop over each one
    for chunk in chunks:
        #print(chunk)
        #continue

        # if the chunk has a 'label' attribute (is a named entity)
        if hasattr(chunk,'label'):

            # get the label
            label = chunk.label()

            # print
            print(label)
            print(list(chunk))
            print()

PERSON
[('Ryan', 'NNP')]

PERSON
[('Heuser', 'NNP')]

ORGANIZATION
[('Stanford', 'NNP'), ('University', 'NNP')]

PERSON
[('Head', 'NNP'), ('Engineer', 'NNP')]

ORGANIZATION
[('NASA', 'NNP')]

PERSON
[('Mars', 'NNP')]



In [22]:
x = [('Stanford', 'NNP'), ('University', 'NNP')]

l = []
for word,tag in x:
    l.append(word)
    
" ".join(l)

'Stanford University'

In [26]:
l = ['Hello','this','is','Ryan']

''.join(l)

'Hello\nthis\nis\nRyan'

In [27]:
### Step 5: Get information about the named entity chunks

# loop over sentences
for sent in sentences:
    # get words
    sent_words = nltk.word_tokenize(sent)
    
    # get POS
    sent_pos = nltk.pos_tag(sent_words)
    
    # get named entity 'chunks'
    chunks = nltk.ne_chunk(sent_pos)

    # loop over each one
    for chunk in chunks:

        # if the chunk has a 'label' attribute (is a named entity)
        if hasattr(chunk,'label'):

            # get the label
            label = chunk.label()

            # get the words
            chunk_words = []
            for word,tag in chunk:
                chunk_words.append(word)

            # make a string version
            chunk_words_str = ' '.join(chunk_words)
            
            print(label,':',chunk_words_str)

PERSON : Ryan
PERSON : Heuser
ORGANIZATION : Stanford University
PERSON : Head Engineer
ORGANIZATION : NASA
PERSON : Mars


In [28]:

# let's make a function for this
def ner_nltk(string):
    """
    Using NLTK, this function takes any string, identifies the named entities in it,
    and returns a list of dictionaries, with one dictionary per named entitiy,
    where each dictionary looks like this:
    
    {
        'type': 'PERSON',
        'entity': 'Ryan',
        '_sent_num': 1,
        '_sent': 'Ryan Heuser cannot wait until he graduates from Stanford University.'
    }
    """
    
    # clean string
    string = string.strip().replace('\n',' ')
    
    # sentence tokenize the string
    sentences = nltk.sent_tokenize(string)
    
    # set empty list for output
    output_list = []
    
    # loop over each sentence
    sent_num = 0
    for sent in sentences:
        # add 1 to sent num
        sent_num+=1
        
        # default this to False (see why below)
        added_sent_already = False
        
        # we need to get the words
        sent_words = nltk.word_tokenize(sent)
        
        # parts of speech
        sent_pos = nltk.pos_tag(sent_words)
        
        # then "chunk"
        chunks = nltk.ne_chunk(sent_pos)
        
        # loop over chunks...
        for chunk in chunks:
            # if the chunk has a 'label' attribute (is a named entity)
            if hasattr(chunk,'label'):
                
                # get the label
                label = chunk.label()
                
                # get the words in the chunk
                chunk_words = []
                for word,pos in chunk:
                    chunk_words.append(word)
                
                # make a string version
                chunk_words_str = ' '.join(chunk_words)
                
                # make a result dictionary
                result_dict = {}
                
                # add NER info
                result_dict['type'] = label
                result_dict['entity'] = chunk_words_str
                
                ### optional: add sent info
                result_dict['_sent_num'] = sent_num
                # add a string of the sentence, but only once per sentence
                if not added_sent_already:
                    result_dict['_sent'] = sent
                    added_sent_already = True
                else:
                    result_dict['_sent'] = ''
                ###
                
                # add result dictionary to output list
                output_list.append(result_dict)
    
    # return list of dictionaries
    return output_list

In [29]:
help(ner_nltk)

Help on function ner_nltk in module __main__:

ner_nltk(string)
    Using NLTK, this function takes any string, identifies the named entities in it,
    and returns a list of dictionaries, with one dictionary per named entitiy,
    where each dictionary looks like this:
    
    {
        'type': 'PERSON',
        'entity': 'Ryan',
        '_sent_num': 1,
        '_sent': 'Ryan Heuser cannot wait until he graduates from Stanford University.'
    }



In [32]:
nltk_ner_ld = ner_nltk(nasa)
nltk_ner_ld

[{'type': 'PERSON',
  'entity': 'Ryan',
  '_sent_num': 1,
  '_sent': 'Ryan Heuser cannot wait until he graduates from Stanford University.'},
 {'type': 'PERSON', 'entity': 'Heuser', '_sent_num': 1, '_sent': ''},
 {'type': 'ORGANIZATION',
  'entity': 'Stanford University',
  '_sent_num': 1,
  '_sent': ''},
 {'type': 'PERSON',
  'entity': 'Head Engineer',
  '_sent_num': 2,
  '_sent': 'He will take up position as Head Engineer of NASA\'s secret "Send Literary Critics to Mars" mission.'},
 {'type': 'ORGANIZATION', 'entity': 'NASA', '_sent_num': 2, '_sent': ''},
 {'type': 'PERSON', 'entity': 'Mars', '_sent_num': 2, '_sent': ''}]

In [33]:
nltk_ner_df = pd.DataFrame(nltk_ner_ld)
nltk_ner_df

Unnamed: 0,_sent,_sent_num,entity,type
0,Ryan Heuser cannot wait until he graduates from Stanford University.,1,Ryan,PERSON
1,,1,Heuser,PERSON
2,,1,Stanford University,ORGANIZATION
3,"He will take up position as Head Engineer of NASA's secret ""Send Literary Critics to Mars"" mission.",2,Head Engineer,PERSON
4,,2,NASA,ORGANIZATION
5,,2,Mars,PERSON


In [34]:
nltk_ner_ld = ner_nltk(manzanar)
nltk_ner_ld[0]

{'type': 'GPE',
 'entity': 'Coliseum',
 '_sent_num': 2,
 '_sent': 'L.A. marathoners slouched by the droves across the finish line at the Coliseum.'}

In [35]:
nltk_ner_df = pd.DataFrame(nltk_ner_ld)
nltk_ner_df

Unnamed: 0,_sent,_sent_num,entity,type
0,L.A. marathoners slouched by the droves across the finish line at the Coliseum.,2,Coliseum,GPE
1,"At the Rose Bowl: UCLA versus USC; the Bruin mascot had been carried off the field with heat stroke, and the Trojan horse was tied up after throwing its sweaty rider.",3,Rose,ORGANIZATION
2,,3,USC,ORGANIZATION
3,,3,Bruin,GPE
4,,3,Trojan,GPE
5,The Clippers were attempting a comeback in overtime at the Sports Arena.,4,Clippers,ORGANIZATION
6,,4,Sports Arena,ORGANIZATION
7,"It was the end of the seventhinning stretch, and Nomo fans at Chavez Ravine hunkered down with their cold beers and Dodger dogs.",5,Nomo,ORGANIZATION
8,,5,Chavez Ravine,ORGANIZATION
9,,5,Dodger,GPE


In [36]:
# Save this data!
nltk_ner_df.to_excel('data.ner_nltk.xls')

### (2) Polyglot (for non-English, non-French, non-German text)

[Polyglot](https://polyglot.readthedocs.io/) is a really cool package, built on top of TextBlob, which supports up to 140 different languages depending on the NLP task. This increase in linguistic range comes at a cost of accuracy, however. The tools was trained using Wikipedia as a Rosetta Stone, calibrating languages' models against each other by using the "same" articles in those languages. The other costs of using Polyglot: the documentation isn't that great, and it doesn't seem to be actively updated.

*Installation is also kind of a pain in the neck.* I recommend installing this only if you are planning to work with non-English, non-French, non-German text. To do so, paste the following into Terminal:

    conda install -c conda-forge pyicu
    pip install pycld2
    pip install morfessor
    pip install polyglot
    polyglot download LANG:en   # for english
    polyglot download LANG:es   # for spanish (optional)
    polyglot download LANG:xx   # where xx is the two-letter language code
   
See [the website](https://polyglot.readthedocs.io/) for more details.

In [37]:
def ner_polyglot(string):
    """
    Using polyglot, this function takes any string, identifies the named entities in it,
    and returns a list of dictionaries, with one dictionary per named entitiy,
    where each dictionary looks like this:
    
    {
        'type': 'PERSON',
        'entity': 'Ryan',
        '_sent_num': 1,
        '_sent': 'Ryan Heuser cannot wait until he graduates from Stanford University.'
    }
    """    
    
    # let's try this...
    try:
        # to use polyglot, import its "Text" object:
        from polyglot.text import Text
    except ImportError:
        print('Polyglot not installed! To do so, follow the instructions above.')
        return
    # from here on we can assume that polyglot is imported
    
    # wrap that Text object around any string
    pg_text = Text(string)

    # make an output list
    output_list = []
    
    # get the entities
    entities = pg_text.entities
    # loop over sentences
    sent_num = 0
    for sent in pg_text.sentences:
        sent_num+=1

        # loop over the entities
        added_sent_already = False
        for ent in sent.entities:
            # get the type
            ent_type = ent.tag

            # get the words
            ent_words = list(ent)

            # bogus if the first letter of the first word is not alphabetic (not punctuation)
            if not ent_words[0][0].isalpha(): continue

            # make a string version
            ent_words_str = ' '.join(ent_words)

            # make a results dict
            result_dict = {}
            result_dict['_sent_num'] = sent_num
            if not added_sent_already:
                result_dict['_sent'] = str(sent)
                added_sent_already = True
            else:
                result_dict['_sent'] = ''
            
            result_dict['type']=ent_type
            result_dict['entity']=ent_words_str

            # add to output
            output_list.append(result_dict)
        
    return output_list

In [38]:
#pd.DataFrame(ner_polyglot(nasa))

In [39]:
# Run on Manzanar's paragraph
#pd.DataFrame(ner_polyglot(manzanar))

### (3) Spacy

[Spacy](http://spacy.io) is industrial-strength NLP. It's the fastest, most powerful, and most accurate. It can also work on [several languages besides English](https://spacy.io/models). But it's also kinda ugly and confusing to use. I recommend using this only if you are working on hundreds of texts and feel extremely comfortable with all the things we've been doing so far.

To install:

    pip install spacy
    python -m spacy download en_core_web_sm

Here's an NER implementation.

In [48]:
def ner_spacy(string):
    """
    Using spacy, this function takes any string, identifies the named entities in it,
    and returns a list of dictionaries, with one dictionary per named entitiy,
    where each dictionary looks like this:
    
    {
        'type': 'PERSON',
        'entity': 'Ryan',
        '_sent_num': 1,
        '_sent': 'Ryan Heuser cannot wait until he graduates from Stanford University.'
    }
    """
    
    try:
        # import spacy
        import spacy
    except ImportError:
        print("spacy not installed. Please follow directions above.")
        return

    # clean string
    string = string.strip().replace('\n',' ').replace("’","'").replace("‘","'")
    
    # load its default English model
    nlp = spacy.load("en_core_web_sm")

    # create a spacy text object
    doc = nlp(string)
    
    # make an output list
    output_list = []

    # loop over sentences
    sent_num=0
    for sent in doc.sents:
        sent_num+=1
        added_sent_already = False

        # loop over sentence's entities
        sent_doc = nlp(str(sent))
        for ent in sent_doc.ents:
            
            # make a result dict
            result_dict = {}
            
            # set sentence number
            result_dict['_sent_num'] = sent_num
            
            # store text too
            if not added_sent_already:
                result_dict['_sent'] = sent.text
                added_sent_already = True
            else:
                result_dict['_sent'] = ''
            
            # get type
            result_dict['type'] = ent.label_
            
            # get entity
            result_dict['entity'] = ent.text
            
            # get start char
            result_dict['start_char'] = ent.start_char
            
            # get end char
            result_dict['end_char'] = ent.end_char
            
            # add result_dict to output_list
            output_list.append(result_dict)
            
    # return output
    return output_list
            


In [49]:
pd.DataFrame(ner_spacy(nasa))

Unnamed: 0,_sent,_sent_num,end_char,entity,start_char,type
0,Ryan Heuser cannot wait until he graduates from Stanford University.,1,11,Ryan Heuser,0,PERSON
1,,1,67,Stanford University,48,ORG
2,"He will take up position as Head Engineer of NASA's secret ""Send Literary Critics to Mars"" mission.",2,41,Engineer,33,PERSON
3,,2,49,NASA,45,ORG
4,,2,89,Send Literary Critics to Mars,60,WORK_OF_ART


In [50]:
spacy_ner_ld = ner_spacy(manzanar)
spacy_ner_ld[0]

{'_sent_num': 2,
 '_sent': 'L.A. marathoners slouched by the droves across the finish line at the Coliseum.',
 'type': 'GPE',
 'entity': 'Coliseum',
 'start_char': 70,
 'end_char': 78}

In [51]:
spacy_ner_df = pd.DataFrame(spacy_ner_ld)
spacy_ner_df

Unnamed: 0,_sent,_sent_num,end_char,entity,start_char,type
0,L.A. marathoners slouched by the droves across the finish line at the Coliseum.,2,78,Coliseum,70,GPE
1,"At the Rose Bowl: UCLA versus USC; the Bruin mascot had been carried off the field with heat stroke, and the Trojan horse was tied up after throwing its sweaty rider.",3,16,the Rose Bowl,3,FAC
2,,3,33,USC,30,ORG
3,,3,44,Bruin,39,NORP
4,,3,115,Trojan,109,GPE
5,The Clippers were attempting a comeback in overtime at the Sports Arena.,4,12,Clippers,4,ORG
6,,4,71,the Sports Arena,55,FAC
7,"It was the end of the seventhinning stretch, and Nomo fans at Chavez Ravine hunkered down with their cold beers and Dodger dogs.",5,53,Nomo,49,ORG
8,,5,75,Chavez Ravine,62,PERSON
9,,5,122,Dodger,116,NORP


In [52]:
# Save results!
spacy_ner_df.to_excel('data.ner_spacy.xls')

### Getting counts from the data

Use `value_counts()` to count the values in any column of a pandas dataframe.


In [53]:
# Get the counts for the column 'type'
val_counts_type = nltk_ner_df['type'].value_counts()
val_counts_type

ORGANIZATION    15
PERSON          12
GPE             9 
Name: type, dtype: int64

In [54]:
# Get the counts for the column 'entity'
val_counts_entity = nltk_ner_df['entity'].value_counts()
val_counts_entity

Universal Amphitheater    1
Plácido                   1
Dodger                    1
Girls                     1
USC                       1
Chinese                   1
Shaq                      1
L.A. Convention Center    1
Scottie                   1
Simpsons                  1
Peter Sellers             1
Rossini                   1
Richard Sakai             1
Nomo                      1
Trekkie                   1
Coliseum                  1
Bud                       1
Ahmanson                  1
Sports Arena              1
Hollywood                 1
Dorothy                   1
Miss Saigon               1
Julio Iglesias            1
Shrine                    1
Asian                     1
Burbank                   1
Chavez Ravine             1
Rose                      1
Beauty                    1
Bruin                     1
Domingo                   1
Shubert                   1
Trojan                    1
Pippen                    1
Clippers                  1
Bicycle Club        

In [63]:
!spacy info

/bin/sh: /Users/ryan/anaconda3/bin/spacy: Permission denied


In [55]:
# You can save any of these val_counts as an excel file itself
val_counts_entity.to_excel('data.ner_nltk_entity_counts.xls')

In [57]:
# You can also convert these to dictionaries
dict(val_counts_type)

{'ORGANIZATION': 15, 'PERSON': 12, 'GPE': 9}

#### Counting in multiple columns

In [60]:
# To count a combination of multiple columns, use .groupby() followed by .size()

val_counts_entity_type = nltk_ner_df.groupby(['entity','type']).size()
val_counts_entity_type

entity                  type        
Ahmanson                ORGANIZATION    1
Asian                   GPE             1
Beauty                  PERSON          1
Bicycle Club            ORGANIZATION    1
Bruin                   GPE             1
Bud                     PERSON          1
Burbank                 GPE             1
Chavez Ravine           ORGANIZATION    1
Chinese                 GPE             1
Clippers                ORGANIZATION    1
Coliseum                GPE             1
Dodger                  GPE             1
Domingo                 PERSON          1
Dorothy                 ORGANIZATION    1
Girls                   PERSON          1
Hollywood               GPE             1
Julio Iglesias          PERSON          1
L.A. Convention Center  ORGANIZATION    1
Miss Saigon             ORGANIZATION    1
Nomo                    ORGANIZATION    1
Peter Sellers           PERSON          1
Pippen                  PERSON          1
Plácido                 PERSON         

In [61]:
# You can save this to an excel file too
val_counts_entity_type.to_excel('data.ner_nltk_entity_type_counts.xls')

In [62]:
# You can convert this to a dictionary too
dict(val_counts_entity_type)

{('Ahmanson', 'ORGANIZATION'): 1,
 ('Asian', 'GPE'): 1,
 ('Beauty', 'PERSON'): 1,
 ('Bicycle Club', 'ORGANIZATION'): 1,
 ('Bruin', 'GPE'): 1,
 ('Bud', 'PERSON'): 1,
 ('Burbank', 'GPE'): 1,
 ('Chavez Ravine', 'ORGANIZATION'): 1,
 ('Chinese', 'GPE'): 1,
 ('Clippers', 'ORGANIZATION'): 1,
 ('Coliseum', 'GPE'): 1,
 ('Dodger', 'GPE'): 1,
 ('Domingo', 'PERSON'): 1,
 ('Dorothy', 'ORGANIZATION'): 1,
 ('Girls', 'PERSON'): 1,
 ('Hollywood', 'GPE'): 1,
 ('Julio Iglesias', 'PERSON'): 1,
 ('L.A. Convention Center', 'ORGANIZATION'): 1,
 ('Miss Saigon', 'ORGANIZATION'): 1,
 ('Nomo', 'ORGANIZATION'): 1,
 ('Peter Sellers', 'PERSON'): 1,
 ('Pippen', 'PERSON'): 1,
 ('Plácido', 'PERSON'): 1,
 ('Richard Sakai', 'PERSON'): 1,
 ('Rose', 'ORGANIZATION'): 1,
 ('Rossini', 'PERSON'): 1,
 ('Scottie', 'PERSON'): 1,
 ('Shaq', 'PERSON'): 1,
 ('Shrine', 'GPE'): 1,
 ('Shubert', 'ORGANIZATION'): 1,
 ('Simpsons', 'ORGANIZATION'): 1,
 ('Sports Arena', 'ORGANIZATION'): 1,
 ('Trekkie', 'ORGANIZATION'): 1,
 ('Trojan', 'GPE')

### Practice

**@TODO: Make a map of all mentioned places in *Tropic of Orange***

Follow the steps below:

In [None]:
## @TODO: Get the named entities for the entire Tropic of Orange text
#

# Load the dataframe for Tropic of Orange
df_tropic = pd.read_excel('../corpora/tropic_of_orange/metadata.xls')

# make an empty list for all results in the book
all_results = []

# set a variable to the text folder
text_folder = '../corpora/tropic_of_orange/texts'


# loop over the filename column in df_tropic...      

    # print filename
    

    # get full path
    
    
    # open text
    
        
    # call one of the NER functions and get back the list of results
    
    
    # for each NER result dictionary
    
        # add the filename to the result dictionary
        
        # append the result dictionary to all_results
        

# make a data frame from all of the results


In [None]:
# @TODO: Merge the dataframe you just made with df_tropic,
# and save the merged dataframe to an excel file



In [None]:
# @TODO: Investigate the counts of 'entity' from the results



In [None]:
# @TODO: Filter the dataframe to show only places,
# and then investigate the counts of 'entity' from the results



In [None]:
# @TODO: Save to an excel file the counts for the place entities



Remaining steps:
* Upload the excel file of place counts to a Google Drive spreadsheet
* Geocode that Google spreadsheet ([see Won-Gi's advice here](https://github.com/quadrismegistus/literarytextmining/issues/2))
* Download the excel file to your computer
* Open Tableau and connect to the excel file of place *instances*
* Then click "add" on top left and connect to the excel file of place *counts* (with geocoding)
* Select "fn" to merge on in Tableau
* Click Sheet1 and make a map

## For geography research team


* Map the *non* proper nouns in Tropic of Orange:
    * Geocode the nouns:
        * Generate list of most frequent nouns in book
        * Upload that to a google spreadsheet, share the link with a group of people
        * See if you can "geocode" these places. Where is Pepsi produced? Where are oranges imported from? Keep a notes column to explain your interpretive decisions!
        * Save this spreadsheet to an excel file
    * Count the nouns in the texts:
        * Get the list of nouns you geocoded from the saved excel file
        * Go through the chapters and count those nouns in the chapters
        * Package these results into a dataframe of the form:
            |fn|word|count|
        * Merge that dataframe to the metadata dataframe
        * Save the merged form
        * Explore the data in Tableau
   

* See [5C sentiment analysis](5C_sentiment_analysis.ipynb) for a research problem involving sentiment and geography.


* Work more with the Tableau file we generated last time and try to answer some research questions.
    * Which is the narrator with the greatest geographic range?
    * Can we track globalization?
    * What are the sentences like that are mentioning far-flung places?
    

* (Advanced) See if you can figure out how to visualize the networked connections between places in [Palladio](http://hdlab.stanford.edu/palladio-app)
    * I've never done this, but I'm pretty sure you'll need to:
    * Make a dataframe (and save as CSV) of places, their lat/longs, and their counts
    * Make a dataframe (and save as CSV) of places that are mentioned in the same paragraph
    * Add both to Palladio