# (5B) NLP Cookbook, part 2: Advanced NLP

In this second installment of NLP, let's look at some of the more advanced questions:

* Detect proper nouns of places and people?
* Identify noun phrases? ("natural language processing", "education department", etc)
* Sentiment analysis? (estimating positive/negative sentiment of text)
* Describe sentence syntax? (clauses, subject-object and other "dependencies", etc)s

-----

## How can I identify places and people?


In [46]:
# Manzanar loves Named Entity Recognition

manzanar = """Despite everything, every sports event, concert, and whatnot was happening at the same time. L.A.
marathoners slouched by the droves across the finish line at the Coliseum. At the Rose Bowl: UCLA versus
USC; the Bruin mascot had been carried off the field with heat stroke, and the Trojan horse was tied up
after throwing its sweaty rider. The Clippers were attempting a comeback in overtime at the Sports Arena.
It was the end of the seventhinning stretch, and Nomo fans at Chavez Ravine hunkered down with their cold
beers and Dodger dogs. Scottie Pippen fouled Shaq who sank a free throw for the Lakers at the Forum in the
last seconds. The Trekkie convention warped into five at the L.A. Convention Center. Bud Girls paraded
between boxing matches at the Olympic Auditorium. Plácido Domingo belted Rossini at the Dorothy Chandler
under the improbable abstract/minimal/baroque direction of Peter Sellers. At the Shrine, executive
producer Richard Sakai accepted an Oscar for the movie version of The Simpsons. The helicopter landed for
the 944th time on the set of Miss Saigon at the Ahmanson, and Beauty smacked the Beast at the Shubert.
Chinese housewives went for the big stakes in pai gow in the Asian room at the Bicycle Club. Live-laughter
sitcom audiences and boisterous crowds for the daytime and nighttime talks filled every available studio
in Hollywood and Burbank. Thousands of fans melted away with Julio Iglesias at the Universal Amphitheater."""

In [98]:
# do some imports
import os
import nltk
import pandas as pd
pd.set_option('display.max_colwidth', 0)

### (1) NLTK

In [102]:

# let's make a function for this
def ner_nltk(string):
    # clean string
    string = string.strip().replace('\n',' ')
    
    # sentence tokenize the string
    sentences = nltk.sent_tokenize(string)
    
    # set empty list for output
    output_list = []
    
    # loop over each sentence
    sent_num = 0
    for sent in sentences:
        # add 1
        sent_num+=1
        added_sent_already = False
        
        # we need to get the words
        sent_words = nltk.word_tokenize(sent)
        
        # parts of speech
        sent_pos = nltk.pos_tag(sent_words)
        
        # then "chunk":
        chunks = nltk.ne_chunk(sent_pos)
        
        # loop over chunks...
        for chunk in chunks:
            #print(chunk)
            
            # if the chunk has a 'label' attribute (is a named entity)
            if hasattr(chunk,'label'):
                
                # get the label
                label = chunk.label()
                
                # print
                #print(label,list(chunk))
                
                # get the words
                chunk_words = []
                for word,pos in chunk:
                    chunk_words.append(word)
                
                # make a string version
                chunk_words_str = ' '.join(chunk_words)
                
                # make a result dictionary
                result_dict = {}
                
                # add NER info
                result_dict['type'] = label
                result_dict['entity'] = chunk_words_str
                
                # add sent info
                result_dict['_sent_num'] = sent_num
                
                # add a string of the sentence, but only once per sentence
                if not added_sent_already:
                    result_dict['_sent'] = sent
                    added_sent_already = True
                else:
                    result_dict['_sent'] = ''
                
                # add result dictionary to output list
                output_list.append(result_dict)
    
    # return output
    #output_df=pd.DataFrame(output_list)
    #return output_df
    return output_list


In [110]:
nltk_ner_ld = ner_nltk("""Ryan Heuser cannot wait until he graduates from Stanford University.
He will take up position as Head Engineer of NASA's secret "Send Literary Critics to Mars" mission.""")
nltk_ner_ld[0]

{'type': 'PERSON',
 'entity': 'Ryan',
 '_sent_num': 1,
 '_sent': 'Ryan Heuser cannot wait until he graduates from Stanford University.'}

In [111]:
nltk_ner_df = pd.DataFrame(nltk_ner_ld)
nltk_ner_df

Unnamed: 0,_sent,_sent_num,entity,type
0,Ryan Heuser cannot wait until he graduates from Stanford University.,1,Ryan,PERSON
1,,1,Heuser,PERSON
2,,1,Stanford University,ORGANIZATION
3,"He will take up position as Head Engineer of NASA's secret ""Send Literary Critics to Mars"" mission.",2,Head Engineer,PERSON
4,,2,NASA,ORGANIZATION
5,,2,Mars,PERSON


In [113]:
nltk_ner_ld = ner_nltk(manzanar)
nltk_ner_ld[0]

{'type': 'GPE',
 'entity': 'Coliseum',
 '_sent_num': 2,
 '_sent': 'L.A. marathoners slouched by the droves across the finish line at the Coliseum.'}

In [114]:
nltk_ner_df = pd.DataFrame(nltk_ner_ld)
nltk_ner_df

Unnamed: 0,_sent,_sent_num,entity,type
0,L.A. marathoners slouched by the droves across the finish line at the Coliseum.,2,Coliseum,GPE
1,"At the Rose Bowl: UCLA versus USC; the Bruin mascot had been carried off the field with heat stroke, and the Trojan horse was tied up after throwing its sweaty rider.",3,Rose,ORGANIZATION
2,,3,USC,ORGANIZATION
3,,3,Bruin,GPE
4,,3,Trojan,GPE
5,The Clippers were attempting a comeback in overtime at the Sports Arena.,4,Clippers,ORGANIZATION
6,,4,Sports Arena,ORGANIZATION
7,"It was the end of the seventhinning stretch, and Nomo fans at Chavez Ravine hunkered down with their cold beers and Dodger dogs.",5,Nomo,ORGANIZATION
8,,5,Chavez Ravine,ORGANIZATION
9,,5,Dodger,GPE


In [115]:
# Save this data!
nltk_ner_df.to_excel('data.ner_nltk.xls')

### (3) Polyglot (for non-English, non-French, non-German text)

[Polyglot](https://polyglot.readthedocs.io/) is a really cool package, built on top of TextBlob, which supports up to 140 different languages depending on the NLP task. This increase in linguistic range comes at a cost of accuracy, however. The tools was trained using Wikipedia as a Rosetta Stone, calibrating languages' models against each other by using the "same" articles in those languages. The other costs of using Polyglot: the documentation isn't that great, and it doesn't seem to be actively updated.

*Installation is also kind of a pain in the neck.* I recommend installing this only if you are planning to work with non-English, non-French, non-German text. To do so, paste the following into Terminal:

    conda install -c conda-forge pyicu
    pip install pycld2
    pip install morfessor
    pip install polyglot
    polyglot download LANG:en   # for english
    polyglot download LANG:es   # for spanish (optional)
    polyglot download LANG:xx   # where xx is the two-letter language code
   
See [the website](https://polyglot.readthedocs.io/) for more details.

In [116]:
def ner_polyglot(string):
    # let's try this...
    try:
        # to use polyglot, import its "Text" object:
        from polyglot.text import Text
    except ImportError:
        print('Polyglot not installed! To do so, follow the instructions above.')
        return
    # from here on we can assume that polyglot is imported
    
    # wrap that Text object around any string
    pg_text = Text(string)

    # make an output list
    output_list = []
    
    # get the entities
    entities = pg_text.entities
    # loop over sentences
    sent_num = 0
    for sent in pg_text.sentences:
        sent_num+=1

        # loop over the entities
        added_sent_already = False
        for ent in sent.entities:
            # get the type
            ent_type = ent.tag

            # get the words
            ent_words = list(ent)

            # bogus if the first letter of the first word is not alphabetic (not punctuation)
            if not ent_words[0][0].isalpha(): continue

            # make a string version
            ent_words_str = ' '.join(ent_words)

            # make a results dict
            result_dict = {}
            result_dict['_sent_num'] = sent_num
            if not added_sent_already:
                result_dict['_sent'] = str(sent)
                added_sent_already = True
            else:
                result_dict['_sent'] = ''
            
            result_dict['type']=ent_type
            result_dict['entity']=ent_words_str

            # add to output
            output_list.append(result_dict)
        
    return output_list

In [117]:
ner_polyglot("""Ryan Heuser cannot wait until he graduates from Stanford University.
He will take up position as Head Engineer of NASA's secret "Send Literary Critics to Mars" mission.""")

[{'_sent_num': 1,
  '_sent': 'Ryan Heuser cannot wait until he graduates from Stanford University.',
  'type': 'I-PER',
  'entity': 'Ryan Heuser'},
 {'_sent_num': 1,
  '_sent': '',
  'type': 'I-ORG',
  'entity': 'Stanford University'}]

In [118]:
# Run on Manzanar's paragraph
pd.DataFrame(ner_polyglot(manzanar))

Unnamed: 0,_sent,_sent_num,entity,type
0,marathoners slouched by the droves across the finish line at the Coliseum.,3,Coliseum,I-ORG
1,At the Rose Bowl: UCLA versus,4,Rose Bowl,I-ORG
2,,4,UCLA,I-ORG
3,"USC; the Bruin mascot had been carried off the field with heat stroke, and the Trojan horse was tied up",5,USC,I-ORG
4,,5,Bruin,I-ORG
5,The Clippers were attempting a comeback in overtime at the Sports Arena.,7,Clippers,I-ORG
6,,7,Sports Arena,I-ORG
7,"It was the end of the seventhinning stretch, and Nomo fans at Chavez Ravine hunkered down with their cold",8,Chavez,I-PER
8,Scottie Pippen fouled Shaq who sank a free throw for the Lakers at the Forum in the,10,Scottie Pippen,I-PER
9,,10,Shaq,I-PER


### (4) Spacy

[Spacy](http://spacy.io) is industrial-strength NLP. It's the fastest, most powerful, and most accurate. It can also work on [several languages besides English](https://spacy.io/models). But it's also kinda ugly and confusing to use. I recommend using this only if you are working on hundreds of texts and feel extremely comfortable with all the things we've been doing so far.

To install:

    pip install spacy
    python -m spacy download en_core_web_sm

Here's an NER implementation.

In [137]:
def ner_spacy(string):
    try:
        # import spacy
        import spacy
    except ImportError:
        print("spacy not installed. Please follow directions above.")
        return

    # clean string
    string = string.strip().replace('\n',' ').replace("’","'").replace("‘","'")
    
    # load its default English model
    nlp = spacy.load("en_core_web_sm")

    # create a spacy text object
    doc = nlp(string)
    
    # make an output list
    output_list = []

    # loop over sentences
    sent_num=0
    for sent in doc.sents:
        sent_num+=1
        added_sent_already = False

        # loop over sentence's entities
        for ent in sent.ents:
            
            # make a result dict
            result_dict = {}
            
            # set sentence number
            result_dict['_sent_num'] = sent_num
            
            # store text too
            if not added_sent_already:
                result_dict['_sent'] = sent.text
                added_sent_already = True
            else:
                result_dict['_sent'] = ''
            
            # get type
            result_dict['type'] = ent.label_
            
            # get entity
            result_dict['entity'] = ent.text
            
            # get start char
            result_dict['star_char'] = ent.start_char
            
            # get end char
            result_dict['star_char'] = ent.end_char
            
            # add result_dict to output_list
            output_list.append(result_dict)
            
    # return output
    return output_list
            


In [138]:
pd.DataFrame(ner_spacy("""Ryan Heuser cannot wait until he graduates from Stanford University.
He will take up position as Head Engineer of NASA's secret "Send Literary Critics to Mars" mission."""))

Unnamed: 0,_sent,_sent_num,entity,star_char,type
0,Ryan Heuser cannot wait until he graduates from Stanford University.,1,Ryan Heuser,11,PERSON
1,,1,Stanford University,67,ORG
2,"He will take up position as Head Engineer of NASA's secret ""Send Literary Critics to Mars"" mission.",2,Engineer,110,PERSON
3,,2,NASA,118,ORG
4,,2,Send Literary Critics to Mars,158,WORK_OF_ART


In [139]:
spacy_ner_ld = ner_spacy(manzanar)
spacy_ner_ld[0]

{'_sent_num': 2,
 '_sent': 'L.A. marathoners slouched by the droves across the finish line at the Coliseum.',
 'type': 'GPE',
 'entity': 'L.A.',
 'star_char': 97}

In [140]:
spacy_ner_df = pd.DataFrame(spacy_ner_ld)
spacy_ner_df

Unnamed: 0,_sent,_sent_num,entity,star_char,type
0,L.A. marathoners slouched by the droves across the finish line at the Coliseum.,2,L.A.,97,GPE
1,,2,Coliseum,171,PERSON
2,"At the Rose Bowl: UCLA versus USC; the Bruin mascot had been carried off the field with heat stroke, and the Trojan horse was tied up after throwing its sweaty rider.",3,the Rose Bowl,189,FAC
3,,3,USC,206,ORG
4,,3,Bruin,217,NORP
5,,3,Trojan,288,GPE
6,The Clippers were attempting a comeback in overtime at the Sports Arena.,4,Clippers,352,ORG
7,,4,the Sports Arena,411,FAC
8,"It was the end of the seventhinning stretch, and Nomo fans at Chavez Ravine hunkered down with their cold beers and Dodger dogs.",5,Nomo,466,ORG
9,,5,Chavez Ravine,488,PERSON


In [141]:
# Save results!
spacy_ner_df.to_excel('data.ner_spacy.xls')

### Getting counts from the data

Use `value_counts()` to count the values in any column of a pandas dataframe.

#### Counts for `type`

In [142]:
# Get the counts for the column 'type'
nltk_ner_df['type'].value_counts()

ORGANIZATION    15
PERSON          12
GPE             9 
Name: type, dtype: int64

In [143]:
# If you have spacy working
#spacy_ner_df['type'].value_counts()

In [144]:
# We can also convert this data into a dictionary:
type_counts = dict( nltk_ner_df['type'].value_counts() )
type_counts

{'ORGANIZATION': 15, 'PERSON': 12, 'GPE': 9}

In [145]:
# And we can loop over the (key,value) pairs of any dictionary like this:

for key,value in type_counts.items():
    print(key,':',value)

ORGANIZATION : 15
PERSON : 12
GPE : 9


#### Counts for `entity`

In [146]:
# Count the number of unique values under entity
nltk_ner_df['entity'].value_counts()

Julio Iglesias            1
Dodger                    1
Universal Amphitheater    1
Shrine                    1
Simpsons                  1
Miss Saigon               1
Richard Sakai             1
Bud                       1
Chinese                   1
Scottie                   1
Rose                      1
Clippers                  1
Ahmanson                  1
Girls                     1
Nomo                      1
Domingo                   1
Rossini                   1
Plácido                   1
USC                       1
Coliseum                  1
Trekkie                   1
Bruin                     1
L.A. Convention Center    1
Chavez Ravine             1
Bicycle Club              1
Pippen                    1
Dorothy                   1
Hollywood                 1
Beauty                    1
Asian                     1
Sports Arena              1
Shubert                   1
Burbank                   1
Trojan                    1
Shaq                      1
Peter Sellers       

In [147]:
# Get these results as a dictionary
entity_counts = dict( results_df['entity'].value_counts() )
entity_counts

{'Sin': 1,
 'Los Angeles': 1,
 'Tut': 1,
 'America': 1,
 'Hollywood Bowl': 1,
 'Qris': 1,
 'Sports Arena': 1,
 'Asian': 1,
 'Bud': 1,
 'Santa Monica': 1,
 'Universal Studios': 1,
 'Pippen': 1,
 'Berry Farm': 1,
 'Knott': 1,
 'The Lollapalooza': 1,
 'Plaza': 1,
 'Hollywood Park': 1,
 'L.A. Convention Center': 1,
 'Ahmanson': 1,
 'Robert': 1,
 'Disneyland': 1,
 'Americans': 1,
 'Jubilee Choir': 1,
 'Beauty': 1,
 'Hollywood Riviera': 1,
 'César Chávez': 1,
 'Chicanos': 1,
 'Rossini': 1,
 'Domingo': 1,
 'Bicycle Club': 1,
 'Pomona Raceway': 1,
 'Codrescu': 1,
 'Clippers': 1,
 'Rose': 1,
 'USC': 1,
 'Trojan': 1,
 'Chinese': 1,
 'Stomp': 1,
 'Andrei': 1,
 'LACMA': 1,
 'Universal Amphitheater': 1,
 'Pizzicato': 1,
 'Plácido': 1,
 'West Hollywood': 1,
 'Shrine': 1,
 'Nomo': 1,
 'Trekkie': 1,
 'Burbank': 1,
 'Volleyball': 1,
 'Santa Anita Racetrack': 1,
 'McNeil': 1,
 'Republican': 1,
 'Hollywood': 1,
 'Soleil': 1,
 'Drag': 1,
 'Dorothy': 1,
 'Shaq': 1,
 'Wadsworth': 1,
 'Japan': 1,
 'Bonaventu

In [148]:
# Loop over these results:
for entity,count in entity_count_dict.items():
    print(entity,count)

NameError: name 'entity_count_dict' is not defined

#### Counts for `entity+type`

In [149]:
# For both together
results_df.groupby(['entity','type']).size()

entity                  type        
AIDS                    ORGANIZATION    1
Ahmanson                ORGANIZATION    1
America                 ORGANIZATION    1
Americans               ORGANIZATION    1
Andrei                  PERSON          1
Andy Warhol             PERSON          1
Asian                   GPE             1
Beauty                  PERSON          1
Berry Farm              PERSON          1
Bicycle Club            ORGANIZATION    1
Bonaventure             ORGANIZATION    1
Bruin                   GPE             1
Bud                     PERSON          1
Burbank                 GPE             1
Central Library         ORGANIZATION    1
Chavez Ravine           ORGANIZATION    1
Chicanos                GPE             1
Chinese                 GPE             1
Chris                   GPE             1
Cirque                  GPE             1
Clippers                ORGANIZATION    1
Codrescu                ORGANIZATION    1
Coliseum                GPE            

In [150]:
# Get these results as a dictionary
entity_type_counts = dict( results_df.groupby(['entity','type']).size() )
entity_type_counts

{('AIDS', 'ORGANIZATION'): 1,
 ('Ahmanson', 'ORGANIZATION'): 1,
 ('America', 'ORGANIZATION'): 1,
 ('Americans', 'ORGANIZATION'): 1,
 ('Andrei', 'PERSON'): 1,
 ('Andy Warhol', 'PERSON'): 1,
 ('Asian', 'GPE'): 1,
 ('Beauty', 'PERSON'): 1,
 ('Berry Farm', 'PERSON'): 1,
 ('Bicycle Club', 'ORGANIZATION'): 1,
 ('Bonaventure', 'ORGANIZATION'): 1,
 ('Bruin', 'GPE'): 1,
 ('Bud', 'PERSON'): 1,
 ('Burbank', 'GPE'): 1,
 ('Central Library', 'ORGANIZATION'): 1,
 ('Chavez Ravine', 'ORGANIZATION'): 1,
 ('Chicanos', 'GPE'): 1,
 ('Chinese', 'GPE'): 1,
 ('Chris', 'GPE'): 1,
 ('Cirque', 'GPE'): 1,
 ('Clippers', 'ORGANIZATION'): 1,
 ('Codrescu', 'ORGANIZATION'): 1,
 ('Coliseum', 'GPE'): 1,
 ('César Chávez', 'PERSON'): 1,
 ('Disneyland', 'GPE'): 1,
 ('Dodger', 'GPE'): 1,
 ('Domingo', 'PERSON'): 1,
 ('Dorothy', 'ORGANIZATION'): 1,
 ('Drag', 'GPE'): 1,
 ('Endless', 'GPE'): 1,
 ('Girls', 'PERSON'): 1,
 ('Greek', 'GPE'): 1,
 ('Hilton', 'GPE'): 1,
 ('Hollywood', 'GPE'): 1,
 ('Hollywood Bowl', 'ORGANIZATION'): 1,

In [151]:
# Loop over these results:
for (entity,ent_type),count in entity_type_counts.items():
    print(entity,ent_type,count)

AIDS ORGANIZATION 1
Ahmanson ORGANIZATION 1
America ORGANIZATION 1
Americans ORGANIZATION 1
Andrei PERSON 1
Andy Warhol PERSON 1
Asian GPE 1
Beauty PERSON 1
Berry Farm PERSON 1
Bicycle Club ORGANIZATION 1
Bonaventure ORGANIZATION 1
Bruin GPE 1
Bud PERSON 1
Burbank GPE 1
Central Library ORGANIZATION 1
Chavez Ravine ORGANIZATION 1
Chicanos GPE 1
Chinese GPE 1
Chris GPE 1
Cirque GPE 1
Clippers ORGANIZATION 1
Codrescu ORGANIZATION 1
Coliseum GPE 1
César Chávez PERSON 1
Disneyland GPE 1
Dodger GPE 1
Domingo PERSON 1
Dorothy ORGANIZATION 1
Drag GPE 1
Endless GPE 1
Girls PERSON 1
Greek GPE 1
Hilton GPE 1
Hollywood GPE 1
Hollywood Bowl ORGANIZATION 1
Hollywood Park FACILITY 1
Hollywood Riviera ORGANIZATION 1
Japan GPE 1
Japanese GPE 1
John Mauceri PERSON 1
Jubilee Choir ORGANIZATION 1
Julio Iglesias PERSON 1
Knott PERSON 1
L.A. Convention Center ORGANIZATION 1
LACMA GPE 1
Los Angeles GPE 1
MOCA ORGANIZATION 1
Magic Mountain PERSON 1
Malibu GPE 1
McNeil ORGANIZATION 1
Miss Saigon ORGANIZATION 1

### Practice

In [152]:
df_tropic = pd.read_excel('../corpora/tropic_of_orange/metadata.xls')
df_tropic

Unnamed: 0,fn,part,part_day,part_title,chapter,chapter_title,setting,narrator
0,ch01.txt,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
1,ch02.txt,1,Monday,Summer Solstice,2,Benefits,Koreatown,Bobby Ngu
2,ch03.txt,1,Monday,Summer Solstice,3,Weather Report,Westside,Emi
3,ch04.txt,1,Monday,Summer Solstice,4,Station ID,Jefferson & Normandie,Buzzworm
4,ch05.txt,1,Monday,Summer Solstice,5,Traffic Window,Harbor Freeway,Manzanar Murakami
5,ch06.txt,1,Monday,Summer Solstice,6,Coffee Break,Downtown,Gabriel Balboa
6,ch07.txt,1,Monday,Summer Solstice,7,To Wake,The Marketplace,Arcangel
7,ch08.txt,2,Tuesday,Diamond Lane,8,Rideshare,Downtown Interchange,Manzanar Murakami
8,ch09.txt,2,Tuesday,Diamond Lane,9,NewsNow,Hollywood South,Emi
9,ch10.txt,2,Tuesday,Diamond Lane,10,Morning,En México,Rafaela Cortes


In [153]:
## @TODO:

# make an empty list for all results in the book
all_results = []

# loop over each row...
for index,row in df_tropic.iterrows():
    # values for each row are available as keys on the 'row' dictionary
    fn=row['fn']
    chapter_num=row['chapter']
    
    # no spoilers!
    if chapter_num>35: break
        
    # print filename
    print(fn,'...')

    # get full path
    path=os.path.join('../corpora/tropic_of_orange/texts', fn)
    
    # open text
    with open(path) as file:
        txt=file.read()
        
    # parse NER's
    ner_results_ld = ner_nltk(txt)
    
    # add to all results
    for result_dict in ner_results_ld:
        result_dict['fn']=fn
        all_results.append(result_dict)

# make a data frame
all_results_df=pd.DataFrame(all_results)

ch01.txt ...
ch02.txt ...
ch03.txt ...
ch04.txt ...
ch05.txt ...
ch06.txt ...
ch07.txt ...
ch08.txt ...
ch09.txt ...
ch10.txt ...
ch11.txt ...
ch12.txt ...
ch13.txt ...
ch14.txt ...
ch15.txt ...
ch16.txt ...
ch17.txt ...
ch18.txt ...
ch19.txt ...
ch20.txt ...
ch21.txt ...
ch22.txt ...
ch23.txt ...
ch24.txt ...
ch25.txt ...
ch26.txt ...
ch27.txt ...
ch28.txt ...
ch29.txt ...
ch30.txt ...
ch31.txt ...
ch32.txt ...
ch33.txt ...
ch34.txt ...
ch35.txt ...
ch36.txt ...


In [161]:
all_results_df

Unnamed: 0,_sent,_sent_num,entity,fn,star_char,type
0,"Rafaela Cortes spent the morning barefoot, sweeping both dead and living things from over and under beds, from behind doors and shutters, through archways, along the veranda—sweeping them all across the deep shadows and luminous sunlight carpeting the cool tile floors.",1,Rafaela Cortes,ch01.txt,14,PERSON
1,,1,the morning,ch01.txt,32,TIME
2,"Every morning, a small pile of assorted insects and tiny animals—moths and spiders, lizards and beetles",3,Every morning,ch01.txt,483,TIME
3,"And the snake that slithered away at the urging of her broom—probably not poisonous, but one never knew.",8,one,ch01.txt,896,CARDINAL
4,Every morning it was the same.,9,Every morning,ch01.txt,922,TIME
5,"Every morning, she swept this mound of dead and wiggling things to the door and off the side of the veranda and into the dark green undergrowth with the same flourish.",10,Every morning,ch01.txt,953,TIME
6,"Occasionally, there was more of one species or the other, but each somehow always made its way back into the house.",11,one,ch01.txt,1143,CARDINAL
7,"On some days, it seemed to twirl before her broom communicating a kind of dance that seemed to send a visceral message up the broom to her fingertips.",15,some days,ch01.txt,1431,DATE
8,It made no difference if she closed the doors and shutters at the first sign of dusk or if she left the house unoccupied and tightly shut for several days.,17,first,ch01.txt,1681,ORDINAL
9,,17,dusk,ch01.txt,1694,TIME


In [162]:
all_results_df.merge(df_tropic,on='fn')

Unnamed: 0,_sent,_sent_num,entity,fn,star_char,type,part,part_day,part_title,chapter,chapter_title,setting,narrator
0,"Rafaela Cortes spent the morning barefoot, sweeping both dead and living things from over and under beds, from behind doors and shutters, through archways, along the veranda—sweeping them all across the deep shadows and luminous sunlight carpeting the cool tile floors.",1,Rafaela Cortes,ch01.txt,14,PERSON,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
1,,1,the morning,ch01.txt,32,TIME,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
2,"Every morning, a small pile of assorted insects and tiny animals—moths and spiders, lizards and beetles",3,Every morning,ch01.txt,483,TIME,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
3,"And the snake that slithered away at the urging of her broom—probably not poisonous, but one never knew.",8,one,ch01.txt,896,CARDINAL,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
4,Every morning it was the same.,9,Every morning,ch01.txt,922,TIME,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
5,"Every morning, she swept this mound of dead and wiggling things to the door and off the side of the veranda and into the dark green undergrowth with the same flourish.",10,Every morning,ch01.txt,953,TIME,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
6,"Occasionally, there was more of one species or the other, but each somehow always made its way back into the house.",11,one,ch01.txt,1143,CARDINAL,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
7,"On some days, it seemed to twirl before her broom communicating a kind of dance that seemed to send a visceral message up the broom to her fingertips.",15,some days,ch01.txt,1431,DATE,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
8,It made no difference if she closed the doors and shutters at the first sign of dusk or if she left the house unoccupied and tightly shut for several days.,17,first,ch01.txt,1681,ORDINAL,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes
9,,17,dusk,ch01.txt,1694,TIME,1,Monday,Summer Solstice,1,Midday,Not Too Far From Mazatlán,Rafaela Cortes


In [157]:
all_results_df.entity.value_counts()

Rafaela                                       185
Bobby                                         154
Gabriel                                       138
Buzzworm                                      107
Sol                                           90 
one                                           77 
Arcangel                                      67 
Emi                                           67 
two                                           53 
Manzanar                                      43 
Doña Maria                                    40 
first                                         38 
today                                         38 
L.A.                                          32 
”                                             28 
Buzz                                          26 
Chinese                                       24 
Mexican                                       23 
Margarita                                     21 
Singapore                                     21 


## How can I get the sentiment of sentences?

### (2) textblob (recommended)

In [247]:
from textblob import TextBlob

def sentiment_analysis_textblob(string):
    # first make a blob
    blob = TextBlob(string)

    # make output dictionary
    output_list = []
    
    # for each sentence
    sent_num=0
    for sent in blob.sentences:
        sent_num+=1
        
        # make an empty results dictionary
        result_dict={}
        result_dict['_sent_num'] = sent_num
        result_dict['_sent'] = str(sent)
        
        result_dict['polarity'] = sent.sentiment.polarity
        result_dict['subjectivity'] = sent.sentiment.subjectivity
        
        output_list.append(result_dict)
    
    return pd.DataFrame(output_list)
        

In [248]:
sentiment_analysis_textblob(manzanar)

Unnamed: 0,_sent,_sent_num,polarity,subjectivity
0,"Despite everything, every sports event, concert, and whatnot was happening at the same time.",1,0.0,0.125
1,L.A.\nmarathoners slouched by the droves across the finish line at the Coliseum.,2,0.0,0.0
2,"At the Rose Bowl: UCLA versus\nUSC; the Bruin mascot had been carried off the field with heat stroke, and the Trojan horse was tied up\nafter throwing its sweaty rider.",3,0.6,0.95
3,The Clippers were attempting a comeback in overtime at the Sports Arena.,4,0.0,0.0
4,"It was the end of the seventhinning stretch, and Nomo fans at Chavez Ravine hunkered down with their cold\nbeers and Dodger dogs.",5,-0.377778,0.644444
5,Scottie Pippen fouled Shaq who sank a free throw for the Lakers at the Forum in the\nlast seconds.,6,0.2,0.433333
6,The Trekkie convention warped into five at the L.A. Convention Center.,7,-0.1,0.1
7,Bud Girls paraded\nbetween boxing matches at the Olympic Auditorium.,8,0.0,0.0
8,Plácido Domingo belted Rossini at the Dorothy Chandler\nunder the improbable abstract/minimal/baroque direction of Peter Sellers.,9,0.0,0.0
9,"At the Shrine, executive\nproducer Richard Sakai accepted an Oscar for the movie version of The Simpsons.",10,0.0,0.0
