<b>

<p>
<center>
<font size="5">
Natural Language Processing
</font>
</center>
</p>

<p>
<center>
<font size="4">
Entity finder in news articles
</font>
</center>
</p>

<p>
<center>
<font size="3">
Author: Sandra Valdés Salas
</font>
</center>
</p>


<p>
<center>
<font size="3">
March 2020
</font>
</center>
</p>

</b>

## _Table of Contents_

* [Step 1) Import necessary libraries](#1)
* [Step 2) Fucntion to exrtact entity, document id and relevant sentence text](#2)
* [Step 3) Call the entity extraction function and add the entity records to a combined dictionary](#3)
* [Step 4) Function to determine the most popular entity based on number of mentions](#4)
* [Step 5) Invoke the top entity mention finder](#1)
* [Step 6) Analyze the most popular entities to determine what words they most frequently occur with](#6)
* [Step 7) Results](#7)


## _Introduction_

This exercise extracts the entities (Person, Organizations, Locations, Geopolitical Entities) from Reuters news articles. The top entities from each category are found. NLTK and SpaCy are used for this exercise. 

## Step 1) Import necessary libraries <a class="anchor" id="1"></a>

In [1]:
#import nltk and spacy
import nltk.data
from nltk.corpus import reuters
import spacy
nlp = spacy.load("en_core_web_sm")

## Step 2) Function to extract the entity, document id, and relevant sentence text <a class="anchor" id="2"></a>

The `extract_entities` function returns a dictionary with all the entities extracted from the input text. The values of each key-entity corresponds to a tuple, in which the first element is the document id and the second element is a list of all sentences that mention the entity in the same document. 

In [2]:
#define function to process all labels and dictionaries

def process_dict(entity, relevant_sentence, label, dictionary):
    '''In case an entity occurs multiple times in the same document, 
    this function creates a key,value pair where the key = entity and
    value = tuple with document id and a list of all sentences where the entity is mentioned'''
    
    if entity.label_==label and entity.text.strip() not in dictionary:
        dictionary[entity.text.strip()]=relevant_sentence 
    elif entity.label_==label and entity.text.strip() in dictionary: 
        dictionary[entity.text.strip()][1].append(entity.sent.text.strip())
        
    return dictionary

In [3]:
def extract_entities(doc_id, doc_text):
    analyzed_doc = nlp(doc_text)
    doc_persons = {}
    doc_organizations = {}
    doc_locations = {}
    doc_gpe = {}
    
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "":
            #print(" -> ", entity.label_)
            #print("->", entity.text.strip(), "<-")
            #print("->", entity.sent.text.strip(), "<-")
            #Use lower cases to avoid repited entities, like "FED" and "Fed"
            relevant_sentence = (doc_id, [entity.sent.text.strip()])
            #Use process_dict function to create the dictionaries for each entity
            #The value corresponds to a tuple = (document_id, list of all entity mentions in the document)
            process_dict(entity,relevant_sentence,"PERSON",doc_persons)
            process_dict(entity,relevant_sentence,"ORG",doc_organizations)
            process_dict(entity,relevant_sentence,"LOC",doc_locations)
            process_dict(entity,relevant_sentence,"GPE",doc_gpe)
            
    return doc_persons, doc_organizations, doc_locations, doc_gpe

By using a small sample, we can see that the function is working. In the case of GPE entities, the U.S. is mentioned in different sentences of the same document. So the function works.  

In [4]:
sample_doc = reuters.open('test/14826').read()
sample_per, sample_org, sample_loc, sample_gpe = extract_entities('test/14826',sample_doc)
#sample_per  #sample to see check that the previous functions work
sample_gpe["U.S."]

('test/14826',
 ['Mounting trade friction between the\n  U.S.',
  'They told Reuter correspondents in Asian capitals a U.S.',
  'Move against Japan might boost protectionist sentiment in the\n  U.S. And lead to curbs on American imports of their products.',
  "The U.S. Has said it will impose 300 mln dlrs of tariffs on\n  imports of Japanese electronics goods on April 17, in\n  retaliation for Japan's alleged failure to stick to a pact not\n  to sell semiconductors on world markets at below cost.",
  '"If the tariffs remain in place for any length of time\n  beyond a few months it will mean the complete erosion of\n  exports (of goods subject to tariffs) to the U.S.," said Tom\n  Murtha, a stock analyst at the Tokyo office of broker &lt;James\n  Capel and Co>.',
  'Taiwan had a trade trade surplus of 15.6 billion dlrs last\n  year, 95 pct of it with the U.S.',
  '"We must quickly open our markets, remove trade barriers and\n  cut import tariffs to allow imports of U.S. Products, if we\

## Step 3) Call the entity extraction function and add the entity records to a combined dictionary <a class="anchor" id="3"></a>

In [5]:
num_docs = len(reuters.fileids())
num_docs

10788

In [6]:
#Get files from reuters
reuters_files = reuters.fileids()
#reuters_files = reuters.fileids()[:25]

In [7]:
def process_combined_dict(key, dictionary):
    '''This function adds a list of values to the keys of the combined dictionary. In case the 
    key already exists, the new elements will be the list of values'''
    for key,value in key.items():
        if key not in dictionary:
            dictionary[key]=[value]
        else:
            dictionary[key].append(value)
    return dictionary

In [8]:
combined_persons = {}
combined_organizations = {}
combined_locations = {}
#combined_gpe = {}

for doc_id in reuters_files:
    #extract the input and use lower case to avoid repetition in entities like "FED" and "Fed"
    persons, organizations, locations, gpe=extract_entities(doc_id, reuters.open(doc_id).read().replace('\n', '').lower())
    #Use process_combined_dict function to add new items to combined dictionaries
    process_combined_dict(persons,combined_persons)
    process_combined_dict(locations,combined_locations)
    process_combined_dict(organizations,combined_organizations)
    #process_combined_dict(gpe,combined_gpe)

In [9]:
len(combined_persons.keys())

4070

In [10]:
len(combined_locations.keys())

399

In [11]:
len(combined_organizations.keys())

15697

---

In [None]:
#Keep in mind the structure of the combined dictionary

#entity : [list of tuples]
#each tuple = (document_id, [sentence 1, sentence 2, etc...])

## Step 4) Function to determine the most popular entity based on number of mentions <a class="anchor" id="4"></a>

### 4.1) Function to get a list of tuples with the most mentioned entities and their total mentions

In [12]:
def entity_mentions_tuples(entity_dictionary):
    most_common=[]
    
    #iterate dictionary key
    for key in entity_dictionary.keys():
        total_entity_mentions=0
        #iterate each tuple with info about (doc_id, sentences)
        for t in range(len(entity_dictionary[key])):
            #total_entity_mentions += len(entity_dictionary[key][t]) -> get documents
            total_entity_mentions += len(entity_dictionary[key][t][1]) #get all mentions per document
        #get tuples -> (entity, total mentions in all documents)
        most_common.append((key, total_entity_mentions))
    
    #sort entities based on total number of mentions
    most_common= sorted(most_common, key = lambda x:x[1], reverse=True)
    return most_common

In [13]:
#Top 3 entities - persons mentioned in the corpus, with the frequencies of their mentions
tuples_persons = entity_mentions_tuples(combined_persons)
tuples_persons[:3]

[('reagan', 409), ('baker', 278), ('lawson', 123)]

### 4.2) Function to get a list of top 500 entities with the most mentions

In [14]:
def find_most_popular_entities(entity_dictionary):
    #Get list of tuples (entity, total mentions)
    tuples_with_most_mentions= entity_mentions_tuples(entity_dictionary)
    #get top 50
    return [t[0] for t in tuples_with_most_mentions[:500]]   

## Step 5) Invoke the top entity mention finder <a class="anchor" id="5"></a>

In [15]:
top_persons = find_most_popular_entities(combined_persons)
top_locations = find_most_popular_entities(combined_locations)
top_organizations = find_most_popular_entities(combined_organizations)
#top_gpe = find_most_popular_entities(combined_gpe)

## Step 6) Analyze the most popular entities to determine what words they most frequently occur with <a class="anchor" id="6"></a>

Each dictionary has a key, value pair where the key is the entity and the value is a tuple. Each tuple has: 1) the document id; 2) a list of all sentences related to the entity. It may be the case that an entity appears many times in the same document, therefore, the this method considers that followins steps:

1. Extracts all sentences from the tuple.
2. Creates a single string with all the sentences and store it in an object called "text".
3. Tokenizes the object text.
4. Returns tokens that are not numeric, not stopwords.
5. Tokens such as "say" or "said" are not considered (which I assume will have a higher frequency and are irrelevant for the analysis).
6. Finally, only tokens that are different from the entity text are considered.

In [16]:
#function to get all tokens of an entity
def get_tokens(dictionary):
    '''Join all sentences (mentions of the entity) and extract the most relevant tokens:
    not stop word, not numeric, not equal to the entity, not repetitive words like "say" or "said"'''
    text=""
    for item in range(len(dictionary[ent])):
        mentions = dictionary[ent][item][1]
        all_mentions="".join(mentions)
        text += all_mentions
    doc = nlp(text)
    #return [t.text for t in doc if not t.is_stop if t.is_alpha if t.pos_!="VERB" if t.text!=ent]
    return [t.text for t in doc if not t.is_stop if t.is_alpha if t.text!=ent if t.text!="says" if t.text!="said"]

In [17]:
#main dictionary -> keys are entities and values are dictionaries of top 50 words with their frequencies 
person_most_popular_terms = {}

for ent in top_persons:
    person_token_dictionary = {}
    #get all tokens 
    person_words = get_tokens(combined_persons)
    #get frequencies
    for w in person_words:
        if w not in person_token_dictionary.keys():
            person_token_dictionary[w]=1
        else:
            person_token_dictionary[w]+=1
    
    #sort tokens_freq dicionary based on value
    person_sorted_words= sorted(person_token_dictionary.items(), key=lambda x:x[1], reverse=True)
    
    #person_top_words = dict([t for t in person_sorted_words[:25]]) #store words,freq as dictionary
    person_top_words = [t[0] for t in person_sorted_words[:50]] #store words as list
    
    #store results as values in main dictionary
    person_most_popular_terms[ent] = person_top_words

#### Organizations

In [18]:
organization_most_popular_terms = {}

for ent in top_organizations:
    
    organization_token_dictionary = {}
    
    org_words = get_tokens(combined_organizations)
    for w in org_words:
        if w not in organization_token_dictionary.keys():
            organization_token_dictionary[w]=1
        else:
            organization_token_dictionary[w]+=1
            
    org_sorted_words= sorted(organization_token_dictionary.items(), key=lambda x:x[1], reverse=True)
    org_top_words = [t[0] for t in org_sorted_words[:50]] 
    organization_most_popular_terms[ent] = org_top_words

#### Locations

In [19]:
location_most_popular_terms = {}

for ent in top_locations:
    
    location_token_dictionary = {}
    
    loc_words = get_tokens(combined_locations)
    for w in loc_words:
        if w not in location_token_dictionary.keys():
            location_token_dictionary[w]=1
        else:
            location_token_dictionary[w]+=1
            
    loc_sorted_words= sorted(location_token_dictionary.items(), key=lambda x:x[1], reverse=True)
    loc_top_words = [t[0] for t in loc_sorted_words[:50]] 
    location_most_popular_terms[ent] = loc_top_words

#### GPE

In [None]:
#gpe_most_popular_terms = {}
#for ent in top_gpe:    
    #gpe_token_dictionary = {}
    #gpe_words = get_tokens(combined_gpe)
    #for w in gpe_words:
       # if w not in gpe_token_dictionary.keys():
          #  gpe_token_dictionary[w]=1
        #else:
          #  gpe_token_dictionary[w]+=1       
    #gpe_sorted_words= sorted(gpe_token_dictionary.items(), key=lambda x:x[1], reverse=True)
    #gpe_top_words = [t[0] for t in gpe_sorted_words[:50]] 
    #gpe_most_popular_terms[ent] = gpe_top_words

## Step 7) Results  <a class="anchor" id="7"></a>

### 7.1) Persons

Most common entities with their frequencies (total of mentions in all documents of Reuters corpus).

In [20]:
tuples_persons = entity_mentions_tuples(combined_persons)
tuples_persons[:10]

[('reagan', 409),
 ('baker', 278),
 ('lawson', 123),
 ('yeutter', 121),
 ('james baker', 94),
 ('baldrige', 66),
 ('johnson', 64),
 ('mln', 63),
 ('poehl', 54),
 ('stoltenberg', 53)]

In [22]:
#Documents with the entity mentioned
len(combined_persons['reagan'])

221

In [56]:
#Documents with the entity mentioned
len(combined_persons['baker'])

93

In [21]:
#Total mentions as a percentage of the corpus length 
len(combined_persons['reagan'])/len(reuters_files)*100

2.0485724879495737

- **'reagan'** is the most popular entity in persons dictionary. It is mentioned 409 times in the Reuters corpus (409 sentences have "reagan"). Also, it appears in 221 documents, which accounts for 2% of the entire Reuters corpus.
- **'baker'** is the second most mentioned entity. It is mentioned 278 times in the Reuters corpus, and it appears in 93 documents of the entire corpus.

From the **top 500 entities** with the largest number of mentions in the **persons dictionary**, let's take a look at the first 100 entities.

In [30]:
print(top_persons[:100])

['reagan', 'baker', 'lawson', 'yeutter', 'james baker', 'baldrige', 'johnson', 'mln', 'poehl', 'stoltenberg', '1986/87', 'inra', 'clayton yeutter', 'kiichi miyazawa', 'herrington', 'williams', 'heller', 'yasuhiro nakasone', 'richard lyng', 'satoshi sumita', 'nigel lawson', 'chirac', 'clark', 'edouard balladur', 'clayton', 'jardine matheson', 'paul volcker', 'malcolm baldrige', 'caspar weinberger', 'twa', 'nazer', 'donald trump', '1987/88', 'bass', 'gerhard stoltenberg', 'howard', 'microchips', 'simon', 'amc', 'mulroney', 'greenspan', 'george shultz', 'de clercq', 'ali', 'jorio dauster', 'richard  lyng', 'rotterdam', 'kim', 'karl otto poehl', 'heyman', 'marlin fitzwater', 'margaret thatcher', 'darman', 'leigh-pemberton', 'subroto', 'fitzwater', 'bangemann', 'paul  volcker', 'silas', 'karl otto', 'harris', 'jordan', 'nspa', 'gao', 'john herrington', 'kato', 'corazon aquino', 'gerhard  stoltenberg', 'morgan', 'gelco', 'barber', 'sama', 'marshall', 'brown', 'romero', 'lee', 'jim wright', '

- Overall, it seems that the method worked. However, some of these entities are repeated. For example, we have 2 different entities related to James Baker: **"baker"** and **"james baker"**. It is important to figure out a method that will treat both entities as the same person. This is a key point because it will change the total count of sentences associated to these entities and, thus, our rankings will be adjusted. 
- Also, we should avoid the entity '1987/88', which clearly is not a person. 

We can take a look at the most frequent tokens associated with each entity in the following section.

In [32]:
#The top 50 tokens associated with the entity "reagan"
print(person_most_popular_terms['reagan'])

['administration', 'trade', 'president', 'japan', 'oil', 'economic', 'bill', 'japanese', 'house', 'foreign', 'congress', 'gulf', 'tariffs', 'united', 'states', 'policy', 'officials', 'agreement', 'tax', 'secretary', 'year', 'official', 'dlrs', 'legislation', 'wheat', 'decision', 'markets', 'unfair', 'action', 'soviet', 'new', 'today', 'retaliate', 'senate', 'union', 'sanctions', 'offer', 'semiconductor', 'retaliation', 'exports', 'mln', 'open', 'week', 'industry', 'impose', 'help', 'opposition', 'countries', 'american', 'washington']


- **administration**, **trade** and **president** are the most frequent words associated with **'reagan'** (in that order). 
- Overall, the rest of the words are related to political economy (tariffs, sanctions, exports), legislation (tax, policy), domestic politics (congress, opposition, union) and international politics (gulf, soviet, foreign). 
- Some words have negative connotation: retaliation, unfair.

In [33]:
#The top 50 tokens associated with the entity "reagan"
print(person_most_popular_terms['baker'])

['west', 'trade', 'hughes', 'treasury', 'agreement', 'dollar', 'louvre', 'meeting', 'rate', 'exchange', 'german', 'merger', 'paris', 'policy', 'interest', 'currency', 'accord', 'economic', 'agreed', 'germany', 'department', 'sees', 'rates', 'international', 'comment', 'deficit', 'secretary', 'told', 'monetary', 'weekend', 'stoltenberg', 'billion', 'consent', 'reduction', 'currencies', 'dlr', 'justice', 'decree', 'japanese', 'james', 'remarks', 'today', 'interview', 'declined', 'market', 'markets', 'official', 'proposed', 'imbalances', 'finance']


- **imbalances**, **rate** and **hughes** are the most frequent words associated with **'baker'** (in that order). 
- Overall, the rest of the words are related to economic jargon.

In [37]:
print(person_most_popular_terms['donald trump'])

['trump', 'donald', 'estate', 'resorts', 'dlrs', 'real', 'developer', 'inc', 'stock', 'ual', 'crosby', 'shares', 'mln', 'new', 'york', 'international', 'casino', 'class', 'b', 'common', 'agreed', 'purchase', 'hotel', 'month', 'held', 'bid', 'acquire', 'control', 'buy', 'interested', 'interstate', 'alexanders', 'discussions', 'takeover', 'recently', 'apparently', 'unsuccessful', 'spokesman', 'sharesdonald', 'family', 'chairman', 'james', 'manufacturing', 'charge', 'earnings', 'quarter', 'result', 'deal', 'bally', 'according']


- Some entities are written differently but refer to the same person, as is the case with **"Trump"** and **"Donald Trump"**. 
- We can find the union of the 2 lists (although the result will not be a list of words, orderd by frequency). 

As a result, these are the most frequent words (unorodered) associated with "Trump" and "Donald Trump":

In [47]:
a = person_most_popular_terms['donald trump']
b = person_most_popular_terms['trump']
print(list(set().union(a,b)))

['takeover', 'class', 'alexanders', 'new', 'buy', 'resorts', 'quarter', 'international', 'offered', 'stock', 'sharesdonald', 'held', 'real', 'makes', 'agreed', 'unsuccessful', 'common', 'bid', 'discussions', 'james', 'month', 'family', 'inc', 'deal', 'developer', 'shares', 'spokesman', 'charge', 'bally', 'seek', 'price', 'crosby', 'york', 'donald', 'estate', 'hotel', 'b', 'owner', 'acquire', 'remaining', 'level', 'dlrs', 'casino', 'mln', 'rival', 'recently', 'earnings', 'result', 'purchase', 'pratt', 'reach', 'manufacturing', 'requires', 'interested', 'according', 'trump', 'control', 'apparently', 'ual', 'february', 'chairman', 'investment', 'offer', 'try', 'corp', 'share', 'beat', 'interstate']


### 7.2) Organizations

Most common entities with their frequencies (total of mentions in all documents of Reuters corpus).

In [38]:
#Most common entities with their frequencies (total of mentions in all documents from entire corpus)
tuples_organizations = entity_mentions_tuples(combined_organizations)
tuples_organizations[:10]

[('mln', 5984),
 ('cts', 4936),
 ('pct', 2477),
 ('ec', 871),
 ('fed', 663),
 ('opec', 524),
 ('reuters', 442),
 ('treasury', 442),
 ('usda', 367),
 ('bundesbank', 304)]

- **'mln'** is a popular entity in organizations dictionary, followed by 'cts' and 'pct'. These entities might not necessary suggest a valid organization (for example, is probable that 'mln' is million). Also, these entities have a very large frequency, so it is probable that they might refer to other concepts different from *organizations*. 
- By contrast, 'fed', 'opec', 'reuters', and the rest of the entities, are known organizations.  note that **'fed'** is mentioned 663 times in the entire corpus, followed by **'opec'** which is mentioned 524 times in the entire corpus.


In [39]:
print(organization_most_popular_terms['opec'])

['oil', 'prices', 'mln', 'bpd', 'production', 'output', 'price', 'dlrs', 'official', 'december', 'market', 'barrel', 'crude', 'quota', 'members', 'barrels', 'meeting', 'ceiling', 'february', 'day', 'arabia', 'minister', 'saudi', 'countries', 'agreed', 'year', 'agreement', 'world', 'demand', 'al', 'member', 'fixed', 'president', 'ecuador', 'conference', 'producing', 'dlr', 'sources', 'spot', 'levels', 'traders', 'new', 'march', 'energy', 'quotas', 'group', 'quarter', 'lukman', 'current', 'level']


- As expected, the most frequent words related to the **OPEC** organization are words related to oil production and oil market: oil, prices, production, barrel, crude, arabia, energy 

In [40]:
print(organization_most_popular_terms['bundesbank'])

['billion', 'rates', 'marks', 'money', 'market', 'central', 'bank', 'rate', 'interest', 'president', 'pct', 'liquidity', 'poehl', 'german', 'dealers', 'west', 'cut', 'banks', 'repurchase', 'credit', 'monetary', 'week', 'karl', 'otto', 'term', 'february', 'schlesinger', 'policy', 'dollar', 'council', 'meeting', 'securities', 'policies', 'tender', 'short', 'mark', 'currency', 'today', 'fixed', 'net', 'growth', 'lower', 'germany', 'yen', 'january', 'pact', 'spokesman', 'funds', 'month', 'supply']


- As expected, the most frequent words associated to **Bundesbank** are words related to finance and economics, such as: rate, money, market, credit, germany.

### 7.3) Locations

Most common entities with their frequencies (total of mentions in all documents of Reuters corpus).

In [48]:
#Most common entities with their frequencies (total of mentions in all documents from entire corpus)
tuples_locations = entity_mentions_tuples(combined_locations)
tuples_locations[:10]

[('gulf', 248),
 ('europe', 169),
 ('west texas', 49),
 ('north sea', 41),
 ('asia', 40),
 ('africa', 40),
 ('the middle east', 32),
 ('north america', 30),
 ('west', 30),
 ('atlantic', 23)]

In [51]:
#Documents with the entity mentioned
len(combined_locations['gulf'])

109

In [55]:
#Documents with the entity mentioned
len(combined_locations['europe'])

124

- **'gulf'** is the most popular entity: there are 248 sentences that mention 'gulf' in the corpus. Also, it is mentioned in 109 documents. 
- **'europe'** has also a high frequency: there are 169 sentences that mention 'europe' in the corpus. However, it appears in more documents (124 documents) than 'gulf'.
- The rest of the locations have a frequency lower than 50.

From the **top 500 entities** with the largest number of mentions in the **locations dictionary**, let's take a look at the first 100 entities.

In [50]:
print(top_locations[:100])

['gulf', 'europe', 'west texas', 'north sea', 'asia', 'africa', 'the middle east', 'north america', 'west', 'atlantic', 'midwest', 'the  gulf', 'pacific', 'mideast', '1986/87', 'middle east', 'western europe', 'mediterranean', 'the far east', 'the north sea', 'latin america', 'south america', 'west coast', 'northeast', 'the gulf of mexico', 'nova', '4th qtr', 'mississippi river', 'new england', 'southern gulf', 'the aegean sea', 'eastern europe', 'the mideast gulf', 'east', 'the pacific coast', 'east coast', 'the red river', 'the east coast', 'south', 'southeast asia', 'valley', 'southern california', 'the west coast', 'kharg island', 'continental europe', 'la pampa', 'highland valley', 'east bloc', 'east europe', 'mideast gulf', 'caribbean', 'scandinavia', 'the persian gulf', 'western', 'the southern gulf', 'south louisiana', 'the black sea', 'eastern bloc', 'lake ontario', 'the north china', 'delta', 'midmississippi river', 'illinois river', 'the mississippi river', 'persian gulf', '

- Overall, many locations are extracted in 2 ways: "the east cost" and "east cost". Therefore, many of these words are counted separately, although they refer to the same location. As a possible solution, we could delete "the" from the documents.

In [67]:
print(location_most_popular_terms['gulf'])

['oil', 'iranian', 'states', 'iran', 'united', 'attack', 'shipping', 'reagan', 'military', 'mln', 'dlrs', 'missiles', 'prices', 'kuwaiti', 'american', 'price', 'warned', 'new', 'told', 'tehran', 'minister', 'forces', 'ships', 'tankers', 'ship', 'protect', 'use', 'monday', 'platform', 'attacks', 'tension', 'president', 'kuwait', 'arab', 'action', 'officials', 'fob', 'foreign', 'soviet', 'rates', 'week', 'hormuz', 'corn', 'usda', 'barge', 'freight', 'near', 'anti', 'region', 'iraq']


- **'gulf'** location has words related to oil production, international politics (in specific, U.S.-Middle East politics), an military jargon (attacks, tension, tankers, forces).

In [68]:
print(location_most_popular_terms['europe'])

['mln', 'oil', 'japan', 'market', 'year', 'japanese', 'exports', 'currency', 'pct', 'european', 'growth', 'trade', 'export', 'united', 'states', 'company', 'imports', 'prices', 'west', 'america', 'south', 'east', 'sales', 'demand', 'dlrs', 'largest', 'world', 'rose', 'sold', 'rate', 'billion', 'far', 'chairman', 'interest', 'report', 'sharply', 'added', 'firms', 'domestic', 'officials', 'ec', 'yen', 'rise', 'high', 'dollar', 'crude', 'economic', 'terms', 'cost', 'help']


- **'europe'** has words related to macro economics (exports, growth, trade, imports); others refer to european commerce relations with other states (specifically Japan and the U.S.)

## Extras

Some additional coding to improve the method.

#### 1) Determine which persons, organizations and locations most frequently occur in the same sentence

One way to approach this problem is by calculating the frequency of each entity in a sentence, store the entity, frequency in a dictionary (as key, value pairs). We could then sort the dictionary items based on the values. We can apply this method to the combined dictionaries of persons, organizations, locations.

The following is an example:

In [69]:
#Count entities with the largest number of mentions in the same sentence, for example:
sample="A U.S. move against Japan might boost protectionist sentiment in the U.S. against Japan and lead to curbs on american imports of their products, said the chairman of the Committee of Commerce in the U.S., Roger F. Wicker."
doc = nlp(sample)
doc

A U.S. move against Japan might boost protectionist sentiment in the U.S. against Japan and lead to curbs on american imports of their products, said the chairman of the Committee of Commerce in the U.S., Roger F. Wicker.

In [70]:
#Print all entities of the sentence with its label
for ent in doc.ents:
    print(ent, ent.label_)

U.S. GPE
Japan GPE
U.S. GPE
Japan GPE
american NORP
the Committee of Commerce ORG
U.S. GPE
Roger F. Wicker PERSON


In [71]:
#Count the number of times the entity appears in the sentence and store it in a dictionary
ent_counter={}
for ent in doc.ents:
    if ent.text not in ent_counter:
        ent_counter[ent.text]=1
    else: 
        ent_counter[ent.text]+=1
ent_counter

{'U.S.': 3,
 'Japan': 2,
 'american': 1,
 'the Committee of Commerce': 1,
 'Roger F. Wicker': 1}

In [72]:
#sort the dictionary based on the values of the entities 
#and return a list of tuples with the (entity, frequency)
sorted_entities = sorted(ent_counter.items(), key=lambda x:x[1], reverse=True)
sorted_entities

[('U.S.', 3),
 ('Japan', 2),
 ('american', 1),
 ('the Committee of Commerce', 1),
 ('Roger F. Wicker', 1)]

In [73]:
#Or we can simply get the most repeated element in the sentence
sorted_entities[0][0]

'U.S.'

In this example, the U.S. GPE entity has the largest number of mentions in the sentence. 