# Homework 2

In this homework you will be performing some analysis with entity extraction. In particular, you will be looking at the Reuters corpus and trying to construct entity profiles of persons, organizations, and locations. This will require you to iterate through the documents in the Reuters corpus, parse them appropriately, extract entities, and then store the entities along with some surrounding text. Additionally, you will be looking for mechanisms to identify potential relationships between persons and locations.

Throughout this you will need to use NLTK to access the corpus. At the same time, you will need to use an entity extraction system. You can choose to use either NLTK or Spacy. I would strongly suggest using Spacy for the entity extraction portion of this assignment.

The basic idea is to build a knowledge base around the entities you will extract in the Reuters corpus. Normally, this would be a first step to trying to model such things as entity resolution across documents. You could also use this as a first step to analyzing the sentiment towards particular entities. For example, people expressing dissatistfaction at a restaurant or brand.

Follow the below steps and read the comments carefully on the types of tasks your code will need to do.

I would expect that some of you might be able to reuse parts of this code for your project...

## Step 1) Import necessary libraries 

In [1]:
# This will be the corpus we work from
from nltk.corpus import reuters

In [2]:
# I will assume you are using Spacy as a default entity recognizer.
import spacy
# note, the model load can be odd. In some instances your model might have the full name or the short name here.
# if you run into issues here, check the spacy model page at https://spacy.io/usage/models
nlp = spacy.load("en_core_web_sm")

## Step 2) FIll in the following function to extract the entity, document id, and relevant sentence text from the input

In [3]:
def extract_entities(doc_id, doc_text):
    analyzed_doc = nlp(doc_text)
    
    # these two dictionaries will include all the persons and locations you find in a document.
    # You will need to add each person or location you encounter in the document to them
    # for the key you can use the text of the entity, for the value you will want to use the document_id and the
    # text of the sentence one challenge could be that an entity might occur multiple times in the document, 
    # thus the value should really be a document id and a list of the text of the sentences ( or something such as that)
    doc_persons = {}
    doc_organizations = {}
    doc_locations = {}
    
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "":
            # one way to represent the document id and the sentence text would be with a tuple
            # thus, you could do:
            relevant_sentence = (doc_id, entity.sent.text)
            
            # add the relevant document id and sentence to the entity record
            if entity.label_ == 'PERSON':
                if entity.text not in doc_persons:
                    doc_persons[entity.text] = []
                doc_persons[entity.text].append(relevant_sentence)
            elif entity.label_ == 'ORG':
                if entity.text not in doc_organizations:
                    doc_organizations[entity.text] = []
                doc_organizations[entity.text].append(relevant_sentence)
            elif entity.label_ == 'LOC':
                if entity.text not in doc_locations:
                    doc_locations[entity.text] = []
                doc_locations[entity.text].append(relevant_sentence)
            
            # Extra credit: resolve different forms of the same name for each person and location inside the same document.
            list_per = list(doc_persons.keys())
            for per1 in list_per:
                for per2 in list_per:
                    if (per1 != per2) & (per1 in per2):
                        try:
                            doc_persons[per2] = doc_persons[per2] + doc_persons[per1]
                            del doc_persons[per1]
                            list_per.remove(per1)
                        except:
                            pass
            
            list_loc = list(doc_locations.keys())
            for loc1 in list_loc:
                for loc2 in list_loc:
                    if (loc1 != loc2) & (loc1 in loc2):
                        try:
                            doc_locations[loc2] = doc_locations[loc2] + doc_locations[loc1]
                            del doc_locations[loc1]
                            list_loc.remove(loc1)
                        except:
                            pass
   
    return doc_persons, doc_organizations, doc_locations
        

## Step 3) Adjust the following code to run the document entity extraction function
Also, add the entity records you are constructing to your master list of entities
Note: for the full subission run across all the Reuters documents

In [4]:
num_docs = len(reuters.fileids())
#  this has a large number of files... 
# you might wish to limit the number of documents you use while developing your technique 
# ex. reuters.fileids()[0:25]

# these two dictionaries will incorporate all the referneces to 
combined_persons = {}
combined_organizations = {}
combined_locations = {}

# this will only iterate over the first 25 documents, for the real submission you will need to run across all documents
for doc_id in reuters.fileids(): 
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    persons, organizations, locations = extract_entities(doc_id, reuters.open(doc_id).read())
    
    # you will need to write something here to put the persons and locations found in a document into the 
    # combined_persons, combined_organizations, and combined_locations dictionaries.
    # here you will need to consider how to extend the values already in the dictionaries
    # maybe something like:
    for person in persons.keys():
        if person not in combined_persons.keys():
            # add a person key to the combined persons list
            combined_persons[person] = []
            # now here you can add the person's document ids and sentence texts to the dictionary value
        combined_persons[person].append(persons[person])
    for org in organizations.keys():
        if org not in combined_organizations.keys():
            combined_organizations[org] = []
        combined_organizations[org].append(organizations[org])
    for loc in locations.keys():
        if loc not in combined_locations.keys():
            combined_locations[loc] = []
        combined_locations[loc].append(locations[loc])       
            

## Step 4) Fill in the following method to look through the content of an entity dictionary to determine the most popular based on number of mentions

In [5]:
# now that we have the text associated with the entities, 
# you will want to focus on the 500 top entities in each category
# Identify the top 500 entities by the count of their occurrences
def find_most_popular_entities(entity_dictionary):
    # sort through the entities in the dictionary by the number of sentences
    sorted_dic_keys = [k for k, v in sorted(entity_dictionary.items(), reverse=True, key=lambda item: len(item[1]))]
    list_of_dictionary_keys_with_most_mentions = sorted_dic_keys[:500]
    return list_of_dictionary_keys_with_most_mentions


## Step 5) Now invoke your top entity mention finder

In [6]:
# simply get the top persons and locations
top_persons = find_most_popular_entities(combined_persons)
top_locations = find_most_popular_entities(combined_locations)
top_organizations = find_most_popular_entities(combined_organizations)

## Step 6) Analyze the most popular entities to determine what words they most frequently occur with

In [7]:
# use these dictionaries to store the most frequent terms associated with the entities
person_most_popular_terms = {}
organization_most_popular_terms = {}
location_most_popular_terms = {}

# finally, now find the most frequent tokens associated with the entities
for person in top_persons:
    # fill this dictionary with all the words in the context of the person entity
    person_token_dictionary = {}
    sents = []
    for lst in combined_persons[person]:
        for tup in lst:
            sents.append(tup[1])
    for i in range(len(sents)):
        tokens = nlp(sents[i])
        for token in tokens:
            if (token.text not in person) & (token.is_alpha) & (not token.is_stop):
                if token.lemma_ not in person_token_dictionary:
                    person_token_dictionary[token.lemma_] = 1
                else:
                    person_token_dictionary[token.lemma_] += 1
        sorted_token = [k for k, v in sorted(person_token_dictionary.items(), reverse=True, key=lambda item: item[1])]
        person_most_popular_terms[person] = sorted_token[:5]

# finally, now find the most frequent tokens associated with the entities
for organization in top_organizations:
    # fill this dictionary with all the words in the context of the person entity
    organization_token_dictionary = {}
    sents = []
    for lst in combined_organizations[organization]:
        for tup in lst:
            sents.append(tup[1])
    for i in range(len(sents)):
        tokens = nlp(sents[i])
        for token in tokens:
            if (token.text not in organization) & (token.is_alpha) & (not token.is_stop):
                if token.lemma_ not in organization_token_dictionary:
                    organization_token_dictionary[token.lemma_] = 1
                else:
                    organization_token_dictionary[token.lemma_] += 1
        sorted_token = [k for k, v in sorted(organization_token_dictionary.items(), reverse=True, key=lambda item: item[1])]
        organization_most_popular_terms[organization] = sorted_token[:5]

for location in top_locations:
    # fill this dictionary with all the words in the context of the location entity
    location_token_dictionary = {}
    sents = []
    for lst in combined_locations[location]:
        for tup in lst:
            sents.append(tup[1])
    for i in range(len(sents)):
        tokens = nlp(sents[i])
        for token in tokens:
            if (token.text not in location) & (token.is_alpha) & (not token.is_stop):
                if token.lemma_ not in location_token_dictionary:
                    location_token_dictionary[token.lemma_] = 1
                else:
                    location_token_dictionary[token.lemma_] += 1
        sorted_token = [k for k, v in sorted(location_token_dictionary.items(), reverse=True, key=lambda item: item[1])]
        location_most_popular_terms[location] = sorted_token[:5]

## Step 7) Present your results of the most popular entities and their associated terms

In [8]:
# present you results
person_most_popular_terms

{'Avg': ['vs', 'shrs', 'mln', 'note', 'mth'],
 'Reagan': ['say', 'President', 'trade', 'administration', 'Japan'],
 'James Baker': ['say', 'Treasury', 'Secretary', 'rate', 'dollar'],
 'Clayton Yeutter': ['say', 'trade', 'Trade', 'Representative', 'Japan'],
 'Kiichi Miyazawa': ['Finance', 'Minister', 'say', 'japanese', 'dollar'],
 'Ecus': ['tonne', 'rebate', 'export', 'maximum', 'kilo'],
 'Yasuhiro Nakasone': ['Prime', 'Minister', 'say', 'Japan', 'Washington'],
 'Rotterdam': ['say', 'port', 'union', 'general', 'cargo'],
 '1986/87': ['tonne', 'mln', 'say', 'crop', 'wheat'],
 'Satoshi Sumita': ['say', 'Japan', 'Bank', 'rate', 'governor'],
 'Richard Lyng': ['Agriculture', 'say', 'Secretary', 'Japan', 'farm'],
 'Louvre': ['say', 'accord', 'agreement', 'Baker', 'currency'],
 'Nigel Lawson': ['say', 'Chancellor', 'rate', 'Exchequer', 'market'],
 'Brown': ['Wagner', 'AFG', 'Inc', 'Industries', 'offer'],
 'Banks': ['billion', 'mark', 'reserve', 'Bundesbank', 'bid'],
 'Edouard Balladur': ['Minis

In [9]:
organization_most_popular_terms

{'mln': ['dlrs', 'say', 'share', 'year', 'net'],
 'cts': ['vs', 'loss', 'Shr', 'profit', 'shr'],
 'pct': ['say', 'rise', 'year', 'January', 'February'],
 'QTR': ['net', 'loss', 'SHR', 'dlrs', 'CTS'],
 'Reuters': ['tell', 'official', 'say', 'spokesman', 'year'],
 'EC': ['say', 'export', 'trade', 'European', 'Community'],
 'MLN': ['dlrs', 'DLRS', 'stg', 'say', 'WEEK'],
 'USDA': ['say', 'mln', 'tonne', 'official', 'corn'],
 'QTR NET\n  ': ['INC', 'CORP', 'INDUSTRIES', 'dlrs', 'vs'],
 'Treasury': ['say', 'billion', 'Secretary', 'dlrs', 'Baker'],
 'FED': ['say', 'set', 'billion', 'CUSTOMER', 'REPURCHASE'],
 'the Securities and Exchange Commission': ['share',
  'say',
  'filing',
  'dlrs',
  'common'],
 'PCT': ['rate', 'BANK', 'prime', 'rise', 'cuts'],
 'QTR NET': ['dlrs', 'vs', 'shr', 'ct', 'CORP'],
 'Fed': ['say', 'pct', 'dlrs', 'reserve', 'rate'],
 'Congress': ['say', 'trade', 'bill', 'Reagan', 'administration'],
 'The Bank of England': ['say', 'stg', 'mln', 'market', 'money'],
 'OPEC': [

In [10]:
location_most_popular_terms

{'Europe': ['say', 'mln', 'export', 'Japan', 'market'],
 'Gulf': ['say', 'oil', 'iranian', 'price', 'Iran'],
 'Africa': ['say', 'East', 'Middle', 'year', 'country'],
 'North America': ['say', 'pct', 'Europe', 'investment', 'local'],
 'Asia': ['say', 'Europe', 'Oceania', 'South', 'billion'],
 'the Middle East': ['say', 'oil', 'Africa', 'increase', 'fall'],
 'the\n  Gulf': ['oil', 'say', 'soviet', 'Iran', 'tanker'],
 'West': ['say', 'german', 'year', 'sale', 'commodity'],
 'North Sea': ['oil', 'say', 'price', 'dlrs', 'cut'],
 '1986/87': ['mln', 'tonne', 'forecast', 'export', 'vs'],
 'Western Europe': ['mln', 'dlrs', 'say', 'billion', 'Japan'],
 'Latin America': ['dlrs', 'pct', 'billion', 'say', 'mln'],
 'Mediterranean': ['say', 'price', 'trader', 'country', 'Greece'],
 'South America': ['say', 'production', 'import', 'soybean', 'likely'],
 'the Far East': ['mln', 'say', 'pct', 'Europe', 'sale'],
 'Middle East': ['oil', 'mln', 'official', 'supply', 'war'],
 'Mideast': ['crude', 'oil', 'dl

## Extra Credit

There are several extra credit options for this assignment. 
* The first would be to determine which persons, organizations, and locations most frequently occur in the same sentences.
* Another task would be to attempt to resolve different forms of the same name for each person and location. For example, George Bush and Bush inside the same document.

In [11]:
loc_per_org = {}

for location in top_locations:
    loc_per_org[location] = {}
    loc_per_org[location]['per'] = {}
    loc_per_org[location]['per']['org'] = 0
    sents = []
    for lst in combined_locations[location]:
        for tup in lst:
            sents.append(tup[1])
    for sent in sents:
        pers = set()
        orgs = set()
        analyzed = nlp(sent)
        for entity in analyzed.ents:
            if entity.text.strip() != "":
                if entity.label_ == 'PERSON':
                    pers.add(entity.text)
                if entity.label_ == 'ORG':
                    orgs.add(entity.text)
        for per in pers:
            if per not in loc_per_org[location]:
                loc_per_org[location][per] = {}
                loc_per_org[location][per]['org'] = 0
                for org in orgs:
                    loc_per_org[location][per][org] = 1
            else:
                for org in orgs:
                    if org not in loc_per_org[location][per]:
                        loc_per_org[location][per][org] = 1
                    else:
                        loc_per_org[location][per][org] += 1
nested_list = []
for loc in loc_per_org.items():
    for per in loc[1].items():
        for org in per[1].items():
            nested_list.append([loc[0], per[0], org[0], org[1]])
sorted_list = [item for item in sorted(nested_list, reverse=True, key=lambda x: x[3])]
sorted_list[:13]

[['Europe', 'Stora', 'Papyrus', 2],
 ['Europe', 'Stora', 'Reed International Plc\n  &lt;REED L', 2],
 ['Gulf', 'Mir-Hossein', 'U.S. Navy', 2],
 ['Gulf', 'Avaj', 'Lloyds', 2],
 ['the\n  Gulf', 'Alexander Ivanov', 'Foreign Ministry', 2],
 ['the\n  Gulf', 'Al-Rai al-Aam', 'Foreign Ministry', 2],
 ['Persian Gulf', 'Reagan', 'Kuwaiti', 2],
 ['Arabian Sea', 'Kitty Hawk', 'Pentagon', 2],
 ['the Pacific coast', 'Balao', 'Lago Agrio', 2],
 ['Mideast Gulf', 'Reagan', 'House', 2],
 ['Mideast Gulf', 'Reagan', 'Congress', 2],
 ['the\n  Mark', 'Lira', 'Bank', 2],
 ['Admiralty Island', 'Hecla', 'pct', 2]]