## Entity Extraction from Reuters Corpus using spacy and NLTK

Performed some analysis with entity extraction. In particular,I looked at the Reuters corpus and tried to construct entity profiles of persons and locations. This will require you to iterate through the documents in the Reuters corpus, parse them appropriately, extract entities, and then store the entities along with some surrounding text. Additionally, you will be looking for mechanisms to identify potential relationships between persons and locations.

Throughout this you will need to use NLTK to access the corpus. At the same time, you will need to use an entity extraction system. You can choose to use either NLTK or Spacy. I would strongly suggest using Spacy for the entity extraction portion of this assignment.

The basic idea is to build a knowledge base around the entities you will extract in the Reuters corpus. Normally, this would be a first step to trying to model such things as entity resolution across documents. You could also use this as a first step to analyzing the sentiment towards particular entities. For example, people expressing dissatistfaction at a restaurant or brand.


## Step 1) Import necessary libraries 

In [1]:
# This will be the corpus we work from
from nltk.corpus import reuters
from nltk.tokenize import word_tokenize
import spacy
import heapq

In [2]:
nlp = spacy.load('en_core_web_sm')

## Step 2) FIll in the following function to extract the entity, document id, and relevant sentence text from the input

In [3]:
def extract_entities(doc_id, doc_text):
    analyzed_doc = nlp(doc_text)
    
    # these two dictionaries will include all the persons and locations you find in a document.
    # You will need to add each person or location you encounter in the document to them
    # for the key you can use the text of the entity, for the value you will want to use the document_id and the
    # text of the sentence one challenge could be that an entity might occur multiple times in the document, 
    # thus the value should really be a document id and a list of the text of the sentences ( or something such as that)
    doc_persons = {}
    doc_locations = {}
    
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "":
            # The .label_ property will provide information on the type of entity tagged
            #print(" -> ", entity.label_)
            # The .text property will display the actual text of the entity in the text
            #print("->", entity.text.strip(), "<-")
            # You can also access the sentence that the entity is contained in by using the .sent property
            # inside the sentence you can then use the .text property
            #print("->", entity.sent.text, "<-")
            
            
            # one way to represent the document id and the sentence text would be with a tuple
            # thus, you could do:
            relevant_sentence = (doc_id, entity.sent.text)
            
            if entity.label_ == "LOC":
                if entity.text.strip() not in doc_locations.keys():
                    doc_locations[entity.text.strip()] = []
                    doc_locations[entity.text.strip()].append(relevant_sentence)
            
            elif entity.label_ == "PERSON":
                if entity.text.strip() not in doc_persons.keys():
                    doc_persons[entity.text.strip()] = []
                    doc_persons[entity.text.strip()].append(relevant_sentence)
            
            # add the relevant document id and sentence to the entity record
            
            
            
    return doc_persons, doc_locations
        

## Step 3) Adjust the following code to run the document entity extraction function
## Also, add the entity records you are constructing to your master list of entities
## Note: for the full subission run across all the Reuters documents

In [4]:
num_docs = len(reuters.fileids()[0:25])
#  this has a large number of files... 
# you might wish to limit the number of documents you use while developing your technique 
# ex. reuters.fileids()[0:25]

# these two dictionaries will incorporate all the referneces to 
combined_persons = {}
combined_locations = {}

# this will only iterate over the first 25 documents, for the real submission you will need to run across all documents
for doc_id in reuters.fileids()[0:25]:
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    #wordtokens = sent_tokenize(reuters.open(doc_id).read())
    persons, locations = extract_entities(doc_id, reuters.open(doc_id).read())
    # you will need to write something here to put the persons and locations found in a document into the 
    # combined_persons and combined_locations dictionaries.
                                          
    combined_persons.update(persons)
    combined_locations.update(locations)
    
   
    # here you will need to consider how to extend the values already in the dictionaries
    # maybe something like:
    #         --- add a person key to the combined persons list
    #     now here you can add the person's document ids and sentence texts to the dictionary value
    
    
    for person in persons.keys():
        if person not in combined_persons.keys():
            combined_persons[person].append(person.value)
    
    for location in locations.keys():
        if location not in combined_locations.keys():
            combined_locations[location].append(locations.value)
            


## Step 4) Fill in the following method to look through the content of an entity dictionary to determine the most popular based on number of mentions

In [5]:
combined_persons

{'FEAR DAMAGE': [('test/14826',
   'ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT\n  Mounting trade friction between the\n  U.S.')],
 'Reuter': [('test/14826',
   'They told Reuter correspondents in Asian capitals a U.S.\n  Move against Japan might boost protectionist sentiment in the\n  U.S. And lead to curbs on American imports of their products.\n      ')],
 'Tom\n  Murtha': [('test/14826',
   '"If the tariffs remain in place for any length of time\n  beyond a few months it will mean the complete erosion of\n  exports (of goods subject to tariffs) to the U.S.," said Tom\n  Murtha, a stock analyst at the Tokyo office of broker &lt;James\n  Capel and Co>.\n      ')],
 'Paul Sheen': [('test/14826',
   '"We must quickly open our markets, remove trade barriers and\n  cut import tariffs to allow imports of U.S. Products, if we\n  want to defuse problems from possible U.S. Retaliation," said\n  Paul Sheen, chairman of textile exporters &lt;Taiwan Safe Group>.\n      ')],
 'Lawrence Mill

In [6]:
combined_locations

{'Asia': [('test/14862',
   "The surplus enabled the\n  country to reduce its foreign debt last year for the first\n  time.\n      South Korea's foreign debt, which fell to 44.5 billion dlrs\n  in 1986 from 46.8 billion in 1985, is still among the largest\n  in Asia.\n      ")],
 'Europe': [('test/14872',
   'Profit from Australia and the Far East showed the greatest\n  percentage rise, jumping 55.0 pct to 15.5 mln from 10.0 mln,\n  while the profit from U.K. Operations rose 30.7 pct to 24.7\n  mln, and Europe, 42.9 pct to 11.0 mln.\n  \n\n')],
 'Pacific Northwest': [('test/14841',
   'SRI LANKA GETS USDA APPROVAL FOR WHEAT PRICE\n  Food Department officials said the U.S.\n  Department of Agriculture approved the Continental Grain Co\n  sale of 52,500 tonnes of soft wheat at 89 U.S. Dlrs a tonne C\n  and F from Pacific Northwest to Colombo.\n      ')],
 'the Northern Territory': [('test/14842',
   'WESTERN MINING TO OPEN NEW GOLD MINE IN AUSTRALIA\n  Western Mining Corp Holdings Ltd\n 

In [7]:
# now that we have the text associated with the entities, 
# you will want to focus on the 500 top entities in each category
# Identify the top 500 entities by the count of their occurrences
def find_most_popular_entities(entity_dictionary):
    # sort through the entities in the dictionary by the number of sentences
    popular_entities = heapq.nlargest(500,entity_dictionary, key=entity_dictionary.get)        
    most_popular_entity ={}
    for key,value in entity_dictionary.items():
        for key1 in popular_entities:
            if key1 == key:
                most_popular_entity[key1]=value
                
    return most_popular_entity

## Step 5) Now invoke your top entity mention finder

In [8]:
# simply get the top persons and locations
top_persons = find_most_popular_entities(combined_persons)
top_locations = find_most_popular_entities(combined_locations)

## Step 6) Analyze the most popular entities to determine what words they most frequently occur with

In [9]:
top_persons

{'FEAR DAMAGE': [('test/14826',
   'ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT\n  Mounting trade friction between the\n  U.S.')],
 'Reuter': [('test/14826',
   'They told Reuter correspondents in Asian capitals a U.S.\n  Move against Japan might boost protectionist sentiment in the\n  U.S. And lead to curbs on American imports of their products.\n      ')],
 'Tom\n  Murtha': [('test/14826',
   '"If the tariffs remain in place for any length of time\n  beyond a few months it will mean the complete erosion of\n  exports (of goods subject to tariffs) to the U.S.," said Tom\n  Murtha, a stock analyst at the Tokyo office of broker &lt;James\n  Capel and Co>.\n      ')],
 'Paul Sheen': [('test/14826',
   '"We must quickly open our markets, remove trade barriers and\n  cut import tariffs to allow imports of U.S. Products, if we\n  want to defuse problems from possible U.S. Retaliation," said\n  Paul Sheen, chairman of textile exporters &lt;Taiwan Safe Group>.\n      ')],
 'Lawrence Mill

In [44]:
# use these two dictionaries to store the most frequent terms associated with the entities
person_most_popular_terms = {}
location_most_popular_terms = {}


# finally, now find the most frequent tokens associated with the entities
for key,value in top_persons.items():
    sentences=value[0]
    person_most_popular_terms[key]=word_tokenize(sentences[1])
    
    
#for location in top_locations:
    # fill this dictionary with all the words in the context of the location entity
for key1,value1 in top_locations.items():
    sentences1=value1[0]
    location_most_popular_terms[key1]=word_tokenize(sentences1[1])
    
    

## Step 7) Present your results of the most popular entities and their associated terms

In [45]:
location_most_popular_terms

{'Asia': ['The',
  'surplus',
  'enabled',
  'the',
  'country',
  'to',
  'reduce',
  'its',
  'foreign',
  'debt',
  'last',
  'year',
  'for',
  'the',
  'first',
  'time',
  '.',
  'South',
  'Korea',
  "'s",
  'foreign',
  'debt',
  ',',
  'which',
  'fell',
  'to',
  '44.5',
  'billion',
  'dlrs',
  'in',
  '1986',
  'from',
  '46.8',
  'billion',
  'in',
  '1985',
  ',',
  'is',
  'still',
  'among',
  'the',
  'largest',
  'in',
  'Asia',
  '.'],
 'Europe': ['Profit',
  'from',
  'Australia',
  'and',
  'the',
  'Far',
  'East',
  'showed',
  'the',
  'greatest',
  'percentage',
  'rise',
  ',',
  'jumping',
  '55.0',
  'pct',
  'to',
  '15.5',
  'mln',
  'from',
  '10.0',
  'mln',
  ',',
  'while',
  'the',
  'profit',
  'from',
  'U.K.',
  'Operations',
  'rose',
  '30.7',
  'pct',
  'to',
  '24.7',
  'mln',
  ',',
  'and',
  'Europe',
  ',',
  '42.9',
  'pct',
  'to',
  '11.0',
  'mln',
  '.'],
 'Pacific Northwest': ['SRI',
  'LANKA',
  'GETS',
  'USDA',
  'APPROVAL',
  'FOR

In [43]:
person_most_popular_terms

{'FEAR DAMAGE': ['ASIAN',
  'EXPORTERS',
  'FEAR',
  'DAMAGE',
  'FROM',
  'U.S.-JAPAN',
  'RIFT',
  'Mounting',
  'trade',
  'friction',
  'between',
  'the',
  'U.S',
  '.'],
 'Reuter': ['They',
  'told',
  'Reuter',
  'correspondents',
  'in',
  'Asian',
  'capitals',
  'a',
  'U.S.',
  'Move',
  'against',
  'Japan',
  'might',
  'boost',
  'protectionist',
  'sentiment',
  'in',
  'the',
  'U.S.',
  'And',
  'lead',
  'to',
  'curbs',
  'on',
  'American',
  'imports',
  'of',
  'their',
  'products',
  '.'],
 'Tom\n  Murtha': ['``',
  'If',
  'the',
  'tariffs',
  'remain',
  'in',
  'place',
  'for',
  'any',
  'length',
  'of',
  'time',
  'beyond',
  'a',
  'few',
  'months',
  'it',
  'will',
  'mean',
  'the',
  'complete',
  'erosion',
  'of',
  'exports',
  '(',
  'of',
  'goods',
  'subject',
  'to',
  'tariffs',
  ')',
  'to',
  'the',
  'U.S.',
  ',',
  "''",
  'said',
  'Tom',
  'Murtha',
  ',',
  'a',
  'stock',
  'analyst',
  'at',
  'the',
  'Tokyo',
  'office',
  '