# Entity Tagging Task

### Jesus I. Ramirez Franco

In this notebook, I explain the steps to extract the sentences from the two text files given, that contain the tag 'MONEY'. I created a function which expected input is a list with the path to the text files and the expected output is a dictionary with a key equal path to the file and value is a list containing all sentences containing money from that file.

To do so, I used several auxiliary functions that I explain in the following paragraphs. Those functions are located in the python file "auxiliary.py".

## Reading the files
In order to read the files and get the raw text to do the entity tagging, I created a function that opens the files which name is in a defined list and returns a dictionary with the path of every file as key and the string with the text as value:
```
def files_to_raw(file_names):
	'''
	Open txt files and creates a dictionary with the name of
	the file as key and raw text as value.
	Inputs:
		file_names(list): list of paths to files to be analyzed.
	Returns: a dictionary.
	'''
	text_dict = {}
	for name in file_names:
		f = open(name, encoding = 'ANSI')
		text_dict[name] = f.read()

	return text_dict
```
It is important to point out that the raw text is not cleaned because in this case we care about some special characters, like "$" or numbers, and by cleaning the text it is possible to lose information. 
One critical issue in this step is to identify the encoding of the files, in this case, it is 'ANSI'.

## Classifying the words
Once we have the raw text, we can classify all the words in the files. I do it by using the Stanford NER Classifier 7class-model. I used specifically 'StanfordNERTagger' version 2018-02-27, that will be deprecated soon but is still supported. It is possible to use other classifiers as CoreNLPNERTagger, but in general, the methodology is the same.

In this step it is necessary to define the paths to the model, the jar file and the location of Java before creating the Tagger object:
```
model_path = 'C:/Users/jesus/OneDrive/Documentos/GitHub/NLP_test/stanford-ner-2018-02-27/classifier /english.muc.7class.distsim.crf.ser.gz'
jar_path = 'C:/Users/jesus/OneDrive/Documentos/GitHub/NLP_test/stanford-ner-2018-02-27/stanford-ner.jar'
java_path = 'C:/Program Files/Java/jdk-11/bin/java.exe'
os.environ['JAVAHOME'] = java_path

# Tagger object
nert = StanfordNERTagger(model_path, jar_path)

```

Then, I created a function that not only does the classification of the words, but identifies the location (index) of the words with a defined tag ('MONEY' for example). This is very important to identify the sentence that contains these words in the next steps without having to use the Classifier again.
```
def tagger(text, tag, tagger_=nert):
	'''
	Tags the words of a text according to Stanford_NER classifier, and tracks
	only the words with the desired tags; for instance, 'MONEY'.
	Inputs:
		text (string): Raw text to be analyzed.
		tag (string): tag of interest.
		tagger_: model of classifier used.
	Returns: list of classified words and the indices where they locate in
	the tokenized text.
	'''
	tokenized_text = word_tokenize(text)
	tagged_words = tagger_.tag(tokenized_text)
	words = []
	indices = []
	
	for i in range(len(tagged_words)):
		if tagged_words[i][1]==tag:
			words.append(tagged_words[i][0])
			indices.append(i)
	
	return words, indices 
```

## Identifying the sentences
Once we have identified the indices of the words with the defined tag ('MONEY' in this case), the next step is to use this information to identify the complete sentences that contain these words.
To do so, I use two auxiliary functions, one that creates a relation between the index of the tokenized words and the index of the tokenized sentences ("token_intervals"); and other to get the specific sentences given their indices ("find_sentences"):
```
def token_intervals(sentences): 
	'''
	Identifies the intervals of indices in the word-tokenized text 
	that correspond to the indices in the sentence-tokenized text.
	Inputs:
		sentences(list): list of strings with the text divided in
		sentences.
	Returns: a dictionary with the interval of indices of words 
	that contanis each sentence.  
	'''
	sentences_len = {}
	for i in range(len(sentences)):
		sentences_len[i] = len(nltk.word_tokenize(sentences[i]))

	intervals = {}
	low = 0
	high = 0
	for k, v in sentences_len.items():
		high += v
		intervals[(low, high)] = k
		low += v

	return intervals
```

```
def find_sentences(text, tag='MONEY'):
	'''
	Identifies the sentences that contain at least a word classified 
	with the tag of interest; for instance 'MONEY'.
	Inputs:
		text(string): raw text to be analized.
	Returns: a list of sentences.
	'''
	sentences = nltk.sent_tokenize(text)
	words, indices = tagger(text, tag)
	intervals = token_intervals(sentences)

	tagged_sentences = []
	for i in indices:
		for k, v in intervals.items():
			if i in range(k[0], k[1]):
				if sentences[v] not in tagged_sentences:
					tagged_sentences.append(sentences[v])

	return tagged_sentences

```

## Putting it all together
Finally, I created a function that does all the task:
1. Reads the files
2. Creates the classifier and for every file:
    3. Classifies the words
    4. Creates a relation between indices of words and indices of sentences
    5. Identifies the sentences with the desired tag
6. Finally, creates the dictionary with the paths of the files as keys and the sentences with at least one word classified with 'MONEY' as values.

```
def get_entity_sentences(file_names, tag='MONEY'):
	'''
	Analyze a text and retuns the sentences of that text that have at least
	one word classified with the tag of interest; for instance 'MONEY'.
	Inputs:
		file_names(list): list of paths to files to be analyzed.
	Returns: a dictinary with the name of the file as key and a list of
	sentences as value.
	'''
	files_dict = files_to_raw(file_names)
	results_dict = {}

	for k, v in files_dict.items():
		results_dict[k] = find_sentences(v)

	return results_dict
```

### Now, let's see how it works:

In [2]:
import auxiliary

In [3]:
paths = ['data/results.txt','data/results1.txt']

In [4]:
# The task takes only seconds
final_results = auxiliary.get_entity_sentences(paths)

### The results look like this:
The auxiliary function "show_sentences" just print the results in a readeble form.

In [5]:
auxiliary.show_sentences(final_results, 'data/results.txt')

Sentences found in data/results.txt
-----------------------------------------------------------------
Â  Support our new building This challenge from the Irving Harris Foundation encourages Harris Public Policy alumni and friendsÂ to come together for maximumÂ impact, securing a broad base of philanthropic support between now and 2019, when the University campaign concludes.Â We need 1,000 donors to either make their first gift of $100 or more, or to increase their largest previous gift by at least $100.
*************************************************************
Founded in 1890, the University has nearly 6,000Â undergraduate students;Â 10,000 graduate, professional, and other students; about 2,300 faculty members; and more than 160,000 alumni worldwide.Â q$X,   https://harris.uchicago.edu/about/who-we-areq%X‡
  Get a jump start now by creating an account with our simple online tool.
*************************************************************
Â  University Trustee Dennis J. Keller,

In [6]:
auxiliary.show_sentences(final_results, 'data/results1.txt')

Sentences found in data/results1.txt
-----------------------------------------------------------------
Future of Diplomacy Project Project on Europe and the Transatlantic Relationship U.S.- Russia Initiative to Prevent Nuclear Terrorism Vietnam Programq$X'   https://www.hks.harvard.edu/centers/cidq%X´  Researching new strategies and tactics to build the capability of public organizations.
*************************************************************
They are now building to 3 million.
*************************************************************
But women are stillÂ 37 percent less likely than men to graduate with degrees in science or engineering and only account for 25 percent of the science, technology, engineering, and math (STEM) workforce.q\X{   https://www.hks.harvard.edu/research-insights/policy-topics/public-leadership-management/performance-specialists-governmentq]X(  Governor Gina Raimondo had just taken office, inheriting a troubled child welfare agency $16 million over b

As we can see the sentences found look dirty. The normal text mining pipeline would include steps to clean the strings to be analyzed. The cleaning process includes removing unknown characters, convert all capital letters to lower and remove stop-words. We perform these tasks using the following auxiliary functions:
```
def clean_sentence(sentence):
	'''
	Removes unknown characters, change capital to lower letters and remove
	english stop words
	Inputs:
		sentence (string): a sting to be cleaned
	Returns: a string
	'''
	new = ''
	for l in sentence:
		if re.match('[a-zA-Z0-9_\s]',l):
			new += l

	tokens = nltk.word_tokenize(new)
	tokens = [t.lower() for t in tokens]

	new_tokens = []
	for t in tokens:
		if t not in set(stopwords.words('english')):
			new_tokens.append(t)

	return ' '.join(new_tokens)
```

```
def clean_results(results):
	'''
	Createsa new results dictionary with cleaned sentences
	Inputs:
		results (dictionary): Dictionary with results to be cleaned.
	Returns: a cleaned dictionary.
	'''
	new_dict = {}
	for k, v in results.items():
		sentences = v
		new_dict[k] = [clean_sentence(s) for s in sentences]

	return new_dict
```

Once we do so, the text is ready to be analyzed in different ways.
Observe the difference between the dirty results and the cleaned results:

In [7]:
cleaned_results = auxiliary.clean_results(final_results)

In [8]:
auxiliary.show_sentences(cleaned_results, 'data/results.txt')

Sentences found in data/results.txt
-----------------------------------------------------------------
support new building challenge irving harris foundation encourages harris public policy alumni friends come together maximum impact securing broad base philanthropic support 2019 university campaign concludes need 1000 donors either make first gift 100 increase largest previous gift least 100
*************************************************************
founded 1890 university nearly 6000 undergraduate students 10000 graduate professional students 2300 faculty members 160000 alumni worldwide qxhttpsharrisuchicagoeduaboutwhoweareqx get jump start creating account simple online tool
*************************************************************
university trustee dennis j keller mba68 cofounder retired chairman ceo devry education group committed 20 million harris public policy currently secondlargest gift schools history university trustee king harris chairman harris holdings inc board c

In [9]:
auxiliary.show_sentences(cleaned_results, 'data/results1.txt')

Sentences found in data/results1.txt
-----------------------------------------------------------------
future diplomacy project project europe transatlantic relationship us russia initiative prevent nuclear terrorism vietnam programqxhttpswwwhksharvardeducenterscidqxresearching new strategies tactics build capability public organizations
*************************************************************
building 3 million
*************************************************************
women still 37 percent less likely men graduate degrees science engineering account 25 percent science technology engineering math stem workforceqxhttpswwwhksharvardeduresearchinsightspolicytopicspublicleadershipmanagementperformancespecialistsgovernmentqx governor gina raimondo taken office inheriting troubled child welfare agency 16 million budget higher percentage children group care settings almost state country
*************************************************************
audit would reveal agency signing a

The cleaned text is the base for more advanced analysis. For example, let's find the most common bi-grams in each group of sentences:

In [10]:
corpus = ''
for s in cleaned_results['data/results.txt']:
    corpus += s
auxiliary.tokens_freq(corpus, 2)

Unnamed: 0,2-gram,Frequency
8,"harris,public",4
9,"public,policy",4
117,"masters,degree",3
86,"university,trustee",3
58,"jump,start",2
65,"trustee,dennis",2
265,"candidates,receive",2
296,"degree,program",2
295,"admitted,masters",2
294,"students,admitted",2


In [11]:
corpus1 = ''
for s in cleaned_results['data/results1.txt']:
    corpus1 += s
auxiliary.tokens_freq(corpus1, 2)

Unnamed: 0,2-gram,Frequency
131,"million,homelessness",2
80,"children,families",2
0,"future,diplomacy",1
194,"mpp,2017",1
200,"children,servicestwo",1
199,"families,children",1
198,"connects,families",1
197,"way,connects",1
196,"transform,way",1
195,"2017,transform",1
