# Homework #1

In this homework you will be analyzing a number of articles. These articles are derived from a number of different news sections. For example, some will be from the politics section and some will be from entertainment, etc. Some documents will be somewhat duplicative.

What we would like to do is get an idea of the type of vocabulary employed across all the documents. The hope is that by knowing the vocabulary across these documents we might identify certain words more associated with particular topics. Thus, we might expect that enterntainment might have the word "film" whereas politics might have the word "vote."

So we will want to generate several pieces of data. Your code should produce this information:
- we want a list of all the unique words in this document corpus (You will want to consider lowercasing, stemming, and potentially lemmatization)
- we want a list of the words in this document ranked by their frequency
- we want a list of the words average frequency in documents (this indicates on average how many times does this term occur in a document
- we want to identify what words documents contain unique words in the corpus (imagine a document mentioning a person with a distinct last name. It's possible that that name might only occur once in one document. We want to find all those documents.)


In [1]:
# we will use spacy
import spacy
nlp = spacy.load("en_core_web_sm")


In [2]:
import os

dir_base = "/Users/teacher/repos/s21_ds_nlp/homeworks/homework_1/data" 
# point this to the data directory

# you can use the below code to read all of the text files and then have them available in a list

def read_file(filename):
    input_file_text = open(filename, encoding='utf-8').read()
    return input_file_text

    
def read_directory_files(directory):
    file_texts = []
    files = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]
    for f in files:
        file_text = read_file(os.path.join(directory, f))
        file_texts.append({"file":f, "content": file_text })
    return file_texts
    
# here we will generate the list that contains all the files and their contents
text_corpus = read_directory_files(dir_base)


In [3]:
# function to extract all the words in a document
def document_words(document):
    word_list = []
    analyzed = nlp(document)
    for token in analyzed:
        #print(token)
        # you will want to look at spacy's documentation on token properties
        # https://spacy.io/api/token/
        # This has the properties you can use in handling the individual tokens
        # check to see if the token is alphabetical. 
        # Also, make sure to check that it isn't a stop word.
        if token.is_alpha and not token.is_stop:
            possible_add = token.lemma_.lower()
            if possible_add not in word_list:
                word_list.append(possible_add)
        # here add to the word_list list
    return word_list


In [4]:
# function to extract all the words in a document
def word_frequency(document):
    document_word_frequency_list = []
    doc_word_frequency_dict = {}
    analyzed = nlp(document)
    for token in analyzed:
        # you will want to look at spacy's documentation on token properties
        # https://spacy.io/api/token/
        # This has the properties you can use in handling the individual tokens
        
        # here add to the word_list list
        # you will want to have something to capture the frequency of the word inside the document here.
        # you want the document word frequency list as you will want to be able to get document level count info
        
        
        if token.is_alpha and not token.is_stop:
            possible_add = token.lemma_.lower()
            document_word_frequency_list.append(possible_add)

    # now iterate over the set of all words and count them
    for token in document_word_frequency_list:
        if token not in doc_word_frequency_dict.keys():
            doc_word_frequency_dict[token] = 0
        doc_word_frequency_dict[token] = doc_word_frequency_dict[token] + 1
    
    token_count_tuples = [(doc_word_frequency_dict[token], token) for token in doc_word_frequency_dict]
    sorted_frequency = sorted(token_count_tuples, reverse=True)
    # now that we've got the word list overall
    
    return sorted_frequency



In [None]:
# Iteration example. This will iterate over every document.

# This will be the place where we store 
all_unique_words = []

all_doc_terms = []

all_doc_frequency = {}

for doc in text_corpus:
    #print(doc["content"]) # the corpus is stored in a hash here. You can get the text by looking at the content key
    doc_unique_words = document_words(doc["content"])
    
    # here we can add the unique word list so that we can reuse it later without having to reprocess all docs
    all_doc_terms.append(doc_unique_words)
    
    # Iterate over doc unique words, see if we should add them to master list
    for term in doc_unique_words:
        if term not in all_unique_words:
            all_unique_words.append(term)
    
    doc_frequency = word_frequency(doc["content"])
    for freq, token in doc_frequency:
        if token not in all_doc_frequency.keys():
            all_doc_frequency[token] = []
        all_doc_frequency[token].append(freq)

# now we have a dictionary of the count that each term has across all documents
term_average_frequency = []
for term in all_doc_frequency.keys():
    num_docs = len(all_doc_frequency[term])
    average_times = sum(all_doc_frequency[term])/num_docs
    term_info = (term, num_docs, average_times)
    # here we just have a tuple of the term, the number of docs, and the average times
    term_average_frequency.append(term_info)

In [None]:
# Comparison Iteration Example.
# We don't want to iterate over every combination twice

all_doc_unique_words = []

for index_i, doc_i in enumerate(text_corpus):
    temp_row = [] # this will be how we accumulate the upper portion of the matrix
    doc_i_unique_words = all_doc_terms[index_i]
    doc_i_set = set(doc_i_unique_words)
    
    for index_j, doc_j in enumerate(text_corpus):
        if index_j == index_i:
            continue
        # here we will want to get the word list from doc_i and compare it to the word list from doc_j
        # This should tell us which words are unique between the two
        # We can then accumulate the unique words per document and at the end of iterating through all
        # documents we should be able to get which words are actually unique in the document.
        doc_j_unique_words = all_doc_terms[index_j]
        doc_j_set = set(doc_j_unique_words)
        # here we see what words are unique for doc i
        # we can then take the doc_i_set and assign it the uniqe words between these two
        # gradually we will find which terms are unique to the input document
        doc_i_set = doc_i_set - doc_j_set
        temp_row.append(i_words)
    all_doc_unique_words.append(doc_i_set)

#print(all_doc_unique_words)


Put a brief report of your approach, your results, and any conclusions you can draw from this here.

Extra credit: 

Try to identify potential irregular verbs automatically (abuse the fact that the lemmas and the surface forms of irregular verbs are typically different. For example "eat" and "ate" have similar lemmas but different surface forms.