# Homework 1

The assignment requires you to compute the TF-IDF measures for words in a collection of documents or corpus. TF-IDF stands for Term Frequency-Inverse Document Frequency and measures the relevance of a term <i><b>t</b></i> (t can be a word or a sequence of words called an n-gram) for a document in a collection. It is a combination of two measures as follows:

1. Term Frequency (TF) measures the frequency of the term t in a document d. If the term occurs n times in a document, then TF is usually the normalized frequency and is defined as:

	TF(t, d) = n/(number of terms in d)

If the TF measure is high, then the term occurs frequently in the document and is considered relevant for that document. 

2. Inverse Document Frequency (IDF) is

	IDF(t) = log(total number of documents/number of documents containing t)

IDF measures how important a term is in the collection. Terms such as "and" and "the" may occur across all documents and are not considered relevant. So, the IDF value for such terms will be low. 

Finally, TF-IDF(t, d) = TF(t,d) * IDF(t)

For an example, see the example section at http://www.tfidf.com  . 

Your corpus could be a single file where each line or paragraph is a document or it could be a directory 
with multiple files where each file is a document. See the attached hw1-input.txt for a trivial example of two documents. Each line is a document here and a newline character is the indication for the end of the document. We will use this example to illustrate the computation of TF-IDF. 

You may choose to represent the document collection as a Python list of strings in your program. See the code cell below for a representation of the two documents in hw1-input.txt. We will refer to this document collection as the <i><b>sample document collection</b></i> in this homework. 

IMPORTANT:

(1) You need to strip out ALL punctuation and convert words to lowercase in your program.

(2)<b> You cannot use any Python library that computes the TF-IDF scores. </b>
    
(3) <b>  You are required to use NumPy arrays and the functions/techniques discussed in class to find the denominators in the TF and IDF measures in Parts III and IV </b> (number of terms in document and number of documents containing t). If you use naive loops to count these measures, you will lose points. Both these measures require you to compute something across the rows/columns. Think of which aggregate functions you could you use in NumPy to do this easily. 

In [4]:
sample_document_collection = ['The car is driven on the road', 'The truck is driven on the highway']

In [5]:
print(sample_document_collection) 

['The car is driven on the road', 'The truck is driven on the highway']


## Part I  - 10 Points

Write a function called doc_vocab that takes a document corpus (could be a filename or a directory name with multiple files in it. You can assume the former for simplicity.) and returns a (1) vocabulary of words for the corpus and (2) the document collection (a list of documents). 

You can implement the vocabulary as a dictionary where the key is the word and the value (to which the key is mapped) is a unique index assigned to the word. 

The dictionary for the sample document collection could be something like 
{'the': 0, 'car': 1, 'is': 2, 'driven': 3, 'on': 4, 'road': 5, 'truck': 6, 'highway': 7}

Note that the actual indices you have could be different. They just have to be unique and consecutive i.e. for n words in the dictionary, the indices should range from 0 to n-1. Also note that words "the" and "The" are assumed to be the same.  


In [6]:
# PART I: build lexicon and document collection
def doc_vocab(filename):
    ## fill in your code here
    # you may import libraries and define additional helper functions outside this function
    # lexicon is a synonym for vocabulary
    
    return lexicon, collection

## Part II - 20 Points

Write a function called <i><b>doc_term_matrix</i></b> that takes a vocabulary and a document collection as parameters and returns a document term matrix. This matrix should be a two-dimensional NumPy array with the following shape: (number of documents in collection, number of terms in vocabulary). For the sample document collection, the array's shape will be (2, 8). If the array is M, then M[i, j] will contain the number of times term j occurs in document i. For the sample document collection, you may get something like 

[[2 1 1 1 1 1 0 0]

 [2 0 1 1 1 0 1 1]] . 

Note that the columns may be arranged differently based on the indices for the terms in the vocabulary. 

In [7]:
# You will need to initialize a NumPy array to store the document term matrix. 
# Which NumPy function will you use for this? What are the dimensions of this matrix?
# Fill this NumPy array with the appropriate values and return it as the value for this function
#You may use the collections package to count the number of times an element occurs in a list. 
#Note a use of collections below. You can modify it for use in Part II
import collections
dictfreq = collections.Counter("the key is in the car".split())
for key in dictfreq.keys():
    print(key+": "+str(dictfreq[key]))

the: 2
key: 1
is: 1
in: 1
car: 1


In [8]:
# PART II: document term matrix

def doc_term_matrix(vocabulary, collection):
    # fill in your code here
    
    return freq

## Part III - 20 points

Write a function called <i><b>tf_matrix</i></b> that takes a document term matrix and returns a normalized frequency matrix that represents the TF scores for each term in each document. 


In [9]:
# PART III: normalized document term matrix
# Again, initialize a NumPy array with the appropriate dimensions and 
# then fill it with the appropriate values.
# You must use a NumPy aggregation function for computing the TF scores
# Think about making your code succinct.

def tf_matrix(document_term_matrix):
    # fill in your code here
    
    return norm_freq

## Part IV - 20 points

Write a function called <i><b>tf_idf_matrix</i></b> that takes a document term matrix and a normalized frequency matrix and returns a matrix that represents the TF-IDF scores for each term in the document. You can use np.log10 to compute the logarithm here. For the sample document collection, the function could return something like 

[[0.0 &nbsp;  0.04300429 &nbsp; 0.0 &nbsp; 0.0  &nbsp;  0.0 &nbsp;  0.04300429 &nbsp;   0.0 &nbsp;  0.0 &nbsp;]
  
 [0.0 &nbsp; 0.0 &nbsp;  0.0 &nbsp;  0.0 &nbsp;  0.0 &nbsp;  0.0 &nbsp; 0.04300429 &nbsp; 0.04300429 &nbsp;]]


In [10]:
# PART IV: TF_IDF matrix

def tf_idf_matrix(document_term_matrix, tf_matrix):
    # fill in your code here
    
    return tf_idf

## Part V - 10 points

Write code to test your functions for the trivial test case: hw1-input.txt. 
You must print out 

(1) the vocabulary

(2) the document frequency matrix

(3) the normalized frequency matrix

(4) the tf-idf matrix 


Label each output with what it is. E.g. VOCABULARY, DOCUMENT FREQUENCY MATRIX, and so on.  

## Part VI - 20 points

Now test your code with a more complex input file(s). You may use <b><i>beatles_biography.txt</b></i> taken from https://www.notablebiographies.com/Ba-Be/Beatles.html. In this case, the documents could be defined by the newlines in the file. You are free to collapse the paragraphs to create fewer documents. You may also use other documents about the Beatles to do a broader analysis or you can choose a set of documents about a different topic. This part of the assignment is open-ended. 

Using the aggregate functions in NumPy, print out the maximum tf-idf scores for each document and their corresponding words. Note that it may be interesting to pick a few top words (words that have tf-idf scores close to the maximum). 

You can choose to (1) have a list of stop words to filter the documents and (2) have a list of words (say, 1-3 words i.e. n-grams) for each term instead of a single word. You may have to modify your function definitions in this case. Show your modified functions in this section. This part is optional and for extra credit only.

Explain your methods clearly and write your observations from your experiments.

### IMPORTANT: Your submission must be an Jupyter Notebook (.ipynb) file that is organized exactly like this notebook file. You can insert cells (or use the cells provided) after the instructions for each section in this file and then insert your code into the cells. To write your observations for Part VI, change the type of the cell to Markdown. 

This is a Markdown cell unlike the code cell above.