# Assignment 1: Working with Terms and Documents

This first homework assignment starts off with term statistics computations and graphing. In the final section (for CS6200 students), you collect new documents to experiment with.

Read through this Jupyter notebook and fill in the parts marked with `TODO`.

## Download and Unzip the Sample Data


Your first task is to download a dataset containing counts of terms in documents from "https://github.com/dasmiq/cs6200-hw1/blob/main/ap201001.json.gz?raw=true". The dataset covers the first one million tokens in the collection.

In [None]:
# TODO: Download a zipped JSON file from Github. (1 points)
# Hints: use "!wget"
#need to do this

!wsl wget "https://github.com/dasmiq/cs6200-hw1/blob/main/ap201001.json.gz?raw=true" -O ap201001.json.gz  

In [None]:
# TODO: Unzip the file to access the JSON data. Make sure you have a file named ap201001.json after unzipping. (1 points)
# Hints: use "!gunzip"

!wsl gunzip ap201001.json.gz

In [None]:
import json
# TODO: Convert ap201001.json file with one JSON record on each line to a list of dictionaries.(1 points)

rawfile= open('ap201001.json')
terms = [json.loads(line) for line in rawfile]

Now that you've successfully downloaded and unzipped the data, let's dig deeper. Your task is to explore some basic statistics about the terms in our dataset. This will give you a better understanding of the data you're working with.

Find the first 10 terms from the document. In this dataset, field only takes the values `body` or `title`.

In [None]:
# TODO: Find first 10 terms. (2 points)
# Hints: use "terms"

terms[1:10]

Your answer shoule be like:
"[{'id': 'APW_ENG_20100101.0001', 'field': 'body', 'term': 'about', 'count': 1},
 {'id': 'APW_ENG_20100101.0001', 'field': 'body', 'term': 'abuse', 'count': 1},
 ...}]"

Each record has four fields:
* `id`, with the identifier for the document;
* `field`, with the region of the document containing a given term;
* `term`, with the lower-cased term; and
* `count`, with the number of times each term occurred in that field and document.

## Computing Term Statistics


If we look at the most frequent terms for a given document, we mostly see common function words, such as `the`, `and`, and `of`. Start exploring the dataset by computing some of these basic term statistics. You can make your life easier using data frame libraries such as `pandas`, core python libraries such as `collections`, or just simple list comprehensions.

Feel free to define helper functions in your code before computing the statistics we're looking for.

In [None]:
# TODO: Print the 10 terms from document APW_ENG_20100101.0001 with the highest count. (5 points)

import pandas as pd

filtered_dicts = [d for d in terms if d['id'] == 'APW_ENG_20100101.0001']
df = pd.DataFrame(filtered_dicts)
top_10_terms = df.sort_values(by='count', ascending=False).head(10)
top_10_terms

In [None]:
# TODO: Print the 10 terms with the highest total count in the corpus. (5 points)

frequency = []
terms_set = set()
distinct_terms_to_frequency_dict = {}

for d in terms:

    if d['term'] in terms_set:

        distinct_terms_to_frequency_dict[d['term']] += d['count']

    else:

        distinct_terms_to_frequency_dict[d['term']] = d['count']
        terms_set.add(d['term'])

frequency = [(term, count) for term, count in distinct_terms_to_frequency_dict.items()]
sorted_frequency = sorted(frequency, key = lambda x : x[1], reverse=True)
print(sorted_frequency[:10])

Raw counts may not be the most informative statistic. One common improvement is to use *inverse document frequency*, the inverse of the proportion of documents that contain a given term.

In [None]:
# TODO: Compute the number of distinct documents in the collection. (5 points)
N = 0

def number_of_distinct_documents():

    distinct_doc_ids = set()

    for dict in terms:
        distinct_doc_ids.add(dict['id'])
    
    return len(distinct_doc_ids)

N = number_of_distinct_documents()

print(N)

In [None]:
# TODO: Compute the number of distinct documents each term appears in 
# and store in a dictionary. (5 points)

def compute_term_document_counts():

    term_to_distinct_documents_dict = {}

    for d in terms:

        term = d['term']
        doc_id = d['id']

        if term in term_to_distinct_documents_dict:

            if doc_id not in term_to_distinct_documents_dict[term]:

                term_to_distinct_documents_dict[term].add(doc_id)
            
        else:

            term_to_distinct_documents_dict[term] = {doc_id}
        
    return term_to_distinct_documents_dict
        

df = compute_term_document_counts()

for key, value in df.items():
    df[key] = len(value)
      
print(df)

In [None]:
# TODO: Print the relative document frequency of 'the', (5 points)
# i.e., the number of documents that contain 'the' divided by N.

print(df['the'] / N)

Empricially, we usually see better retrieval results if we rescale term frequency (within documents) and inverse document frequency (across documents) with the log function. Let the `tfidf` of term _t_ in document _d_ be:
```
tfidf(t, d) = log(count(t, d) + 1) * log(N / df(t))
```

Later in the course, we will show a probabilistic derivation of this quantity based on smoothing language models.

In [None]:
# TODO: Compute the tf-idf value for each term in each document. (10 points)
# Take the raw term data and add a tfidf field to each record.

import math

tfidf_terms = None

for d in terms:
    tf= math.log(d['count'] + 1, 10)
    idf = math.log(N / df[d['term']], 10)
    tfidf = tf*idf

    d['tfidf'] = math.log(d['count'] + 1, 10) * math.log(N / df[d['term']], 10)

tfidf_terms = terms

print(tfidf_terms[:10])

In [None]:
# TODO: Print the 20 term-document pairs with the highest tf-idf values. (5 points)

df = pd.DataFrame(tfidf_terms)

top_20_term_document_pairs_list = df.sort_values(by='tfidf', ascending=False).head(20)

top_20_term_document_pairs_list

## Plotting Term Distributions

Besides frequencies and tf-idf values within documents, it is often helpful to look at the distrubitions of word frequencies in the whole collection. In class, we talk about the Zipf distribution of word rank versus frequency and Heaps' Law relating the number of distinct words to the number of tokens.

We might examine these distributions to see, for instance, if an unexpectedly large number of very rare terms occurs, which might indicate noise added to our data.

In [None]:
# TODO: Compute a list of the distinct words in this collection and sort it in descending order of frequency.
# Thus frequency[0] should contain the word "the" and the count 62216. (5 points)

frequency = []

terms_set = set()

distinct_terms_to_frequency_dict = {}

for d in terms:

    if d['term'] in terms_set:

        distinct_terms_to_frequency_dict[d['term']] += d['count']

    else:

        distinct_terms_to_frequency_dict[d['term']] = d['count']
        terms_set.add(d['term'])


frequency = [(term, count) for term, count in distinct_terms_to_frequency_dict.items()]

sorted_frequency = sorted(frequency, key = lambda x : x[1], reverse=True)

print(sorted_frequency)

In [None]:
# TODO: Plot a graph of the log of the rank (starting at 1) on the x-axis,
# against the log of the frequency on the y-axis. You may use the matplotlib
# or other library. (5 points)

import matplotlib.pyplot as plt
import numpy as np

x_val = list(range(1, len(sorted_frequency) + 1))

y_val = [element[1] for element in sorted_frequency]

log_rank = np.log(x_val)
log_frequency = np.log(y_val)

plt.plot(log_rank, log_frequency )

In [None]:
# TODO: Compute the number of tokens in the corpus. (5 points)
# Remember to count each occurrence of each word.

ntokens = 0

for element in sorted_frequency:

    ntokens += element[1]

print(ntokens)


In [None]:
# TODO: Compute the proportion of tokens made up by the top 10 most
# frequent words. (5 points)

top_10_most_frequent_words_total = 0

for i in range(10):

    top_10_most_frequent_words_total += sorted_frequency[i][1]

print(top_10_most_frequent_words_total / ntokens)



In [None]:
# TODO: Compute the proportion of tokens made up by the words that occur
# exactly once in this collection. (5 points)

words_that_occur_exactly_once = 0

for element in sorted_frequency:

    if element[1] == 1:

        words_that_occur_exactly_once += 1

print(words_that_occur_exactly_once / ntokens)


For this assignment so far, you've worked with data that's already been extracted, tokenized, and counted. In this final section, you'll briefly explore acquiring new data.

Find a collection of documents that you're interested in. For the statistics to be meaningful, this collection should have at least 1,000 words.

The format could be anything you can extract text from: HTML, PDF, MS PowerPoint, chat logs, etc.

The collection should be in a natural language, not mostly code or numerical data. It could be in English or in any other language.

The final project for this course will involve designing an information retrieval task on some dataset. You could use this exercise to think about what kind of data you might be interested in, although that is not required.

In [None]:
# TODO: Write code to download and extract the text from the collection.  (5 points)
import kaggle
import os

!wsl wget "https://www.kaggle.com/datasets/jensenbaxter/10dataset-text-document-classification"

**TODO**: Describe choices you make about what contents to keep. (2 points)

I want to keep every word. I want to remove all punctuation such as commas, and periods.



In [None]:
# TODO: Data acquisition code here. (5 points)

import os
import glob

directory_path = r"C:\Users\vijay\Desktop\6200 Projects\Assignment 1\sport"

file_paths = glob.glob(os.path.join(directory_path, '*'))

list_of_dicts = []
term_to_frequency_in_document_dict = {}

for file_path in file_paths:
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            tokens = line.strip().split()
            for token in tokens:
                element = dict()
                element['id'] = os.path.basename(file_path)
                element['term'] = token
                
                term_to_frequency_in_document_dict[token] = term_to_frequency_in_document_dict.get(token, 0) + 1
                list_of_dicts.append(element)

for element in list_of_dicts:
    element['count'] = term_to_frequency_in_document_dict[element['term']]

print(list_of_dicts[:100])

**TODO**: Write code to tokenize the text and count the resulting terms in each document. Describe your tokenization approach here.

Each term may also be associated with a field, such as `body` and `title` in the newswire collection above. Describe the different fields in your data.

In [None]:
# TODO: Tokenization code here. (5 points)

import string

translator = str.maketrans('', '', string.punctuation)

cleaned_tokens = []

for d in list_of_dicts:

    cleaned_token = d['term'].translate(translator)

    if cleaned_token:

        cleaned_tokens.append(cleaned_token)

print(cleaned_tokens)

**TODO**: Plot a graph of the log rank against log frequency for your collection, as you did for the sample collection above. What do you observe about the differences between the distributions in these two collections? (2 points)


x_val = list(range(1, len(sorted_frequency) + 1))

y_val = [element[1] for element in sorted_frequency]

log_rank = np.log(x_val)
log_frequency = np.log(y_val)


plt.plot(log_rank, log_frequency )



# Inverted Index
Create an inverted index of the data in the corpus

In [None]:
#To Do : Create an inverted index of the corpus extracted in the 1st part. (10 points)

def create_inverted_index ():

  inverted_index = {}

  for element in list_of_dicts:

    if element['term'] in inverted_index :

      if element['id'] in inverted_index[element['term']] :

        inverted_index[element['term']][element['id']] += 1
      
      else :

        inverted_index[element['term']][element['id']] = 1
    
    else:
      inverted_index[element['term']] = {} 


  return inverted_index


inverted_index = create_inverted_index()

print(inverted_index)


In [None]:
# TODO: Write unit tests to validate your inverted index. (5 points)

#t is the term
#ev is the expected value

def validate_index(t, ev):

    result = inverted_index[t] 
    expected_value = ev

    if result == expected_value:

        print("Good Job, test passed!")
    
    else:

        print("Sorry, test failed")

In [None]:
validate_index('Claxton', {'sport_1.txt': 5, 'sport_58.txt': 1, 'sport_60.txt': 1, 'sport_68.txt': 1})
validate_index('first', {'sport_1.txt': 2, 'sport_14.txt': 1, 'sport_2.txt': 1, 'sport_22.txt': 1, 'sport_25.txt': 1, 'sport_26.txt': 1, 'sport_34.txt': 1, 'sport_35.txt': 1, 'sport_40.txt': 1, 'sport_42.txt': 1, 'sport_44.txt': 2, 'sport_52.txt': 1, 'sport_53.txt': 1, 'sport_57.txt': 2, 'sport_58.txt': 3, 'sport_60.txt': 1, 'sport_61.txt': 4, 'sport_66.txt': 5, 'sport_68.txt': 2, 'sport_69.txt': 1, 'sport_71.txt': 1, 'sport_75.txt': 3, 'sport_76.txt': 2, 'sport_83.txt': 1, 'sport_84.txt': 1, 'sport_85.txt': 1, 'sport_86.txt': 1, 'sport_89.txt': 1, 'sport_9.txt': 1, 'sport_93.txt': 1, 'sport_95.txt': 1, 'sport_96.txt': 1})