# Working with Terms and Documents

This first homework assignment starts off with term statistics computations and graphing. In the final section (for CS6200 students), you collect new documents to experiment with.

Read through this Jupyter notebook and fill in the parts marked with `TODO`.

## Sample Data

Start by looking at some sample data. We donwload the counts of terms in documents for the first one million tokens of a newswire collection.

In [1]:
!wget -O ap201001.json.gz https://github.com/dasmiq/cs6200-hw1/blob/main/ap201001.json.gz?raw=true
!gunzip ap201001.json.gz

--2022-09-30 21:56:10--  https://github.com/dasmiq/cs6200-hw1/blob/main/ap201001.json.gz?raw=true
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/dasmiq/cs6200-hw1/raw/main/ap201001.json.gz [following]
--2022-09-30 21:56:10--  https://github.com/dasmiq/cs6200-hw1/raw/main/ap201001.json.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dasmiq/cs6200-hw1/main/ap201001.json.gz [following]
--2022-09-30 21:56:10--  https://raw.githubusercontent.com/dasmiq/cs6200-hw1/main/ap201001.json.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 20

We convert this file with one JSON record on each line to a list of dictionaries.

In [2]:
import json
rawfile = open('ap201001.json')
terms = [json.loads(line) for line in rawfile]

Here are the first ten records, showing the count of each term for each document and field. In this dataset, field only takes the values `body` or `title`.

In [3]:
terms[1:10]

[{'id': 'APW_ENG_20100101.0001', 'field': 'body', 'term': 'about', 'count': 1},
 {'id': 'APW_ENG_20100101.0001', 'field': 'body', 'term': 'abuse', 'count': 1},
 {'id': 'APW_ENG_20100101.0001',
  'field': 'body',
  'term': 'academy',
  'count': 1},
 {'id': 'APW_ENG_20100101.0001',
  'field': 'body',
  'term': 'accused',
  'count': 2},
 {'id': 'APW_ENG_20100101.0001',
  'field': 'body',
  'term': 'actress',
  'count': 1},
 {'id': 'APW_ENG_20100101.0001', 'field': 'body', 'term': 'ad', 'count': 1},
 {'id': 'APW_ENG_20100101.0001', 'field': 'body', 'term': 'after', 'count': 1},
 {'id': 'APW_ENG_20100101.0001',
  'field': 'body',
  'term': 'agenda',
  'count': 1},
 {'id': 'APW_ENG_20100101.0001',
  'field': 'body',
  'term': 'agreed',
  'count': 1}]

Each record has four fields:
* `id`, with the identifier for the document;
* `field`, with the region of the document containing a given term;
* `term`, with the lower-cased term; and
* `count`, with the number of times each term occurred in that field and document.

## Computing Term Statistics


If we look at the most frequent terms for a given document, we mostly see common function words, such as `the`, `and`, and `of`. Start exploring the dataset by computing some of these basic term statistics. You can make your life easier using data frame libraries such as `pandas`, core python libraries such as `collections`, or just simple list comprehensions.

Feel free to define helper functions in your code before computing the statistics we're looking for.

In [4]:
# TODO: Print the 6 terms from document APW_ENG_20100101.0001 with the highest count.
import pandas as pd

data = pd.DataFrame(terms)
#data.head()
data_APW_ENG_20100101_0001 = data[data['id'] == 'APW_ENG_20100101.0001']
data_APW_ENG_20100101_0001.nlargest(6, 'count')

Unnamed: 0,id,field,term,count
0,APW_ENG_20100101.0001,body,a,16
192,APW_ENG_20100101.0001,body,the,11
15,APW_ENG_20100101.0001,body,and,10
34,APW_ENG_20100101.0001,body,brooks,10
133,APW_ENG_20100101.0001,body,of,10
198,APW_ENG_20100101.0001,body,to,10


In [5]:
# TODO: Print the 10 terms from all fields of document APW_ENG_20100102.0077 with the highest count.

data_APW_ENG_20100102_0077 = data[data['id'] == 'APW_ENG_20100102.0077']
data_APW_ENG_20100101_0001.nlargest(10, 'count')

Unnamed: 0,id,field,term,count
0,APW_ENG_20100101.0001,body,a,16
192,APW_ENG_20100101.0001,body,the,11
15,APW_ENG_20100101.0001,body,and,10
34,APW_ENG_20100101.0001,body,brooks,10
133,APW_ENG_20100101.0001,body,of,10
198,APW_ENG_20100101.0001,body,to,10
86,APW_ENG_20100101.0001,body,he,9
95,APW_ENG_20100101.0001,body,in,9
81,APW_ENG_20100101.0001,body,gomez,8
74,APW_ENG_20100101.0001,body,for,6


In [6]:
# TODO: Print the 10 terms with the highest total count in the corpus.

data_count_aggregate = data.groupby('term').sum()
data_count_aggregate = data_count_aggregate.reset_index(level=0)
data_count_aggregate.nlargest(10, 'count')

Unnamed: 0,term,count
24646,the,62216
24900,to,26931
11865,in,25659
0,a,23383
16991,of,22326
944,and,22125
21244,said,10888
9269,for,9716
17109,on,9382
24639,that,8942


Raw counts may not be the most informative statistic. One common improvement is to use *inverse document frequency*, the inverse of the proportion of documents that contain a given term.

In [7]:
# TODO: Compute the number of distinct documents in the collection.
N = len(data.id.unique())
print('Number of distinct documents: ', N)

# TODO: Compute the number of distinct documents each term appears in
# and store in a dictionary.
term_document_dict = dict()
term_df = data.groupby("term").agg(id_count = ("id", pd.Series.nunique))
term_df.reset_index()
term_document_dict = dict(zip(term_df.index, term_df.id_count))
print('\nTerm-DocumentCount Dict:\n', term_document_dict)

Number of distinct documents:  2778

Term-DocumentCount Dict:


In [8]:
# TODO: Print the relative document frequency of 'the',
# i.e., the number of documents that contain 'the' divided by N.

relative_frequency_the = term_document_dict['the']/N
print('Relative freuency of term "the" is- ', relative_frequency_the)

Relative freuency of term "the" is-  0.9704823614110871


Empricially, we usually see better retrieval results if we rescale term frequency (within documents) and inverse document frequency (across documents) with the log function. Let the `tfidf` of term _t_ in document _d_ be:
```
tfidf(t, d) = log(count(t, d) + 1) * log(N / df(t))
```

Later in the course, we will show a probabilistic derivation of this quantity based on smoothing language models.

In [9]:
# TODO: Compute the tf-idf value for each term in each document.
# Take the raw term data and add a tfidf field to each record.
import math
tfidf_terms = []

for index, row in data.iterrows():
    tfidf_terms.append(math.log(row['count']+1)*math.log(N/term_document_dict[row['term']]))

data['tfidf'] = tfidf_terms
data.head()

Unnamed: 0,id,field,term,count,tfidf
0,APW_ENG_20100101.0001,body,a,16,0.219394
1,APW_ENG_20100101.0001,body,about,1,0.656793
2,APW_ENG_20100101.0001,body,abuse,1,3.237961
3,APW_ENG_20100101.0001,body,academy,1,3.419818
4,APW_ENG_20100101.0001,body,accused,2,2.885155


In [10]:
# TODO: Print the 20 term-document pairs with the highest tf-idf values.

data.nlargest(20, 'tfidf')

Unnamed: 0,id,field,term,count,tfidf
52727,APW_ENG_20100103.0028,body,guarani,24,23.292878
199263,APW_ENG_20100105.0061,body,nomination,95,22.519372
234566,APW_ENG_20100105.0446,body,methane,15,21.985205
48925,APW_ENG_20100103.0015,body,kheire,14,21.473448
192483,APW_ENG_20100105.0014,body,greyhound,14,21.473448
433679,APW_ENG_20100107.0036,body,shakespeare,18,21.30696
199179,APW_ENG_20100105.0061,body,guild,28,20.667543
342740,APW_ENG_20100106.0428,body,shakespeare,16,20.502093
21195,APW_ENG_20100102.0197,body,elkhart,12,20.338731
305769,APW_ENG_20100106.0075,body,magna,12,20.338731


## Plotting Term Distributions

Besides frequencies and tf-idf values within documents, it is often helpful to look at the distrubitions of word frequencies in the whole collection. In class, we talk about the Zipf distribution of word rank versus frequency and Heaps' Law relating the number of distinct words to the number of tokens.

We might examine these distributions to see, for instance, if an unexpectedly large number of very rare terms occurs, which might indicate noise added to our data.

In [11]:
# TODO: Compute a list of the distinct words in this collection and sort it in descending order of frequency.
# Thus frequency[0] should contain the word "the" and the count 62216.
frequency = []


data_count_aggregate = data_count_aggregate.sort_values(by=['count'], ascending=False)
frequency = data_count_aggregate.values.tolist()
frequency

[['the', 62216],
 ['to', 26931],
 ['in', 25659],
 ['a', 23383],
 ['of', 22326],
 ['and', 22125],
 ['said', 10888],
 ['for', 9716],
 ['on', 9382],
 ['that', 8942],
 ['was', 7791],
 ['is', 6317],
 ['with', 6283],
 ['at', 6078],
 ['he', 5874],
 ['it', 5357],
 ['from', 5094],
 ['as', 4746],
 ['by', 4641],
 ['has', 4355],
 ['an', 4087],
 ['have', 4012],
 ['his', 3904],
 ['be', 3693],
 ['but', 3638],
 ['s', 3567],
 ['u', 3518],
 ['were', 3478],
 ['not', 3320],
 ['are', 3218],
 ['will', 3063],
 ['its', 2955],
 ['who', 2904],
 ['had', 2900],
 ['after', 2876],
 ['year', 2816],
 ['they', 2423],
 ['this', 2331],
 ['new', 2263],
 ['been', 2241],
 ['more', 2225],
 ['two', 2136],
 ['security', 2097],
 ['or', 2076],
 ['which', 2059],
 ['about', 2003],
 ['percent', 1966],
 ['up', 1917],
 ['their', 1898],
 ['al', 1897],
 ['would', 1851],
 ['also', 1826],
 ['last', 1807],
 ['first', 1761],
 ['than', 1748],
 ['i', 1722],
 ['one', 1717],
 ['other', 1680],
 ['people', 1678],
 ['out', 1677],
 ['government',

In [12]:
# TODO: Plot a graph of the log of the rank (starting at 1) on the x-axis,
# against the log of the frequency on the y-axis. You may use the matplotlib
# or other library.

import matplotlib.pyplot as plt

plt.plot(math.log(data['tfidf']), math.log(data['count']), label = "line 2")

TypeError: ignored

In [None]:
# TODO: Compute the number of tokens in the corpus.
# Remember to count each occurrence of each word. For instance, the 62,216
# instances of "the" will all count here.
ntokens = 0

In [None]:
# TODO: Compute the proportion of tokens made up by the top 10 most
# frequent words.

In [None]:
# TODO: Compute the proportion of tokens made up by the words that occur
# exactly once in this collection.

## Acquiring New Documents (for CS6200)

For this assignment so far, you've worked with data that's already been extracted, tokenized, and counted. In this final section, you'll explore acquiring new data.

One common way of acquiring data is through application programming interfaces (APIs) to various databases. The Library of Congress's [_Chronicling America_](https://chroniclingamerica.loc.gov/) site aggregates digitized US newspapers from the past two hundred years, such as the [_Seattle Star_](https://chroniclingamerica.loc.gov/lccn/sn87093407/1922-09-19/ed-1/seq-1/) from 100 years ago.

You can use [the API](https://chroniclingamerica.loc.gov/about/api/) to retrieve JSON data listing all issues of the _Seattle Star_: https://chroniclingamerica.loc.gov/lccn/sn87093407.json

Note the list in the `issues` field. For example, here is the record for the September 19, 1922, issue: https://chroniclingamerica.loc.gov/lccn/sn87093407/1922-09-19/ed-1.json

In that issue record, you'll see records for each page, e.g.: https://chroniclingamerica.loc.gov/lccn/sn87093407/1922-09-19/ed-1/seq-1.json

And inside that page record, you'll see links to data about that page in various data formats, such as JPEG, PDF, and plain text, which is what we want here: https://chroniclingamerica.loc.gov/lccn/sn87093407/1922-09-19/ed-1/seq-1/ocr.txt

This plain text was transcribed from the old page images using optical character recognition (OCR) models, and so contains errors.

Your task is to acquire and analyze the issues of the _Seattle Star_ from the month of September, 1922, i.e., the issues with a date field that starts with `1922-09`. This should be about the same amount of data as the million words from the Associated Press you analyzed in the last section.

**TODO**: Write code that calls the _Chronicling America_ API to download and extract the text from the _Seattle Star_ from September 1922. You can use the `json` library from above and any other libraries you wish to fetch data from URLs. As you would when working with any production API, you may need to limit your rate of requests.

In [None]:
# TODO: Data acquisition code here.

**TODO**: Write code to tokenize the text and count the resulting terms in each document. Since this data comes from automatically transcribing printed pages, some words may be hyphenated across line breaks. There is more than one right way to tokenize this data, so add comments to your code documenting your choices.

In [None]:
# TODO: Tokenization code here.

**TODO**: Plot a graph of the log rank against log frequency for your collection, as you did for the sample collection above.

In [None]:
# TODO: Plotting code here.

**TODO**: What do you observe about the differences between the distributions of the Associated Press and Seattle Star collections? In this text box, give some possible reasons for these differences.