<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Finding Significant Words Using TF/IDF

**Description:**
This [notebook](https://docs.constellate.org/key-terms/#jupyter-notebook) shows how to discover significant words. The method for finding significant terms is [tf-idf](https://docs.constellate.org/key-terms/#tf-idf).  The following processes are described:

* An educational overview of TF-IDF, including how it is calculated
* Using the `tdm_client` to retrieve a dataset
* Filtering based on a pre-processed ID list
* Filtering based on a [stop words list](https://docs.constellate.org/key-terms/#stop-words)
* Cleaning the tokens in the dataset
* Creating a [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary)
* Creating a [gensim](https://docs.constellate.org/key-terms/#gensim) [bag of words](https://docs.constellate.org/key-terms/#bag-of-words) [corpus](https://docs.constellate.org/key-terms/#corpus)
* Computing the most significant words in your [corpus](https://docs.constellate.org/key-terms/#corpus) using [gensim](https://docs.constellate.org/key-terms/#gensim) implementation of [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf)

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

[Take me to the **Research Version** of this notebook ->](./finding-significant-terms-for-research.ipynb)

**Difficulty:** Intermediate

**Completion time:** 60 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:**
* [Exploring Metadata](./metadata.ipynb)
* [Working with Dataset Files](./working-with-dataset-files.ipynb)
* [Pandas I](./pandas-1.ipynb)
* [Creating a Stopwords List](./creating-stopwords-list.ipynb)
* A familiarity with [gensim](https://docs.constellate.org/key-terms/#gensim) is helpful but not required.

**Data Format:** [JSON Lines (.jsonl)](https://docs.constellate.org/key-terms/#jsonl)

**Libraries Used:**
* `pandas` to load a preprocessing list
* `csv` to load a custom stopwords list
* [gensim](https://docs.constellate.org/key-terms/#gensim) to help compute the [tf-idf](https://docs.constellate.org/key-terms/#tf-idf) calculations
* [NLTK](https://docs.constellate.org/key-terms/#nltk) to create a stopwords list (if no list is supplied)

**Research Pipeline:**

1. Build a dataset
2. Create a "Pre-Processing CSV" with [Exploring Metadata](./exploring-metadata.ipynb) (Optional)
3. Create a "Custom Stopwords List" with [Creating a Stopwords List](./creating-stopwords-list.ipynb) (Optional)
4. Complete the TF-IDF analysis with this notebook
____

## What is "Term Frequency- Inverse Document Frequency" (TF-IDF)?

[TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) is used in [machine learning](https://docs.constellate.org/key-terms/#machine-learning) and [natural language processing](https://docs.constellate.org/key-terms//#nlp) for measuring the significance of particular terms for a given document. It consists of two parts that are multiplied together:

1. Term Frequency- A measure of how many times a given word appears in a document
2. Inverse Document Frequency- A measure of how many times the same word occurs in other documents within the corpus

If we were to merely consider [word frequency](https://docs.constellate.org/key-terms/#word-frequency), the most frequent words would be common [function words](https://docs.constellate.org/key-terms/#function-words) like: "the", "and", "of". We could use a [stopwords list](https://docs.constellate.org/key-terms/#stop-words) to remove the common [function words](https://docs.constellate.org/key-terms/#function-words), but that still may not give us results that describe the unique terms in the document since the uniqueness of terms depends on the context of a larger body of documents. In other words, the same term could be significant or insignificant depending on the context. Consider these examples:

* Given a set of scientific journal articles in biology, the term "lab" may not be significant since biologists often rely on and mention labs in their research. However, if the term "lab" were to occur frequently in a history or English article, then it is likely to be significant since humanities articles rarely discuss labs. 
* If we were to look at thousands of articles in literary studies, then the term "postcolonial" may be significant for any given article. However, if were to look at a few hundred articles on the topic of "the global south," then the term "postcolonial" may occur so frequently that it is not a significant way to differentiate between the articles.

The [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) calculation reveals the words that are frequent in this document **yet rare in other documents**. The goal is to find out what is unique or remarkable about a document given the context (and *the given context* can change the results of the analysis).

Here is how the calculation is mathematically written:

$$tfidf_{t,d} = tf_{t,d} \cdot idf_{t,D}$$

In plain English, this means: **The value of [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) is the product (or multiplication) of a given term's frequency multiplied by its inverse document frequency.** Let's unpack these terms one at a time.

### Term Frequency Function

$$tf_{t,d}$$
The number of times (t) a term occurs in a given document (d)

### Inverse Document Frequency Function

$$idf_i = \mbox{log} \frac{N}{|{d : t_i \in d}|}$$
The inverse document frequency can be expanded to the calculation on the right. In plain English, this means: **The log of the total number of documents (N) divided by the number of documents that contain the term**

### TF-IDF Calculation in Plain English

$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$

There are variations on the [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) formula, but this is the most widely-used version.

### An Example Calculation of TF-IDF

Let's take a look at an example to illustrate the fundamentals of [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf). First, we need several texts to compare. Our texts will be very simple.

* text1 = 'The grass was green and spread out the distance like the sea.'
* text2 = 'Green eggs and ham were spread out like the book.'
* text3 = 'Green sailors were met like the sea met troubles.'
* text4 = 'The grass was green.'

The first step is we need to discover how many unique words are in each text. 

|text1|text2|text3|text4|
|    ---    | ---| --- | --- |
|the|green|green|the|
|grass|eggs|sailors|grass|
|was|and|were|was|
|green|ham|met|green|
|and|were|like| |
|spread|spread|the| |
|out|out|sea| |
|into|like|met| |
|distance|the|troubles| |
|like|book| | |
|sea| | | |


Our four texts share some similar words. Next, we create a single list of unique words that occur across all three texts. (When we use the [gensim](https://docs.constellate.org/key-terms/#gensim) library later, we will call this list a [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary).)

|id|Unique Words|
|---| --- |
|0|and|
|1|book|
|2|distance|
|3|eggs|
|4|grass|
|5|green|
|6|ham|
|7|like|
|8|met|
|9|out|
|10|sailors|
|11|sea|
|12|spread|
|13|the|
|14|troubles|
|15|was|
|16|were|

Now let's count the occurences of each unique word in each sentence

|id|word|text1|text2|text3|text4|
|---|---|---|---|---|---|
|0|and|1|1|0|0|
|1|book|0|1|0|0|
|2|distance|1|0|0|0|
|3|eggs|0|1|0|0|
|4|grass|1|0|0|1|
|5|green|1|1|1|1|
|6|ham|0|1|0|0|
|7|like|1|1|1|0|
|8|met|0|0|2|0|
|9|out|1|1|0|0|
|10|sailors|0|0|1|0|
|11|sea|1|0|1|0|
|12|spread|1|1|0|0|
|13|the|3|1|1|1|
|14|troubles|0|0|1|0|
|15|was|1|0|0|1|
|16|were|0|1|1|0|

### Computing TF-IDF (Example 1)

We have enough information now to compute [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) for every word in our corpus. Recall the plain English formula.

$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$

We can use the formula to compute [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) for the most common word in our corpus: 'the'. In total, we will compute [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) four times (once for each of our texts).

|id|word|text1|text2|text3|text4|
|---|---|---|---|---|---|
|13|the|3|1|1|1|

text1: $$ tf-idf = 3 \cdot \mbox{log} \frac{4}{(4)} = 3 \cdot \mbox{log} 1 = 3 \cdot 0 = 0$$
text2: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$
text3: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$
text4: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$

The results of our analysis suggest 'the' has a weight of 0 in every document. The word 'the' exists in all of our documents, and therefore it is not a significant term to differentiate one document from another.

Given that idf is

$$\mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$

and 

$$\mbox{log} 1 = 0$$
we can see that [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) will be 0 for any word that occurs in every document. That is, if a word occurs in every document, then it is not a significant term for any individual document.

### Computing TF-IDF (Example 2)

Let's try a second example with the word 'out'. Recall the plain English formula.

$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$

We will compute [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) four times, once for each of our texts.

|id|word|text1|text2|text3|text4|
|---|---|---|---|---|---|
|9|out|1|1|0|0|

text1: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(2)} = 1 \cdot \mbox{log} 2 = 1 \cdot .3010 = .3010$$
text2: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(2)} = 1 \cdot \mbox{log} 2 = 1 \cdot .3010 = .3010$$
text3: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(2)} = 0 \cdot \mbox{log} 2 = 0 \cdot .3010 = 0$$
text4: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(2)} = 0 \cdot \mbox{log} 2 = 0 \cdot .3010 = 0$$

The results of our analysis suggest 'out' has some significance in text1 and text2, but no significance for text3 and text4 where the word does not occur.

### Computing TF-IDF (Example 3)

Let's try one last example with the word 'met'. Here's the [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) formula again:

$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$

And here's how many times the word 'met' occurs in each text.

|id|word|text1|text2|text3|text4|
|---|---|---|---|---|---|
|8|met|0|0|2|0|

text1: $$ tf-idf = 0 \cdot \log \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 0 \cdot .6021 = 0$$
text2: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 0 \cdot .6021 = 0$$
text3: $$ tf-idf = 2 \cdot \mbox{log} \frac{4}{(1)} = 2 \cdot \mbox{log} 4 = 2 \cdot .6021 = 1.2042$$
text4: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 0 \cdot .6021 = 0$$

As should be expected, we can see that the word 'met' is very significant in text3 but not significant in any other text since it does not occur in any other text. 

### The Full TF-IDF Example Table

Here are the original sentences for each text:

* text1 = 'The grass was green and spread out the distance like the sea.'
* text2 = 'Green eggs and ham were spread out like the book.'
* text3 = 'Green sailors were met like the sea met troubles.'
* text4 = 'The grass was green.'

And here's the corresponding [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) scores for each word in each text:

|id|word|text1|text2|text3|text4|
|---|---|---|---|---|---|
|0|and|.3010|.3010|0|0|
|1|book|0|.6021|0|0|
|2|distance|.6021|0|0|0|
|3|eggs|0|.6021|0|0|
|4|grass|.3010|0|0|.3010|
|5|green|0|0|0|0|
|6|ham|0|.6021|0|0|
|7|like|.1249|.1249|.1249|0|
|8|met|0|0|1.2042|0|
|9|out|.3010|.3010|0|0|
|10|sailors|0|0|.6021|0|
|11|sea|.3010|0|.3010|0|
|12|spread|.3010|.3010|0|0|
|13|the|0|0|0|0|
|14|troubles|0|0|.6021|0|
|15|was|.3010|0|0|.3010|
|16|were|0|.3010|.3010|0|

There are a few noteworthy things in this data. 

* The [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) score for any word that does not occur in a text is 0.
* The scores for almost every word in text4 are 0 since it is a shorter version of text1. There are no unique words in text4 since text1 contains all the same words. It is also a short text which means that there are only four words to consider. The words 'the' and 'green' occur in every text, leaving only 'was' and 'grass' which are also found in text1.
* The words 'book', 'eggs', and 'ham' are significant in text2 since they only occur in that text.

Now that you have a basic understanding of how [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) is computed at a small scale, let's try computing [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) on a [corpus](https://docs.constellate.org/key-terms/#corpus) which could contain millions of words.

---

## Computing TF-IDF with your Dataset

We'll use the tdm_client library to automatically retrieve the dataset in the JSON file format. 

Enter a [dataset ID](https://docs.constellate.org/key-terms/#dataset-ID) in the next code cell.

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://constellate.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://constellate.org/dataset/dashboard)

In [1]:
# Dataset is "geometric group theory," 1980-present
dataset_id = "a2b5c15a-8de3-e161-077b-8afd47713826"

Next, import the `tdm_client`, passing the `dataset_id` as an argument using the `get_dataset` method.

In [2]:
# Importing your dataset with a dataset ID
import tdm_client
# Pull in the dataset that matches `dataset_id`
# in the form of a gzipped JSON lines file.
dataset_file = tdm_client.get_dataset(dataset_id)

INFO:root:Downloading a2b5c15a-8de3-e161-077b-8afd47713826 metadata to /home/jovyan/data/a2b5c15a-8de3-e161-077b-8afd47713826-sampled-jsonl.jsonl.gz


100% |########################################################################|


## Apply Pre-Processing Filters (if available)
If you completed pre-processing with the "Exploring Metadata and Pre-processing" notebook, you can use your CSV file of dataset IDs to automatically filter the dataset. Your pre-processed CSV file  must be in the root folder.

In [3]:
# Import a pre-processed CSV file of filtered dataset IDs.
# If you do not have a pre-processed CSV file, the analysis
# will run on the full dataset and may take longer to complete.
import pandas as pd
import os

pre_processed_file_name = f'data/pre-processed_{dataset_id}.csv'

if os.path.exists(pre_processed_file_name):
    df = pd.read_csv(pre_processed_file_name)
    filtered_id_list = df["id"].tolist()
    use_filtered_list = True
    print('Pre-Processed CSV found. Successfully read in ' + str(len(df)) + ' documents.')
else: 
    use_filtered_list = False
    print('No pre-processed CSV file found. Full dataset will be used.')

INFO:numexpr.utils:NumExpr defaulting to 4 threads.
Pre-Processed CSV found. Successfully read in 634 documents.


## Load Stopwords List

If you have created a stopword list in the stopwords notebook, we will import it here. (You can always modify the CSV file to add or subtract words then reload the list.) Otherwise, we'll load the NLTK [stopwords](https://docs.constellate.org/key-terms/#stop-words) list automatically.

In [4]:
# Load a custom data/stop_words.csv if available
# Otherwise, load the nltk stopwords list in English

# Create an empty Python list to hold the stopwords
stop_words = []

# The filename of the custom data/stop_words.csv file
stopwords_list_filename = 'data/stop_words.csv'

if os.path.exists(stopwords_list_filename):
    import csv
    with open(stopwords_list_filename, 'r') as f:
        stop_words = list(csv.reader(f))[0]
    print('Custom stopwords list loaded from CSV')
else:
    # Load the NLTK stopwords list
    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    print('NLTK stopwords list loaded')

NLTK stopwords list loaded


## Define a Unigram Processing Function
In this step, we gather the unigrams. If there is a Pre-Processing Filter, we will only analyze documents from the filtered ID list. We will also process each unigram, assessing them individually. We will complete the following tasks:

* Lowercase all tokens
* Remove tokens in stopwords list
* Remove tokens with fewer than 4 characters
* Remove tokens with non-alphabetic characters

We can define this process in a function.

In [52]:
# Define a function that will process individual tokens
# Only a token that passes through all three `if` 
# statements will be returned. A `True` result for
# any `if` statement does not return the token. 

import string

def process_token(token):
    token = token.lower()
    if token in stop_words: # If True, do not return token
        return
    token = token.strip(string.punctuation)
    if len(token) < 4: # If True, do not return token
        return
    if not(token.isalpha()): # If True, do not return token
        return
    return token # If all are False, return the lowercased token

In [79]:
'blue-green'.isalpha()

False

## Collect lists of Document IDs, Titles, and Unigrams

Next, we process all the unigrams into a list called `documents`. For demonstration purposes, this code runs on a limit of 500 documents, but we can change this to process all the documents. We are also collecting the document titles and ids so we can reference them later.

In [53]:
# Collecting the unigrams and processing them into `documents`

limit = 10000 # Change number of documents being analyzed. Set to `None` to do all documents.
n = 0
documents = []
document_ids = []
document_titles = []
    
for document in tdm_client.dataset_reader(dataset_file):
    processed_document = []
    document_id = document['id']
    document_title = document['title']
    if use_filtered_list is True:
        # Skip documents not in our filtered_id_list
        if document_id not in filtered_id_list:
            continue
    document_ids.append(document_id)
    document_titles.append(document_title) 
    unigrams = document.get("unigramCount", [])
    for gram, count in unigrams.items():
        clean_gram = process_token(gram)
        if clean_gram is None:
            continue
        processed_document.append(clean_gram)
    if len(processed_document) > 0:
        documents.append(processed_document)
    n += 1
    if (limit is not None) and (n >= limit):
        break
print('Unigrams collected and processed.')

Unigrams collected and processed.


At this point, we have unigrams collected for all our documents insde the `documents` list variable. Each index of our list is a single document, starting with `documents[0]`. Each document is, in turn, a list with a single stringe for each unigram.

**Note:** As we collect the unigrams for each document, we are simply including them in a list of strings. This is not the same as collecting them into word counts, and we are not using a Counter() object here like the Word Frequencies notebook. 

The next cell demonstrates the contents of each item in our `document` list. Essentially, 

In [54]:
# Show the unigrams collected for a particular document
documents[0]

['graph',
 'fact',
 'using',
 'myself',
 'rely',
 'acts',
 'isomorphic',
 'points',
 'chapter',
 'comes',
 'proof',
 'domain',
 'actually',
 'matrices',
 'referring',
 'state',
 'discrete',
 'method',
 'delay',
 'focus',
 'result',
 'outline',
 'previous',
 'graham',
 'many',
 'product',
 'many',
 'also',
 'hence',
 'prefer',
 'rational',
 'binary',
 'prove',
 'power',
 'proper',
 'previously',
 'terminology',
 'form',
 'generality',
 'isomorphism',
 'graph',
 'interval',
 'express',
 'following',
 'includes',
 'small',
 'baumslag',
 'used',
 'known',
 'image',
 'unlike',
 'free',
 'closed',
 'basis',
 'acting',
 'slightly',
 'considered',
 'solitar',
 'expressed',
 'however',
 'therefore',
 'otherwise',
 'neither',
 'form',
 'prove',
 'useful',
 'rationals',
 'like',
 'takes',
 'groups',
 'space',
 'given',
 'orbit',
 'endpoint',
 'sees',
 'every',
 'line',
 'exercises',
 'function',
 'homomorphism',
 'quotation',
 'proposition',
 'doomed',
 'attributed',
 'freely',
 'elements',
 'thi

If we wanted to see word frequencies, we could convert the lists at this point into `Counter()` objects. The next cell demonstrates that operation.

In [55]:
# Convert a given document into a Counter object to determine
# word frequencies count

# Import counter to help count word frequencies
from collections import Counter

word_freq = Counter(documents[0]) # Change documents index to see a different document
word_freq.most_common(25) 

[('proof', 3),
 ('discrete', 3),
 ('prove', 3),
 ('proper', 3),
 ('form', 3),
 ('graph', 2),
 ('fact', 2),
 ('acts', 2),
 ('chapter', 2),
 ('actually', 2),
 ('state', 2),
 ('outline', 2),
 ('many', 2),
 ('hence', 2),
 ('rational', 2),
 ('interval', 2),
 ('considered', 2),
 ('otherwise', 2),
 ('rationals', 2),
 ('groups', 2),
 ('line', 2),
 ('group', 2),
 ('integer', 2),
 ('positive', 2),
 ('example', 2)]

Now that we have all the cleaned unigrams for every document in a list called `documents`, we can use Gensim to compute TF/IDF.

---
## Using Gensim to Compute "Term Frequency- Inverse Document Frequency"

It will be helpful to remember the basic steps we did in the explanatory [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) example:

1. Create a list of the frequency of every word in every document
2. Create a list of every word in the [corpus](https://docs.constellate.org/key-terms/#corpus)
3. Compute [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) based on that data

So far, we have completed the first item by creating a list of the frequency of every word in every document. Now we need to create a list of every word in the corpus. In [gensim](https://docs.constellate.org/key-terms/#gensim), this is called a "dictionary". A [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) is similar to a [Python dictionary](https://docs.constellate.org/key-terms/#python-dictionary), but here it is called a [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) to show it is a specialized kind of dictionary.

### Creating a Gensim Dictionary

Let's create our [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary). A [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) is a kind of masterlist of all the words across all the documents in our corpus. Each unique word is assigned an ID in the gensim dictionary. The result is a set of key/value pairs of unique tokens and their unique IDs.

In [74]:
import gensim
dictionary = gensim.corpora.Dictionary(documents)

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(43912 unique tokens: ['abelian', 'acting', 'action', 'actions', 'acts']...) from 634 documents (total 722668 corpus positions)


Now that we have a [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary), we can get a preview that displays the number of unique tokens across all of our texts.

In [75]:
print(dictionary)

Dictionary(43912 unique tokens: ['abelian', 'acting', 'action', 'actions', 'acts']...)


The [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) stores a unique identifier (starting with 0) for every unique token in the corpus. The [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) does not contain information on word frequencies; it only catalogs all the unique words in the corpus. You can see the unique ID for each token in the text using the .token2id() method.

In [58]:
list(dictionary.token2id.items())

[('abelian', 0),
 ('acting', 1),
 ('action', 2),
 ('actions', 3),
 ('acts', 4),
 ('actually', 5),
 ('again', 6),
 ('also', 7),
 ('alternative', 8),
 ('analog', 9),
 ('anyway', 10),
 ('arguing', 11),
 ('around', 12),
 ('asks', 13),
 ('attributed', 14),
 ('basis', 15),
 ('baumslag', 16),
 ('bijection', 17),
 ('binary', 18),
 ('campaign', 19),
 ('case', 20),
 ('cayley', 21),
 ('chapter', 22),
 ('chapters', 23),
 ('closed', 24),
 ('coefficient', 25),
 ('comes', 26),
 ('commonly', 27),
 ('complain', 28),
 ('composed', 29),
 ('composition', 30),
 ('consider', 31),
 ('considered', 32),
 ('constructing', 33),
 ('construction', 34),
 ('contains', 35),
 ('corollary', 36),
 ('cyclic', 37),
 ('defined', 38),
 ('delay', 39),
 ('denoted', 40),
 ('described', 41),
 ('description', 42),
 ('determined', 43),
 ('developed', 44),
 ('different', 45),
 ('dihedral', 46),
 ('discrete', 47),
 ('discussing', 48),
 ('domain', 49),
 ('donald', 50),
 ('doomed', 51),
 ('draw', 52),
 ('drawing', 53),
 ('dyadic', 54

We could also look up the corresponding ID for a token using the ``.get`` method.

In [80]:
# Get the value for the key 'people'. Return 0 if there is no token matching 'people'. 
# The number returned is the gensim dictionary ID for the token. 

dictionary.token2id.get('hyperbolic', 'None') 

418

For the sake of example, we could also discover a particular token using just the ID number. This is not something likely to happen in practice, but it serves here as a demonstration of the connection between tokens and their ID number.

Normally, [Python dictionaries](https://docs.constellate.org/key-terms/#python-dictionary) only map from keys to values (not from values to keys). However, we can write a quick for loop to go the other direction. This cell is simply to demonstrate how the [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) is connected to the list entries in the [gensim](https://docs.constellate.org/key-terms/#gensim) ``bow_corpus``.

In [77]:
# Find the token associated with a token id number
token_id = 17

# If the token id matches, print out the associated token
for dict_id, token in dictionary.items():
    if dict_id == token_id:
        print(token)

bijection


## Creating a Bag of Words Corpus

The next step is to connect our word frequency data found within ``documents`` to our [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) token IDs. For every document, we want to know how many times a word (notated by its ID) occurs. We will create a [Python list](https://docs.constellate.org/key-terms/#python-list) called ``bow_corpus`` that will turn our word counts into a series of [tuples](https://docs.constellate.org/key-terms/#tuple) where the first number is the [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) token ID and the second number is the word frequency.

![Combining Gensim dictionary with documents list to create Bag of Words Corpus](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/bag-of-words-creation.png)

In [61]:
# Create a bag of words corpus

bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

print('Bag of words corpus created successfully.')

Bag of words corpus created successfully.


In [83]:
# Examine the bag of words corpus for a specific document

list(bow_corpus[5][:10]) # List out a slice of the first ten items

[(0, 3),
 (2, 1),
 (3, 2),
 (4, 1),
 (5, 1),
 (7, 1),
 (8, 1),
 (20, 2),
 (22, 1),
 (31, 1)]

Using IDs can seem a little abstract, but we can discover the word associated with a particular ID. For demonstration purposes, the following code will replace the token IDs in the last example with the actual tokens.

In [93]:
word_counts = [[(dictionary[id], count) for id, count in line] for line in bow_corpus]
list(word_counts[5][:10])

[('abelian', 3),
 ('action', 1),
 ('actions', 2),
 ('acts', 1),
 ('actually', 1),
 ('also', 1),
 ('alternative', 1),
 ('case', 2),
 ('chapter', 1),
 ('consider', 1)]

## Create the `TfidfModel`

The next step is to create the [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) model which will set the parameters for our implementation of [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf). In our [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) example, the formula for [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) was:

$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$

In [gensim](https://docs.constellate.org/key-terms/#gensim), the default formula for measuring [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) uses log base 2 instead of log base 10, as shown:

$$(Times-the-word-occurs-in-given-document) \cdot \log_{2} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-the-word)}$$

If you would like to use a different formula for your [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) calculation, there is a description of [parameters you can pass](https://radimrehurek.com/gensim/models/tfidfmodel.html).

In [64]:
# Create our gensim TF-IDF model
model = gensim.models.TfidfModel(bow_corpus) 

INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:calculating IDF weights for 634 documents and 43912 features (552236 matrix non-zeros)


Now, we apply our model to the ``bow_corpus`` to create our results in ``corpus_tfidf``. The ``corpus_tfidf`` is a python list of each document similar to ``bow_document``. Instead of listing the frequency next to the [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) ID, however, it contains the [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) score for the associated token. Below, we display the first document in ``corpus_tfidf``.

In [97]:
# Create TF-IDF scores for the ``bow_corpus`` using our model
# Also create TF-IDF scores keyed off unigram string instead of gensim dictionary id

corpus_tfidf = model[bow_corpus]
example_tfidf_scores = [[(dictionary[id], count) for id, count in line] for line in corpus_tfidf]

In [98]:
# List out the TF-IDF scores for the first 10 tokens of the first text in the corpus
list(corpus_tfidf[5][:10])

[(0, 0.027287195813687024),
 (2, 0.006193903565093521),
 (3, 0.020086563594622107),
 (4, 0.008133639715420712),
 (5, 0.011504563435100174),
 (7, 0.0004264193484837412),
 (8, 0.017667706535831728),
 (20, 0.0016444891754621865),
 (22, 0.016030620320733695),
 (31, 0.0020857063030258974)]

Let's display the tokens instead of the [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary) IDs.

In [99]:
# List out the TF-IDF scores for the first 10 tokens of the first text in the corpus
list(example_tfidf_scores[5][:10]) 

[('abelian', 0.027287195813687024),
 ('action', 0.006193903565093521),
 ('actions', 0.020086563594622107),
 ('acts', 0.008133639715420712),
 ('actually', 0.011504563435100174),
 ('also', 0.0004264193484837412),
 ('alternative', 0.017667706535831728),
 ('case', 0.0016444891754621865),
 ('chapter', 0.016030620320733695),
 ('consider', 0.0020857063030258974)]

## Find Top Terms in a Single Document
Finally, let's sort the terms by their [TF-IDF](https://docs.constellate.org/key-terms/#tf-idf) weights to find the most significant terms in the document.

In [102]:
# Sort the tuples in our tf-idf scores list

# Choosing a document by its index number
# Change n to see a different document
n = 5

def Sort(tfidf_tuples): 
    tfidf_tuples.sort(key = lambda x: x[1], reverse=True) 
    return tfidf_tuples 

# Print the document id and title
print('Title: ', document_titles[n])
print('ID: ', document_ids[n])

#List the top twenty tokens in our example document by their TF-IDF scores
list(Sort(example_tfidf_scores[n])[:20]) 

Title:  Homological Techniques for Strongly Graded Rings: A Survey
ID:  ark://27927/pbd6g0c40h


[('aljadeff', 0.17166508111260184),
 ('chouinard', 0.17166508111260184),
 ('ginosar', 0.17166508111260184),
 ('rxmodule', 0.17166508111260184),
 ('resolutions', 0.15890941803360442),
 ('evens', 0.14243508472272443),
 ('graded', 0.13733357409216954),
 ('rings', 0.1298546079795905),
 ('projectivity', 0.12884391039075085),
 ('cornick', 0.12399301030356452),
 ('moore', 0.10929106250837356),
 ('eckmann', 0.10786598575369791),
 ('tensoring', 0.10786598575369791),
 ('quinn', 0.10555093588440462),
 ('mislin', 0.10144955829986066),
 ('module', 0.10100663012125463),
 ('aljadeffand', 0.08583254055630092),
 ('comtek', 0.08583254055630092),
 ('dade', 0.08583254055630092),
 ('dimkpn', 0.08583254055630092)]

We could also analyze across the entire corpus to find the most unique terms. These are terms that appear in a particular text, but rarely or never appear in other texts. (Often, these will be proper names since a particular article may mention a name often but the name may rarely appear in other articles. There's also a fairly good chance these will be typos or errors in optical character recognition.)

In [68]:
# Define a dictionary ``td`` where each document gather
td = { 
dictionary.get(_id): value for doc in corpus_tfidf
for _id, value in doc
}

# Sort the items of ``td`` into a new variable ``sorted_td``
# the ``reverse`` starts from highest to lowest
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True) 

In [69]:
for term, weight in sorted_td[:25]: # Print the top 25 terms in the entire corpus
    print(term, weight)

divk 0.49389123037412147
tmax 0.4639313507523522
pseudocharacters 0.4440562717163367
hyperspherical 0.44214591896725747
leafages 0.4367000988486141
xiaoman 0.43140437956890487
hyperbolization 0.42879725479274416
cann 0.4163130878867191
nakaoka 0.4113357107314544
obraztsov 0.40852483782114507
hemidiscrete 0.40186646565791384
lasheras 0.39312064611316916
antoniuk 0.3813156513482767
scwols 0.38073238405578
hillman 0.3797701539803256
pegs 0.3792514165752814
sncx 0.3771170076222464
enfngn 0.3765251699202019
ymnθ 0.3735617215149732
cobracket 0.37199218225447134
subnegative 0.3655715945855486
protree 0.36217815560411404
landscapes 0.3588602549124525
quasipolynomials 0.3556034971182082
dismantlability 0.3554600581843127


## Display Most Significant Term for each Document
We can see the most significant term in every document.

In [70]:
# For each document, print the ID, most significant/unique word, and TF/IDF score

n = 0

for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(document_ids[n], dictionary.get(word_id), score)
    if n >= 10:
        break

ark://27927/pbd6j9j9ng waged 0.229536326662481
http://www.jstor.org/stable/118602 incoherent 0.49416190715474884
http://www.jstor.org/stable/23927282 transseries 0.2240610414704663
ark://27927/phx7x24r0rb chevie 0.19076751138165948
ark://27927/phx5wq0m2v3 robot 0.17128959234197738
ark://27927/pbd6g0c40h aljadeff 0.17166508111260184
ark://27927/phx4h1622s5 semistable 0.24202992966776743
ark://27927/phx4f4ks6tb dead 0.16471522883819062
http://www.jstor.org/stable/3844991 exth 0.13967556367522585
http://www.jstor.org/stable/20721717 higes 0.3257865465150734
ark://27927/pbd6fxfg01 spiky 0.29802887392527994


## Ranking documents by TF-IDF Score for a Search Word


In [71]:
from collections import defaultdict
terms_to_docs = defaultdict(list)
for doc_id, doc in enumerate(corpus_tfidf):
    for term_id, value in doc:
        term = dictionary.get(term_id)
        terms_to_docs[term].append((doc_id, value))
    if doc_id >= 500:
        break


In [72]:
# Pick a unigram to discover its score across documents
search_term = 'coriolanus'

# Display a list of documents and scores for the search term
matching = terms_to_docs.get(search_term)
for doc_id, score in sorted(matching, key=lambda x: x[1], reverse=True):
    print(document_ids[doc_id], score)

TypeError: 'NoneType' object is not iterable

In [73]:
# Pick a unigram to discover its score across documents
search_term = 'hyperbolic'

# Display a list of documents and scores for the search term
matching = terms_to_docs.get(search_term)
for doc_id, score in sorted(matching, key=lambda x: x[1], reverse=True):
    print(document_ids[doc_id], score)

http://www.jstor.org/stable/4097604 0.07509385913233951
http://www.jstor.org/stable/40067916 0.05192341649998301
ark://27927/phx4f4kr592 0.04839515731333261
ark://27927/phw4tcbxsf 0.040104965019452044
http://www.jstor.org/stable/20752266 0.038485874941003265
ark://27927/phw4t4rrt1 0.037421455794926256
ark://27927/phx4f149zgc 0.03695599945551828
ark://27927/pgk26wgn37h 0.03615009584621512
http://www.jstor.org/stable/117937 0.03562805588278146
ark://27927/pbd98f3bns 0.03521884569446371
ark://27927/phx4dk54q3z 0.03488696635507774
ark://27927/pbd6fxfg01 0.034560760077270436
http://www.jstor.org/stable/24477626 0.03380584027898839
ark://27927/phx4h15t47j 0.032330890511672525
ark://27927/phzdsrqbnpp 0.03221739308450433
http://www.jstor.org/stable/118035 0.03012414817562989
ark://27927/phz6cqq36b1 0.029440981557024737
ark://27927/phx4f4ks74t 0.02915478233105314
ark://27927/pghjmcv47z 0.028689542430261922
ark://27927/pgk1w0r2b27 0.028080602402389302
ark://27927/phzgkvwqvzd 0.027312268598735995