# Lesson 2: Themes and Topics

Given two documents how can we determine what they are about? In this lab we will explore two methods for doing this.

The first uses statistics such as mutual information and chi-squared to determine unique words. 

The second uses a more advanced method called Latent Dirichlet Algorithm or LDA to determine which topics are present in the document.

## Corpus

The dataset is a text file with two columns. The first column contains one of three document names:
* "Quran"
* "OT"
* "NT"

where OT is Old Testament and NT is New Testament.
The second column contains the document text, in this case a verse from one of the three document names. 

**The document name and the document are separated by a tab character '\t'**.

Run the following code to output the first five lines of the dataset.

In [1]:
with open("datasets/comparing_train_and_dev.tsv") as f:
    for i in range(5):
        print(f.readline())

Quran	Praise be to Allah, Lord of the Worlds,

Quran	the Merciful, the Most Merciful,

Quran	Owner of the Day of Recompense.

Quran	Guide us to the Straight Path,

Quran	the Path of those upon whom You have favored, not those upon whom is the anger, nor the astray. (Amen please answer)



## Preprocessing



In [2]:
import re
import Stemmer
import os
from collections import Counter
import unittest
%reload_ext ipython_unittest

The cell below contains the preprocessing code for English text from Lab 1.

In [3]:
# The file 'datasets/stopwords_en.txt' contains a list of stopwords - one per line.
with open("datasets/stopwords_en.txt") as f:
    enStopWords = set(f.read().splitlines())

#      stopWords.extend(['thy', 'ye', 'thou', 'thee', 'shalt', 'hath'])

# Initialze the SnowballStemmer
enStemmer = Stemmer.Stemmer('english')

def preprocess_line_en(line: str) -> list[str]:
    # Convert to lower case
    tokens = line.lower()
    
    # Split into tokens with no punctuation
    tokens = re.split("[^\w]", tokens)
    
    # Remove empty strings and stop words and apply the stemmer
    tokens = [enStemmer.stemWord(x) for x in tokens if x and x not in enStopWords]
    
    # Return the tokens
    return tokens

## Exercise 1

Given a doc_name "Quran", "NT", or "OT" and implement the following function that reads each line of the corpus and preprocesses the selected documents.

In [None]:
def preprocess_dataset(doc_name: str, dataset: list[str]) -> list[str]:
    """Preprocesses and returns the documents from the dataset with the given doc name"""
    
    # 1. Loop over each line in the dataset and split it on the tab character '\t'. 
    # The first part is the document name the second part is the document.
    
    # 2. If the document name is the one specified in the parameter doc_name then preprocess it using 
    # preprocess_line_en.
    
    # 3. Return all the preprocessed documents.
    # Example return value: [['owner', 'day', 'recompens'], ['guid', 'straight', 'path'], ...]
    
    raise Exception("Implement me!")

Run the following test to verify your preprocessing function works as expected. If it passes you should see it succeeds

In [5]:
test_corpus = ['Quran\tPraise be to Allah, Lord of the Worlds,',
 'Quran\tthe Merciful, the Most Merciful,',
 'Quran\tOwner of the Day of Recompense.',
 'Quran\tGuide us to the Straight Path,',
 'NT\tNow the birth of Jesus Christ was on this wise: When as his mother Mary was espoused to Joseph, before they came together, she was found with child of the Holy Ghost.',
 'NT\tThen Joseph her husband, being a just man, and not willing to make her a publick example, was minded to put her away privily.',
 'NT\tBut while he thought on these things, behold, the angel of the LORD appeared unto him in a dream, saying, Joseph, thou son of David, fear not to take unto thee Mary thy wife: for that which is conceived in her is of the Holy Ghost.',
 'NT\tAnd she shall bring forth a son, and thou shalt call his name JESUS: for he shall save his people from their sins.',
 'OT\tAnd the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.',
 'OT\tAnd God said, Let there be light: and there was light.',
 'OT\tAnd God saw the light, that it was good: and God divided the light from the darkness.',
 'OT\tAnd God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.',
 'OT\tAnd God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.']

In [6]:
%%unittest
"Quran documents are correctly processed"
result = preprocess_dataset("Quran", test_corpus)
assert result == [['prais', 'allah', 'lord', 'world'], ['merci', 'merci'], ['owner', 'day', 'recompens'], ['guid', 'straight', 'path']]

"Old Testament documents are correctly processed"
result = preprocess_dataset("OT", test_corpus)
assert result == [['earth', 'form', 'void', 'dark', 'face', 'deep', 'spirit', 'god', 'move', 'face', 'water'], ['god', 'light', 'light'], ['god', 'light', 'good', 'god', 'divid', 'light', 'dark'], ['god', 'call', 'light', 'day', 'dark', 'call', 'night', 'even', 'morn', 'day'], ['god', 'firmament', 'midst', 'water', 'divid', 'water', 'water']]

"New Testament documents are correctly processed"
result = preprocess_dataset("NT", test_corpus)
assert result == [['birth', 'jesus', 'christ', 'wise', 'mother', 'mari', 'espous', 'joseph', 'found', 'child', 'holi', 'ghost'], ['joseph', 'husband', 'man', 'make', 'publick', 'mind', 'put', 'privili'], ['thought', 'thing', 'behold', 'angel', 'lord', 'appear', 'dream', 'joseph', 'thou', 'son', 'david', 'fear', 'thee', 'mari', 'thi', 'wife', 'conceiv', 'holi', 'ghost'], ['bring', 'son', 'thou', 'shalt', 'call', 'jesus', 'save', 'peopl', 'sin']]



Success

...
----------------------------------------------------------------------
Ran 3 tests in 0.001s

OK


<unittest.runner.TextTestResult run=3 errors=0 failures=0>

In [4]:
# Answer
def preprocess_dataset(doc_name: str, dataset: list[str]) -> list[str]:
    
    docs = []

    for line in dataset:
        
        line = line.strip()
        if line:
            doc, text = line.split('\t')

            terms = preprocess_line_en(text)

            if doc == doc_name:
                docs.append(terms)
    
    return docs

## Mutual Information & Chi Squared

The following functions in the cell below calculate mutual information and chi squared.

In [7]:
from math import log2

In [8]:
def compute_term_doc_freqs(corpus):
    # For each term find the number of docs in the corpus that contain it
    freqs = Counter()

    for doc in corpus:
        terms = set(doc)
        freqs.update(terms)
        
    return freqs 

def xlogy(x, y):
    # Compute x * log(y). If x == 0 then return zero without
    # evaluating log(y)
    if x == 0:
        return 0
    else:
        return x * log2(y)

def compute_mi(n00, n10, n01, n11):
    # Compute mutual information
    N = n00 + n10 + n01 + n11
    return xlogy((n11 / N), (N * n11) / ((n11 + n10) * (n11 + n01))) + \
            xlogy((n01 / N), (N * n01) / ((n01 + n00) * (n11 + n01))) + \
            xlogy((n10 / N), (N * n10) / ((n11 + n10) * (n10 + n00))) + \
            xlogy((n00 / N), (N * n00) / ((n01 + n00) * (n10 + n00)))

def compute_chi_sq(n00, n10, n01, n11):
    # Compute chi squared
    return ((n11 + n10 + n01 + n00) * (n11 * n00 - n10 * n01)**2) / \
            ((n11 + n01) * (n11 + n10) * (n10 + n00) * (n01 + n00))

def compute_mi_and_chi_sq(corpus, other_corpus, vocab):
    # Compute the term doc frequencies for the corpus and the other corpus
    corpus_freqs = compute_term_doc_freqs(corpus)
    other_corpus_freqs = compute_term_doc_freqs(other_corpus)

    # For each term in the vocab compute the MI and Chi squared
    mut_inf = {}
    chi_sq = {}

    for term in vocab:

        # Number of docs in corpus containing term
        n11 = corpus_freqs.get(term, 0)

        # Number of docs in corpus not containing term
        n10 = len(corpus) - n11 

        # Number of docs not in corpus containing term
        n01 = other_corpus_freqs.get(term, 0)

        # Number of docs not in corpus not containing term
        n00 = len(other_corpus) - n01 

        # Compute MI
        mut_inf[term] = compute_mi(n00, n10, n01, n11)

        # Compute Chi squared
        chi_sq[term] = compute_chi_sq(n00, n10, n01, n11)
    
    return mut_inf, chi_sq 

## Exercise 2

Write code following the steps below to calculate mutual information and chi squared for the corpus.

In [9]:
def read_corpus() -> list[str]:
    with open("datasets/comparing_train_and_dev.tsv") as f:
        return f.readlines()

In [None]:
# 1. Read the corpus by calling the function read_corpus defined above. 

# 2. Preprocess the datasets "Quran", "OT", and "NT" using the preprocess_dataset function.

# 3. Generate the vocabulary (a list of all the unique words) in the entire preprocessed corpus.

# 4. By calling the function compute_mi_and_chi_sq calculate the mutual information 
# and chi squared values for the Quran. The parameter corpus should be the preprocessed Quran documents and
# the 'other_corpus' is the combination of the preprocessed old and new testaments.

# 5. Sort the statistics from highest to lowest and print the top 10 mutual information words 
# and the top 10 chi squared words for the Quran.

# 6. Repeat steps 4 and 5 to calculate the mutual information and chi squared for the Old Testament and 
# the New Testament.


In [10]:
### Answer
def compute_vocab(corpus):
    # Compute the vocab for the corpus
    return set([term for doc in corpus for term in doc])

def get_top_10_stats(stats):
    return [(k, round(stats[k], 3)) for k in sorted(stats, key=stats.get, reverse=True)][:10]

def format_list(l):
    return "\n".join([str(i) for i in l])

def print_top_10_stats(corpus_name, mi, chi_sq):
    # Print the top 10 MI topics and scores
    mi_top_10 = get_top_10_stats(mi)
    print(f"Top 10 MI for {corpus_name}:\n{format_list(mi_top_10)}\n")

    # Print the top 10 Chi Sq topics and scores
    chi_sq_top_10 = get_top_10_stats(chi_sq)
    print(f"Top 10 Chi Sq for {corpus_name}:\n{format_list(chi_sq_top_10)}\n")  
    
corpus = read_corpus()
quran = preprocess_dataset("Quran", corpus)
new_testament = preprocess_dataset("NT", corpus)
old_testament = preprocess_dataset("OT", corpus)

# Combine all the preprocessed corpora together
corpus_documents = quran + new_testament + old_testament

# Get the vocab for the corpus
vocab = compute_vocab(corpus_documents)

# Compute stats for the Quran
quran_mi, quran_chi_sq = compute_mi_and_chi_sq(quran, new_testament + old_testament, vocab)
print_top_10_stats("Quran", quran_mi, quran_chi_sq)

# Compute stats for the New Testament
nt_mi, nt_chi_sq = compute_mi_and_chi_sq(new_testament, quran + old_testament, vocab)
print_top_10_stats("New Testament", nt_mi, nt_chi_sq)

# Compute stats for the Old Testament
ot_mi, ot_chi_sq = compute_mi_and_chi_sq(old_testament, quran + new_testament, vocab)
print_top_10_stats("Old Testament", ot_mi, ot_chi_sq)

Top 10 MI for Quran:
('allah', 0.153)
('thou', 0.039)
('thi', 0.031)
('ye', 0.028)
('thee', 0.028)
('god', 0.025)
('man', 0.02)
('king', 0.019)
('hath', 0.019)
('punish', 0.018)

Top 10 Chi Sq for Quran:
('allah', 7058.784)
('punish', 917.837)
('thou', 889.245)
('believ', 856.012)
('unbeliev', 811.822)
('messeng', 769.741)
('god', 704.642)
('thi', 699.436)
('beli', 683.328)
('guid', 677.282)

Top 10 MI for New Testament:
('jesus', 0.065)
('christ', 0.037)
('allah', 0.019)
('discipl', 0.018)
('lord', 0.016)
('ye', 0.013)
('israel', 0.013)
('faith', 0.013)
('paul', 0.012)
('peter', 0.011)

Top 10 Chi Sq for New Testament:
('jesus', 3268.989)
('christ', 1795.001)
('discipl', 909.8)
('faith', 669.145)
('paul', 588.945)
('ye', 586.429)
('peter', 560.751)
('lord', 538.896)
('thing', 525.05)
('receiv', 490.809)

Top 10 MI for Old Testament:
('allah', 0.087)
('jesus', 0.041)
('israel', 0.036)
('lord', 0.031)
('thi', 0.03)
('king', 0.029)
('thou', 0.023)
('christ', 0.021)
('thee', 0.019)
('beli

### Analysis

1. What differences do you observe between the top Mutual Information words and the top Chi Squared words?

2. What can you learn about the three documents, the Quran, Old Testament, and New Testament from these results?

## Extension exercises

1. Notice that words like "thou", "thi", "ye" appear in the output - what do you think these words mean in English?
2. It turns out these are old English words for "you" and "they" - but these should be stop words! Modify the preprocessing code to exclude these words.
3. Recalculate the Mutual Information and Chi Squared scores. Do you get better results?

## LDA

In [11]:
import numpy as np
from gensim import models
from gensim.corpora.dictionary import Dictionary

In [12]:
corpus = read_corpus()
quran = preprocess_dataset("Quran", corpus)
new_testament = preprocess_dataset("NT", corpus)
old_testament = preprocess_dataset("OT", corpus)

# Combine all the preprocessed corpora together
common_texts = quran + new_testament + old_testament

# Create dictionary 
corpus_dictionary = Dictionary(common_texts)
corpus = [corpus_dictionary.doc2bow(d) for d in common_texts]

In [13]:
# Train the LDA model with 20 topics
lda = models.LdaModel(corpus, num_topics=20, id2word=corpus_dictionary)

In [14]:
# Inspect the top 5 docs from the quran
for i in range(5):
    doc_bow = corpus_dictionary.doc2bow(quran[i])
    topics = lda.get_document_topics(doc_bow)
    print('Document: \n', quran[i])
    print('Topics: \n', topics)

Document: 
 ['prais', 'allah', 'lord', 'world']
Topics: 
 [(0, 0.010001853), (1, 0.010001853), (2, 0.010001853), (3, 0.010001853), (4, 0.010001853), (5, 0.010001853), (6, 0.010001853), (7, 0.010001853), (8, 0.010001853), (9, 0.010001856), (10, 0.80996484), (11, 0.010001853), (12, 0.010001854), (13, 0.010001853), (14, 0.010001853), (15, 0.010001853), (16, 0.010001853), (17, 0.010001853), (18, 0.010001853), (19, 0.010001853)]
Document: 
 ['merci', 'merci']
Topics: 
 [(0, 0.01666682), (1, 0.01666682), (2, 0.6833304), (3, 0.01666682), (4, 0.01666682), (5, 0.01666682), (6, 0.01666682), (7, 0.01666682), (8, 0.01666682), (9, 0.01666682), (10, 0.01666682), (11, 0.01666682), (12, 0.01666682), (13, 0.01666682), (14, 0.01666682), (15, 0.01666682), (16, 0.01666682), (17, 0.01666682), (18, 0.01666682), (19, 0.01666682)]
Document: 
 ['owner', 'day', 'recompens']
Topics: 
 [(0, 0.012502207), (1, 0.2371263), (2, 0.28051636), (3, 0.012502207), (4, 0.012502207), (5, 0.012502207), (6, 0.26981983), (7, 0.

In [15]:
# Print out the top 5 words for each topic
print("The top five words for each topic:")
topics = lda.print_topics(num_words=5)
for topic_id, words in topics:
    print(topic_id, ': ', words)

The top five words for each topic:
0 :  0.123*"mine" + 0.085*"midst" + 0.075*"work" + 0.031*"fruit" + 0.031*"drink"
1 :  0.128*"offer" + 0.069*"day" + 0.053*"captiv" + 0.049*"sin" + 0.037*"sacrific"
2 :  0.207*"lord" + 0.034*"evil" + 0.032*"heart" + 0.027*"head" + 0.024*"righteous"
3 :  0.092*"deliv" + 0.062*"blood" + 0.051*"flesh" + 0.045*"reign" + 0.039*"troubl"
4 :  0.072*"set" + 0.052*"mountain" + 0.052*"face" + 0.037*"day" + 0.036*"lift"
5 :  0.141*"men" + 0.059*"thousand" + 0.043*"gold" + 0.041*"hundr" + 0.033*"twenti"
6 :  0.140*"man" + 0.081*"david" + 0.038*"jacob" + 0.032*"round" + 0.030*"lord"
7 :  0.087*"gate" + 0.064*"cast" + 0.056*"tree" + 0.054*"spirit" + 0.051*"god"
8 :  0.078*"father" + 0.062*"son" + 0.057*"daughter" + 0.044*"cut" + 0.040*"field"
9 :  0.152*"god" + 0.135*"lord" + 0.054*"word" + 0.034*"israel" + 0.024*"thing"
10 :  0.118*"hous" + 0.100*"behold" + 0.071*"lord" + 0.046*"earth" + 0.045*"egypt"
11 :  0.082*"pass" + 0.080*"citi" + 0.050*"judgment" + 0.032*"wa

## Exercise 3

In this exercise we'll write some code to interpret the topics identified by the LDA model.

1. For each of the three corpora: Quran, Old Testament, and New Testament, compute the average score for each topic by summing the document-topic probability for each document in that corpus and dividing by the total number of documents in the corpus.
2. Then for each corpus, identify the top topic that has the highest average score. For each of those three top topics print the top 10 tokens with highest probability of belonging to that topic. Hint: use `lda.print_topic`.

In [16]:
# Answer
# For each corpus compute the average score for each topic
def compute_average_topic_score(corpus, lda):
    # Create an array to contain the averages
    topic_sums = np.zeros((lda.num_topics))
    
    for doc in corpus:
        # Get the topics for this document
        doc_bow = corpus_dictionary.doc2bow(doc)
        doc_topics = lda.get_document_topics(doc_bow)
        
        # Sum the probabilities for each topic
        for topic_id, p in doc_topics:
            topic_sums[topic_id] += p
    
    # Return the average probabilities for each topic
    return topic_sums / len(corpus)

def print_top_topic(corpus_name, scores, lda):
    # Find the id of the topic with the highest probability
    top_topic_id = np.argmax(scores)

    # Get the top 10 tokens corresponding to that topic
    topic_words = lda.print_topic(top_topic_id, topn=10)

    # Print the results
    print("Top topic id for ", corpus_name, " is: ", top_topic_id)
    print("Top 10 words:\n", topic_words, "\n")

def print_all_topics(quran_scores, nt_scores, ot_scores, lda):
    print("LDA topics:")
    all_topics = lda.print_topics(num_topics=20, num_words=10)
    for topic_id, keywords in all_topics:
        print(topic_id, ': ', keywords)

    # Sort the topics
    sorted_quran_topic_ids = np.flip(np.argsort(quran_scores))
    sorted_nt_topic_ids = np.flip(np.argsort(nt_scores))
    sorted_ot_topic_ids = np.flip(np.argsort(ot_scores))
    print("Quran, NT, OT")
    print(np.vstack([sorted_quran_topic_ids, quran_scores[sorted_quran_topic_ids], sorted_nt_topic_ids, nt_scores[sorted_nt_topic_ids], sorted_ot_topic_ids, ot_scores[sorted_ot_topic_ids]]).T)
    
# Get the average topic scores for each corpus
quran_avg_topic_scores = compute_average_topic_score(quran, lda)
nt_avg_topic_scores = compute_average_topic_score(new_testament, lda)
ot_avg_topic_scores = compute_average_topic_score(old_testament, lda)

# Print out the top topic and top 10 words for that topic
print_top_topic("Quran", quran_avg_topic_scores, lda)
print_top_topic("New Testament", nt_avg_topic_scores, lda)
print_top_topic("Old Testament", ot_avg_topic_scores, lda)

# Print out the list of all topics for each corpus
# print_all_topics(quran_avg_topic_scores, nt_avg_topic_scores, ot_avg_topic_scores, lda)

Top topic id for  Quran  is:  2
Top 10 words:
 0.207*"lord" + 0.034*"evil" + 0.032*"heart" + 0.027*"head" + 0.024*"righteous" + 0.024*"great" + 0.024*"anger" + 0.023*"mouth" + 0.023*"speak" + 0.022*"word" 

Top topic id for  New Testament  is:  9
Top 10 words:
 0.152*"god" + 0.135*"lord" + 0.054*"word" + 0.034*"israel" + 0.024*"thing" + 0.020*"command" + 0.019*"prais" + 0.018*"peopl" + 0.016*"measur" + 0.016*"spoken" 

Top topic id for  Old Testament  is:  9
Top 10 words:
 0.152*"god" + 0.135*"lord" + 0.054*"word" + 0.034*"israel" + 0.024*"thing" + 0.020*"command" + 0.019*"prais" + 0.018*"peopl" + 0.016*"measur" + 0.016*"spoken" 



### Analysis

1. For each of the three top topics what title in 1-3 words would you give to that topic?

2. What does the LDA model tell you about the corpus? Consider the top three topics you have found. 

### Extension Exercises

1. Print out a list of all the topics for each corpus. Are there any topics that appear to be common in 2 corpora but not in the other? What are they and what are some examples of high probability words from these topics?

2. How do these results differ from what you learned using mutual information and chi squared?