The log-odds ratio with an informative (and uninformative) Dirichlet prior (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf)) is a common method for finding distinctive terms in two datasets (see [Jurafsky et al. 2014](https://firstmonday.org/ojs/index.php/fm/article/view/4944/3863) for an example article that uses it to make an empirical argument). This method for finding distinguishing words combines a number of desirable properties:

* it specifies an intuitive metric (the log-odds) for the ratio of two probabilities
* it can incorporate prior information in the form of pseudocounts, which can either act as a smoothing factor (in the uninformative case) or incorporate real information about the expected frequency of words overall.
* it accounts for variability of a frequency estimate by essentially converting the log-odds to a z-score.

In this homework you will implement this ratio for a dataset of your choice to characterize the words that differentiate each one.

Your first job is to find two datasets with some interesting opposition -- e.g., news articles from CNN vs. FoxNews, books written by Charles Dickens vs. James Joyce, screenplays of dramas vs. comedies.  Be creative -- this should be driven by what interests you and should reflect your own originality. **This dataset cannot come from Kaggle**.  Feel feel to use web scraping (see [here](https://github.com/CU-ITSS/Web-Data-Scraping-S2023) for a great tutorial) or manually copying/pasting text.  Aim for more than 10,000 tokens for each dataset. 
   
Save those datasets in two files: "class1_dataset.txt" and "class2_dataset.txt" 

Q1. Describe each of those datasets and their source in 100-200 words.

I'm interested in word choices in different movie genres so I selected two representative movies of the Sci-Fi (2001: A Space Odyssey) and Romance (Jerry Maguire) genres, respectively. The scripts were sourced from [SimpleScripts](https://www.simplyscripts.com/), a database of open-source screenplays. Both movies had comparable length in terms of screen time. Upon closer inspection, the scripts also share a similar number of tokens (~ 25,000). It is expected that words such as `space`, `machine`, `time` will be most relevant to the Sci-Fi genre, and for the Romance genre, more likley is the case we see words related to emotions, feelings, or mentions of main characters, and descriptions of the scenarios. 

Q2. Tokenize those texts by filling out the `read_and_tokenize` function below (your choice of tokenizer). The input is a filename and the output should be a list of tokens.

In [1]:
import sys
import nltk
import spacy
import numpy as np
import string
from collections import Counter

In [2]:
def read_and_tokenize(filename):
    
    # read in the script and lowercase all tokens
    with open(filename, encoding="utf-8") as file:
        text = file.read().lower()  
    
    # sentence tokenization
    sents = nltk.sent_tokenize(text)
    
    # tokenization
    words = []
    for sent in sents:
        words.append(nltk.word_tokenize(sent))
    
    # flatten the words list to return a list of tokens
    # also remove punctuations identified in string.punctuation
    alltokens = [word for sent in words for word in sent if word not in string.punctuation]
    
    # if not remove punctuation, use the code below
    # alltokens = [word for sent in words for word in sent]
    
    return alltokens

In [3]:
# change these file paths to wherever the datasets you created above live.
class1_tokens=read_and_tokenize("../data/scripts/scifiscripts.com_scripts_2001.txt")
class2_tokens=read_and_tokenize("../data/scripts/awesomefilm.com_script_jerryMaguire.txt")

In [4]:
# check token size for each class
print("The number of tokens in class 1 is: %.f" % len(class1_tokens))
print("The number of tokens in class 2 is: %.f" % len(class2_tokens))

The number of tokens in class 1 is: 25195
The number of tokens in class 2 is: 28505


Q3.  Now let's find the words that characterize each of those sources (with respect to the other). Implement the log-odds ratio with an uninformative Dirichlet prior.  This value, $\hat\zeta_w^{(i-j)}$ for word $w$ reflecting the difference in usage between corpus $i$ and corpus $j$, is given by the following equation:

$$
\hat\zeta_w^{(i-j)}= {\hat{d}_w^{(i-j)} \over \sqrt{\sigma^2\left(\hat{d}_w^{(i-j)}\right)}}
$$

Where: 

$$
\hat{d}_w^{(i-j)} = \log \left({y_w^i + \alpha_w} \over {n^i + \alpha_0 - y_w^i - \alpha_w}) \right) -  \log \left({y_w^j + \alpha_w} \over {n^j + \alpha_0 - y_w^j - \alpha_w}) \right)
$$

$$
\sigma^2\left(\hat{d}_w^{(i-j)}\right) \approx {1 \over {y_w^i + \alpha_w}} + {1 \over {y_w^j + \alpha_w} }
$$

And:

* $y_w^i = $ count of word $w$ in corpus $i$ (likewise for $j$)
* $\alpha_w$ = 0.01
* $V$ = size of vocabulary (number of distinct word types)
* $\alpha_0 = V * \alpha_w$
* $n^i = $ number of words in corpus $i$ (likewise for $j$)

In this example, the two corpora are your class1 dataset (e.g., $i$ = your class1) and your class2 dataset (e.g., $j$ = class2). Using this metric, print out the 25 words most strongly aligned with class1, and 25 words most strongly aligned with class2.  Again, consult [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) for more detail.

In [5]:
def logodds_with_uninformative_prior(one_tokens, two_tokens, display=25):
    
    # get unique tokens for each class
    class1_v = set(one_tokens)
    class2_v = set(two_tokens)
    
    # get the vocab list - distinct word types from two classes
    vocab = list(class1_v | class2_v)
    
    # get word count for each token in each class
    # token not in the class gets 0 count
    class1_cnt = {word:one_tokens.count(word) if word in class1_v else 0 for word in vocab}
    class2_cnt = {word:two_tokens.count(word) if word in class2_v else 0 for word in vocab}
    
    assert len(class1_cnt) == len(class2_cnt) == len(vocab)
    
    # get vocab size of each class
    class1_num = len(one_tokens)
    class2_num = len(two_tokens)
    
    # alpha setting
    alpha_w = 0.01
    alpha_0 = len(vocab) * alpha_w 
    
    # initialize a dict to store log odds ratios
    lor = {}
    
    # loop over to get lor
    for word in vocab:
        I = (class1_cnt[word] + alpha_w) / (class1_num + alpha_0 - class1_cnt[word] - alpha_w)
        J = (class2_cnt[word] + alpha_w) / (class2_num + alpha_0 - class2_cnt[word] - alpha_w)
        d = (1 / (class1_cnt[word] + alpha_w)) + (1 / (class2_cnt[word] + alpha_w))
        ratio = (np.log(I) - np.log(J)) / np.sqrt(d)
        lor[word] = ratio
    
    # sort the lor dict by values 
    sorted_lor = dict(sorted(lor.items(), key=lambda item: item[1], reverse=True))
    scores = list(sorted_lor.items())
    
    # return top and last 25 tokens
    return scores[:display], scores[-display:]

In [6]:
print("Words that most aligned with the Sci-Fi movie--2001: A Space Odyssey:\n")
logodds_with_uninformative_prior(class1_tokens, class2_tokens)[0]

Words that most aligned with the Sci-Fi movie--2001: A Space Odyssey:



[('--', 68.82342605357024),
 ('had', 5.586309810975934),
 ('space', 4.626690273113128),
 ('earth', 4.45198097545501),
 ('its', 4.310250655343011),
 ('any', 4.238909992584407),
 ('control', 4.221181072603766),
 ('would', 3.6724336732334875),
 ('been', 3.604349770236324),
 ('well', 3.588526684311813),
 ('doors', 3.4575805734136678),
 ('others', 3.2732512611545554),
 ('yes', 3.1842696557593575),
 ('slowly', 3.1705022245375485),
 ('command', 3.1150770706352646),
 ('mission', 3.111298161286163),
 ('area', 3.0716941110192413),
 ('their', 3.03416884626844),
 ('michaels', 2.841531970932135),
 ('true', 2.8258075947089156),
 ('please', 2.8258075947089156),
 ('hope', 2.7005519067672092),
 ('tv', 2.7005519067672092),
 ('child', 2.672553360644189),
 ('inside', 2.6716649065882927)]

The list above gives the top 25 words that most characterize the writing style of the Sci-Fi movie *2001: A Space Odyssey*. There are a number of intuitive examples: `space`, `earth`, `control`, `command`, `mission`, `area`, and character names such as `michaels`. There are also less intuitive ones, for instance, the most distinctive token `--`, which is used as a line separator in the original script to separate scenes. This token, among others, could be further removed during preprocessing to improve accuracy. 

In [7]:
print("Words that most aligned with the Romance movie--Jerry Maguire:\n")
logodds_with_uninformative_prior(class1_tokens, class2_tokens)[1]

Words that most aligned with the Romance movie--Jerry Maguire:



[("n't", -5.428547395403892),
 ('in', -5.542443736374114),
 ('is', -5.694981912898374),
 ('day', -5.750810208778898),
 ('his', -5.809561882296956),
 ('do', -5.922333833699507),
 ('room', -5.924146925486849),
 ("'m", -5.942145767086766),
 ('int', -6.322948129128177),
 ('night', -6.47454817747116),
 ('with', -6.677140054541707),
 ('``', -6.758807274835594),
 ('her', -7.159767774987964),
 ('my', -7.165257175308368),
 ('me', -7.496562226880489),
 ('him', -7.608793271750172),
 ('on', -7.662279649651411),
 ("''", -8.148035251120648),
 ('she', -8.248173887980224),
 ('...', -8.931941782113322),
 ('he', -9.376475351759723),
 ('a', -10.067238918738651),
 ('you', -10.12467992561834),
 ('i', -10.75946397829889),
 ("'s", -11.549952273385347)]

The last 25 tokens from the sorted dictionary gives the tokens that are best aligned with the Romance movie *Jerry Maguire*. It is interesting that most of these tokens are pronouns, `his`, `her`, `he`, `she`, `me`, and `you`. Pronouns occur most of the times to depict the actions and thoughts of characters, from the third-person point of view, commonly seen in screenplay. There are also words describe the time and location of the scenes--`day`, `night`, `room`, `with`, also relevant to the genre.  