This notebook examines the words that distinguish the [2020 Democrat party platform](https://democrats.org/wp-content/uploads/2020/08/2020-Democratic-Party-Platform.pdf) from the [2020/2016 Republican party platform](https://prod-cdn-static.gop.com/docs/Resolution_Platform_2020.pdf) (both OCR'd with Abbyy FineReader), using the Chi-Square and the Mann-Whitney test.

In [1]:
import sys
import json
import nltk
import math
import operator
from collections import Counter
from scipy.stats import mannwhitneyu

In [2]:
def read(filename):
    with open(filename, encoding="utf-8") as file:
        # lowercase text
        return file.read().lower()

In [3]:
democratText=read("../data/democrat_platform_2020.txt")

In [4]:
republicanText=read("../data/republican_platform_2020.txt")

Explore your assumptions between the words you think will most distinguish the Democrat and Republican platforms.  Before looking at the results of the tests, what words do you think will be comparatively distinct to both?  (If you're not familiar with either, scan the platforms linked above).

Hypotheses: 

Democrates are likely to be characterized by key words such as freedom, liberty, women, right, but republicans are likely to feature key words such as justice, people, budget. 

In [6]:
def tokenize(data):
    return nltk.word_tokenize(data)

In [7]:
def get_counts(tokens):
    counts=Counter()
    for token in tokens:
        counts[token]+=1
    return counts

The $\chi^2$ test as used in the comparison of different texts is designed to measure how statistically significant the distriubtion of counts in a 2x2 contingency table is.  Use the following function to analyze the difference between the platforms.  How do the most distinct terms comport with your assumptions?

In [8]:
def chi_square(one_counts, two_counts):

    one_sum=0.
    two_sum=0.
    vocab={}
    for word in one_counts:
        one_sum+=one_counts[word]
        vocab[word]=1
    for word in two_counts:
        vocab[word]=1
        two_sum+=two_counts[word]

    N=one_sum+two_sum
    vals={}
    
    for word in vocab:
        O11=one_counts[word]
        O12=two_counts[word]
        O21=one_sum-one_counts[word]
        O22=two_sum-two_counts[word]
        
        # We'll use the simpler form given in Manning and Schuetze (1999) 
        # for 2x2 contingency tables: 
        # https://nlp.stanford.edu/fsnlp/promo/colloc.pdf, equation 5.7
        
        vals[word]=(N*(O11*O22 - O12*O21)**2)/((O11 + O12)*(O11+O21)*(O12+O22)*(O21+O22))
        
    sorted_chi = sorted(vals.items(), key=operator.itemgetter(1), reverse=True)
    one=[]
    two=[]
    for k,v in sorted_chi:
        if one_counts[k]/one_sum > two_counts[k]/two_sum:
            one.append(k)
        else:
            two.append(k)
    
    print ("Democrat:\n")
    for k in one[:20]:
        print("%s\t%s" % (k,vals[k]))

    print ("\n\nRepublican:\n")
    for k in two[:20]:
        print("%s\t%s" % (k,vals[k]))

In [9]:
democrat_tokens=tokenize(democratText)
democrat_counts=get_counts(democrat_tokens)

In [10]:
republican_tokens=tokenize(republicanText)
republican_counts=get_counts(republican_tokens)

In [11]:
chi_square(democrat_counts, republican_counts)

Democrat:

democrats	363.5351093621297
will	359.5218076288936
and	216.96130188397586
health	136.8219687920415
trump	103.78980852228422
including	98.64837668848244
care	72.23926563658084
workers	69.87282904669634
pandemic	68.87152165109498
believe	59.449120769409085
covid-19	49.67983456368194
color	47.93560592228924
ensure	40.59885319041926
,	40.28339268660182
native	39.006436354265425
housing	38.63232476674289
investments	38.22087286926605
affordable	36.83400407434361
expand	34.39374416675337
clean	33.908189816758096


Republican:

the	140.22224554283582
of	105.2173628369195
—	95.72655635923344
government	69.32917787095215
republican	64.14154427012217
current	63.01371009014764
is	57.36040628425552
a	56.96910698056252
it	55.4479146995238
congress	45.29329534988732
their	44.404098458032365
be	43.98452976297621
urge	42.89292820199374
★	42.49664909272419
healthcare	40.19863033601007
its	39.91111642081513
:	39.34546458484308
call	38.92239613060992
republicans	36.61108000787439
must	35.36099

Are these results surprising? Examine specific words to check their frequency in both datasets.

Words that are solely frequent in one corpus gets emphasized more, i.e. `democrates`. 

In [12]:
print("Totals: R: %s, D: %s" % (len(republican_tokens), len(democrat_tokens)))

Totals: R: 41484, D: 47627


In [13]:
word="the"
print("%s -- R: %s, D: %s" % (word, republican_counts[word], democrat_counts[word]))

the -- R: 2369, D: 1910


We saw earlier that $\chi^2$ is not a perfect estimator since doesn't account for the "burstiness" of language -- if we see the word "Dracula" in a text, we're probably going to see it again in that same text. The occurrence of words are not independent random events; they are tightly coupled with each other. If we're trying to understanding the robust differences between two corpora, we might prefer to prioritize words that show up more frequently everywhere in corpus A (but not in corpus B) over those that show up only very frequently within narrow slice of A (such as one text in a genre, one chapter in a book, or one speaker when measuring the differences between policital parties). Use the following function to execute the Mann-Whitney test to account for this phenomenon while finding distinctive terms.

In [14]:
def count_differences(one_tokens, two_tokens):
    one_N=len(one_tokens)
    two_N=len(two_tokens)
    
    one_counts=Counter()
    two_counts=Counter()
    
    vocab={}
    for token in one_tokens:
        one_counts[token]+=1
        vocab[token]=1
        
    for token in two_tokens:
        two_counts[token]+=1    
        vocab[token]=1
        
    differences={}
    for word in vocab:
        freq1=one_counts[word]/one_N
        freq2=two_counts[word]/two_N
        
        diff=freq1-freq2
        differences[word]=diff
        
    return differences

# convert a sequence of tokens into counts for each chunkLength-word window
def get_chunk_counts(tokens, chunkLength):
    chunks=[]
    for i in range(0, len(tokens), chunkLength):
            counts=Counter()
            for j in range(chunkLength):
                if i+j < len(tokens):
                    counts[tokens[i+j]]+=1
            chunks.append(counts)
    return chunks

# calculate mann-whitney test for each word in vocabulary
def mann_whitney(one_tokens, two_tokens):

    chunkLength=500
    one_chunks=get_chunk_counts(one_tokens, chunkLength)
    two_chunks=get_chunk_counts(two_tokens, chunkLength)
    
    # vocab is the union of terms in both sets
    vocab={}
    
    for chunk in one_chunks:
        for word in chunk:
            vocab[word]=1
    for chunk in two_chunks:
        for word in chunk:
            vocab[word]=1
    
    pvals={}
    
    for word in vocab:
        
        a=[]
        b=[]
        
        # Note a and b can be different lengths (i.e., different sample sizes)
        # 
        # See Mann and Whitney (1947), "On a Test of Whether one of Two Random 
        # Variables is Stochastically Larger than the Other"
        # https://projecteuclid.org/download/pdf_1/euclid.aoms/1177730491
        
        # (This is part of their innovation over the case of equal sample sizes in Wilcoxon 1945)
        
        for chunk in one_chunks:
            a.append(chunk[word])
        for chunk in two_chunks:
            b.append(chunk[word])

        statistic,pval=mannwhitneyu(a,b, alternative="two-sided")
        
        # We'll use the p-value as our quantity of interest.  [Note in the normal appproximation
        # that Mann-Whitney uses to assess significance for large sample sizes, the significance 
        # of the raw statistic depends on the number of ties in the data, so the statistic itself
        # isn't exactly comparable across different words]
        pvals[word]=pval

    return pvals
    
# calculate mann-whitneyfor each word in vocabulary and present the top 10 terms for each group
def mann_whitney_analysis(one_tokens, two_tokens):
    
    pvals=mann_whitney(one_tokens, two_tokens)
    
    # Mann-Whitney tells us the significance of a term's difference in two groups, but we also 
    # need the directionality of that difference (whether it's used more by group A or group B. 
    
    # Let's use our difference-in-proportions function above to check the directionality.  
    # [Note we could also measure directionality by checking whether the Mann-Whitney statistic
    # is greater or less than the mean=len(one_chunks)*len(two_chunks)*0.5.]

    differences=count_differences(one_tokens, two_tokens)
    
    one_terms={k : pvals[k] for k in pvals if differences[k] <= 0}
    two_terms={k : pvals[k] for k in pvals if differences[k] > 0}
    
    sorted_pvals = sorted(one_terms.items(), key=operator.itemgetter(1))
    print("More Republican:\n")
    for k,v in sorted_pvals[:10]:
        print("%s\t%.15f" % (k,v))

    print("\nMore Democrat:\n")
    sorted_pvals = sorted(two_terms.items(), key=operator.itemgetter(1))
    for k,v in sorted_pvals[:10]:
        print("%s\t%.15f" % (k,v))

In [15]:
mann_whitney_analysis(democrat_tokens, republican_tokens)

More Republican:

the	0.000000000000028
—	0.000000000000037
of	0.000000000002035
republican	0.000000000003047
.	0.000000000006887
current	0.000000000007479
a	0.000000000138855
:	0.000000002872820
republicans	0.000000003159717
urge	0.000000008079836

More Democrat:

democrats	0.000000000000000
and	0.000000000000000
will	0.000000000000000
trump	0.000000000000000
including	0.000000000001298
believe	0.000000000002003
pandemic	0.000000000082672
covid-19	0.000000000295251
ensure	0.000000015554296
color	0.000000056605160


How are the differences identified by Mann-Whitney similar to, and different from, those identified by $\chi^2$?  What conclusions would you draw from the differences between these platforms?

The results from the Mann-Whitney test are very similar to the words identified by chi-squared test. It might be the case that these salient words occur prevalently in each chunk. 