# Homework 4: Applying Dunning's log-likelihood to 19c poetry

I've put my "solutions" to in-class exercises on the Moodle, except for Exercise 2, which has become our homework assignment. 

## Problem 1.

Find 25 words that are overrepresented in poetry reviewed by elite 19c magazines, as compared to other works of poetry that didn't get reviewed in those venues. Also list 25 words that are overrepresented in poetry that didn't get reviewed.

To do this, you'll need to copy over some of the functions from our Week 4 exercises, and also copy over the code from our in-class Exercise #1, editing it so that it divides the corpus.

Here's some code to get us started. I load some modules we're likely to need, and then load the ```poefic``` corpus.

Then I filter the ```poefic``` DataFrame to have only poetry. I'm doing this for two reasons. The first is that I'm a little concerned that the size of the data is posing a problem on some computers. The other, more immediate, reason is that this dataset only has an even distribution of the "reception" variable in poetry. (Almost all the fiction I gave you was reviewed in elite venues.)

In [3]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

def tokenize(astring):
    ''' Breaks a string into words, and counts them.
    Designed so it strips punctuation and lowercases everything,
    but doesn't separate hashtags and at-signs.
    '''
    wordcounts = Counter()
    # create a counter to hold the counts
    
    tokens = astring.split()
    for t in tokens:
        word = t.strip(',.!?:;-—()<>[]/"\'').lower()
        wordcounts[word] += 1
        
    return wordcounts

def addcounters(counter2add, countersum):
    ''' Adds all the counts in counter2add to countersum.
    Because Counters(like dictionaries) are mutable, it
    doesn't need to return anything.
    '''
    
    for key, value in counter2add.items():
        countersum[key] += value

def create_vocab(seq_of_strings, n):
    ''' Given a sequence of text snippets, this function
    returns the n most common words. We'll use this to
    create a limited 'vocabulary'.
    '''
    vocab = Counter()
    for astring in seq_of_strings:
        counts = tokenize(astring)
        addcounters(counts, vocab)
    topn = [x[0] for x in vocab.most_common(n)]
    return topn

# Let's test the vocabulary function.
# vocab = create_vocab(trump['text'], 4000)
# vocab[0:10]

def logodds(countsA, countsB, word):
    ''' Straightforward.
    '''
    
    odds = (countsA[word] + 1) / (countsB[word] + 1)
    
    # Why do we add 1 on both sides? Two reasons. The hacky one is 
    # that otherwise we'll get a division-by-zero error whenever
    # word isn't present in countsB. The more principled reason
    # is that this technique (called Laplacian smoothing) tends
    # to reduce the dramatic disproportion likely to be found in
    # very rare words.
    
    return math.log(odds)

def signed_dunnings(countsA, totalA, countsB, totalB, word):
    ''' Less straightforward. This function calculates a signed (+1 / -1)
    version of Dunning's log likelihood. Intuitively, this is a number 
    that gets larger as the frequency of the word in our two corpora
    diverges from its EXPECTED frequency -- i.e., the frequency it would
    have if it were equally distributed over both. But it also tends to get
    larger as the raw frequency of the word increases.
    
    Note that this function requires two additional arguments:
    the total number of words in A and B. We could calculate that inside
    the function, but it's faster to calculate it just once, outside the function.
    
    Also note: the strict definition of Dunnings has no 'sign': it gets bigger
    whether a word is overrepresented in A or B. I've edited that so that Dunnings
    is positive if overrepresented in A, and negative if overrepresented in B.
    '''
    if word not in countsA and word not in countsB:
        return 0
    
    # the raw frequencies of this word in our two corpora
    # still doing a little Laplacian smoothing here
    a = countsA[word] + 0.1
    b = countsB[word] + 0.1
    
    # now let's calculate the expected number of times this
    # word would occur in both if the frequency were constant
    # across both
    overallfreq = (a + b) / (totalA + totalB)
    expectedA = totalA * overallfreq
    expectedB = totalB * overallfreq
    
    # and now the Dunning's formula
    dunning = 2 * ((a * math.log(a / expectedA)) + (b * math.log(b / expectedB)))
    
    if a < expectedA:
        return -dunning
    else:   
        return dunning

# a set of common words is often useful
stopwords = {'a', 'an', 'are', 'and', 'but', 'or', 'that', 'this', 'so', 
             'all', 'at', 'if', 'in', 'i', 'is', 'was', 'by', 'of', 'to', 
             'the', 'be', 'you', 'were'}

# finally, one more function: given a list of tuples like
# testlist = [(10, 'ten'), (2000, 'two thousand'), (0, 'zero'), (-1, 'neg one'), (8, 'eight')]
# we're going to want to sort them and print the top n and bottom n

def headandtail(tuplelist, n):
    tuplelist.sort(reverse = True)
    print("TOP VALUES:")
    for i in range(n):
        print(tuplelist[i][1], tuplelist[i][0])
    
    print()
    print("BOTTOM VALUES:")
    lastindex = len(tuplelist) - 1
    for i in range(lastindex, lastindex - n, -1):
        print(tuplelist[i][1], tuplelist[i][0])
        
# headandtail(testlist, 2)

In [2]:
import os, csv, math
import pandas as pd
import numpy as np

from collections import Counter

cwd = os.getcwd()
print('Current working directory: ' + cwd + '\n')
      
relativepath = os.path.join('..', 'data', 'weekfour', 'poefic.csv')
poefic = pd.read_csv(relativepath)

# FILTERING BY ROW TO GET ONLY THE POETRY
poefic = poefic[poefic['genre'] == 'poetry']
# equivalent to
    # poefic = poefic.loc[poefic['genre] == poetry, : ]
poefic.head()

# poefic_noelite = poefic[poefic['reception'] != 'elite']
# poefic_noelite.shape

Current working directory: /Users/rdubnic2/Documents/lis590dsh/Code



Unnamed: 0,date,author,title,genre,reception,text
359,1835,"Browning, Robert,",Paracelsus,poetry,remove,Paracelsus. We 154 PARACELSUS [BOOK III Not ea...
360,1833,"Browning, Robert,",Pauline,poetry,remove,"all, I sought How best life’s end might be att..."
361,1855,"Arnold, Matthew,",Poems,poetry,elite,"grace, and Wisdom be too proud To halve a lodg..."
362,1867,"Arnold, Matthew,",New poems,poetry,elite,from the West was then in shade. Ah ! now 'tis...
363,1861,"Mangum, A. W.",The holy shield,poetry,vulgar,happy hgme which he had exchange d for the ten...


**A small digression about the code above**

It's worth dwelling for a moment on the statement that does filtering by row. Notice that if you index a pandas DataFrame with a single string, like ```poefic['genre']```, you get a column. But if you generate a series of Boolean values, and use *that* to index the DataFrame, like so,

```poefic[poefic['genre'] == 'poetry']```

You'll be selecting *rows* where the series has a value ```True.```

If it's not clear what I mean by "generating a series of Boolean values," look at the result of the cell below. (You can delete the cell below when you're working on the homework; this is all a digression.) Notice also, in the code above, that you can also use the ```.loc``` method to specify rows and columns at the same time if you want to. In this case I haven't specified a column for ```.loc``` to select, the ``` : ``` after the comma is a way of saying "all the columns."

In [4]:
poefic['reception'] == 'elite'

359     False
360     False
361      True
362      True
363     False
364     False
365     False
366     False
367      True
368      True
369     False
370     False
371      True
372      True
373      True
374     False
375      True
376     False
377      True
378      True
379     False
380      True
381     False
382     False
383      True
384      True
385      True
386      True
387      True
388     False
        ...  
997     False
998      True
999      True
1000     True
1001     True
1002    False
1003     True
1004    False
1005    False
1006     True
1007     True
1008     True
1009     True
1010     True
1011     True
1012    False
1013    False
1014    False
1015    False
1016     True
1017    False
1018    False
1019    False
1020    False
1021    False
1022     True
1023    False
1024    False
1025    False
1026    False
Name: reception, dtype: bool

In [4]:
# CODE FOR PROBLEM 1

# You'll need to copy over the functions you need: things like "tokenize" will 
# certainly be necessary.

# I recommend removing stopwords, but test, and see what happens if you don't.

# The column 'reception' has several possible values, including 'elite' (was
# reviewed in elite journals), and 'vulgar' (which doesn't mean the poetry was
# obscene, but is just a wry way of saying it didn't turn up in our sample of 
# reviews). You want to contrast these two groups. Leave out other rows, where
# 'reception' has a value like 'remove.'

# After you've run code to produce the top 25 and bottom 25 words, sorted by 
# signed Dunnings, write a few sentences of commentary below.

poefic.shape

poefic.index = range(poefic.shape[0])
# print(poefic.index)

vocab = create_vocab(poefic['text'], 50000)
print('Total vocab = ', len(vocab))

unique_vocab = set(vocab)
print('Total unique words in vocab =', len(unique_vocab))

# An optional step: removing stopwords
vocab_no_stop = list(set(vocab) - stopwords)

print('Total vocab minus stop words = ', len(vocab_no_stop))

# Create counters for the review and not reviewed corpora.

review = Counter()
noreview = Counter()

num_rows = poefic.shape[0]
print('Total number of rows in Poefic dataset = ', num_rows)

ignored = 0

for i in range(num_rows):
    counts = tokenize(poefic['text'][i])
    if 'elite' in poefic['reception'][i]:
        addcounters(counts, review)
    if 'vulgar' in poefic['reception'][i]:
        addcounters(counts, noreview)
    else:
        ignored += 1

print('Words in ignored texts: '+ str(ignored))

review_count = len(review)
noreview_count = len(noreview)

print('Total words in poems reviewed = ', review_count)
print('Total words in poems not reviewed = ', noreview_count)

total_poefic_words = Counter()
addcounters(review, total_poefic_words)
addcounters(noreview, total_poefic_words)

print('Total word count = ', len(total_poefic_words))

rep_review_words_dunnings = []
rep_review_words_nostop_dunnings = []

for wrd in vocab:
    x = signed_dunnings(review, review_count, noreview, noreview_count, wrd)
    rep_review_words_dunnings.append((x, wrd))

# print(len(rep_review_words_dunnings))
    
for wrd in vocab_no_stop:
    y = signed_dunnings(review, review_count, noreview, noreview_count, wrd)
    rep_review_words_nostop_dunnings.append((y, wrd))

# print(len(rep_review_words_nostop_dunnings))

print('    '+'FOR DUNNINGS:')
headandtail(rep_review_words_dunnings, 25)

print('    '+'FOR DUNNINGS, STOP WORDS REMOVED:')
headandtail(rep_review_words_nostop_dunnings, 25)

# I went ahead and did log analysis as well, as I didn't catch that it wasn't 
#  asked for specifically until after finishing.

# rep_review_words_log = []
# rep_review_words_nostop_log = []

# for word in vocab:
#     z = logodds(review, noreview, word)
#     rep_review_words_log.append((z, word))

# print(len(rep_review_words_log))
    
# for word in vocab_no_stop:
#     w = logodds(review, noreview, word)
#     rep_review_words_nostop_log.append((w, word))

# print(len(rep_review_words_nostop_log))
    
# print('    '+'FOR LOG ODDS:')
# headandtail(rep_review_words_log, 25)

# print('    '+'FOR LOG ODDS, STOP WORDS REMOVED:')
# headandtail(rep_review_words_nostop_log, 25)

Total vocab =  38254
Total unique words in vocab = 38254
Total vocab minus stop words =  38231
Total number of rows in Poefic dataset =  668
Words in ignored texts: 365
Total words in poems reviewed =  24967
Total words in poems not reviewed =  25192
Total word count =  36116
    FOR DUNNINGS:
TOP VALUES:
i 167.76726952723084
 135.90025363219797
and 134.8437370701729
her 123.63473157251235
she 113.56243768397769
the 107.96082971148235
a 78.53690586579944
wind 58.45393239811642
pb 53.26838289299076
isis 44.96097587392186
face 43.95188980741705
me 41.36231391262953
old 39.284928591681904
out 38.38242829901128
osiris 38.017244905451996
o 35.806851198561844
grey 35.41520615776817
white 35.32557111412548
man 35.23672856824297
eyes 34.25347474466608
you 31.480921256959988
into 31.280510012192934
down 30.225223816065693
my 28.67224996707995
ye 28.631183360844503

BOTTOM VALUES:
jesus -59.69241425151195
our -43.113445711577924
emma -38.71232907582554
reign -32.34866408346143
tis -29.6479991419

#### Brief commentary on results.

This isn't a class on 19th-century poetry, so I don't expect you to fully
interpret the results. (As Clarice was rightly suggesting in class, it's
necessary to actually read a few documents before we're in a position to
interpret quantitative patterns.) But you might be able to speculate or
form tentative hypotheses based on a selection of distinctive words.

## Shallow Analysis
Analysis might suggest that Christian texts were not reviewed by elite journals as much as other texts, many of which include characters with non-Christian (or in the case of Dante, non-mainstream Christian) religious characters/themes. In the least likely to appear list we have "jesus," "saviour" (admittedly an ambiguous term, with regard to religion), "prayer," and "herodias." This might possibly be one long text throwing off the results, but it's hard to say without an in-depth look at the data. It might make sense to break out this way, if the poetry in the set belongs to hymn or prayer books, which aren't meant for critical review, and are maybe only tenuously classified as fiction, in the traditional sense, at all.

Strangely (and interestingly), the white space character is in the top 2 with and without stop words. Perhaps this is a character that should be added to the stop words (possibly, depending on your research questions), but it does seem to imply that poetry with more "experimental" line structure--such as poetry with fewer words per line--is overwhelmingly more often reviewed than poetry without. With a closer look at dates and the history of poetry, it might make a lot of sense, if this sort of line structure was more popular at the time. Reviewers tend to put more focus on the artists/work that is in the current trend, or by artists who have worked in that trend, than they do for that outside of current trends.