## Analyze Descriptive Passages

* setup
* content
* parts of speech based
* time series
* topic modeling
* neural

### Setup

In [100]:
# data
import numpy as np
import pandas as pd

# POS
import spacy

# nltk for wordnet and tokenization
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus.reader.wordnet import WordNetError
from nltk import sent_tokenize
from nltk import word_tokenize

In [32]:
'''
Read in .csv of descriptive passages as a Pandas data frame
Add appropriate header to the columns as well.
col names are: 
[blank],passage,book,left_claim,left_claim_keywords,
right_claim,right_claim_keywords,claim_id,passage_id,
passage_size,match_output

'''
def read_as_df(filename):
    # read data
    df = pd.read_csv(filename)
    # filter out first column as well as books that are not Left/Both
    df = df[df['match_output'] != 'Right']
    # drop unneeded row number
    df.drop(['Unnamed: 0'], axis=1, inplace=True)
    return df

In [105]:
'''
Helper for reporting 5-number summary of an inputted list
'''
def five_number(data):
    # calculate quartiles
    quartiles = np.percentile(data, [25, 50, 75])
    # calculate min/max
    data_min, data_max = data.min(), data.max()
    # print 5-number summary
    print('Min: %.3f' % data_min)
    print('Q1: %.3f' % quartiles[0])
    print('Median: %.3f' % quartiles[1])
    print('Q3: %.3f' % quartiles[2])
    print('Max: %.3f' % data_max)

In [33]:
descriptive_df = read_as_df('data/descriptive_claims_subset.csv')

In [34]:
descriptive_df.shape

(2383, 10)

In [93]:
descriptive_df.columns

Index(['passage', 'book', 'left_claim', 'left_claim_keywords', 'right_claim',
       'right_claim_keywords', 'claim_id', 'passage_id', 'passage_size',
       'match_output'],
      dtype='object')

### content work
* descriptive words / total words (quite pessimistic)
* words per unique thing (Tenen) -- in just these descriptive passages; aka Unique Clutter Distance
* words per thing (Tenen) -- in just these descriptive passages (self-selecting sample); aka Clutter Distance

### Parts of Speech Based Analysis

* spaCy on each description
    * general counts of adj, prep, pronoun, and can be used for later analysis
* column view a la Bal, Tenen
* specificity (Nelson 2020)
    * per descriptive passage, calculate specificity rating

#### Column View (Bal, Tenen)

In [157]:
# noun phrase

# grammatical category (typed dependency)

# wordnet super sense

def noun_inventory(sample, passage_id):
    tagged_sample=nlp(sample)
    nouns = []
    grammatical_categories = []
    supersenses = []
    for word in tagged_sample:
        if word.pos_ == "NOUN":
            synset = word.lemma_ + "." + "n" + ".01"    
            # occasionally, an adjective will be misidentified as a noun;
            # in that case, catch the exception (since the word won't a true synset)
            try: 
                synset_def = wn.synset(synset)
            except WordNetError as w_e:
                continue
            else:
                # see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L364
                # store the lexicographer filename - https://wordnet.princeton.edu/documentation/lexnames5wn
                lex_name = synset_def._lexname
                # store the word
                nouns.append(word)
                # store the dependency parse
                # see https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf
                grammatical_categories.append(word.dep_)
                supersenses.append(lex_name)
    return nouns, grammatical_categories, supersenses

In [158]:
noun_inventory("The cat jumped over the dog.", 1234)

The : det
cat : nsubj
jumped : ROOT
over : prep
the : det
dog : pobj
. : punct


([cat, dog], ['nsubj', 'pobj'], ['noun.animal', 'noun.animal'])

#### Specificity (Nelson)

In [156]:
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.remove_pipe('ner')

('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fbb1c920700>)

In [36]:
#Function from http://www.nltk.org/howto/wordnet.html to get *all* of a synset's hyponym/hypernyms
hyper = lambda s: s.hypernyms()

In [120]:
'''
Consult wordnet for the situation of a noun and verb with respect to its station in the hypernym hierarchy. 
Based on current SOA, it is acceptable to simply grab the top-level (.01) synset.

Args:
    tagged_sample: a spacy doc

Return:
    specificity: a value conveying the "specificity" of the input, via Nelson (2020)
'''

def specificity(sample, passage_id):
    tagged_sample=nlp(sample)
    hyper_sum = 0
    noun_and_verb_count = 0
    for word in tagged_sample:
        if word.pos_ == "NOUN" or word.pos_ == "VERB":
            noun_and_verb_count +=1
            # if it's a verb, get the most common verb hypernym chain
            # else, get the most common noun hypernym chain
            pos = word.pos_
            tag = "n" if pos.startswith("N") else "v"
            synset = word.lemma_ + "." + tag + ".01"    
            # occasionally, an adjective will be misidentified as a noun;
            # in that case, catch the exception (since the word won't a true synset) + fix the count
            try: 
                synset_def = wn.synset(synset)
            except WordNetError as w_e:
                noun_and_verb_count -= 1
                continue
            else:
                hyper_sum += len(list(synset_def.closure(hyper)))

    # a few 'descriptive' passages lack 
    if noun_and_verb_count == 0:
        return 0
    return hyper_sum / noun_and_verb_count

In [121]:
specificity_scores = descriptive_df.apply(lambda x: specificity(x.passage, x.passage_id), axis=1)

In [122]:
five_number(specificity_scores)

Min: 0.000
Q1: 3.444
Median: 4.333
Q3: 5.200
Max: 10.667


### time series

would need:
* number of fragments total
* number of descriptive fragments
* publish years for each work

### topic model

* what is each description/claim talking about

### embeddings.. 
* universal sentence encoder, across each description, and then cluster together?
* looking for different authors creating similar descriptions ...