# Generating BookNLP-like data for all nouns

## Intro

I'm working on my personification chapter now. So far I've measured an ["agency index"](https://twitter.com/quadrismegistus/status/1059305496211931136), trying to find trends in how words gain/lose (syntactic) "agency" over time. This has been interesting, but what's missing is that personifications don't just do things (as the agency index captures), they do *human* things ("let not Ambition *mock*") and have *human* things ("Honour's voice").

So I'm trying to move on from this 'syntax of agency' to a broader 'grammar of personhood'. I took a failed stopover at the 'semantics of personhood' via word2vec: my 'human words' vector didn't end up being that interesting an index for other words, i.e. didn't seem to capture personification effects. Now I'm moving back to the syntactic BookNLP-style data (subject-verb, modifier-noun, etc) I've collected about the nouns for the Chadwyck Healey poetry collections.

I'm wondering whether a classifier to separate human vs. non-human (maybe human vs. object) words by way of the distribution of other words ('collocated' by syntax): and then use that classifier to estimate the 'humanness' of all words, not just those in the cross-validation experiment? I did this in a smaller related project on animal stories, and according to the cross-validation results, the model found it easier to separate humans and animals in novels than it did in these anthropomorphic animal stories, which seemed right. But here I want to use a classifier to estimate the humanness of all, even non-human/object words like abstract nouns, to see if that changes over time?

## Decide an initial set of words

In [1]:
# Load words from the project-wide 25K
import pandas as pd
from lit.tools import read_ld
all_words = {d['word'] for d in read_ld('data.worddb.txt')}
len(all_words),list(all_words)[:10]

>> streaming as tsv: data.worddb.txt
   done [0.6 seconds]


(25000,
 [u'fawn',
  u'nunnery',
  u'woods',
  u'spiders',
  u'hanging',
  u'woody',
  u'disobeying',
  u'canes',
  u'scold',
  u'originality'])

## Transform slingshot results into booknlp-like data
Code adapted from the classification work in the [Wild Animal Stories notebook](http://localhost:8888/lab/tree/workspace%2Fwildanimalstories%2Fexperiments.ipynb).

In [2]:
import os,pandas as pd,numpy as np,itertools
from lit import tools
rels = {
        'poss':'Possessive',
        'nsubj':'Subject',
        'nsubjpass':'Subject (passive)',
        'dobj':'Object',
        'amod':'Modifier',
        'compound':'Modifier',
        'appos':'Modifier',
        'attr':'Modifier',
        'dative':'Object'
       }

PATH_TO_SLINGSHOT_RESULT_DATA = '../syntax/results_slingshot/spacy_syntax/parse_path2/cache/'

In [3]:
def transform_results(fn=PATH_TO_SLINGSHOT_RESULT_DATA):
    import pandas as pd,os
    from mpi_slingshot import stream_results
    for path,data in stream_results(fn):
        if '.ipynb' in path: continue
        sent_ld=[]
        num_sent=0
        fn=os.path.split(path)[-1]
        for dx in data:
            if sent_ld and dx['sent_start']!=sent_ld[-1]['sent_start']:
                old=get_booknlp_like_data(sent_ld)
                num_sent+=1
                for odx in old:
                    odx['num_sent']=num_sent
                    odx['fn']=fn
                    yield odx
                    sent_ld=[]
            sent_ld+=[dx]

In [4]:
def get_booknlp_like_data(sent_ld,pos_only={'NOUN'},lemma=False):
    """
    Modifiers
    Nouns possessed by characters: poss
    Adjectives modifying characters: 
    Verbs of which character is a subject
    Verbs of which character is an object
    
    rels = {'poss':'Possessive',
           'nsubj':'Subject',
           'dobj':'Object',
           'amod':'Modifier'}
    """
    
    old=[]
    for dx in sent_ld:
        word=dx['lemma'] if lemma else dx['word']
        rel=dx['dep']
        head=dx['head_lemma'] if lemma else dx['head']
        pos=dx['pos']
        word,head=word.lower(),head.lower()
        if not word in all_words or not pos in pos_only: continue
        word_dx={'head':head,'word':word,'rel':rel}
        old+=[word_dx]
    return old

#### Ran this on Sherlock:

In [5]:
# Create iterator
transformer = transform_results(PATH_TO_SLINGSHOT_RESULT_DATA)
#pd.DataFrame(list(itertools.islice(transformer,10)))

In [6]:
#tools.writegen('./data.booknlp_like_data.chadwyck_poetry.txt', transform_results)
# last run: 2/3/2019 13:49 PST

#### Downloaded this data (*data.booknlp_like_data.chadwyck_poetry.txt.gz*) to *data_booknlp/*

Data appears as:

| fn             | head     | num_sent | rel      | word      |
|----------------|----------|----------|----------|-----------|
| Z400605772.xml | are      | 1        | nsubj    | hills     |
| Z400605772.xml | knows    | 2        | dobj     | roads     |
| Z400605772.xml | knows    | 2        | conj     | moves     |
| Z400605772.xml | in       | 3        | pobj     | circles   |
| Z400605772.xml | within   | 3        | pobj     | head      |
| Z400605772.xml | has      | 4        | dobj     | say       |
| Z400605772.xml | is       | 5        | nsubj    | river     |
| Z400605772.xml | lie      | 5        | nsubj    | winds     |
| Z400605772.xml | at       | 6        | pobj     | dawn      |
| Z400605772.xml | sees     | 6        | dobj     | skies     |
| Z400605772.xml | feels    | 7        | nsubj    | shadows   |
| Z400605772.xml | of       | 7        | pobj     | night     |
| Z400605772.xml | recline  | 7        | dobj     | fingers   |
| Z400605772.xml | on       | 7        | pobj     | eyes      |
| Z400605772.xml | welcomes | 8        | dobj     | sun       |
| Z400605772.xml | sun      | 8        | conj     | rain      |
| Z400605772.xml | has      | 9        | nsubj    | landscape |
| Z400605772.xml | has      | 9        | dobj     | depth     |
| Z400605772.xml | depth    | 9        | conj     | height    |
| Z400605772.xml | city     | 10       | ROOT     | city      |
| Z400605772.xml | burns    | 10       | compound | passion   |
| Z400605772.xml | like     | 10       | pobj     | burns     |
| Z400605772.xml | walks    | 11       | compound | morning   |
| Z400605772.xml | of       | 11       | pobj     | walks     |
| Z400605772.xml | on       | 11       | pobj     | wave      |
| Z400605772.xml | of       | 11       | pobj     | sand      |