# Semantics of personhood

Classification experiment for nouns: are they human or not? Can a noun be thought of as human by its grammar, by its modifiers and possessives and verbs?

## Step 1: Gather seed lists of human/nonhuman words

In [None]:
# BYU function
import pytxt
def get_byu_dd(fn='/Users/ryan/DH/TOOLS/words/byu/worddb.byu.txt'):
    return pytxt.ld2dd(pytxt.read_ld(fn),'word')

In [None]:
#get_byu_dd()['summon']

In [None]:
###
# Semantic lists, pruned from Harvard General Inquirer
# originally for Wild Animal Stories project, pruned again (less strictly) for human nouns
# downloaded from v2, http://localhost:8888/lab/tree/workspace%2Fwildanimalstories%2Fparsing.ipynb
###

def get_field_words(fnfn='wordlists/Word Lists.xlsx',only_fields={'Human','Animal','Object'},remove_col='notstrict',byu_pos='nn1'):
    pos_d=None
    if byu_pos:
        pos_d=dict( (w,wd['pos']) for w,wd in get_byu_dd().items() )
    
    import pytxt
    from collections import defaultdict,Counter
    field2words=defaultdict(set)
    field2count=Counter()
    field2count_removed=Counter()
    for d in pytxt.read_ld(fnfn):
        field=d['field'].replace('HGI.','').replace('Person','Human')
        if only_fields and not field in only_fields:continue
        word=d['word'].lower().strip()
        field2count[field]+=1
        toremove=d[remove_col].strip()
        if toremove and toremove.lower()=='y': continue
        if not word: continue
            
        if byu_pos and pos_d and pos_d.get(word)!=byu_pos: continue
            
        field2count_removed[field]+=1
        field2words[field]|={d['word']}
    #statld=[{'fieldname':name, 'num_words':count,'num_words_after_pruning':field2count_removed[name], 'words':' '.join(field2words[name])} for name,count in field2count.items()]
    #print pytxt.tabify(statld)
    return field2words

In [None]:
def get_all_fields():
    D={}
    for k,v in get_field_words(remove_col='strict',byu_pos='nn1').items():
        D[k.strip()]=v
    D['Human (V2)']=get_field_words(remove_col='notstrict',byu_pos='nn1')['Human']
    return D

In [None]:
#field2words=get_all_fields()

In [None]:
# import random
# for fld in sorted(field2words):
#     print fld,'-->',random.sample(field2words[fld],10)

## Step 2: Storing these in the big list of semantic fields

Added above function get_field_words() to the [Semantic Fields Notebook](http://localhost:8888/lab/tree/Dissertation%2Fabstraction%2Fwords%2Fsemantic_fields.ipynb), pointing to the same location (wordlists/Word Lists.xlsx).

Fields Animal, Human, and Object are stored with the prefix "VG": VG.Animal, VG.Human, VG.Object.

This saves a new version of */Users/ryan/DH/Dissertation/abstraction/words/data.fields.txt* and */Users/ryan/DH/Dissertation/abstraction/words/data.field_words.txt*.

In [None]:
# # Now available at
# from lit.tools.freqs import get_fields
# fields=get_fields() # which gets /Users/ryan/DH/Dissertation/abstraction/words/data.fields.txt

In [None]:
# for fld in ['Human','Animal','Object']:
#     print fld,'-->',random.sample(fields['VG.'+fld],10)

## Step 3: Calculating humanness vectors

Changed word2vec.py's abstract_vectors():

    def abstract_vectors(self,only_major=True,include_social=False):
            model = self.gensim
            vd={} # vector dictionary

            from lit.freqs import get_fields
            fields = get_fields()

            vd['Complex Substance (Locke) <> Mixed Modes (Locke)'] = self.centroid(fields['Locke.MixedMode']) - self.centroid(fields['Locke.ComplexIdeaOfSubstance'])
            vd['Concrete (HGI) <> Abstract (HGI)'] = self.centroid(fields['HGI.Abstract']) - self.centroid(fields['HGI.Concrete'])
            vd['Human (VG)'] = self.centroid(fields['VG.Human'])
            vd['Object (VG) <> Human (VG)'] = self.centroid(fields['VG.Human']) - self.centroid(fields['VG.Object'])
            vd['Animal (VG) <> Human (VG)'] = self.centroid(fields['VG.Human']) - self.centroid(fields['VG.Animal'])
            vd['Vice (HGI) <> Virtue (HGI)']=self.centroid(fields['HGI.Moral.Virtue']) - self.centroid(pytxt.fields['HGI.Moral.Vice'])

            return vd

## Step 4: Examining vectors in Chadwyck Poetry

In [None]:
import lit
chadwyck_poetry = lit.load_corpus('ChadwyckPoetry')

In [None]:
cp_w2v = chadwyck_poetry.word2vec_by_period()
cp_w2v.models[0].fnfn, cp_w2v.models[-1].fnfn

In [None]:
# Save partial models?
#maxrank=10000
#cp_w2v.limit_vocab_and_save('/Users/ryan/DH/corpora/chadwyck_poetry/word2vec_models_partial_10K', n=maxrank,fpm_cutoff=None)

### Step 4A. model_words(): Historicizing the humanness vector for words
V1 (2/2/2019): with full vocab models
V2 (2/3/2019): with vocab-limited models

**Update (2/3/19):** Vocab-limited models made no difference on below results. Data files were overwritten (woops), but visualizations are from full vocab model results.

In [None]:
#cp_w2v.model_words(abstract_vectors=True,odir='data_word2vec/chadwyck_poetry')
# last run: 2/1/2019 22:06

# with limited models:
#cp_w2v.model_words(abstract_vectors=True,odir='data_word2vec/chadwyck_poetry_models_partial_10K')
# last run: 2/3/2019 with V2, vocab limited models

In [None]:
#cp_w2v.consolidate_model_words(idir='data_word2vec/chadwyck_poetry/')
# last run: 2/1/2019

The primary output of consolidate_model_words() is analyzed in [this Tableau file](data_word2vec/data.word2vec.consolidated.words.ChadwyckPoetry.by_period.Noneyears.twb).

#### Some of the figures and results

##### How humanness vectors relate to each other

* The standard humanness vector is difficult to understand, with things like "ax" and "sop" and "sling" close to the humanness vector. Here it is contrasted with (for me) more legible vector V(Object-Human) (R^2=0.13):

<center><img src="figures/HumanObject vs Human.png" width=500></center>

* The V(Animal-Human) vector correlates even less with V(Human) (R^2=0.05):

<center><img src="figures/HumanAnimal vs Human.png" width=500></center>

* By contrast, the V(Animal-Human) correlates much better with V(Object-Human), although interesting differences remain (R^2=0.67):

<center><img src="figures/HumanObjectAnimal.png" width=500></center>

##### How humanness relates to abstractness

* The Human-Object vs V(Conc-Abs)[Locke] vector doesn't show much correlation (R^2=0.25):

<center><img src="figures/ConcAbsHumanObject.png" width=500></center>

* The Human-Object vs V(Conc-Abs)[HGI] vector shows more correlation (R^2=0.55):

<center><img src="figures/ConcAbs (HGI) vs HumanObject.png" width=500></center>

##### How humanness relates to virtue and vice

* The Vice-Virtue vector correlates with the **Object**-Human vector (R^2=0.24):

<center><img src="figures/HumanObjectViceVirtue.png" width=500></center>

* The Vice-Virtue vector correlates with the Animal-Human vector (R^2=0.45):

<center><img src="figures/HumanAnimalViceVirtue.png" width=500></center>



### Step 4B. Correlate historical humanness vector results

#### Work with results of Step 4A in pandas

In [None]:
import pandas as pd

In [None]:
df=pd.read_csv('data_word2vec/data.word2vec.consolidated.words.ChadwyckPoetry.by_period.Noneyears.txt',sep='\t')

In [None]:
min_period = 1600
max_period = 1900
min_count = 100
df['period_int']=[int(x.split('-')[0]) for x in df.period]
df=df.loc[df.period_int>=min_period]
df=df.loc[df.period_int<max_period]
df=df.loc[df.model_count>=min_count]

In [None]:
df.iloc[1000:1010]

In [None]:
df['period_word']=[(x,y) for x,y in zip(df.period,df.word)]
groups=df.groupby('period_word')

In [None]:
import numpy as np
#df.groupby('period')
meandf=df.groupby('period_word').mean()
#meandf

In [None]:
meandf['period'],meandf['word'] = zip(*meandf.index)
#pivotdf=meandf.pivot(index='word',columns='period',values='Human (VG)')
pivotdf=meandf.pivot_table(index='word',columns='period',values='Object (VG) <> Human (VG)', aggfunc='mean')

In [None]:
#list(pivotdf.dropna().index)
datadf=pivotdf.dropna()

# standardize
from scipy.stats import zscore
datadf=datadf.apply(zscore)
datadf

#### Analyze the data matrix

In [None]:
# from http://localhost:8888/lab/tree/workspace%2Fsyntax%2Fcorrelations.ipynb

import statsmodels.formula.api as smf
def newpolyfit(X,Y):
    newdf=pd.DataFrame({'X':X, 'Y':Y})
    results = smf.ols('Y ~ X + I(X**2)', data=newdf).fit()
    return results.rsquared, results.f_pvalue

def regressions(df):
    word2rp={}
    df=df.T
    for word in df.columns:
        Y=list(df[word])
        X=list(range(len(Y)))
        word2rp[word]=newpolyfit(X,Y)
    return word2rp

#%load_ext rpy2.ipython

def dist(df):
    from scipy.spatial.distance import squareform, pdist
    distmatrix=pdist(df,metric='correlation')
    return 1-pd.DataFrame(squareform(distmatrix), columns=df.index, index=df.index)

def kmeans(datadf,n_kmeans=5):
    df_dist=dist(datadf)
    m_dist=df_dist.values
    from sklearn.cluster import KMeans
    model_kclust = KMeans(n_clusters=n_kmeans)
    model_kclust.fit(m_dist)
    labels = model_kclust.labels_
    word2label = dict(zip(datadf.index, labels))
    return word2label

def corr_with_cluster(df,word2cluster):
    # cluster2words
    from collections import defaultdict
    cluster2words=defaultdict(list)
    for w,c in word2cluster.items():
        cluster2words[c]+=[w]
    
    # get avg per cluster
    cluster_avg={}
    for clust,words in cluster2words.items():
        cluster_avg[clust]=list(df.loc[words].median(axis=0))
    
    # corr with each word
    from scipy.stats.stats import pearsonr
    word2clustcorr={}
    for word,clust in word2cluster.items():
        word_avgs=list(df.loc[word])
        clust_avgs=cluster_avg[clust]
        word2clustcorr[word]=pearsonr(word_avgs,clust_avgs)
    return word2clustcorr

def tsne(datadf,n_components=2):
    df_dist=dist(datadf)
    m_dist=df_dist.values
    from sklearn.manifold import TSNE
    model = TSNE(n_components=n_components, random_state=0)
    fit = model.fit_transform(m_dist)
    from collections import defaultdict
    newcols=defaultdict(list)
    for i,word in enumerate(datadf.index):
        for ii,xx in enumerate(fit[i]):
            newcols['tsne_V'+str(ii+1)] += [xx]
    for k,v in newcols.items(): datadf[k]=v
    return datadf

In [None]:
#datadf.loc[['virtue','vice']].median(axis=1)

In [None]:
word2cluster=kmeans(datadf,n_kmeans=3)
word2clustcorr=corr_with_cluster(datadf,word2cluster)
word2rp=regressions(datadf)
datadf=tsne(datadf)

In [None]:
df_out = datadf.copy()
df_out['kmeans_cluster'] = [word2cluster.get(w,'') for w in df_out.index]
df_out['polyfit_r2'] = [word2rp[w][0] if w in word2rp else '' for w in df_out.index]
df_out['polyfit_p'] = [word2rp[w][1] if w in word2rp else '' for w in df_out.index]
df_out['corr_w_clust_r'] = [word2clustcorr[w][0] if w in word2clustcorr else '' for w in df_out.index]
df_out['corr_w_clust_p'] = [word2clustcorr[w][1] if w in word2clustcorr else '' for w in df_out.index]
df_out['word']=df_out.index
df_out

In [None]:
df_out.to_csv('data_word2vec/data.word2vec.consolidated.words.ChadwyckPoetry.by_period.Noneyears.CORRELATIONS.txt', sep='\t')

#### Visualize the results

Some results are:
    
* K-means with 3 clusters divides history into three movements:

<center><img src="figures/Cluster Aggregate Trends.png" width=400></center>
    
* Here are the individual words broken down, in 1K word buckets by rank (BYU):

<center>
    <img src="figures/Pages by Rank - 0K-1K.png" width=800>
    <br/>
    <img src="figures/Pages by Rank - 1K-2K.png" width=800>
    <br/>
    <img src="figures/Pages by Rank - 2K-3K.png" width=800>
</center>

#### A few words to notice

* The word "fall" falls on V(Human-Object): turn away from its sense as The (Xtian) Fall?

<center><img src="figures/vhumanobj-fall.png" width=400></center>

* The word "world" rises and falls on V(Human-Object): the social-ization of the world? The public sphere?

<center><img src="figures/vhumanobj-world.png" width=400></center>

* The word "nature" falls and rises on V(Human-Object): I don't know if I trust this... isn't Nature constantly personified in C18?

<center><img src="figures/vhumanobj-nature.png" width=400></center>

* Domestic objects:

<center><img src="figures/vhumanobj-domesticobjs.png" width=400></center>

* Body parts:

<center><img src="figures/vhumanobj-bodypart.png" width=800></center>

## Appendix

### Tests

In [1]:
from lit import load_corpus
CP = load_corpus('ChadwyckPoetry')
CPw2v = CP.word2vec_by_period()

>> reading config files...


In [2]:
models = m1600,m1700,m1800,m1900 = CPw2v.period2models['1600-1624'][0], CPw2v.period2models['1700-1724'][0], CPw2v.period2models['1800-1824'][0], CPw2v.period2models['1900-1924'][0]

In [3]:
from lit.tools.freqs import get_fields
fields=get_fields()
#fields['VG.Human']

>> streaming as tsv: /Users/ryan/DH/Dissertation/abstraction/words/data.fields.txt
   done [0.1 seconds]


In [30]:
def show_similarity(words):
    for m in models:
        print m.period
        m.limit_by_rank(10000)
        print ' '.join(x for x,y in m.similar(words,10))
        print

In [31]:
def test():
    for m in models:
        print m.name
        print ' '.join(x for x,y in m.analogy('man','woman','king'))
        print

In [32]:
test()

1600-1624.run=01.txt.gz
princess queen daughter prince empress henry philip iames anne duke

1700-1724.run=01.txt.gz
goddess queen jupiter sister dow'r priestess prophetess defender earl nero

1800-1824.run=01.txt.gz
queen prince princess sybil queen's monarch mistress naples pharaoh daughter

1900-1924.run=01.txt.gz
queen solomon menelaus redivagate bowing porch mycenae rashumba priam cassandra



In [36]:
#show_similarity(fields['VG.Human'])
show_similarity(['man','woman','girl','boy','child','parent'])

1600-1624
maid girle matron lone iphis helen mother wench acis cuckold

1700-1724
creature shepherdess orphan hermit nymph's doting zelinda jilt myra cloe

1800-1824
mother babe creature maid father maiden daughter boy's son nurse

1900-1924
lover beggar absalom seer thief madman son weeps aunt lad

