# Can we really map the space of genre?

Maps of genre are very popular. Some of them are fanciful.

<img src="files/markdown.png">

How do we know there is *any* universal "distance metric" for cultural phenomena? Mathematically, distance metrics are supposed to obey certain rules. For instance

    A -> B == B -> A   and
    (A -> B) + (B -> C) >= A -> C

But maybe the distances between genres A, B, and C depend on whether you're standing at A, B, or C? Or at any rate, maybe the criteria of similarity vary depending on the pair of genres you're comparing.

How can we know whether this is true? What would we do if it *was* true?

In [1]:
import csv, random
import pandas as pd
import numpy as np
import math, sys
from collections import Counter
from scipy import spatial
from matplotlib import pyplot as plt
from scipy.stats import pearsonr
%matplotlib inline

## 1. Plan of the experiment.

We want to test different measures of textual similarity, and see which ones correspond most closely to other intuitions about the proximity of genres. So we need a measure of proximity that isn't based in textual comparison. The measure we choose can't be infallible, but it should at least be independent of textual measures.

**Social measures of proximity**

For instance, we could use library metadata. If two genre/form/subject headings are often assigned to the same books, we might infer that they're related.

That data will allow us to measure the overlap between genres using the pointwise mutual information of their co-association with volumes:
 
    log(p(x, y) / p(x)p(y)).

Where p(x) is, for instance, the probability that a volume is labeled "Humor," p(y) the probability that a volume is labeled "Horror," and p(x, y) the probability that the same volume is labeled "Humor" and "Horror."

However, we also want to acknowledge that the probability of two categories overlapping declines when they have largely disjoint chronological distributions. The "subjects" we are considering were, by and large, assigned earlier than the "genres," and they will tend not to be associated for that reason.

So I'm proposing a conditional version of PMI, where p(x), p(y), and p(x, y) are all measured within a sample of volumes defined by the chronological distribution of x and y. This will give a boost to PMI in cases where the chronological distributions are spread out over a wide swath of the timeline, because p(x), p(y), and p(x, y) are all lowered by the same amount, but that amount gets basically *squared* in the denominator: p(x)p(y). This conforms intuitively to the notion that coincidence of x and y is less likely when the events are rare within the frame of reference (of say 100 years), and more likely when "the relevant framework" is just (say) 30 years where the genres substantially coincide.

**Priors**

It also makes sense to hard-code certain priors about the relationships of genres, where we can see that two categories are close to synonymous. Catalogers may not assign near-synonymous headings, so this kind of overlap would otherwise be missed.

In [2]:
# The list of priors

priors = {'Subj: Horror': {'Horror': 0.8},
          'Subj: Humor': {'Humor': 0.8},
          'Subj: History': {'Historical': 0.8},
          'Subj: Man-woman': {'Love': 0.6}, 
          'Subj: Short stories, American': {'Short stories': 0.8, 'Subj: Short stories, Other': 0.8},
          'Subj: Short stories, Other': {'Short stories': 0.8, 'Subj: Short stories, American': 0.8},
          'Subj: SF, American': {'SF': 0.8, 'Subj: SF, Other' : 0.8},
          'Subj: SF, Other': {'SF': 0.8, 'Subj: SF, American': 0.8},
          'Subj: Fantasy': {'Fantasy': 0.8}, 
          'Subj: Juvenile': {'Juvenile': 0.8},
          'Subject: Detective': {'Mystery': 0.8}
         }

# Let's turn that into a symmetric dictionary where A -> B
# also implies B -> A
tuplelist = []
for key1, matches in priors.items():
    for match, realnumber in matches.items():
        tuplelist.append((key1, match, realnumber))
        
for t in tuplelist:
    key1, match, realnumber = t
    if match not in priors:
        priors[match] = dict()
    priors[match][key1] = realnumber
    
priors

{'Fantasy': {'Subj: Fantasy': 0.8},
 'Historical': {'Subj: History': 0.8},
 'Horror': {'Subj: Horror': 0.8},
 'Humor': {'Subj: Humor': 0.8},
 'Juvenile': {'Subj: Juvenile': 0.8},
 'Love': {'Subj: Man-woman': 0.6},
 'Mystery': {'Subject: Detective': 0.8},
 'SF': {'Subj: SF, American': 0.8, 'Subj: SF, Other': 0.8},
 'Short stories': {'Subj: Short stories, American': 0.8,
  'Subj: Short stories, Other': 0.8},
 'Subj: Fantasy': {'Fantasy': 0.8},
 'Subj: History': {'Historical': 0.8},
 'Subj: Horror': {'Horror': 0.8},
 'Subj: Humor': {'Humor': 0.8},
 'Subj: Juvenile': {'Juvenile': 0.8},
 'Subj: Man-woman': {'Love': 0.6},
 'Subj: SF, American': {'SF': 0.8, 'Subj: SF, Other': 0.8},
 'Subj: SF, Other': {'SF': 0.8, 'Subj: SF, American': 0.8},
 'Subj: Short stories, American': {'Short stories': 0.8,
  'Subj: Short stories, Other': 0.8},
 'Subj: Short stories, Other': {'Short stories': 0.8,
  'Subj: Short stories, American': 0.8},
 'Subject: Detective': {'Mystery': 0.8}}

#### read in library metadata; assemble dictionary of Lib-of-Congress headings

We read in a table of metadata associated with fiction titles from HathiTrust. How exactly these were selected is a story for another day.

In [30]:
meta = pd.read_csv('experiment_meta.csv', index_col = 'docid', low_memory = False)
meta.shape

(22448, 30)

In [31]:
meta.head()

Unnamed: 0_level_0,oldauthor,author,authordate,inferreddate,latestcomp,datetype,startdate,enddate,imprint,imprintdate,...,copiesin25yrs,enumcron,volnum,title,parttitle,earlyedition,shorttitle,nonficprob,juvenileprob,experimentcat
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
mdp.39015018477995,"Paredes, Américo","Paredes, Américo",,1990,1990,s,1990,,Houston|Arte Publico Press|1990.,1990,...,1.0,,,George Washington Gómez / | $c: Américo Paredes.,,True,George Washington Gómez,0.080192,0.168683,Subj: History
mdp.39015064741138,"Drury, Tom","Drury, Tom",,2006,2006,s,2006,,New York|Atlantic Monthly Press|Distributed by...,2006,...,1.0,,,The driftless area / | $c: Tom Drury.,,True,The driftless area,0.166502,0.0577,Bildungsroman
mdp.39015059250020,"Wolitzer, Meg","Wolitzer, Meg",,2005,2005,s,2005,,New York|Scribner|c2005.,2005,...,1.0,,,The position : | a novel / | $c: Meg Wolitzer.,,True,The position : a novel,0.142588,0.027991,Domestic|Novel|Psychological
mdp.49015000715624,"Major, Clarence","Major, Clarence",,1975,1975,s,1975,,New York|Fiction Collective : distributed by G...,1975,...,4.0,,,Reflex and bone structure / | $c: by Clarence ...,,True,Reflex and bone structure,0.0,0.0,Mystery|Subj: Detective|Psychological
njp.32101065597542,"Whitaker, Evelyn","Whitaker, Evelyn",,1898,1898,t,1899,1898.0,"Boston;Little, Brown;1899;c1898;(Cambridge, U....",1899,...,1.0,,,"Rob and Kit / | $c: by the author of ""Tip-cat,...",,True,Rob and Kit,0.1335,0.737,Subj: Juvenile|Juvenile


In [12]:
genre_categories = {'Novel': {'Novel'},
              'Juvenile': {'Juvenile audience'},
              'Short stories': {'Short stories'},
              'Domestic': {'Domestic fiction'},
              'Love': {'Love stories'},
              'Historical': {'Historical fiction'},
              'Psychological': {'Psychological fiction'},
              'Mystery': {'Detective and mystery stories', 'Mystery and detective fiction', 'Mystery fiction'},
              'Suspense': {'Suspense fiction'},
              'SF': {'Science fiction'},
              'Bildungsroman': {'Bildungsromans', 'Bildungsromane'},
              'Biographical': {'Biographical fiction'},
              'Humor': {'Humorous stories', 'Humorous fiction'},
              'Fantasy': {'Fantasy fiction', 'Fantastic fiction'},
              'Horror': {'Horror tales', 'Horror fiction', 'Occult fiction'},
              'Western': {'Western stories'},
              'Political': {'Political fiction'},
              'War': {'War stories'},
              'Adventure': {'Adventure stories', 'Adventure fiction'},
              'Christian': {'Christian fiction'}
             }

In [13]:
def hasgenre(row, genre2match):
    genres = row.genres
    date = row.inferreddate
    if pd.isnull(genres):
        return None
    elif pd.isnull(date) or int(date) < 1700 or int(date) > 2010:
        return None
    else:
        genres = genres.split('|')
        for g in genres:
            if g == genre2match:
                return row.docid
        return None

def gathergenre(genreset):
    global meta
    allmatches = set()
    for g in genreset:
        thisset = set(meta.apply(hasgenre, args = ([g]), axis = 1))
        thisset.remove(None)
        allmatches = allmatches.union(thisset)
    return allmatches

category_dict = dict()

for name, category in genre_categories.items(): 
    examples = gathergenre(category)
    category_dict[name] = examples

In [14]:
subjects2find = {'Subj: Detective': {'Women detectives', 'Detective and mystery stories', 'Detective stories', 
                                     'Mystery and detective stories', 'Detective and mystery stories, French', 
                                     'Detective and mystery stories, New Zealand', 'Detective and mystery stories, Danish', 
                                     'Detective and mystery stories, Scottish', 'Detective and mystery stories, English', 
                                     'Detective and mystery stories, American'},
                 'Subj: Man-woman': {'Man-woman relationships', 'Marriage'}, 
                 'Subj: SF, Other': {'Science fiction, French', 'Science fiction, Canadian', 
                              'Science fiction, Polish', 'Science fiction, Australian', 
                              'Science fiction, Russian', 'Science fiction, English', 
                              'Science fiction'},
                 'Subj: SF, American': {'Science fiction, American'},
                 'Subj: Short stories, Other': {'Short stories, Icelandic', 'Short stories, Nepali',
                                                'Short stories, Serbian', 'Short stories, Turkish', 
                                                'Short stories, Ghanaian (English)', 'Short stories, Indic (English)',
                                                'Short stories in Russian, 1917-1945 - English texts', 
                                                'Short stories in French, 1900-1945 - English texts', 
                                                'Short stories, Ukrainian', 'Short stories, Caribbean (English)', 
                                                'Short stories, Sindhi', 'Short stories, Urdu', 'Short stories, Arabic', 
                                                'Short stories, Vietnamese', 'Short stories, Lithuanian', 
                                                'Short stories, African (English)', 'Short stories, Malaysian', 
                                                'Short stories, Australian', 'Short stories, Pakistani', 
                                                'Short stories, Irish', 'Short stories, South African (English)', 
                                                'Short stories, Bulgarian', 'Short stories, Panjabi', 
                                                'Short stories, Canadian', 'Short stories, New Zealand', 
                                                'Short stories, Norwegian', 'Short stories, Dutch', 
                                                'Short stories, Malay', 'Short stories, Tarascan', 
                                                'Short stories, English', 'Short stories, South African'},
                 'Subj: Short stories, American': {'Short stories, American'},
                 'Subj: Fairy tales': {'Fairy tales, American', 'Fairy tales, Scottish',
                                       'Fairy tales.', 'Fairy tales, Japanese', 
                                       'Fairy tales, German', 'Fairy tales, English', 
                                       'Fairy tales, French', 'Fairy tales'}, 
                 "Subj: Fantasy": {'Fantasy fiction, Scottish', 'Fantasy games', 'Fantasy fiction, Chinese',
                                   'Fantasy', 'Fantasy fiction, Yiddish', 'Fantasy fiction, French', 
                                   'Fantasy fiction', 'Fantasy fiction, Russian', 'Fantasy fiction, English',
                                   'Fantasy fiction, American', 'Fantasy fiction, Romanian'},
                 'Subj: Horror': {'Horror short stories in English, 1837-1945 - Anthologies', 
                                  'Horror tales, Singaporean', 'Horror & ghost stories', 'Horror tales',
                                  'Horror tales, Irish',  'Horror tales, American', 'Horror tales, Scottish',
                                  'Horror stories', 'Horror tales, Canadian', 'Horror tales, English',
                                  'Ghost stories', 'Ghosts'},
                 'Subj: History': {'History'},
                 'Subj: Humor': {'English wit and humor', 'Humor',
                          'American wit and humor', 'Humorous stories, American', 'Humorous stories'},
                 'Subj: Juvenile': {"Juvenile literature", "Juvenile fiction", "Children's stories",
                                   "Children's stories, American"}
                }

In [15]:
def hassubject(row, subj2match):
    ''' A little different from hasgenre, because it's looking
    for a phrase *in* a subject rather than attempting an exact
    match.
    '''
    subjects = row.subjects
    if pd.isnull(subjects) or pd.isnull(row.inferreddate):
        return None
    elif int(row.inferreddate) < 1700 or int(row.inferreddate) > 2010:
        return None
    else:
        subjects = subjects.split('|')
        for s in subjects:
            if s in subj2match:
                return row.docid
        
    return None

def gathersubject(subj):
    global meta
    thisset = set(meta.apply(hassubject, args = ([subj]), axis = 1))
    return thisset

for name, category in subjects2find.items():
    print(name)
    examples = gathersubject(category)
    examples.remove(None)
    category_dict[name] = examples

Subj: Short stories, Other
Subj: History
Subj: Humor
Subj: Man-woman
Subj: SF, Other
Subj: Fantasy
Subj: Fairy tales
Subj: SF, American
Subj: Short stories, American
Subj: Detective
Subj: Juvenile
Subj: Horror


In [16]:
len(category_dict)

32

In [17]:
keepset = set()
for category, aset in category_dict.items():
    keepset = keepset.union(aset)
print(len(keepset))

22448


In [24]:
meta = meta.loc[list(keepset), :]
meta.shape

(22448, 29)

In [25]:
experimentcats = []
for docid, row in meta.iterrows():
    belongsto = []
    for cat, aset in category_dict.items():
        if docid in aset:
            belongsto.append(cat)
    if len(belongsto) < 1:
        print('error')
    ecat = '|'.join(belongsto)
    experimentcats.append(ecat)
            

In [26]:
meta = meta.assign(experimentcat = experimentcats)

In [27]:
meta.to_csv('experiment_meta.csv', index_label = 'docid')

In [32]:
category_dict = dict()

for docid, row in meta.iterrows():
    categories = row['experimentcat'].split('|')
    for c in categories:
        if c not in category_dict:
            category_dict[c] = []
        category_dict[c].append(docid)
len(category_dict)

32

In [34]:
allcategories = list(category_dict.keys())
allcategories

['Subj: History',
 'Mystery',
 'Subj: Juvenile',
 'Biographical',
 'Subj: Man-woman',
 'Love',
 'War',
 'Domestic',
 'Short stories',
 'Subj: Fantasy',
 'Horror',
 'Humor',
 'Subj: Fairy tales',
 'Western',
 'Fantasy',
 'SF',
 'Suspense',
 'Subj: Short stories, Other',
 'Adventure',
 'Psychological',
 'Novel',
 'Political',
 'Christian',
 'Subj: SF, American',
 'Subj: Short stories, American',
 'Juvenile',
 'Subj: Detective',
 'Subj: SF, Other',
 'Historical',
 'Bildungsroman',
 'Subj: Humor',
 'Subj: Horror']

In [36]:
def sample_bag_of_years(bag_of_years, yeardict):
    sample = []
    errors = 0
    for i in range(25000):
        b = random.choice(bag_of_years)
        if pd.isnull(b):
            errors += 1
        else:
            vol = random.choice(yeardict[b])
            sample.append(vol)
    return sample

def labelprob(sample, label, theprior, otherlabel):
    global category_dict
    count = 0
    bothcount = 0
    
    for s in sample:
        if s in category_dict[label]:
            count += 1
            bothcount += 1
        elif theprior > 0 and theprior < 1 and s in category_dict[otherlabel]:
            bothcount += 1
    
    if theprior > 0 and theprior < 1:
        count = (count + (bothcount * theprior)) / 2
        
    # the addition of 0.1 is Laplacian smoothing
    return (count + 0.1) / len(sample)

def labelxyprob(sample, labelx, labely, theprior):
    global category_dict
    count = 0
    candidates = 0
    for s in sample:
        if s in category_dict[labelx] and s in category_dict[labely]:
            count += 1
            candidates += 1
        elif s in category_dict[labelx]:
            candidates += 1
        elif s in category_dict[labely]:
            candidates += 1
    
    if theprior > 0 and theprior < 1:
        priorcount = candidates * theprior
        count = (count + priorcount) / 2
        
    # This prior is informative only in a small number of cases
    # where we expect a strong match, and know that the data
    # will underrepresent the match, because of the difference
    # of "subjects" and "genres."
    
    # Most comparisons will have
    # no prior, because we don't actually let theprior == 0
    # drag a comparison down below observed evidence.
        
    # the addition of 0.1 is Laplacian smoothing
    return (count + 0.1) / len(sample)

yeardict = dict()
for yr in range(1700, 2100):
    yeardict[yr] = meta.index[meta['inferreddate'] == yr].tolist()

pmidict = dict()

for name1, ex1 in category_dict.items():
    print(name1)
    for name2, ex2 in category_dict.items():
        
        if name1 not in pmidict:
            pmidict[name1] = dict()
            
        if name2 in pmidict and name1 in pmidict[name2]:
            pmidict[name1][name2] = pmidict[name2][name1]
            
        else:
            bag1 = meta.loc[ex1, 'inferreddate']
            bag2 = meta.loc[ex2, 'inferreddate']
            jointbag = list(bag1) + list(bag2)
            sample_of_docids = sample_bag_of_years(jointbag, yeardict)
            
            if name1 == name2:
                theprior = 1
                # There is no meaningful definition of PMI where two categories
                # perfectly coincide. We're going to improvise, in a way that
                # allows small categories to be more self-similar than
                # large ones.
                
            elif name1 in priors and name2 in priors[name1]:
                theprior = priors[name1][name2]
            else:
                theprior = 0
            
            
            prob2 = labelprob(sample_of_docids, name2, theprior, name1)
            prob1 = labelprob(sample_of_docids, name1, theprior, name2) 
            jointprob = labelxyprob(sample_of_docids, name1, name2, theprior)
            pmidict[name1][name2] = math.log(jointprob / (prob1 * prob2))

Subj: History
Mystery
Subj: Juvenile
Biographical
Subj: Man-woman
Love
War
Domestic
Short stories
Subj: Fantasy
Horror
Humor
Subj: Fairy tales
Western
Fantasy
SF
Suspense
Subj: Short stories, Other
Adventure
Psychological
Novel
Political
Christian
Subj: SF, American
Subj: Short stories, American
Juvenile
Subj: Detective
Subj: SF, Other
Historical
Bildungsroman
Subj: Humor
Subj: Horror


In [37]:
pmidf = pd.DataFrame(pmidict)
for idx in pmidf.index:
    maximum = max(pmidf.loc[idx, : ])
    if maximum > pmidf.loc[idx, idx]:
        print(idx, maximum)
        pmidf.loc[idx, idx] = maximum
pmidf

Subj: History 1.25516761144
Subj: SF, American 2.35951916418


Unnamed: 0,Adventure,Bildungsroman,Biographical,Christian,Domestic,Fantasy,Historical,Horror,Humor,Juvenile,...,Subj: Humor,Subj: Juvenile,Subj: Man-woman,"Subj: SF, American","Subj: SF, Other","Subj: Short stories, American","Subj: Short stories, Other",Suspense,War,Western
Adventure,3.210808,-0.137137,-0.831807,-1.742076,-1.053973,-0.03828,0.314062,-5.367066,0.111353,-0.824162,...,-5.459706,-0.617602,-0.902073,-5.817529,-1.116793,-5.459034,-6.044068,-1.83777,-0.343853,-1.114986
Bildungsroman,-0.137137,2.992458,-5.261986,-0.365743,0.722973,-2.702175,0.003912,-1.405142,1.382844,-2.163213,...,-5.593229,-6.494692,-0.232881,-6.086143,-5.755195,-5.988591,-6.409714,-1.33021,1.566013,0.330802
Biographical,-0.831807,-5.261986,3.756559,2.096735,-1.67118,-4.873258,1.350132,-4.727007,-5.031379,-5.285394,...,-4.705113,-5.534253,-5.220972,-5.677493,-5.088522,-5.189215,-5.681046,-5.381418,-0.283552,1.687193
Christian,-1.742076,-0.365743,2.096735,3.986531,0.433622,1.074846,-0.276719,-1.404781,-0.551646,-5.953711,...,-4.54667,-5.351224,-0.4746,-4.88289,-4.68934,-4.559862,-5.073565,-1.430505,-4.195581,-3.671069
Domestic,-1.053973,0.722973,-1.67118,0.433622,2.271823,-3.491713,0.215012,-1.638215,0.048558,-3.701644,...,-2.178758,-5.025454,0.162706,-6.656919,-6.127699,-2.772482,-7.195721,-1.11297,-0.31366,-0.531564
Fantasy,-0.03828,-2.702175,-4.873258,1.074846,-3.491713,3.360324,-0.060898,0.94116,-0.703737,-1.622516,...,-5.190966,-1.983523,-1.151852,-1.493576,-1.401766,-3.119902,-6.064025,-0.716878,-1.071065,-4.371867
Historical,0.314062,0.003912,1.350132,-0.276719,0.215012,-0.060898,2.602556,-1.159305,-1.221403,-3.506019,...,-3.545615,-4.123531,-0.139345,-6.458985,-6.062024,-6.431076,-6.919996,-0.541948,2.003924,0.857573
Horror,-5.367066,-1.405142,-4.727007,-1.404781,-1.638215,0.94116,-1.159305,3.217776,-5.390445,-5.878844,...,-5.025702,-5.59421,-5.395814,-2.677827,-2.478933,-1.337562,-5.71361,-1.209797,-4.957712,-0.243971
Humor,0.111353,1.382844,-5.031379,-0.551646,0.048558,-0.703737,-1.221403,-5.390445,3.223788,-2.718562,...,2.802078,-6.220726,0.085492,-5.961779,-5.630952,-2.661365,-6.237945,-1.297421,-5.191267,-4.550342
Juvenile,-0.824162,-2.163213,-5.285394,-5.953711,-3.701644,-1.622516,-3.506019,-5.878844,-2.718562,0.904056,...,-5.8562,0.849397,-2.863982,-7.127108,-6.880446,-3.260566,-7.985101,-6.536959,-2.67261,-6.358438
