Try WordNet for creating specificity manipulation

In [2]:
from nltk.corpus import wordnet as wn

WordNet implemented here has the ability to find hypernyms (more general) and hyponyms (more specific)

In [17]:
wn.synsets('dog', pos=wn.NOUN)[0].hyponyms()

[Synset('basenji.n.01'),
 Synset('corgi.n.01'),
 Synset('cur.n.01'),
 Synset('dalmatian.n.02'),
 Synset('great_pyrenees.n.01'),
 Synset('griffon.n.02'),
 Synset('hunting_dog.n.01'),
 Synset('lapdog.n.01'),
 Synset('leonberg.n.01'),
 Synset('mexican_hairless.n.01'),
 Synset('newfoundland.n.01'),
 Synset('pooch.n.01'),
 Synset('poodle.n.01'),
 Synset('pug.n.01'),
 Synset('puppy.n.01'),
 Synset('spitz.n.01'),
 Synset('toy_dog.n.01'),
 Synset('working_dog.n.01')]

One idea: 
1. Since we have three levels of specificity (low, mid, high), for the 'mid' level, we sample nouns that are of various frequencies/age of acquisition/lengths
2. Then for each of those words, we find a corresponding hyponym and a hypernym, creating our triad
3. The resulting triads should have distributions whose locations are incrementally higher (low < mid < high), but on the whole at least we have 'low' words that are low frequency, and 'high' words that are high frequency
Would this work? 

Actually, use LexOps to find pairs that different in frequencies, perhaps whose locations are quite close, whilst matching concreteness, age of acquisition, lexical decision. Then use the low frequency side to find hypernyms, and high frequency side to find hyponyms.

In [18]:
lexops = pd.read_csv('lexops_output_frequency_manipulation.csv')
lexops

Unnamed: 0,item_nr,condition,match_null,string,Zipf.SUBTLEX_UK,AoA.Kuperman,CNC.Brysbaert,RT.BLP
0,1,LowFreq,A1,earphone,1.649978,9.45,4.74,613.184210
1,1,HighFreq,A1,cocoon,2.841864,8.95,4.83,598.351351
2,2,LowFreq,A1,tuft,2.451611,9.87,3.85,699.962963
3,2,HighFreq,A1,pauper,2.897133,9.44,3.80,680.580645
4,3,LowFreq,A2,goggle,2.348948,9.44,4.42,623.083333
...,...,...,...,...,...,...,...,...
235,118,HighFreq,A1,uproar,2.913220,9.55,3.00,666.277778
236,119,LowFreq,A1,thinness,2.158134,7.78,2.76,622.454545
237,119,HighFreq,A1,coolness,2.625155,7.90,2.67,625.405405
238,120,LowFreq,A2,floorboard,2.411739,9.35,4.89,671.285714


In [119]:
hypernyms = []
for row in lexops.iterrows():
    wordnet_entries = wn.synsets(row[1].string, pos=wn.NOUN)
    if len(wordnet_entries) > 0:
        hypernym_entries = wn.synsets(row[1].string, pos=wn.NOUN)[0].hypernyms()
        if len(hypernym_entries) > 0:
            hypernym = hypernym_entries[0]
            hypernyms.append(hypernym.name().split('.')[0])
        else: 
            hypernyms.append(np.nan)
    elif len(wordnet_entries) == 0: 
        hypernyms.append(np.nan)
lexops['hypernym'] = hypernyms
lexops_hypernym_available = lexops[~lexops['hypernym'].isna()]
print(lexops_hypernym_available[['string','hypernym']].sample(50))

         string                hypernym
51         grit               sandstone
71      plywood                laminate
75        chime   percussion_instrument
217     transit    surveying_instrument
139        gall          animal_disease
182    molehill                   knoll
145      martyr                  victim
142    soreness                    pain
175      mantel                   shelf
107      damper                   plate
181    preacher               clergyman
134    bootlace                    lace
190       colic                    pain
232     checkup             examination
31       seesaw               plaything
114     slobber                  saliva
88   sculptress                sculptor
27        broom      cleaning_implement
66    nursemaid                  keeper
136      stanza                    text
213     showman                  person
60      corsage      flower_arrangement
118    conclave                 meeting
94       cutlet                   piece


In [121]:
resources = [('zipf','../resources/subtlex_uk.csv','LogFreq(Zipf)'),
             ('CNC_M','../resources/brysbaert_etal_2014.csv','Conc_M'),
             ('CNC_SD','../resources/brysbaert_etal_2014.csv','Conc_SD'),
             ('imageability','../resources/scott_etal_2019.csv','IMAG'),
             ('valence','../resources/scott_etal_2019.csv','VAL'),
             ('AoA','../resources/kuperman_etal_2012.csv','Rating.Mean'),
             ('RT','../resources/blp-items.xls','rt')]

def get_stimulus_property(resource='', dataframe=None):
    stim_property = resource[0]
    resource_fname = resource[1]
    col_name = resource[2]
    if stim_property == 'RT':
        resource_df = pd.read_excel(resource_fname, index_col=0)
    else:
        resource_df = pd.read_csv(resource_fname, index_col=0)
    dataframe = dataframe.merge(resource_df[col_name].rename(stim_property), left_on='hypernym', right_index=True, how='left')
    dataframe = dataframe.astype({stim_property:'float'})
    return dataframe

# count word length in number of letters
lexops_hypernym_available['length'] = lexops_hypernym_available['string'].str.len()

for resource in resources:
    lexops_hypernym_available = get_stimulus_property(resource=resource, dataframe=lexops_hypernym_available)

lexops_hypernym_available

  resource_df = pd.read_csv(resource_fname, index_col=0)


KeyError: "Only a column name can be used for the key in a dtype mappings argument. 'zipf' not found in columns."