# Wordnet words

Here's how we made the base dataset. The steps are:
* Get a list of most frequent (English) words
* Get embeddings for each of these words
* Get planar projections for these embeddings
* Link the words in various ways (i.e make the link data)

In [1]:
# Set project rootdir here
rootdir = ""

# ... or, if you'll be sharing this notebook, make it so that the rootdir will be entered by the user
# and placed in a config file...
if not rootdir:
    from config2py import config_getter  # pip install config2py

    # If the env variable is not set, running this will ask the user to enter the rootdir
    # and it will save it for them for future use
    rootdir = config_getter('WORDNET_WORDS_PROJECT_ROOTDIR') 


In [2]:
from imbed_data_prep.wordnet_words import WordsDacc

dacc = WordsDacc(rootdir)

## Peeking at the WordsDacc data accessor

`WordsDacc` is your entry to all dataset items. 
You instantiate it with a rootdir and then ask for data items. 
If the data is stored in your cache, it will be given to you from there, 
if it's not, it will compute it (download, prepare, etc.) and store it for further use. 
You can always refresh (redownload, re-compute, etc.) any data items simply by deleting the file in your rootdir. 

In [3]:
from imbed_data_prep.wordnet_words import *

dacc = WordsDacc(rootdir)

There are two raw source data items we start with. 
The first is the wordnet words (because wordnet has a bunch of linguistic features for these):

In [4]:
print(f"{len(dacc.wordnet_words)=}")
dacc.wordnet_words[1000:1005]

len(dacc.wordnet_words)=147306


['acceptation', 'accepted', 'accepting', 'acceptive', 'acceptor']

The second is a word frequency dataset (a count of words in a very large corpus)

In [5]:
print(f"{dacc.word_counts.shape}")
dacc.word_counts.head()

(333333,)


word
the    23135851162
of     13151942776
and    12997637966
to     12136980858
a       9081174698
Name: count, dtype: int64

We take the intersection of both datasets as our `word_list`. 
Note that you could also specify your own `word_list` via an argument of that name when 
making a `dacc = WordsDacc(..., word_list=[...])` instance.

In [6]:
print(f"{len(dacc.word_list)=}")

len(dacc.word_list)=52078


For each of the words of `word_list`, the `wordnet_metadata` data is a big dataframe containing a bunch of information on these words. 

The row is index by a "lemma" id. Without going into linguistics theory too much, we should at least mention this: 
A "word" (or "lemma name") is a string of characters, but it could have various meanings (indexed here by "synset"), and for each of 
these (word, meaning) combinations (indexed by "lemma") therefore, different characteristics (definition, pos ("part of speech"), etc.)

In [7]:
meta = dacc.wordnet_metadata
print(f"{meta.shape}")
meta.loc['cast.v.03.casting']

(123587, 29)


word                                                            casting
synset                                                        cast.v.03
example                  He cast a young woman in the role of Desdemona
definition            select to play,sing, or dance a part in a play...
lexname                                                   verb.creation
name                                                          cast.v.03
pos                                                                verb
attributes                                                           []
causes                                                               []
entailments                       [film.v.02, perform.v.01, stage.v.01]
hypernyms                                               [delegate.v.02]
hyponyms                     [recast.v.01, typecast.v.01, miscast.v.01]
in_region_domains                                                    []
in_topic_domains                                                

Let's have a look at what different POS ([part-of-speech](https://www.geeksforgeeks.org/nlp-part-of-speech-default-tagging/))
categories we have in this data:

In [8]:
meta.pos.value_counts()

pos
noun                   64228
verb                   34733
adjective_satellite    14108
adjective               6989
adverb                  3529
Name: count, dtype: int64

We used an OpenAI embeddings model to compute the embeddings of each (individual) word:

In [9]:
print(f"{dacc.words_embeddings.shape=}")
print(f"Vector size: {len(dacc.words_embeddings.iloc[0].embedding)}")
dacc.words_embeddings.iloc[0]

dacc.words_embeddings.shape=(52078, 1)
Vector size: 1536


embedding    [0.02674213983118534, 0.008698769845068455, -0...
Name: a, dtype: object

We then can get a planar embedding of these multi-dimensional vectors using all kinds of methods.

Here we use UMAP.

In [10]:
dacc.umap_embeddings.iloc[100:105]

Unnamed: 0_level_0,umap_x,umap_y
word,Unnamed: 1_level_1,Unnamed: 2_level_1
abnormal,0.814811,9.603489
abnormality,0.899243,9.588635
abnormally,0.872207,9.655293
abo,2.871932,3.976475
aboard,0.326998,5.151422


The `words_and_features` joins a bunch of these data aspects together.

In [22]:
print(f"{dacc.wordnet_feature_meta.shape=}")
dacc.wordnet_feature_meta.iloc[0]

dacc.wordnet_feature_meta.shape=(123587, 8)


word                                                          a
frequency                                              0.015441
definition    a metric unit of length equal to one ten billi...
lexname                                           noun.quantity
name                                              angstrom.n.01
pos                                                        noun
umap_x                                                 3.027916
umap_y                                                 3.760965
Name: angstrom.n.01.a, dtype: object

# Making link data

Many of the metadata items we have are lists of synsets (often empty). 
These have been accessed via the `wordnet_collection_meta` dataframe
and can be used to connect some synsets to other synsets, therefore some words to other words.

In WordNet, a **synset** (short for "synonym set") is a group of words that share the same meaning or concept. Think of it as a cluster of synonyms that can be used interchangeably in certain contexts without changing the overall meaning. For example, the words "happy," "joyful," and "elated" might belong to the same synset because they convey similar emotions.

The synset relationships are as follows:

* **attributes**: These are qualities or characteristics associated with a synset. *Example*: For the synset representing "banana," an attribute might be "yellow."
* **causes**: This relationship indicates that one synset brings about or results in another. *Example*: "Tickling" (synset) causes "laughter" (synset).
* **entailments**: Primarily used for verbs, this relationship means that one action logically necessitates another. *Example*: If someone is "snoring," it entails that they are "sleeping."
* **hypernyms**: A hypernym is a more general term that encompasses more specific instances. *Example*: "Vehicle" is a hypernym of "car."
* **hyponyms**: A hyponym is a more specific term within a broader category. *Example*: "Poodle" is a hyponym of "dog."
* **in_region_domains**: This denotes the regional usage of a synset, indicating where a term is commonly used. *Example*: The term "biscuit" in British English refers to what Americans call a "cookie."
* **in_topic_domains**: This shows the subject area or field to which a synset belongs. *Example*: The term "quantum" belongs to the domain of physics.
* **in_usage_domains**: This indicates the context or manner in which a term is used. *Example*: "LOL" is used in informal, internet communication.
* **instance_hypernyms**: This relationship links a specific instance to its general category. *Example*: "Einstein" is an instance of the hypernym "physicist."
* **instance_hyponyms**: This connects a general category to its specific instances. *Example*: "Physicist" has instance hyponyms like "Einstein" and "Newton."
* **member_holonyms**: This indicates the whole to which a member belongs. *Example*: A "tree" is a member of the holonym "forest."
* **member_meronyms**: This shows the members that constitute a collective whole. *Example*: "Player" is a member meronym of "team."
* **part_holonyms**: This denotes the whole object that a part belongs to. *Example*: A "wheel" is part of the holonym "car."
* **part_meronyms**: This indicates the parts that make up a whole object. *Example*: "Keyboard" is a part meronym of "computer."
* **region_domains**: This specifies the geographical area where a term is used. *Example*: "G'day" is used in the region domain of Australia.
* **root_hypernyms**: This refers to the most general term in a hierarchy. *Example*: For "poodle," the root hypernym might be "entity."
* **similar_tos**: This indicates synsets that are similar in meaning. *Example*: "Big" is similar to "large."
* **substance_holonyms**: This shows the whole that a substance is part of. *Example*: "Flour" is a substance holonym of "bread."
* **substance_meronyms**: This indicates the substances that make up a whole. *Example*: "Alcohol" is a substance meronym of "wine."
* **topic_domains**: This denotes the subject area a term is associated with. *Example*: "Molecule" belongs to the topic domain of chemistry.
* **usage_domains**: This specifies the context in which a term is appropriately used. *Example*: "Thou" is used in archaic or poetic contexts.
* **verb_groups**: This links verbs that are similar in meaning or usage. *Example*: "Run" and "jog" might be in the same verb group.


In [45]:
from imbed_data_prep.wordnet_words import *

In [59]:
collection_meta = dacc.wordnet_collection_meta
collection_meta.loc['cast.v.03.casting']

attributes                                                    []
causes                                                        []
entailments                [film.v.02, perform.v.01, stage.v.01]
hypernyms                                        [delegate.v.02]
hyponyms              [recast.v.01, typecast.v.01, miscast.v.01]
in_region_domains                                             []
in_topic_domains                                              []
in_usage_domains                                              []
instance_hypernyms                                            []
instance_hyponyms                                             []
member_holonyms                                               []
member_meronyms                                               []
part_holonyms                                                 []
part_meronyms                                                 []
region_domains                                                []
root_hypernyms           

In [60]:
for relationship_name in dacc.wordnet_collection_meta.columns:
    adjacencies = dacc.wordnet_collection_meta[relationship_name]
    link_data = pd.DataFrame(dacc.lemma_graph_edges(adjacencies))
    dacc.df_files[f"link_data/{relationship_name}.parquet"] = link_data


# Visualize this data

In [1]:
# TODO: Get data from remote source

from imbed_data_prep.wordnet_words import WordsDacc

rootdir = __import__('config2py').config_getter('WORDNET_WORDS_PROJECT_ROOTDIR') 

dacc = WordsDacc(rootdir)

In [23]:
from cosmograph import cosmo

# help(cosmo)

In [24]:
print(dacc.wordnet_collection_meta.columns)

Index(['attributes', 'causes', 'entailments', 'hypernyms', 'hyponyms',
       'in_region_domains', 'in_topic_domains', 'in_usage_domains',
       'instance_hypernyms', 'instance_hyponyms', 'member_holonyms',
       'member_meronyms', 'part_holonyms', 'part_meronyms', 'region_domains',
       'root_hypernyms', 'similar_tos', 'substance_holonyms',
       'substance_meronyms', 'topic_domains', 'usage_domains', 'verb_groups'],
      dtype='object')


In [28]:
relationship_name = 'hypernyms'
df = dacc.df_files[f"link_data/{relationship_name}.parquet"]
df.head(3)

Unnamed: 0,source,target
0,deoxyadenosine_monophosphate.n.01.a,nucleotide.n.01.nucleotide
1,deoxyadenosine_monophosphate.n.01.a,nucleotide.n.01.base
2,adenine.n.01.a,purine.n.01.purine


In [None]:
cosmo(links=df, link_source_by='source', link_target_by='target')

In [29]:
from importlib.metadata import version, PackageNotFoundError


version("cosmograph")

PackageNotFoundError: No package metadata was found for cosmograph

In [30]:
t = dacc.wordnet_feature_meta
t.iloc[0]

definition    a metric unit of length equal to one ten billi...
lexname                                           noun.quantity
name                                              angstrom.n.01
pos                                                        noun
Name: angstrom.n.01.a, dtype: object

In [24]:
# merge the wordnet metadata with the word counts
word_counts = dacc.word_counts.reset_index().rename(columns={'index': 'word'})
wordnet_metadata = dacc.wordnet_metadata
wordnet_metadata = wordnet_metadata.reset_index().rename(columns={'index': 'word'})
# # word_counts.merge(wordnet_metadata, on='word', how='left')
t = pd.merge(word_counts, wordnet_metadata, on='word')
t.iloc[0]

word                                                                  a
count                                                        9081174698
lemma                                                   angstrom.n.01.a
synset                                                    angstrom.n.01
example                                                                
definition            a metric unit of length equal to one ten billi...
lexname                                                   noun.quantity
name                                                      angstrom.n.01
pos                                                                noun
attributes                                                           []
causes                                                               []
entailments                                                          []
hypernyms                                     [metric_linear_unit.n.01]
hyponyms                                                        

In [3]:
from imbed_data_prep.wordnet_words import *
wordnet_feature_attr_names


['word', 'definition', 'lexname', 'name', 'pos']

In [2]:
help(cosmo)

Help on function cosmo in module cosmograph.base:

cosmo(data=None, *, disable_simulation: bool = False, simulation_decay: float = 1000, simulation_gravity: float = 0, simulation_center: float = 0, simulation_repulsion: float = 0.1, simulation_repulsion_theta: float = 1.7, simulation_repulsion_quadtree_levels: float = 12, simulation_link_spring: float = 1, simulation_link_distance: float = 2, simulation_link_dist_random_variation_range: list[typing.Any] = [1, 1.2], simulation_repulsion_from_mouse: float = 2, simulation_friction: float = 0.85, simulation_cluster: float = None, background_color: Union[str, list[float]] = '#222222', space_size: int = 4096, point_color: Union[str, list[float]] = '#b3b3b3', point_greyout_opacity: float = 0.1, point_size: float = 4, point_size_scale: float = 1, hovered_point_cursor: str = None, render_hovered_point_ring: bool = 0.7, hovered_point_ring_color: Union[str, list[float]] = 'white', focused_point_ring_color: Union[str, list[float]] = 0.95, focused_

In [31]:
from cosmograph import cosmo

help(cosmo)

Help on function cosmo in module cosmograph.base:

cosmo(data=None, *, disable_simulation: bool = False, simulation_decay: float = 1000, simulation_gravity: float = 0, simulation_center: float = 0, simulation_repulsion: float = 0.1, simulation_repulsion_theta: float = 1.7, simulation_repulsion_quadtree_levels: float = 12, simulation_link_spring: float = 1, simulation_link_distance: float = 2, simulation_link_dist_random_variation_range: list[typing.Any] = [1, 1.2], simulation_repulsion_from_mouse: float = 2, simulation_friction: float = 0.85, simulation_cluster: float = None, background_color: Union[str, list[float]] = '#222222', space_size: int = 4096, point_color: Union[str, list[float]] = '#b3b3b3', point_greyout_opacity: float = 0.1, point_size: float = 4, point_size_scale: float = 1, hovered_point_cursor: str = None, render_hovered_point_ring: bool = 0.7, hovered_point_ring_color: Union[str, list[float]] = 'white', focused_point_ring_color: Union[str, list[float]] = 0.95, focused_

In [36]:
cosmo(points=pd.DataFrame([[1,2], [3,4], [5,6]]))

Cosmograph(background_color=None, focused_point_ring_color=None, hovered_point_ring_color=None, link_color=Non…

In [5]:
import pandas as pd
from cosmograph import cosmo

In [3]:
df = pd.read_parquet('https://www.dropbox.com/scl/fi/4mnk1e2wx31j9mdsjzecy/wordnet_feature_meta.parquet?rlkey=ixjiiso80s1uk4yhx1v38ekhm&dl=1')
print(f"{df.shape=}")
df.iloc[0]

df.shape=(123587, 8)


word                                                          a
frequency                                              0.015441
definition    a metric unit of length equal to one ten billi...
lexname                                           noun.quantity
name                                              angstrom.n.01
pos                                                        noun
umap_x                                                 3.027916
umap_y                                                 3.760965
Name: angstrom.n.01.a, dtype: object

In [None]:
hyponyms = pd.read_parquet('https://www.dropbox.com/scl/fi/pl72ixv34soo1o8zanfrz/hyponyms.parquet?rlkey=t4d606fmq1uinn29qmli7bx6r&dl=1')
print(f"{hyponyms.shape=}")
hyponyms.iloc[0]

hyponyms.shape=(258896, 2)


source           vitamin_a.n.01.a
target    vitamin_a1.n.01.retinol
Name: 0, dtype: object

In [6]:
g = cosmo(
    df,
    point_id_by='lemma',
    point_label_by='word',
    point_x_by='umap_x',
    point_y_by='umap_y',
    point_color_by='pos',
    point_size_by='frequency',
    point_size_scale=0.01,  # often have to play with this number to get the size right
)
g

Cosmograph(background_color=None, focused_point_ring_color=None, hovered_point_ring_color=None, link_color=Non…

In [7]:
h = cosmo(
    points=df,
    links=hyponyms,
    link_source_by='source',
    link_target_by='target',
    point_id_by='lemma',
    point_label_by='word',
    # point_x_by='umap_x',
    # point_y_by='umap_y',
    point_color_by='pos',
    point_size_by='frequency',
    point_size_scale=0.01,  # often have to play with this number to get the size right
)
h

Cosmograph(background_color=None, focused_point_ring_color=None, hovered_point_ring_color=None, link_color=Non…

# Appendix: WIP and scrap

### Word frequencies

In [2]:
word_frequency_data_url = 'https://github.com/thorwhalen/content/raw/refs/heads/master/tables/csv/zip/english-word-frequency.csv.zip'

# Note: The (..., keep_default_na=False, na_values=[]) is to avoid words "null" and "nan" being interpretted as NaN
#    see https://www.skytowner.com/explore/preventing_strings_from_getting_parsed_as_nan_for_read_csv_in_pandas
word_counts = pd.read_csv(word_frequency_data_url, keep_default_na=False, na_values=[])
word_counts

Unnamed: 0,word,count
0,the,23135851162
1,of,13151942776
2,and,12997637966
3,to,12136980858
4,a,9081174698
...,...,...
333328,gooek,12711
333329,gooddg,12711
333330,gooblle,12711
333331,gollgo,12711


### Word properties (definition, type, etc.)

In [79]:
from nltk.corpus import wordnet as wn

len(list(wn.all_lemma_names()))

147306

In [46]:
import lexis  # pip install lexis

lemmas = lexis.Lemmas()
len(lemmas)

147306

In [4]:
# The words that are both in the lemmas and in the word_counts
word_list = sorted(set(lemmas) & set(word_counts.word))
len(word_list)

52078

In [124]:
t = lemmas['body']
tt = t['body.n.01']
list(tt)

['examples',
 '__slots__',
 'in_topic_domains',
 'attributes',
 '_examples',
 'frame_ids',
 'part_holonyms',
 'mst',
 '_lemma_pointers',
 'substance_meronyms',
 '__module__',
 '_lexname',
 '_iter_hypernym_lists',
 '__dir__',
 '_instance_hypernyms',
 'name',
 '__hash__',
 '_min_depth',
 'topic_domains',
 'acyclic_tree',
 'in_usage_domains',
 '__doc__',
 'verb_groups',
 '__dict__',
 'entailments',
 'hyponyms',
 'substance_holonyms',
 'hypernym_distances',
 '__weakref__',
 'root_hypernyms',
 'min_depth',
 'member_holonyms',
 '_definition',
 'hypernym_paths',
 'in_region_domains',
 'lemma_names',
 '_name',
 '_pointers',
 'similar_tos',
 'definition',
 '_lemmas',
 '_pos',
 'max_depth',
 'causes',
 'instance_hypernyms',
 'part_meronyms',
 '_max_depth',
 '_needs_root',
 '_hypernyms',
 'instance_hyponyms',
 '_frame_ids',
 'lexname',
 'usage_domains',
 'also_sees',
 'member_meronyms',
 '__sizeof__',
 'hypernyms',
 '__reduce__',
 '_lemma_names',
 '__repr__',
 'lemmas',
 '__str__',
 'offset',
 'r

In [125]:
tt['definition']

'the entire structure of an organism (an animal, plant, or human being)'

In [126]:
dir(tt.store.store)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_all_hypernyms',
 '_definition',
 '_doc',
 '_examples',
 '_frame_ids',
 '_hypernyms',
 '_instance_hypernyms',
 '_iter_hypernym_lists',
 '_lemma_names',
 '_lemma_pointers',
 '_lemmas',
 '_lexname',
 '_max_depth',
 '_min_depth',
 '_name',
 '_needs_root',
 '_offset',
 '_pointers',
 '_pos',
 '_related',
 '_shortest_hypernym_paths',
 '_wordnet_corpus_reader',
 'acyclic_tree',
 'also_sees',
 'attributes',
 'causes',
 'closure',
 'common_hypernyms',
 'definition',
 'entailments',
 'examples',
 'frame_ids',
 'hypernym_distances',
 'hypernym_paths',
 'hypernyms',
 'hyponyms',
 'in_region_domains',
 'in_topic_domains

In [127]:
tt.store.store.lemma_names()

['body', 'organic_structure', 'physical_structure']

In [128]:
tt.store.store.examples()

['he felt as if his whole body were on fire']

In [68]:
lexis.lemma_methods_returning_lemmas
lemma = lemmas['go']
lemma

{'go.n.01': WordnetElement('go.n.01'),
 'adam.n.03': WordnetElement('adam.n.03'),
 'crack.n.09': WordnetElement('crack.n.09'),
 'go.n.04': WordnetElement('go.n.04'),
 'travel.v.01': WordnetElement('travel.v.01'),
 'go.v.02': WordnetElement('go.v.02'),
 'go.v.03': WordnetElement('go.v.03'),
 'become.v.01': WordnetElement('become.v.01'),
 'go.v.05': WordnetElement('go.v.05'),
 'run.v.05': WordnetElement('run.v.05'),
 'run.v.03': WordnetElement('run.v.03'),
 'proceed.v.04': WordnetElement('proceed.v.04'),
 'go.v.09': WordnetElement('go.v.09'),
 'go.v.10': WordnetElement('go.v.10'),
 'sound.v.02': WordnetElement('sound.v.02'),
 'function.v.01': WordnetElement('function.v.01'),
 'run_low.v.01': WordnetElement('run_low.v.01'),
 'move.v.13': WordnetElement('move.v.13'),
 'survive.v.01': WordnetElement('survive.v.01'),
 'go.v.16': WordnetElement('go.v.16'),
 'die.v.01': WordnetElement('die.v.01'),
 'belong.v.03': WordnetElement('belong.v.03'),
 'go.v.19': WordnetElement('go.v.19'),
 'start.v.0

In [67]:
w = lemma['a.n.06']
dir(w)
w.verb_groups()

[]

In [62]:
lexis.lemma_methods_returning_lemmas


Synset('None')

In [None]:
t = lemmas['a']['a.n.06']
type(t)

lexis.KvSynset

In [None]:
from nltk.corpus import wordnet as wn

wn.lemma('salt.n.03.saltiness')

AttributeError: 'NoneType' object has no attribute 'end'

In [None]:

test_words = ['body', 'head', 'hand']

t = pd.DataFrame(wordnet_details(test_words))
print(f"{t.shape}")
t

(70, 22)


Unnamed: 0,word,synset,definition,example,pos,hypernyms,hyponyms,member_holonyms,substance_holonyms,part_holonyms,...,part_meronyms,attributes,also_sees,verb_groups,entailments,causes,similar_tos,domain_topic,domain_region,domain_usage
0,body,body.n.01,the entire structure of an organism (an animal...,he felt as if his whole body were on fire,n,[natural_object.n.01],"[human_body.n.01, life_form.n.01, live_body.n.01]",[],[],[],...,"[arm.n.01, articulatory_system.n.01, body_subs...",[],[],[],[],[],[],"[animal.n.01, homo.n.02]",[],[]
1,body,body.n.02,a group of persons associated by some common t...,the whole body filed out of the auditorium,n,[social_group.n.01],"[administration.n.02, christendom.n.01, church...",[],[],[],...,[],[],[],[],[],[],[],[],[],[]
2,body,body.n.03,a natural object consisting of a dead animal o...,they found the body in the lake,n,[natural_object.n.01],"[cadaver.n.01, carcase.n.01, carrion.n.01, mum...",[],[],[],...,[],[],[],[],[],[],[],[],[],[]
3,body,body.n.04,an individual 3-dimensional object that has ma...,heavenly body,n,[natural_object.n.01],"[chromosome.n.01, inclusion_body.n.01, mass.n....",[],[],[],...,[],[],[],[],[],[],[],[],[],[]
4,body,torso.n.01,the body excluding the head and neck and limbs,they moved their arms and legs and bodies,n,[body_part.n.01],[],[],[],[body.n.01],...,"[abdomen.n.01, back.n.01, belly.n.02, buttock....",[],[],[],[],[],[],[],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,hand,hand.n.12,a round of applause to signify approval,give the little lady a great big hand,n,[applause.n.01],[],[],[],[],...,[],[],[],[],[],[],[],[],[],[]
66,hand,hand.n.13,terminal part of the forelimb in certain verte...,the kangaroo's forearms seem undeveloped but t...,n,[forepaw.n.01],[],[],[],[],...,[],[],[],[],[],[],[],[],[],[]
67,hand,hand.n.14,physical assistance,give me a hand with the chores,n,[aid.n.02],[],[],[],[],...,[],[],[],[],[],[],[],[],[],[]
68,hand,pass.v.05,place into the hands or custody of,"hand me the spoon, please",v,[transfer.v.05],"[deal.v.12, entrust.v.01, entrust.v.02, give.v...",[],[],[],...,[],[],[],[],[],[],[],[],[],[]


In [None]:
import pandas as pd
import lexis
lemmas = lexis.Lemmas()

def extract_synset_features(word):
    if word in lemmas:
        for synset_key, synset in lemmas[word].items():
            row = {
                "word": word,
                "synset": synset_key,
                "definition": synset.get('definition', ''),
                "examples": synset.get('examples', []),
                "pos": synset.get('pos', ''),
                "lemma_names": synset.get('lemma_names', []),
                "hypernyms": [h.name() for h in synset.get('hypernyms', [])],
                "hyponyms": [h.name() for h in synset.get('hyponyms', [])],
                "holonyms": [h.name() for h in synset.get('member_holonyms', [])],
                "meronyms": [h.name() for h in synset.get('member_meronyms', [])],
            }
            yield row

    return data

# Combine data for all words

rows = []
for word in ['body', 'head', 'hand']:
    rows.extend(extract_synset_features(word))

# Create the dataframe
df = pd.DataFrame(rows)

# Display the dataframe
print(df)

    word      synset                                         definition  \
0   body   body.n.01  the entire structure of an organism (an animal...   
1   body   body.n.02  a group of persons associated by some common t...   
2   body   body.n.03  a natural object consisting of a dead animal o...   
3   body   body.n.04  an individual 3-dimensional object that has ma...   
4   body  torso.n.01     the body excluding the head and neck and limbs   
..   ...         ...                                                ...   
65  hand   hand.n.12            a round of applause to signify approval   
66  hand   hand.n.13  terminal part of the forelimb in certain verte...   
67  hand   hand.n.14                                physical assistance   
68  hand   pass.v.05                 place into the hands or custody of   
69  hand   hand.v.02                guide or conduct or usher somewhere   

                                             examples pos  \
0         [he felt as if his whole bod

In [137]:
word = 'body'
t = lemmas[word]
t['body.n.01']['hypernyms'][0].lemmas()[0]

Lemma('natural_object.n.01.natural_object')

147306

In [93]:
word_counts.word[word_counts.word.isna()]

Series([], Name: word, dtype: object)

In [80]:
import graze as gz
b = gz.graze(word_frequency_data_url)
from dol import zip_decompress
b = zip_decompress(b)
import io
bi = io.BytesIO(b)

In [81]:
t = pd.read_csv(bi, na_values=[])
t.word.nunique()

333331

In [70]:
it = bi.readlines()

In [75]:
it[12819 + 1]

b'nan,3398089\n'

## Getting embeddings of our words

In [4]:
rootdir = '/Users/thorwhalen/Dropbox/_odata/figiri/english_words'

from tabled import DfFiles

df_files = DfFiles(rootdir)

In [140]:
len(word_list)

52078

In [27]:
if 'words_embeddings.parquet' not in df_files:
    import oa

    assert len(word_list) == len(set(word_list)), "Words not unique"
    word_embeddings = oa.embeddings(word_list)
    df = pd.DataFrame(index=word_list, data=map(lambda x: [x], word_embeddings))
    df.columns = ['embedding']
    df_files['words_embeddings.parquet'] = df
else:
    df = df_files['words_embeddings.parquet']


In [41]:
if 'umap_embeddings.parquet' not in df_files:
    import imbed
    umap_planar_embeddings = imbed.umap_2d_embeddings(df.embedding)
    umap_embeddings = imbed.planar_embeddings_dict_to_df(umap_planar_embeddings, index_name='word')
    df_files['umap_embeddings.parquet'] = umap_embeddings
else:
    umap_embeddings = df_files['umap_embeddings.parquet']

In [43]:
umap_embeddings.reset_index(inplace=True)
umap_embeddings

Unnamed: 0,word,x,y
0,a,3.027916,3.760965
1,aa,3.009660,3.677792
2,aaa,2.991489,3.758253
3,aachen,1.601711,1.289808
4,aah,2.878280,3.509061
...,...,...,...
52073,zygomatic,-2.601842,2.954538
52074,zygote,-2.934901,1.785373
52075,zygotic,-3.039186,1.812226
52076,zyloprim,-3.420418,0.476265


In [44]:
df_files['umap_embeddings.csv'] = umap_embeddings

In [3]:
w = lemmas['body']['body.n.01']
w.lemmas()

NameError: name 'lemmas' is not defined

In [147]:
w.lemmas()[0].count()

113

In [154]:
w = next(iter(lemmas['go'].values()))
[x.count() for x in w.lemmas()]

[0, 1, 1, 0]

In [None]:
dir(w.lemmas()[0])

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_frame_ids',
 '_frame_strings',
 '_hypernyms',
 '_instance_hypernyms',
 '_key',
 '_lang',
 '_lex_id',
 '_lexname_index',
 '_name',
 '_related',
 '_synset',
 '_syntactic_marker',
 '_wordnet_corpus_reader',
 'also_sees',
 'antonyms',
 'attributes',
 'causes',
 'count',
 'derivationally_related_forms',
 'entailments',
 'frame_ids',
 'frame_strings',
 'hypernyms',
 'hyponyms',
 'in_region_domains',
 'in_topic_domains',
 'in_usage_domains',
 'instance_hypernyms',
 'instance_hyponyms',
 'key',
 'lang',
 'member_holonyms',
 'member_meronyms',
 'name',
 'part_holonyms',
 'part_meronyms',
 'pertainyms',
 'region_dom

In [None]:
ww = w.lemmas()[0]
ww.count()

0

In [163]:
ww.pertainyms()

[]