## Reverse dictionary in Spanish: Part 1

Every once in a while I'm trying to remember a word and I have a vague memory of its definition. Googling that definition  to try to find the word  is most of the time not very useful. I thought a reverse dictionary, where you search a definition and get word that matches it,  would be useful for this case. Turns out is a common term to refer to this kind of task. With the rise of LLM and general NLP advances, seems like something like this should be easy nowadays, but as it tends to happen doing so in spanish is not as easy. In this first part, I will build the most basic reverse dictionary using available word embbedings. In the next parts will venture into fine-tuning a new neural net, or straight up using something like the newly fine-tuned llama2 in spanish to try make it work for this.

In [1]:
from pathlib import Path
import fastbook
import fastai
from gensim.models.keyedvectors import KeyedVectors

Exactly what embbedings to use I wasn't sure, so I picked the vector formatted version of Word2Vec embeddings from SBWC from [here](https://github.com/dccuchile/spanish-word-embeddings#word2vec-embeddings-from-sbwc)


In [None]:
# Unzip it. This works with WSL/linux etc. Won't work (I think) if you're just running jupyter from windows cmd
# this will also delete the bz2. If you want to keep it use -dk instead of -d
# !bzip2 -d data/SBW-vectors-300-min5.txt.bz2

The unzipped file is almost 3gb, so we import a few examples to check it out

In [2]:
# Use gensim Keyvectors to read the embbedings 
wordvectors_file_vec = 'data/SBW-vectors-300-min5.txt'
cantidad = 10000
wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec, limit=cantidad)
# wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec)

Embeddings like this really aren't anything more than a mapping between keys and vectors. Basically each word has a bunch of numbers that correspond to it, and that we can interpret as a sort of meaning: semantically similar words should have similar numbers, very different words should have very different numbers.

Most of the time can be a bit confusing how to index into this object. What words are in this sample? How do the numbers look like?

In [11]:
print(wordvectors[0].shape) # So, vector of 300 for each word
print(wordvectors[0][:10]) # A few examples of whatever the word 0 is

# 
print(wordvectors.index_to_key[0]) # You can find what word is the word 0 with this method
print(wordvectors.key_to_index['reina']) # Or search for the index of a particular word

(300,)
[-0.029648  0.011336  0.019949 -0.088832 -0.025225  0.056844  0.025473  0.014068  0.163694 -0.067154]
de
3278


Since I wanted to eventually turn this into a small gradio app to share, I figured using the whole 3 gb of embbedings is overkill and also, surely there's a lot of words that I don't really need there (like places, names, etc.). Ideally I would have a database with words useful for spellcheking in spanish so I could filter it, but I couldn't find one. I suppose I could straight up scrape RAE or wordreference but I don't think they allow and it might take quite a while to get that going.

I ended up deciding to use [Wikcionario](https://es.wiktionary.org/wiki/Wikcionario:Portada) the only really open dictionary in 
spanish I could find. Kind of a problem because they don't just provide a list of words ready to use, so I had to go through the eswiki dump and see what files I could use. I went for eswiki-20230820-pages-articles.xml.bz2 that you can find [here](https://dumps.wikimedia.org/eswiki/20230820/).  Other options might be :

* [Spanish Wordnet](http://grial.edu.es/web/es/descargas/): Might be good too, and honestly didn't find it when I was first doing this sooooo I'll check it out later. Not sure of their terms, and also seems like its just a translation of wordnet.
* [Freeling](https://nlp.lsi.upc.edu/freeling/index.php/node/10) Could not find where to get the dictionary, which has 500k+ words. I think its just [here](https://github.com/TALP-UPC/FreeLing/tree/master/data/es/dictionary/entries) but seemed like it was going to take a while to figure it out.
* [Apertium](https://github.com/apertium) This might be the best alternative really. File is [here](https://repositori.upf.edu/handle/10230/17123) though unzipping and figuring where exactly is what I want is also an issue.

Anyways I went ahead and downloaded the full xml, hoping to parse and extract the titles of each article to use for this.

In [None]:
# bzip2 -d data/eswiktionary-20230820-pages-articles.xml.bz2

Problem again is, I don't really need the whole site's xml data, just the words of the articles in there. So load the XML and filter for the appropiate tag

In [12]:
import xml.etree.ElementTree as ET

tree = ET.parse('data/eswiktionary-20230820-pages-articles.xml')
root = tree.getroot()

# Had to this because I wasn't getting any of the titles, turns out I needed the namespace for it
namespaces = {elem.tag.split('}')[0].strip('{') for elem in root.iter() if '}' in elem.tag}
print(namespaces)

{'http://www.mediawiki.org/xml/export-0.10/'}


In [13]:
namespaces = {'ns': 'http://www.mediawiki.org/xml/export-0.10/'}  # the namespace mapping
titles = [elem.text for elem in root.findall('.//ns:title', namespaces)]

It works!

In [16]:
titles[:30]

['MediaWiki:Category',
 'MediaWiki:Helppage',
 'MediaWiki:Wikititlesuffix',
 'MediaWiki:Bugreportspage',
 'Plantilla:Sitesupportpage',
 'MediaWiki:Qbspecialpages',
 'Plantilla:Fromwikipedia',
 'MediaWiki:Postcomment',
 'Plantilla:Gnunote',
 'MediaWiki:Developertitle',
 'MediaWiki:Developertext',
 'MediaWiki:Sitesubtitle',
 'MediaWiki:Noconnect',
 'MediaWiki:Missingarticle',
 'MediaWiki:Perfdisabled',
 'MediaWiki:Perfdisabledsub',
 'MediaWiki:Whitelistedittitle',
 'MediaWiki:Whitelistreadtitle',
 'MediaWiki:Whitelistreadtext',
 'MediaWiki:Whitelistacctitle',
 'MediaWiki:Whitelistacctext',
 'MediaWiki:Newarticletext',
 'MediaWiki:Anontalkpagetext',
 'MediaWiki:Noarticletext',
 'Plantilla:Sectionedit',
 'Plantilla:Commentedit',
 'MediaWiki:Revhistory',
 'MediaWiki:Loadhist',
 'Plantilla:Searchhelppage']

It sucks. Obviously a lot of titles are stuff that I don't need. Filtering:

In [17]:
filtered_titles = [title for title in titles if ':' not in title] # this to eliminate wiktionary specific titles
filtered_titles = [title for title in filtered_titles if title.islower()] # upper case means mostly places, names and such
filtered_titles = [title for title in filtered_titles if len(title.split()) == 1] # only care for words

In [18]:
filtered_titles[:10]

['japonés',
 'hiragana',
 'katakana',
 'alemán',
 'catalán',
 'mayo',
 'domingo',
 'hoy',
 'gurú',
 'francés']

Much better. I think. Next issue is filtering the words from the wordvector to words that exist in the filtered titles from wikcionario. So, read the whole embbeding file and then filter. This is real slow on my local pc.

In [19]:
wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec)
words = wordvectors.index_to_key[0:len(wordvectors)]

In [20]:
# aer = [word for word in words if word in filtered_titles] <- this takes forever, don't try
filtered_titles_set = set(filtered_titles)
aer = [word for word in words if word in filtered_titles_set]

In [23]:
print(len(words)) # number of words in the embbeding 
print(len(filtered_titles_set)) # number of words in the final filter of the wiktionary
print(len(aer)) # final number of words the smaller embedding matrix i'll keep

1000653
856904
173274


Not sure if it makes sense that the difference is such, but for now I don't care. Filter and save the smaller model

In [25]:
# Just trying to slice the wordvector
words_to_keep = aer
smaller_model = wordvectors.vectors_for_all(words_to_keep)
smaller_model.save_word2vec_format("smaller_model_spa.txt")

### Examples!

So now we can use some of the methods and try this first basic version of a spanish reverse dictionary

In [None]:
smaller_model.most_similar_cosmul(positive=['angustia','esperar'])

#A type of vehicle that goes under the water

In [26]:
# Example 1: distress because you forgot something
smaller_model.most_similar_cosmul(positive=['angustia', 'porque', 'se', 'te', 'olvidó', 'algo'])

[('reconcome', 0.2180447280406952),
 ('apachurra', 0.2174457609653473),
 ('empezás', 0.21651658415794373),
 ('causaste', 0.2164955586194992),
 ('exploté', 0.2155492752790451),
 ('comentabas', 0.2149752825498581),
 ('desconfié', 0.21489164233207703),
 ('agüitado', 0.21468758583068848),
 ('recordá', 0.213822603225708),
 ('aborrecerse', 0.21345187723636627)]

Didn't know the word "reconcome", but turns out rae says [reconcomer](https://dle.rae.es/reconcomer) means "Dicho de un problema, una preocupación" (about a problem, a worry). It actually works pretty well

In [27]:
# Example 2: fear of heights
smaller_model.most_similar_cosmul(positive=['miedo', 'a', 'las', 'alturas'])

[('cuarteando', 0.3146556615829468),
 ('cachamos', 0.31217625737190247),
 ('basculen', 0.3110094964504242),
 ('erran', 0.3109343945980072),
 ('que', 0.3109077513217926),
 ('avecinarían', 0.3102854788303375),
 ('enfermarían', 0.3097950518131256),
 ('acongojantes', 0.30800893902778625),
 ('destronadas', 0.30754488706588745),
 ('baqueteadas', 0.3073866665363312)]

Not even close, but without stopwords acrofobia actually appears in the list, which means fear of heights. Didn't know the word either and I usually thought of "vértigo" when thinking of this

In [29]:
# Example 3
smaller_model.most_similar_cosmul(positive=['miedo', 'alturas'])

[('basculen', 0.5672526359558105),
 ('enervadas', 0.5599164962768555),
 ('encrespe', 0.5534656047821045),
 ('ensuciarán', 0.5530100464820862),
 ('apeteciendo', 0.5520198345184326),
 ('topábamos', 0.5518502593040466),
 ('acrofobia', 0.5518409609794617),
 ('importunase', 0.5515409708023071),
 ('sobresaltando', 0.5505372881889343),
 ('solazo', 0.549772322177887)]

Makes sense really. An embedding like this means each words really only has one meaning, when we know a lot of them can have multiple meanings. So adding connectors, stopwords and such will probably make the search worse. Something to think about when wrapping it all in the gradio app

The next example is kind of funny. Was hoping it would find submarine ("vehicle that goes under water") but found something else:

In [33]:
# Example 4
smaller_model.most_similar_cosmul(positive=['vehiculo', 'que', 'anda', 'bajo', 'el', 'agua'])

[('desbarrancada', 0.16990652680397034),
 ('enfilen', 0.1675097644329071),
 ('gritadera', 0.1669069081544876),
 ('cremaron', 0.16604605317115784),
 ('encostalado', 0.1657562106847763),
 ('servicialmente', 0.16530342400074005),
 ('arroyar', 0.1648721992969513),
 ('desenmascaro', 0.16470250487327576),
 ('escaldó', 0.1644110232591629),
 ('citroneta', 0.16413670778274536)]

Desbarrancada means fall, as in, from a cliff. So that vehicle is under water but for sure not functioning. Finally, an example that really didn't work ("group of wolves that roam together")

In [34]:
# Example 4
smaller_model.most_similar_cosmul(positive=['grupo', 'de', 'lobos', 'que', 'andan', 'juntos'])

[('albertosaurios', 0.17380031943321228),
 ('camionetita', 0.17278659343719482),
 ('destazando', 0.1716463416814804),
 ('acicalaban', 0.17020738124847412),
 ('desenmascaro', 0.16754880547523499),
 ('rechistara', 0.16714678704738617),
 ('atónicos', 0.16713115572929382),
 ('toreamos', 0.16500285267829895),
 ('convalecían', 0.16470967233181),
 ('flipados', 0.16468177735805511)]

Is albertosaurio an old dude named alberto or an actual dinosaur? Don't know but for sure has nothing to do with a pack of wolves. No combination of those words actually makes to have "manada" in the list. Though it does turn out that if you use "rebaño" as a positive word, it does find "manada". 

In [36]:
smaller_model.most_similar_cosmul(positive=['grupo', 'lobos'])

[('montescos', 0.5416086912155151),
 ('lobo', 0.5249671936035156),
 ('andariegos', 0.5227840542793274),
 ('albertosaurios', 0.5224950909614563),
 ('cuellilargos', 0.5204229950904846),
 ('azotadores', 0.5189481973648071),
 ('destazando', 0.5181227326393127),
 ('boleando', 0.5153393745422363),
 ('arponeando', 0.5151293277740479),
 ('arponeados', 0.5130795836448669)]

In [46]:
smaller_model.most_similar_cosmul(positive=['lobos', 'rebaño'])

[('manada', 0.650048553943634),
 ('rebaños', 0.643172025680542),
 ('cabras', 0.6415314674377441),
 ('ovejas', 0.6396838426589966),
 ('lobo', 0.6271942853927612),
 ('lobeznos', 0.6143544912338257),
 ('manadas', 0.6129717826843262),
 ('corderillos', 0.6069217324256897),
 ('salvajes', 0.6065280437469482),
 ('zorros', 0.6042017936706543)]

### Closing thoughts

So this works pretty ok out of the box. If you wanted to just run it locally you could use the whole embedding matrix and do nothing more than call the methods for it. The more annoying part was finding open resources for spanish really. Still, a lot of improvements could be made for later parts:

* Using actual definitions: As of now, we used the meaning of words estimated for a different task like next word prediction, but we really aren't using the fact that the mapping we really want to take advantage of is definition -> word. So using the embeddings to train for a different task seems like the next obvious step
* Multiple word meanings: The fact that each word is mapped to one vector means that multiple meanings get lost, which I believe is part of the reasons why removing some words helps the search in this example. Context gets lost

So next up should be, figuring out how to use contex (multisense embedding maybe?) and then some architecture to train the task we actually care about. For now, the most immediate thing to do is wrap the method used in a function so as to strip the string from mostly useless words to put up the gradio app

### Removing stopwords and filtering for gradio app

In [1]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/imauriacaf/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/imauriacaf/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
#tokenizer = nltk.data.load('nltk:reverse_dictionary/tokenizers/punkt/spanish.pickle')

In [2]:
example_sent = "grupo de lobos que andan juntos"
stop_words = set(stopwords.words('spanish'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

In [4]:
import pickle 

with open('stop_words.pkl', 'wb') as f:
    pickle.dump(stop_words, f)

In [4]:
def filter_words(x):
    word_tokens = word_tokenize(x)
    filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
    return filtered_sentence

def reverse_dictionary(definition):
    words = filter_words(definition)
    list_similar = smaller_model.most_similar_cosmul(positive= words)
    return list_similar

In [5]:
wordvectors_file_vec = 'data/smaller_model_spa.txt'
smaller_model = KeyedVectors.load_word2vec_format(wordvectors_file_vec)

In [16]:
reverse_dictionary("vehiculo que anda en el agua")

[('arroyar', 0.41619226336479187),
 ('despapaye', 0.41503772139549255),
 ('pepenando', 0.4141307771205902),
 ('parapete', 0.41403526067733765),
 ('mareta', 0.4119090437889099),
 ('rentábamos', 0.41015368700027466),
 ('servicialmente', 0.40943971276283264),
 ('empoza', 0.4043053090572357),
 ('fregó', 0.4042477607727051),
 ('parquea', 0.4041784405708313)]

In [8]:
import gradio as gr
# labels = learn.dls.vocab
gr.Interface(fn = reverse_dictionary, 
             inputs = gr.inputs.Textbox(lines=5, placeholder="Enter your text here..."), 
             outputs= "text" ).launch(share=True)



Running on local URL:  http://127.0.0.1:7861
Running on public URL: https://924c3b97bcee8ca03b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


