<a href="https://colab.research.google.com/github/mnbpdx/code_switched_next_word_predictor/blob/main/synonym_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steps, from [Refined Project Proposal](https://docs.google.com/document/d/1NRUdfsiXkgPQW7Mg-CSVs8FDzghQjYWcvZ9JPUFfLW4/edit?usp=sharing)

1. Pass the predicted next word to a pre-trained language recognition model to **determine the language** of the word (one of the two languages in our mixed, code-switched corpus.)
2. **Translate** the predicted next word into the other language by passing it into an appropriate translation model.
3. **Get n synonyms** of predicted next word in both languages using cosine distance between word embeddings, gathered from two vector embedding models, one in each language. This could also be done by a GPT-3 model.
4. **Score the model** based on whether or not the actual next word is in the list of predicted next word bilingual synonyms.


# Full Synonym Pipeline

## Imports/Downloads

In [6]:
!pip install -U easynmt
!pip install sacremoses # was told to install this by a warning

from easynmt import EasyNMT
import nltk
from nltk.corpus import wordnet as wn

nltk.download('omw-1.4')
nltk.download('wordnet')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Fake Data

In [7]:
# Dummy Data
predictions = ['pizza', 'turkey', 'baño', 'book', 'manzana', 'tiger']
actual =    ['whale', 'chicken', 'bathroom', 'libro', 'plátano', 'tiger']

## Utility Functions

In [8]:
# I think I should be raising the exceptions here and catching them in
# the pipeline function actually, but i dont wanna waste time on that rn. 
def langs_are_valid(lang1, lang2):
  if lang1 == lang2:
    print('Langugages must not be the same.')
    return False
  
  if lang1 != 'en' and lang1 != 'es':
    print('Langugage must be either Spanish or English.')
    return False
  
  if lang2 != 'en' and lang2 != 'es':
    print('Langugage must be either Spanish or English.')
    return False

  return True

## Pipeline

In [21]:
from pickle import FALSE
from nltk.corpus.reader import wordlist
# -- CITE ME -- Look at Mark's scratchpad noteboook for everything we gotta cite
# -- POSSIBLE ISSUE -- 2x check my old comments for issues
# -- POSSIBLE ISSUE -- -- DISCUSSION -- When I run id_model.language_detection on "manzana" it returns 'en'.
# "manzana" IS a word in English, that means like a unit of land or something. That
# wouldn't be a problem (hey, maybe the person speaking Spanglish here DID mean "a unit of land")
# but in this case, it's SO obvious that there's another, much more common Spanish word
# that means "manzana". I feel like the ID model should have caught that, by comparing
# the frequency of the English and Spanish versions. My guess is that this model is
# more used to translating sentences in context rather than words. We can sub it out
# with a different language detection/identification model later then, if we notice this problem at a large scale.
# That shouldn't be hard, there's many. I also wonder if what's happening here is that
# the model is capable of interpreting semi-code-switched sentences and translating them.
# It has to ID them first, so it would interpret an English sentence with a few Spanish words
# as "English", then do the translation step. For that to work, it'd have to have some sense
# of "manzana" as an English word. IOW, could be a NN problem, or a problem of a model that
# is intended to do translation too. Same easy solution though, just sub it out. We
# could even sub it out with an API call to Dictionary.com or something, doesn't
# have to be a NN model.
# Okay now I'm getting different languages detected than English or Spanish. Which
# is a tougher problem to solve. We can switch out the model, but idk if I'm gonna
# be able to find a model that allows us to specify: "Only return one of these
# two languages". The big question is, what do we when our detection model returns
# null, or returns an unknown language? Throw out the prediction from the score?


# Right now, this function only ACTUALLY works for Spanish and English, gotta
# figure out how to convert from the EasyNMT language strings to the WordNet
# strings. Or I could just hack it. I'm not 100% sure yet that WN works for 
# Mandarin anyway, which would be an issue.
def full_synonym_pipeline_rename_me(predictions, actual, lang1, lang2):
  if not langs_are_valid(lang1, lang2):
    raise Exception('Please use valid EasyNMT language strings.') # idk how to use exceptions in python, someone tell me if this is wrong

  id_model = EasyNMT('opus-mt')
  translation_model = EasyNMT('opus-mt')

  synonym_lists = []
  scores = []

  for word in predictions:

    # Determine the language of word
    detected_lang = id_model.language_detection(word)
    if (detected_lang == lang1):
      other_lang = lang2
    elif (detected_lang == lang2): 
      other_lang = lang1
    else:
      print('EasyNMT detected a language other than lang1 or lang2. Detected Language -> ' + detected_lang)
      raise Exception

    # Translate word into the OTHER language
    translated_word = translation_model.translate(word, source_lang=detected_lang, target_lang=other_lang) # or swap the languages for the reverse, obv. 

    if (detected_lang == 'en'): # need to update this step to make this function generic with respect to languages
      detected_lang = 'eng'
      other_lang = 'spa'
    else:
      detected_lang = 'spa'
      other_lang = 'eng'

    # -- POSSIBLE ISSUE -- Should probably limit the number of words, right now there's roughly 0 - 20 for each.
    # Get synonym lists for word and translated_word

    # -- POSSIBLE ISSUE -- I'm combining the word and translated_word synonyms, we can split later if needed
    synonyms = []

    # add the words themselves to the synonym list
    synonyms.append(word)
    synonyms.append(translated_word)

    # add synonyms of word
    for synonym in wn.synsets(word, lang=detected_lang):
      for item in synonym.lemmas(detected_lang):
          if item.name() != word:
            synonyms.append(item.name())

    # add synonyms of translated_word
    for synonym in wn.synsets(translated_word, lang=other_lang):
      for item in synonym.lemmas(other_lang):
          if item.name() != translated_word:
            synonyms.append(item.name())

    # Throw all the discovered synonyms (and word & translated_word) onto synonym_lists
    synonym_lists.append(synonyms)

  print()
  print()
  print("PREDICTIONS: " + str(predictions))
  print("ACTUAL: " + str(actual))

  # Compare synonym lists to actual
  match_found = False
  for word, synonyms in zip(actual, synonym_lists):
    print()
    print(word)
    print(synonyms)
    if word in synonyms:
      match_found = True
    else:
      match_found = False
    scores.append(match_found)

  # Calculate and return metric
  return sum(scores) / len(scores)

In [22]:
full_synonym_pipeline_rename_me(predictions, actual, 'en', 'es')





PREDICTIONS: ['pizza', 'turkey', 'baño', 'book', 'manzana', 'tiger']
ACTUAL: ['whale', 'chicken', 'bathroom', 'libro', 'plátano', 'tiger']

whale
['pizza', 'Pizza', 'pizza_pie']

chicken
['turkey', 'pavo', 'Meleagris_gallopavo', 'Turkey', 'Republic_of_Turkey', 'joker', 'bomb', 'dud', 'Pavo', 'género_Pavo', 'Pavo']

bathroom
['baño', 'bathroom', 'aplicación', 'capa', 'cobertura', 'mano', 'revestimiento', 'bañera', 'capa', 'mano', 'recubrimiento', 'inodoro', 'bath', 'toilet', 'lavatory', 'lav', 'can', 'john', 'privy']

libro
['book', 'libro', 'volume', 'record', 'record_book', 'script', 'playscript', 'ledger', 'leger', 'account_book', 'book_of_account', 'rule_book', 'Koran', 'Quran', "al-Qur'an", 'Book', 'Bible', 'Christian_Bible', 'Book', 'Good_Book', 'Holy_Scripture', 'Holy_Writ', 'Scripture', 'Word_of_God', 'Word', 'reserve', 'hold', 'ejemplar', 'volumen', 'Biblia', 'Biblia_cristiana', 'Libro', 'biblia', 'el_buen_libro', 'la_palabra', 'las_escrituras', 'las_sagradas_escrituras', 'li

0.5