<a href="https://colab.research.google.com/github/mnbpdx/code_switched_next_word_predictor/blob/main/synonym_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steps, from [Refined Project Proposal](https://docs.google.com/document/d/1NRUdfsiXkgPQW7Mg-CSVs8FDzghQjYWcvZ9JPUFfLW4/edit?usp=sharing)

1. Pass the predicted next word to a pre-trained language recognition model to **determine the language** of the word (one of the two languages in our mixed, code-switched corpus.)
2. **Translate** the predicted next word into the other language by passing it into an appropriate translation model.
3. **Get n synonyms** of predicted next word in both languages using cosine distance between word embeddings, gathered from two vector embedding models, one in each language. This could also be done by a GPT-3 model.
4. **Score the model** based on whether or not the actual next word is in the list of predicted next word bilingual synonyms.


# Full Synonym Pipeline

## Imports/Downloads

In [17]:
!pip install -U easynmt
!pip install sacremoses # was told to install this by a warning

from easynmt import EasyNMT
import nltk
from nltk.corpus import wordnet as wn

nltk.download('omw-1.4')
nltk.download('wordnet')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Fake Data

In [18]:
# Fake Data
# predictions = ['pizza', 'turkey', 'baño', 'book', 'manzana', 'tiger']
# actual =    ['whale', 'chicken', 'bathroom', 'libro', 'plátano', 'tiger']

## Utility Functions

In [19]:
# I think I should be raising the exceptions here and catching them in
# the pipeline function actually, but i dont wanna waste time on that rn. 
def langs_are_valid(lang1, lang2):
  if lang1 == lang2:
    print('Langugages must not be the same.')
    return False
  
  if lang1 != 'en' and lang1 != 'es':
    print('Langugage must be either Spanish or English.')
    return False
  
  if lang2 != 'en' and lang2 != 'es':
    print('Langugage must be either Spanish or English.')
    return False

  return True

In [20]:
def strip_formatting(word: str) -> str:
    word = word.lower()
    word = word.strip()
    word = word.replace('.', '')
    word = word.replace(',', '')
    word = word.replace('-', '')
    word = word.replace('_', '')
    word = word.replace('!', '')
    word = word.replace('?', '')
    return word

## Pipeline

In [39]:
# -- CITE ME -- Look at Mark's scratchpad noteboook for everything we gotta cite
# -- POSSIBLE ISSUE -- 2x check my old comments for issues
# -- POSSIBLE ISSUE -- -- DISCUSSION -- When I run id_model.language_detection on "manzana" it returns 'en'.
# "manzana" IS a word in English, that means like a unit of land or something. That
# wouldn't be a problem (hey, maybe the person speaking Spanglish here DID mean "a unit of land")
# but in this case, it's SO obvious that there's another, much more common Spanish word
# that means "manzana". I feel like the ID model should have caught that, by comparing
# the frequency of the English and Spanish versions. My guess is that this model is
# more used to translating sentences in context rather than words. We can sub it out
# with a different language detection/identification model later then, if we notice this problem at a large scale.
# That shouldn't be hard, there's many. I also wonder if what's happening here is that
# the model is capable of interpreting semi-code-switched sentences and translating them.
# It has to ID them first, so it would interpret an English sentence with a few Spanish words
# as "English", then do the translation step. For that to work, it'd have to have some sense
# of "manzana" as an English word. IOW, could be a NN problem, or a problem of a model that
# is intended to do translation too. Same easy solution though, just sub it out. We
# could even sub it out with an API call to Dictionary.com or something, doesn't
# have to be a NN model.
# Okay now I'm getting different languages detected than English or Spanish. Which
# is a tougher problem to solve. We can switch out the model, but idk if I'm gonna
# be able to find a model that allows us to specify: "Only return one of these
# two languages". The big question is, what do we when our detection model returns
# null, or returns an unknown language? Throw out the prediction from the score?

# -- TODO -- Get rid of all the print statements
# -- TODO -- Make the comments better
# -- TODO -- Dump data to CSV


# -- TODO -- Right now, this function only ACTUALLY works for Spanish and English, gotta
# figure out how to convert from the EasyNMT language strings to the WordNet
# strings. Or I could just hack it. I'm not 100% sure yet that WN works for 
# Mandarin anyway, which would be an issue.
def score_with_synonym_list(predictions, actual, lang1, lang2):
  # if not langs_are_valid(lang1, lang2):
  #   raise Exception('Please use valid EasyNMT language strings.') # idk how to use exceptions in python, someone tell me if this is wrong

  translation_model = EasyNMT('m2m_100_418M')

  synonym_lists = []
  scores = []

  for word in predictions:

    if (lang1 != lang2):
      # Translate word into first language
      word_in_lang1 = translation_model.translate(word, target_lang=lang1)

      # Translate word into the OTHER language
      word_in_lang2 = translation_model.translate(word, target_lang=lang2)
    else:
       word_in_lang1 = word

    if (lang1 == 'en'): # need to update this step to make this function generic with respect to languages
      wn_lang1 = 'eng'
      wn_lang2 = 'spa'
    else:
      wn_lang1 = 'spa'
      wn_lang2 = 'eng'

    # -- POSSIBLE ISSUE -- Should probably limit the number of words, right now there's roughly 0 - 20 for each.
    # Get synonym lists for word and translated_word

    # I'm combining the word and translated_word synonyms, we can split later if needed
    synonyms = []

    # add the words themselves to the synonym list
    synonyms.append(strip_formatting(word_in_lang1))
    if (lang1 != lang2):
      synonyms.append(strip_formatting(word_in_lang2))

    # add synonyms of word in lang1
    for synonym in wn.synsets(word_in_lang1, lang=wn_lang1):
      for item in synonym.lemmas(wn_lang1):
          if item.name() != word_in_lang1:
              unformatted = item.name()
              formatted = strip_formatting(item.name().split('_')[0])
              synonyms.append(formatted)
              print("Unformatted: " + unformatted)
              print("Formatted: " + formatted)

    if (lang1 != lang2):
      # add synonyms of word in lang2
      for synonym in wn.synsets(word_in_lang2, lang=wn_lang2):
        for item in synonym.lemmas(wn_lang2):
            if item.name() != word_in_lang2:
              unformatted = item.name()
              formatted = strip_formatting(item.name().split('_')[0])
              synonyms.append(formatted)
              print("Unformatted: " + unformatted)
              print("Formatted: " + formatted)

    # Throw all the discovered synonyms (and word & translated_word) onto synonym_lists
    synonym_lists.append(synonyms)

  stripped_actual = []

  for a_word in actual:
    stripped_actual.append(strip_formatting(a_word))

  print()
  print()
  print("PREDICTIONS: " + str(predictions))
  print("ACTUAL: " + str(stripped_actual))

  # Compare synonym lists to actual
  match_found = False
  for predicted_word, actual_word, synonyms in zip(predictions, stripped_actual, synonym_lists):
    print()
    print("Predicted: " + predicted_word)
    print("Prediction Synonyms:")
    print(synonyms)
    print("Actual: " + actual_word)
    if actual_word in synonyms:
      match_found = True
      print("!!!!!!!!!MATCH FOUND!!!!!!!!!")
    else:
      match_found = False
    scores.append(match_found)


  # Dump Data to JSON    

  # Calculate and return metric
  return sum(scores) / len(scores)

In [22]:
# This is just the basic metric, iow, whether or not the predicted word is equal
# to the actual next word.
def score_with_actual_word(predictions, actual):
  scores = []

  for prediction, actual in zip(predictions, actual):
    if prediction == actual:
      scores.append(True)
    else:
      scores.append(False)

  return sum(scores) / len(scores)


In [23]:
# score_with_synonym_list(predictions, actual, 'en', 'es')

In [24]:
# score_with_actual_word(predictions, actual)

## Score Vince's GPT responses

Uploaded the most recent results manually. We can make this flow better later, by adding this pipeline to the other notebook, or integrating this notebook into the project better.

In [25]:
import json

standard_results = dict()

standard_results_path = '/content/standard_perfromance.json'
with open(standard_results_path) as json_file:
    standard_results = json.load(json_file)

# pprint(standard_results)

if standard_results:
    print('Standard results loaded')
else:
     print('Standard results not loaded')



Standard results loaded


In [26]:
prediction_lists = []
actual_lists = []

for corpus in range(3):
  predictions = []
  actual = []
  for input_and_response in standard_results[corpus][1]:
    predictions.append(input_and_response[1][1])
    actual.append(input_and_response[0][4])
  prediction_lists.append(predictions)
  actual_lists.append(actual)



In [27]:
# # Standard Metric Scores

# scores = []

# for corpus in range(3):
#   scores.append(score_with_actual_word(prediction_lists[corpus], actual_lists[corpus]))

In [40]:
# Synonym Metric Scores

synonym_scores = []

synonym_scores.append(score_with_synonym_list(prediction_lists[0], actual_lists[0], 'en', 'en'))
synonym_scores.append(score_with_synonym_list(prediction_lists[1], actual_lists[1], 'es', 'es'))
synonym_scores.append(score_with_synonym_list(prediction_lists[2], actual_lists[2], 'en', 'es'))

Unformatted: iodine
Formatted: iodine
Unformatted: iodin
Formatted: iodin
Unformatted: atomic_number_53
Formatted: atomic
Unformatted: one
Formatted: one
Unformatted: 1
Formatted: 1
Unformatted: ace
Formatted: ace
Unformatted: single
Formatted: single
Unformatted: unity
Formatted: unity
Unformatted: i
Formatted: i
Unformatted: one
Formatted: one
Unformatted: 1
Formatted: 1
Unformatted: i
Formatted: i
Unformatted: ane
Formatted: ane
Unformatted: bash
Formatted: bash
Unformatted: brawl
Formatted: brawl
Unformatted: doh
Formatted: doh
Unformatted: ut
Formatted: ut
Unformatted: Doctor_of_Osteopathy
Formatted: doctor
Unformatted: DO
Formatted: do
Unformatted: make
Formatted: make
Unformatted: perform
Formatted: perform
Unformatted: execute
Formatted: execute
Unformatted: perform
Formatted: perform
Unformatted: fare
Formatted: fare
Unformatted: make_out
Formatted: make
Unformatted: come
Formatted: come
Unformatted: get_along
Formatted: get
Unformatted: cause
Formatted: cause
Unformatted: mak



Unformatted: Y
Formatted: y
Unformatted: Associate_in_Nursing
Formatted: associate
Unformatted: AN
Formatted: an
Unformatted: affect
Formatted: affect
Unformatted: impact
Formatted: impact
Unformatted: bear_upon
Formatted: bear
Unformatted: bear_on
Formatted: bear
Unformatted: touch_on
Formatted: touch
Unformatted: touch
Formatted: touch
Unformatted: affect
Formatted: affect
Unformatted: involve
Formatted: involve
Unformatted: affect
Formatted: affect
Unformatted: regard
Formatted: regard
Unformatted: feign
Formatted: feign
Unformatted: sham
Formatted: sham
Unformatted: pretend
Formatted: pretend
Unformatted: affect
Formatted: affect
Unformatted: dissemble
Formatted: dissemble
Unformatted: affect
Formatted: affect
Unformatted: impress
Formatted: impress
Unformatted: move
Formatted: move
Unformatted: strike
Formatted: strike
Unformatted: unnatural
Formatted: unnatural
Unformatted: moved
Formatted: moved
Unformatted: stirred
Formatted: stirred
Unformatted: touched
Formatted: touched
Unfo

In [41]:
# print("Standard Metric Scores:")

# for score in scores:
#   print(score)
# print()

In [42]:
print("Synonym Metric Scores:")

for synonym_score in synonym_scores:
  print(synonym_score)
print()

Synonym Metric Scores:
0.21739130434782608
0.12987012987012986
0.1839080459770115

