<a href="https://colab.research.google.com/github/mnbpdx/code_switched_next_word_predictor/blob/main/synonym_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Full Synonym Pipeline

## Synonym Pipeline Steps, Explained

1. Pass the predicted next word to a pre-trained language recognition model to **determine the language** of the word (one of the two languages in our mixed, code-switched corpus.)
2. **Translate** the predicted next word into the other language by passing it into an appropriate translation model.
3. **Get n synonyms** of predicted next word in both languages using cosine distance between word embeddings, gathered from two vector embedding models, one in each language. This could also be done by a GPT-3 model.
4. **Score the model** based on whether or not the actual next word is in the list of predicted next word bilingual synonyms.


## Imports/Downloads

In [None]:
!pip install -U easynmt
!pip install sacremoses # was told to install this by a warning

from easynmt import EasyNMT
import nltk
from nltk.corpus import wordnet as wn

nltk.download('omw-1.4')
nltk.download('wordnet')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting easynmt
  Downloading EasyNMT-2.0.2.tar.gz (23 kB)
Collecting transformers<5,>=4.4
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 5.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 65.7 MB/s 
[?25hCollecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 7.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 62.3 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 105.0 MB

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Utility Functions

In [None]:
def langs_are_valid(lang1, lang2):
  if lang1 != 'en' and lang1 != 'es':
    print('Langugage must be either Spanish or English.')
    return False
  
  if lang2 != 'en' and lang2 != 'es':
    print('Langugage must be either Spanish or English.')
    return False

  return True

In [None]:
def strip_formatting(word: str) -> str:
    word = word.lower()
    word = word.strip()
    word = word.replace('.', '')
    word = word.replace(',', '')
    word = word.replace('-', '')
    word = word.replace('_', '')
    word = word.replace('!', '')
    word = word.replace('?', '')
    return word

## Pipeline

### Synonym Evaluation Metric

In [None]:
# Synonym gatherer code is modified from: https://github.com/johnbumgarner/synonyms_discovery_aggregation

def score_with_synonym_list(predictions, actual, lang1, lang2):
  if not langs_are_valid(lang1, lang2):
    raise Exception('Please use valid EasyNMT language strings.') # idk how to use exceptions in python, someone tell me if this is wrong

  translation_model = EasyNMT('m2m_100_418M')

  synonym_lists = []
  scores = []

  predictions = [strip_formatting(prediction) for prediction in predictions]
  actual = [strip_formatting(a) for a in actual]

  for word in predictions:

    if (lang1 != lang2):
      # Translate word into first language
      word_in_lang1 = translation_model.translate(word, target_lang=lang1)
      word_in_lang1 = strip_formatting(word_in_lang1)

      # Translate word into the OTHER language
      word_in_lang2 = translation_model.translate(word, target_lang=lang2)
      word_in_lang2 = strip_formatting(word_in_lang2)
    else:
       word_in_lang1 = word

    if (lang1 == 'en'):
      wn_lang1 = 'eng'
      wn_lang2 = 'spa'
    else:
      wn_lang1 = 'spa'
      wn_lang2 = 'eng'

    # Get synonym lists for word and translated_word

    synonyms = []

    # add the predicted word (and translation) to the synonym list
    synonyms.append(word_in_lang1)
    if (lang1 != lang2):
      synonyms.append(word_in_lang2)

    # add synonyms of word in lang1
    for synonym in wn.synsets(word_in_lang1, lang=wn_lang1):
      for item in synonym.lemmas(wn_lang1):
          if item.name() != word_in_lang1:
              formatted = strip_formatting(item.name().split('_')[0])
              synonyms.append(formatted)

    if (lang1 != lang2):
      # add synonyms of word in lang2
      for synonym in wn.synsets(word_in_lang2, lang=wn_lang2):
        for item in synonym.lemmas(wn_lang2):
            if item.name() != word_in_lang2:
              formatted = strip_formatting(item.name().split('_')[0])
              synonyms.append(formatted)

    # Throw all the discovered synonyms (and word & translated_word) onto synonym_lists
    synonym_lists.append(synonyms)

  # Compare synonym lists to actual
  match_found = False
  for predicted_word, actual_word, synonyms in zip(predictions, actual, synonym_lists):
    print()
    print("Predicted Word: " + predicted_word)
    print("Prediction Synonyms:")
    print(synonyms)
    print("Actual Word: " + actual_word)
    if actual_word in synonyms:
      match_found = True
      print("MATCH FOUND!")
    else:
      match_found = False
    scores.append(match_found)

  # Calculate and return metric
  return sum(scores) / len(scores)

### Standard Evaluation Metric

For comparison. 

In [None]:
# Standard Evaluation Metric
def score_with_actual_word(predictions, actual):
  scores = []


  for prediction, actual in zip(predictions, actual):
    if strip_formatting(prediction) == strip_formatting(actual):
      scores.append(True)
    else:
      scores.append(False)

  return sum(scores) / len(scores)


In [None]:
# Sample Prediction

# predictions = ['pizza', 'turkey', 'baño', 'book', 'manzana', 'tiger']
# actual =    ['whale', 'chicken', 'bathroom', 'libro', 'plátano', 'tiger']

# score_with_synonym_list(predictions, actual, 'en', 'es')
# score_with_actual_word(predictions, actual)

## Score GPT-3 Responses

To run this on GPT-3 Responses from `code_switched_language_modeling_performance_evaluation.ipynb`, change `standard_results_path` to reflect the path of `standard_perfromance.json`.

Alternatively, if you're running in Google Colab, simply drop `standard_perfromance.json` into the "Files" section of Google Colab and leave `standard_results_path` as '/content/standard_perfromance.json'.

In [None]:
import json

standard_results = dict()

standard_results_path = '/content/standard_perfromance.json'
with open(standard_results_path) as json_file:
    standard_results = json.load(json_file)

# pprint(standard_results)

if standard_results:
    print('Standard results loaded')
else:
     print('Standard results not loaded')



Standard results loaded


In [None]:
prediction_lists = []
actual_lists = []

for corpus in range(3):
  predictions = []
  actual = []
  for input_and_response in standard_results[corpus][1]:
    predictions.append(input_and_response[1][1])
    actual.append(input_and_response[0][4])
  prediction_lists.append(predictions)
  actual_lists.append(actual)



In [None]:
len(prediction_lists[0])

92

### Scores

In [None]:
# Standard Metric Scores

scores = []

for corpus in range(3):
  scores.append(score_with_actual_word(prediction_lists[corpus], actual_lists[corpus]))

In [None]:
# Synonym Metric Scores

synonym_scores = []

synonym_scores.append(score_with_synonym_list(prediction_lists[0], actual_lists[0], 'en', 'en'))
synonym_scores.append(score_with_synonym_list(prediction_lists[1], actual_lists[1], 'es', 'es'))
synonym_scores.append(score_with_synonym_list(prediction_lists[2], actual_lists[2], 'en', 'es'))

89.9kB [00:00, 34.1MB/s]                   


Downloading:   0%|          | 0.00/908 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/272 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14k [00:00<?, ?B/s]


Predicted Word: i
Prediction Synonyms:
['i', 'iodine', 'iodin', 'i', 'atomic', 'one', '1', 'i', 'ace', 'single', 'unity', 'i', 'one', '1', 'ane']
Actual Word: i
MATCH FOUND!

Predicted Word: the
Prediction Synonyms:
['the']
Actual Word: @slackerradio

Predicted Word: do
Prediction Synonyms:
['do', 'bash', 'brawl', 'doh', 'ut', 'doctor', 'do', 'make', 'perform', 'execute', 'perform', 'fare', 'make', 'come', 'get', 'cause', 'make', 'practice', 'practise', 'exercise', 'suffice', 'answer', 'serve', 'make', 'act', 'behave', 'serve', 'manage', 'dress', 'arrange', 'set', 'coif', 'coiffe', 'coiffure']
Actual Word: rock

Predicted Word: (ps
Prediction Synonyms:
['(ps']
Actual Word: i

Predicted Word: be
Prediction Synonyms:
['be', 'beryllium', 'be', 'glucinium', 'atomic', 'exist', 'equal', 'constitute', 'represent', 'make', 'comprise', 'follow', 'embody', 'personify', 'live', 'cost']
Actual Word: bee

Predicted Word: for
Prediction Synonyms:
['for']
Actual Word: for
MATCH FOUND!

Predicted Wor

100%|██████████| 938k/938k [00:00<00:00, 1.08MB/s]
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.



Predicted Word: )
Prediction Synonyms:
[')', ')']
Actual Word: y

Predicted Word: y
Prediction Synonyms:
['and', 'y', 'y']
Actual Word: lea

Predicted Word: ie:
Prediction Synonyms:
['it is:', 'es que:']
Actual Word: cuñadas

Predicted Word: an
Prediction Synonyms:
['an', 'una', 'associate', 'an']
Actual Word: the

Predicted Word: we
Prediction Synonyms:
['we', 'nosotros']
Actual Word: we
MATCH FOUND!

Predicted Word: what
Prediction Synonyms:
['what', '¿qué']
Actual Word: it's

Predicted Word: afectan
Prediction Synonyms:
['affected', 'afectan', 'affect', 'impact', 'bear', 'bear', 'touch', 'touch', 'affect', 'involve', 'affect', 'regard', 'feign', 'sham', 'pretend', 'affect', 'dissemble', 'affect', 'impress', 'move', 'strike', 'unnatural', 'moved', 'stirred', 'touched']
Actual Word: te

Predicted Word: 
Prediction Synonyms:
['', '']
Actual Word: fuck

Predicted Word: ien
Prediction Synonyms:
['the nine', 'el neno']
Actual Word: ay

Predicted Word: is
Prediction Synonyms:
['is', 'es',

In [None]:
print("Standard Metric Scores:")

for score in scores:
  print(score)
print()

Standard Metric Scores:
0.21739130434782608
0.12987012987012986
0.1839080459770115



In [None]:
print("Synonym Metric Scores:")

for synonym_score in synonym_scores:
  print(synonym_score)
print()

Synonym Metric Scores:
0.22826086956521738
0.12987012987012986
0.1839080459770115

