# Language Technology: An Ongoing Journey for NLP

#### by Samantha Rigor — December 24, 2023 — CS505

## Abstract

In this project, I aimed to create a chatbot that would be a valuable resource to bilingual language learners that speak English and Spanish. While I initially attempted to train a ChatBot from the `chatterbot` package, I ultimately created a simplistic chatbot that takes menu options as inputs and returns different functions, such as dictionary lookup and vocabulary units. Though the chatbot has 3 functions, these functions are not fully implemented for both languages and require much more development before it could be released for practical use.

## Background/Description

Although artificial intelligence has proven itself capable of generating art, essays, and scripts, many have yet to use natural language processing to develop new language technologies. Though improving existing tools like Google Translate and ChatGPT are helpful, these agents are not the most helpful or accessible for new language learners. As the Internet continues to connect users all over the world, it becomes more imperative that knowledge is accessible to all, and giving people the opportunity to learn a new language can aid in this endeavor.

Outside of my computer science degree, I also study linguistics, in which I have learned of the importance of multilingualism and the impact that language has on our society. Between learning a new language in college and trying to speak to my parents in their native language I have experienced language barriers in its many forms, and I have longed for language learning tools that are both easy to use and easy to access. As I near graduation, I look forward to a future where I can use my knowledge of computer science and linguistics to aid in the development of language technology. With the love I have for both programming and language learning, I hope that a project like this can serve as a step on the way to providing better online resources for learners all over the world.

## Method & Results

Originally, I only wanted to use `chatterbot` feature, but as I played around with the package, I found that the ChatBot returned strange responses, even when I changed the training data. Thus, I reverted to a simpler chatbot with menu options to choose from.

To build this chatbot, I relied heavily on the Open Multilingual Wordnet package (`wordnet` and `omw`) to go back and forth between English and Spanish. Within WordNet, there are only 4 main categories of words: nouns, verbs, adjectives, and adverbs. With that in mind, I had to make the language lessons within the ChatBot simple enough for language learners to understand as well as simple enough for WordNet to have an entry.

To start, I looked up basic categories of words that people would expect to learn within the first few months of learning a second language. I then boiled down [this set of categories](https://www.towerofbabelfish.com/the-method/vocabulary/base-vocabulary-list/) down to colors, days of the week, months of the year, basic verbs, family, seasons, and travel.

I then thought of different tools that would be best suited for new language learners, which resulted in the 3 main functions of the program: dictionary lookup, simple vocabulary units, and ChatBot practice.

With the dictionary lookup, I utilized the existing functionality of WordNet— primarily the synset class and its methods that take in languages as parameters.

For the vocabulary units, I attempted to use the same functionality, but given that synsets give a long list of synonyms as translations, I wanted to make it simple for language learners by developing a function that only returns 1 translation per word. Thus, I tried to build a Spanish corpus using the online `.txt` version of _Don Quixote_ by cleaning the text of extraneous data and punctuation as well as ordering the words by usage. Ideally, the vocabulary function would have returned the most used translation for the given word.

Finally, I gave one last effort at using `chatterbot` by creating a separate function outside of the `main()` program. If the user selects the third option (practice mode), the function will initiate a ChatBot trained on the target language. This involved creating a virtual environment with an older version of Python (3.7) to run `chatterbot`, creating a ChatBot based on the user's "target language" (the language being learned), and then using [existing chatterbox corpora](https://github.com/gunthercox/chatterbot-corpus) to train the bot. For both English and Spanish, I limited the data to more common topics like greetings, conversations, and sports.

#### Future Directions
Unfortunately, I could not get `chatterbot` to work well. Initially, I tried to use a `ListTrainer` on the ChatBot and gave the trainer my own conversations to work off of. However, the chatbot was slow to learn these responses, if it even began learning them at all. Even when I used the standard `ChatterBotCorpusTrainer`, the chatbot would return strange responses, regardless of whether or not I removed those datapoints from its training data. In the future, it would be very worthwhile to upgrade `chatterbot` or develop another ChatBot package compatible with Jupyter Notebook as this package only works with Python 3.7 or older.

Within the existing options, I did not have enough time to fully implement the Vocabulary Mode (mode 2) for both English and Spanish. If I had more time, I would have tried to further expand the "most used" translation idea. To get the best Spanish translation, I would have built a larger Spanish corpus (rather than just one novel). As for the English translation, I would have liked to use the Brown corpus as a model for most common English words and phrases.

Additionally, I think a language learning tool like this would benefit from some sort of sentence generator. Because we learned a simpler way of sentences in HW03, I would have liked to create some sort of N-grams model for sentence generation given the basic words in the vocabulary sets. Learners could practice translating these sentences (similar to existing Duolingo activities) to practice their fluency. Alternatively, these sentences (and their translations) could be used as examples in the dictionary to help learners develop reading comprehension and understand words using context clues.

Lastly, a tool like this would be much more useful if I had expanded its reach to more languages in the WordNet database. With my own limited language knowledge, I was only able to create this project for English and Spanish— the prompts and menu options for each step of the way were translated by me in both languages, and I do not have the proficiency in any other language to do the same for more language options. If I knew a way to properly translate my prompts from English to every other language in the WordNet database, a tool like this would be revolutionary for NLTK and the low-resource language corpora.

## Conclusion/Reflection

While I enjoyed the work I made on this project, I believe there is much more to be accomplished before this tool could be beta-tested. This tool has some functionality for both word lookup and conversation practice, but it ultimately does not meet the standards of other resources online. That being said, though, this project was innovative in its attempt to combine multiple resources in one application. While most language learners would need to reference several different types of media to learn a given vocabulary unit, a finalized version of this project would make language  learning much simpler for people who do not know where to start. If I had the time and resources to explore the further directions mentioned above, I believe that this project would be extremely helpful for both language learners as well as fellow computer scientists and linguists working in natural language processing. The version I have now is just a start to the types of language technology I (and the rest of the NLP, MT, and computational linguistics communities) have yet to discover.

## Code/Data Repository

In [1]:
import nltk
import numpy as np
import math
from tqdm import tqdm
from collections import defaultdict

In [2]:
from nltk.corpus import wordnet as wn

nltk.download("wordnet")
nltk.download("omw")
nltk.download("omw-1.4")
nltk.download("extended_omw")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data]   Package omw is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package extended_omw to /root/nltk_data...
[nltk_data]   Package extended_omw is already up-to-date!


True

In [3]:
lang_opts = ["eng", "spa"]

In [4]:
def lang_set():
    user_choice = "tbd"
    while (user_choice != 1 and user_choice != 2):
        user_choice = int(input("Please choose an option:\n \
                       (1) I speak English, and I want to learn Spanish.\n \
                       (2) Hablo español y quiero aprender inglés.\n"))
    if user_choice == 1:
        native = "eng"
        target = "spa"
    else:
        native = "spa"
        target = "eng"
    return (native, target)

In [5]:
def eng_mode_set():
    mode = ""
    while mode not in range(1, 5):
        mode = int(input("Please choose a mode:\n \
                           (1) Dictionary Mode\n \
                           (2) Learning Mode\n \
                           (3) Practice Mode\n \
                           (4) Quit\n"))
    return mode

In [6]:
def esp_mode_set():
    mode = ""
    while mode not in range(1, 5):
        mode = int(input("Escoge una opción por favor:\n \
                           (1) Quiero usar un diccionario.\n \
                           (2) Quiero aprender vocabulario.\n \
                           (3) Quiero practicar.\n \
                           (4) Salir\n"))
    return mode

In [7]:
def eng_dict_lookup(mode):
  if mode == 1:
    word = str(input("Which English word would you like to look up in Spanish?\t"))
    synsets = wn.synsets(word, lang="eng")
    for s in synsets:
      print(f' - {s.pos()}. {s.definition()}')
      syns = [w for w in s.lemma_names(lang="spa") if w != word]
      if len(syns) > 0:
        print('  Possible Spanish equivalent(s): {}'.format(syns))
    if len(synsets) == 0:
      print("Sorry, that word is not in the dictionary.")
  else:
    word = str(input("Which Spanish word would you like to look up?\t"))
    synsets = wn.synsets(word, lang="spa")
    for s in synsets:
      print(f' - {s.pos()}. {s.definition()}')
      syns = [w for w in s.lemma_names(lang="spa") if w != word]
      if len(syns) > 0:
        print('  Syn: {}'.format(syns))
    if len(synsets) == 0:
      print("Sorry, that word is not in the dictionary.")

In [8]:
def spa_dict_lookup(mode):
  native = "spa"
  target = "eng"
  if mode == 1:
    word = str(input("¿Qué es el significado de esta palabra?\t"))
    synsets = wn.synsets(word, lang="eng")
    for i in range(len(synsets)):
      print(f"Contexto {i+1}: {synsets[i].lemma_names(lang=native)}")
    if len(synsets) == 0:
      print("Lo siento, no puedo traducir esa palabra.")
  else:
    word = str(input("¿Cómo se dice esta palabra en inglés?\t"))
    synsets = wn.synsets(word, lang="spa")
    for i in range(len(synsets)):
      print(f"Contexto {i+1}: {synsets[i].lemma_names(lang=target)}")
    if len(synsets) == 0:
      print("Lo siento, no puedo traducir esa palabra.")

In [9]:
def dictionary_mode(native):
    choice = ""
    while choice not in range(1, 4):
      if native == "eng":
        choice = int(input("Please choose an option:\n \
                        (1) English to Spanish\n \
                        (2) Spanish to English\n \
                        (3) Exit\n"))
      else:
        choice = int(input("Escoge una opción por favor:\n \
                        (1) Inglés—Español\n \
                        (2) Español—Inglés\n \
                        (3) Salir\n"))
    if choice == 3:
        return
    else:
      while True:
            if native == "eng":
              eng_dict_lookup(choice)
              done = input("Would you like to look up another word? (Yes / No)\t")
            else:
              spa_dict_lookup(choice)
              done = input("¿Quieres traducir otra palabra? (Sí / No)\t")
            if done.lower() == "no":
                  break

In [10]:
basic_categories = {}
basic_categories["color"] = ["red","green","blue","yellow","green","brown","pink","orange","black","white","gray"]
basic_categories["week"] = ["sunday","monday","tuesday","wednesday","thursday","friday","saturday"]
basic_categories["month"] = ["january","february","march","april","may","june","july","august","september","october","november","december"]
basic_categories["verb"] = ["run", "walk", "think", "speak", "smile", "eat", "drink", "sleep", "cook"]
basic_categories["person"] = ["man", "woman", "person", "mother", "father", "son", "daughter", "sister", "brother", "family", "boy", "girl", "husband", "wife"]
basic_categories["season"] = ["winter", "spring", "summer", "fall"]
basic_categories["travel"] = ["hotel", "restaurant", "place", "airport", "car", "plane", "bicycle", "ticket", "boat"]

basic_pos = {}
basic_pos["color"] = "a"
basic_pos["week"] = "n"
basic_pos["month"] = "n"
basic_pos["verb"] = "v"
basic_pos["person"] = "n"
basic_pos["season"] = "n"
basic_pos["travel"] = "n"

In [11]:
def learning_category(native):
    choice = ""
    while choice not in range(1, 8):
        if native == "eng":
            choice = int(input("What subject would you like to learn?:\n \
                           (1) Colors\n \
                           (2) Days of the Week\n \
                           (3) Months of the Year\n \
                           (4) Basic Verbs\n \
                           (5) Family\n \
                           (6) Seasons\n \
                           (7) Travel\n"))
        else:
            choice = int(input("¿Qué te gustaría aprender?:\n \
                           (1) Los colores\n \
                           (2) Los días de la semana\n \
                           (3) Los meses del año\n \
                           (4) Unos verbos fáciles\n \
                           (5) La familia\n \
                           (6) Las estaciones del año\n \
                           (7) El viaje\n"))
    return choice-1

In [12]:
# building a Spanish corpora based off of Don Quixote (as provided by Project Gutenberg)
# using this method that I learned in CAS LX496 (Computational Linguistics, Prof. Hagstrom)
from urllib import request
dq_url = "https://www.gutenberg.org/cache/epub/2000/pg2000.txt"
dq_response = request.urlopen(dq_url)
dq_raw = dq_response.read().decode('utf8')

byline = "por Miguel de Cervantes Saavedra"
start_index = dq_raw.find("por Miguel de Cervantes Saavedra")
end_index = dq_raw.find("*** END OF THE PROJECT GUTENBERG EBOOK DON QUIJOTE ***")

dq_raw = dq_raw[start_index+len(byline):end_index]
from string import punctuation
clean_dq = (''.join([c for c in dq_raw if c not in punctuation])).lower()
dq_words = clean_dq.split()

from collections import Counter
dq_mc = Counter(dq_words).most_common()
esp_mc_nocounts = []
for i in tqdm(range(len(dq_mc))):
  esp_mc_nocounts.append(dq_mc[i][0])

esp_mc_nocounts = list(set(esp_mc_nocounts))

100%|██████████| 23772/23772 [00:00<00:00, 770759.53it/s]


In [13]:
def learn(native, target, choice):
  categories = list(basic_categories.keys())
  topic = categories[choice]
  word_pairs = []
  for term in basic_categories[topic]:
    if native == "eng":
      synsets = wn.synsets(term, lang=native)
    else:
      synsets = wn.synsets(term, lang=target)
    for i in range(len(synsets)):
      name_pos_num = synsets[i].name().split(".")
      if (name_pos_num[1] == basic_pos[topic]) \
      or (name_pos_num[1] == "s" and basic_pos[topic] == "a"):
        if native == "eng":
          poss_transls = synsets[i].lemma_names(lang=target)
        else:
          poss_transls = synsets[i].lemma_names(lang=native)
    pt_ranks = []
    for j in range(len(esp_mc_nocounts)):
      for word in poss_transls:
        if esp_mc_nocounts[j] == word:
          pt_ranks.append(j)
    if len(pt_ranks) == 0:
      best_match = poss_transls[0]
    else:
      best_match = poss_transls[pt_ranks.index(min(pt_ranks))]
    word_pairs.append((term, best_match))
  for (nat_word, tar_word) in word_pairs:
    print(f"{tar_word} - ({basic_pos[topic]}.) {nat_word}")

In [14]:
def start_chatbot(native):
  if native == "eng":
    print("Please wait...")
  else:
    print("Espera por favor...")
  !pip install -q condacolab
  import condacolab
  condacolab.install()
  !conda create --name myenv python=3.7
  !conda activate myenv
  !pip install chatterbot==1.0.4
  !pip install -U PyYaml==3.12
  import collections.abc
  collections.Hashable = collections.abc.Hashable
  from chatterbot import ChatBot
  from chatterbot.trainers import ChatterBotCorpusTrainer
  if native == "eng":
    esp_chatbot = ChatBot("Español")
    esp_trainer = ChatterBotCorpusTrainer(esp_chatbot)
    esp_trainer.train("chatterbot.corpus.spanish.greetings",
                      "chatterbot.corpus.spanish.conversations")
    while True:
      print("To quit, type \"Goodbye\" or \"Adiós\".")
      user_in = input("¿Qué quieres decir al ChatBot?\t")
      if user_in.lower() == "goodbye" or user_in.lower() == "adiós":
        break
      response = esp_chatbot.get_response(str(input))
      print(response)
  else:
    eng_chatbot = ChatBot("English")
    eng_trainer = ChatterBotCorpusTrainer(eng_chatbot)
    eng_trainer.train("chatterbot.corpus.english.conversations",
                      "chatterbot.corpus.english.greetings",
                      "chatterbot.corpus.english.health",
                      "chatterbot.corpus.english.sports",
                      "chatterbot.corpus.english.humor")
    while True:
      print("Para salir, escribe \"Goodbye\" o \"Adiós\".")
      user_in = input("What do you want to say to the ChatBot?\t")
      if user_in.lower() == "goodbye" or user_in.lower() == "adiós":
        break
      response = eng_chatbot.get_response(str(input))
      print(response)
  print("Goodbye! Good luck on studying!")
  print("¡Adiós! ¡Buena suerte por estudiar!")
  return

In [15]:
def chatbot(native):
  if native == "eng":
    esp_chatbot = ChatBot("Español")
    esp_trainer = ChatterBotCorpusTrainer(esp_chatbot)
    esp_trainer.train("chatterbot.corpus.spanish.greetings",
                      "chatterbot.corpus.spanish.conversations")
    while True:
      print("To quit, type \"Goodbye\" or \"Adiós\".")
      user_in = input("¿Qué quieres decir al ChatBot?\t")
      if user_in.lower() == "goodbye" or user_in.lower() == "adiós":
        break
      response = esp_chatbot.get_response(str(input))
      print(response)
  else:
    eng_chatbot = ChatBot("English")
    eng_trainer = ChatterBotCorpusTrainer(eng_chatbot)
    eng_trainer.train("chatterbot.corpus.english.conversations",
                      "chatterbot.corpus.english.greetings",
                      "chatterbot.corpus.english.health",
                      "chatterbot.corpus.english.sports",
                      "chatterbot.corpus.english.humor")
    while True:
      print("Para salir, escribe \"Goodbye\" o \"Adiós\".")
      user_in = input("What do you want to say to the ChatBot?\t")
      if user_in.lower() == "goodbye" or user_in.lower() == "adiós":
        break
      response = eng_chatbot.get_response(str(input))
      print(response)
  print("Goodbye! Good luck on studying!")
  print("¡Adiós! ¡Buena suerte por estudiar!")
  return

In [16]:
def main():
  (native, target) = lang_set()
  typed_three_before = False
  while True:
    if native == "eng":
      mode = eng_mode_set()
    else:
      mode = esp_mode_set()
    if mode == 1:
      dictionary_mode(native)
    elif mode == 2:
      learn(native, target, learning_category(native))
    elif mode == 3:
      if not typed_three_before:
        start_chatbot(native)
      else:
        chatbot(native)
    elif mode == 4:
      print("Goodbye! Good luck on studying!")
      print("¡Adiós! ¡Buena suerte por estudiar!")
      break

In [17]:
# for L1 English
main()

Please choose an option:
                        (1) I speak English, and I want to learn Spanish.
                        (2) Hablo español y quiero aprender inglés.
1
Please choose a mode:
                            (1) Dictionary Mode
                            (2) Learning Mode
                            (3) Practice Mode
                            (4) Quit
1
Please choose an option:
                         (1) English to Spanish
                         (2) Spanish to English
                         (3) Exit
1
Which English word would you like to look up in Spanish?	car
 - n. a motor vehicle with four wheels; usually propelled by an internal combustion engine
  Possible Spanish equivalent(s): ['auto', 'automóvil', 'carro', 'coche', 'máquina', 'turismo', 'vehículo']
 - n. a wheeled vehicle adapted to the rails of railroad
  Possible Spanish equivalent(s): ['automotor', 'coche', 'vagón']
 - n. the compartment that is suspended from an airship and that carries personnel and the

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Training greetings.yml: [####################] 100%
Training conversations.yml: [####################] 100%
To quit, type "Goodbye" or "Adiós".
¿Qué quieres decir al ChatBot?	¡Hola!
Los casos especiales no son lo suficientemente especiales como para romper las reglas.
To quit, type "Goodbye" or "Adiós".
¿Qué quieres decir al ChatBot?	¿Cómo estás?
Los errores nunca debe pasar en silencio.
To quit, type "Goodbye" or "Adiós".
¿Qué quieres decir al ChatBot?	Estoy bien
Significa que sólo se vive una vez. ¿Dónde has oído eso?
To quit, type "Goodbye" or "Adiós".
¿Qué quieres decir al ChatBot?	goodbye
Goodbye! Good luck on studying!
¡Adiós! ¡Buena suerte por estudiar!
Please choose a mode:
                            (1) Dictionary Mode
                            (2) Learning Mode
                            (3) Practice Mode
                            (4) Quit
4
Goodbye! Good luck on studying!
¡Adiós! ¡Buena suerte por estudiar!


In [18]:
# for L1 Spanish
main()

Please choose an option:
                        (1) I speak English, and I want to learn Spanish.
                        (2) Hablo español y quiero aprender inglés.
2
Escoge una opción por favor:
                            (1) Quiero usar un diccionario.
                            (2) Quiero aprender vocabulario.
                            (3) Quiero practicar.
                            (4) Salir
1
Escoge una opción por favor:
                         (1) Inglés—Español
                         (2) Español—Inglés
                         (3) Salir
1
¿Qué es el significado de esta palabra?	man
Contexto 1: ['hombre', 'varón']
Contexto 2: ['militar']
Contexto 3: ['hombre']
Contexto 4: ['hombre']
Contexto 5: []
Contexto 6: []
Contexto 7: ['ayuda_de_cámara']
Contexto 8: ['hombre']
Contexto 9: ['Isla_de_Man', 'Man']
Contexto 10: ['pieza']
Contexto 11: ['hombre', 'humanidad', 'mundo']
Contexto 12: []
Contexto 13: []
¿Quieres traducir otra palabra? (Sí / No)	no
Escoge una opción por fav

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Training conversations.yml: [####################] 100%
Training greetings.yml: [####################] 100%
Training health.yml: [####################] 100%
Training sports.yml: [####################] 100%
Training humor.yml: [####################] 100%
Para salir, escribe "Goodbye" o "Adiós".
What do you want to say to the ChatBot?	Hi
¿Qué pasa?
Para salir, escribe "Goodbye" o "Adiós".
What do you want to say to the ChatBot?	What's up
Hola, ¿cómo estás?
Para salir, escribe "Goodbye" o "Adiós".
What do you want to say to the ChatBot?	How are things going for you?
¿Puedo ayudarte en algo?
Para salir, escribe "Goodbye" o "Adiós".
What do you want to say to the ChatBot?	What food do you like?
The sky's up but I'm fine thanks. What about you?
Para salir, escribe "Goodbye" o "Adiós".
What do you want to say to the ChatBot?	goodbye
Goodbye! Good luck on studying!
¡Adiós! ¡Buena suerte por estudiar!
Escoge una opción por favor:
                            (1) Quiero usar un diccionario.
     

In [19]:
# to see how the vocabulary function doesn't quite function:
main()

Please choose an option:
                        (1) I speak English, and I want to learn Spanish.
                        (2) Hablo español y quiero aprender inglés.
1
Please choose a mode:
                            (1) Dictionary Mode
                            (2) Learning Mode
                            (3) Practice Mode
                            (4) Quit
2
What subject would you like to learn?:
                            (1) Colors
                            (2) Days of the Week
                            (3) Months of the Year
                            (4) Basic Verbs
                            (5) Family
                            (6) Seasons
                            (7) Travel
5


IndexError: ignored