# Final Project: A Literary "Translator"
Samantha Rigor

CAS LX496

December 12, 2022


### Introduction and Goals


Because I have an interest in working in natural language processing and machine translation in the future, I wanted to do some sort of translation-related project for our final assignment. I was inspired by the many novelty translators I saw online as a kid, such as [the English to Shakespearean translator](https://lingojam.com/EnglishtoShakespearean). After learning about the many modules and packages in NLTK and Python, I became inspired to create a similar "literary translator" that would take user input and turn it into text that would model a famous author.

In developing this project, I hope my translator will be able to achieve a style similar to that of Jane Austen, an English author who wrote from the end of the eighteenth century to the beginning of the nineteenth century. Even if the code is not able to emulate her writing, I aim to create a translator that will be able to develop a grammatical sentence that carries the same meaning as the user's input phrase.

Using one of Austen's most famous works, _Emma_, I intend on using [NLTK Toolkit](https://www.nltk.org),[Project Gutenberg](https://gutenberg.org), and [WordNet](https://wordnet.princeton.edu/) to create a corpus from the novel and develop functions that will pull from the words in _Emma_ to develop an Austen-esque phrase with the same semantic meaning as the user's input.

### Data Cleaning

To start this project, I looked at the [Project Gutenberg](https://www.gutenberg.org) website and arbitrarily chose a famous literary work, which, as stated before, is [_Emma_ by Jane Austen](https://www.gutenberg.org/ebooks/158). At first, I tried to use the code we used in the [Federalist Papers lesson](https://colab.research.google.com/drive/1k4BIE5b3Lf3QZgeELedKVsXw9SHY9pur?usp=sharing) to pull the text from the .txt file online, but as it turns out, the `gutenberg` package that already exists in NLTK has a mostly cleaned version of _Emma_. Thus, I began by using what we learned from Homework 1 and the Federalist Papers lesson to clean the data. All I had to do was remove all of the metadata at the beginning that introduces what the book is, the words "CHAPTER" and "VOLUME," and the Roman numerals representing each chapter/volume number. I also made all of the words lowercase to make capitalized and non-capitalized instances of the same word indistinguishable.

In [None]:
# importing the NLTK Toolkit
import nltk

# downloading/importing the Gutenberg package
nltk.download('gutenberg')
from nltk.corpus import gutenberg

corpus = nltk.corpus.gutenberg.words('austen-emma.txt') # importing "Emma" by Jane Austen
corpus = corpus[7:] # removing the introductory parts of the file

words = [] # making a list for the cleaned corpus
import string # to get the punctuation
lw_chapvol = False # was the last word "CHAPTER" or "VOLUME"?

for i in range(len(corpus)):
  if corpus[i] not in list(string.punctuation): # if it isn't a punctuation mark,
    if (corpus[i] == "CHAPTER") or (corpus[i] == "VOLUME"): # is the word "CHAPTER" or "VOLUME" (table of contents words)?
      lw_chapvol = True # if it is, set this to true so we don't include the Roman numeral following.
    elif (lw_chapvol == True): # it's probably a Roman numeral, don't want to confuse this with real words
      lw_chapvol = False # set to false so we can get the next word
    else:
      words.append(corpus[i].lower()) # making all of the words lowercase

words = words[:-1] # removing "FINIS" (the end)

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


### Developing a Synonym Picker

In the [WordNet lesson on November 7th](https://colab.research.google.com/drive/19DVqYIIWgjCzBE0hvnabn5XT0n8yNxt5?usp=sharing), we saw that [WordNet](https://wordnet.princeton.edu) could give us a list of synonyms for most words.

In [None]:
# importing the wordnet package 

from nltk.corpus import wordnet as wn
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

For my own reference and as an example, I brought the Webster function here as a model to see what types of functionality comes with WordNet. In this example, you can see that we can make use of the `synsets` and `lemma_names` functions to look up words and find synonyms respectively. 

In [None]:
# taking the webster function from the wordnet lesson

def webster(word):
  synsets = wn.synsets(word) # this method looks up the word parameter
  for s in synsets: # for each item in the list
    print(' - {}. {}'.format(s.pos(), s.definition())) # formats the POS + def
    syns = [w for w in s.lemma_names() if w != word] # defines synonyms
    if len(syns) > 0: # if there are synonyms
      print('  Syn: {}'.format(syns)) # list them here
    if len(s.examples()) > 0: # if the word has example sentences/phrases
      print('  Exx: {}'.format(s.examples())) # list it here

webster('broadcast')
# when you call this command in the shell, you'll see this for each entry:
# POS + definition(s)
# synonyms
# examples

 - n. message that is transmitted by radio or television
 - n. a radio or television show
  Syn: ['program', 'programme']
  Exx: ['did you see his program last night?']
 - v. broadcast over the airwaves, as in radio or television
  Syn: ['air', 'send', 'beam', 'transmit']
  Exx: ['We cannot air this X-rated song']
 - v. sow over a wide area, especially by hand
  Exx: ['broadcast seeds']
 - v. cause to become widely known
  Syn: ['circulate', 'circularize', 'circularise', 'distribute', 'disseminate', 'propagate', 'spread', 'diffuse', 'disperse', 'pass_around']
  Exx: ['spread information', 'circulate a rumor', 'broadcast the news']


With these capabilities in mind, I wrote `pick_from_emma()`, which is a function that takes a `word` in as a parameter, looks it up in WordNet, and returns a randomly chosen synonym in WordNet's database.

In [None]:
from numpy import random

def pick_from_emma(word):
  synsets = wn.synsets(word) # look up the word
  if len(synsets) == 0: # if the word isn't in WordNet's database,
    return word # then just return the word as is
  syns = [w for w in synsets[0].lemma_names() if w != word] # otherwise, find the synonyms for the first entry for the word
  if len(syns) > 0: # if there are synonyms:
    for s in syns: # for each possible synonym:
      options = [s for s in syns if s in words] # only add it to the list of options if it is in the words from Emma
    if len(options) > 0: 
      num_entries = len(options)
      choice = random.randint(0, (num_entries))
      return options[choice]
    return word # if there are no synonyms that are in Emma, return the word as is
  return word # if there are no synonyms at all, return the word as is

As you can see, the functionality isn't the best. At the time, I wasn't sure what the best way to choose a word entry was without having to print out or look through all of them manually, so I decided to just choose the first entry WordNet provided since most dictionaries often list the most common definition first. From there, if any of the first entry's synonyms are in _Emma_, I decided to use `randint()` from `numpy.random` to randomly choose a word from the list of `syns`. If the word isn't in WordNet's database, if it doesn't have any synonyms in _Emma_, or if it doesn't have any synonyms _at all_, `pick_from_emma()` will just return the `word` it was given.

### Version 1 (The Presentation Version)

With `pick_from_emma()` coded, we can create an `emmatize_v1()` function that will "translate" our user's input into Jane Austen's style! The function has been commented throughout to make its steps clear, but to elaborate a bit more:
*   The function prints an input prompt that takes in the user's desired word, phrase, or sentence.
*   From there, the input is split on whitespace, turned into all lowercase letters, and depunctuated to be put into a list of `cleaned` words.
*   For each word in `cleaned`, `emmatize_v1()` will call `pick_from_emma()` to pick a synonym that Jane Austen used in _Emma._ This chosen synonym (as well as a space) will get appended to the `output` string.
*   The function then removes the last trailing space after the final word of output, then prints the "Austen-ized," "translated" `output` to the shell.



In [None]:
def emmatize_v1():
  take_in = input('What do you want to Austen-ize today? ') # input prompt
  input_words = take_in.split() # turns string into a list of strings containing each word

  cleaned = [] # making a list for the cleaned words
  for w in input_words: # for each word
    w = w.lower() # make the chars all lowercase
    word = "" # empty string for the word
    for i in range(len(w)):
      if w[i] not in list(string.punctuation): # for each character, if it isn't a punctuation mark
        word += w[i] # add it to the cleaned word
    cleaned.append(word) # append the cleaned word to the list

  output = "" # empty output string
  for word in cleaned: # for each input word
    output += pick_from_emma(word) # pick its synonym from Emma and add it to the output string
    output += " " # add a space before the next word

  output = output[:-1] # remove trailing space

  print(output) # print the "translated" output

In theory, now that we've created our functions, this should work out! Let's translate a couple of phrases/sentences and see...

In [None]:
emmatize_v1()

What do you want to Austen-ize today? good luck on finals
good lot on final


If you've tried to use the translator, it's really not great. I spent thirty to forty-five minutes trying to test out different sentences, but regardless of the content, the sentences don't really make sense.

Here are some examples of what I tried to translate:


*   `I want something to eat.`  ➡  `i privation something to eat`
*   `Have you seen my keys anywhere?`   ➡  `have you see my key anywhere`
*   `I miss you. It's been a long time since we've seen each other!`   ➡  `i girl you its be a long time since weve see each other`
*   `I went home and read a book.`   ➡  `i travel place and read a book`
*   `Good luck on finals!`   ➡  `good fate on final`



### Remarks on the Presentation Version of the Code

As you can tell, the translator currently has no concept of morphology or semantics at all. Instead of actually providing a sensical choice for a synonym, it may provide the infinitive form of the verb or choose a synonym that relates to the wrong definition of the original word. The code is currently programmed to take the first entry from the synset, and the random choice from the list of synonyms certainly doesn't help either. I really want to fine-tune this code a little bit more to see if it'll be able to accommodate parts of speech as well, but for right now, it's more of an Austen thesaurus than an Austen translator.

### Improvements Using Comments from the Presentation

During the presentation, the main concerns with my project were being able to determine the sense in which Austen used her words in _Emma_ and using that data to inform the translator on how to properly choose a synonym from the text. Unfortunately, the best way to do this would be manually going through _Emma_ and tagging the senses, parts of speech, and morphological statuses myself. The input text would also have to be tagged with the same information, but this would be much harder since there is no way to predict what the user might write.

At first, I thought about using Word2Vec to achieve this, but since Word2Vec provides a "network" of related terms as opposed to actual _synonyms_, I figured it may not be that useful. Neither the computer nor I would be able to reasonably tag _all_ of the aforementioned information at this point in time, but a great recommendation the class gave me was to use a part of speech tagger to get a better read as to how the words in _Emma_ are actually used. Fortunately, NLTK comes with the function `pos_tag()`, which takes in a list of words and returns a list of tuples in the format (word, part of speech). [Here is a list of the abbreviations](https://www.geeksforgeeks.org/part-speech-tagging-stop-words-using-nltk-python/) that `nltk.pos_tag()` uses in its part of speech tagging method— as you can see, this function goes a bit farther than simple parts of speech. It can include information on tense, mood, singular vs. plural, and more. This will be a bit more helpful in understanding the senses in both _Emma_ and the input the user will give us.

In [None]:
nltk.download('averaged_perceptron_tagger')

emma_pos = nltk.pos_tag(words) # using a POS tagger on the words in the corpus
# returns a list of tuples where the tuples have the format (word, POS)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Cool! Now we've tagged all of the words in _Emma_ with their parts of speech. We can use this information to make a more informed decision in our synonym picking process. Let's make a new function to choose synonyms— we'll call it `better_pick_from_emma()`.

As with the previous synonym-picking function, I've commented throughout `better_pick_from_emma()` to give you a good sense of what's going on here. But to explain a bit further, this is what `better_pick_from_emma()` does:

*   The function passes in the `word` parameter to `nltk.pos_tag()` to get what the part of speech of the word of interest. (You'll see below that we passed in `[word]`, which is a list. `nltk.pos_tag()` only takes in lists of strings, so we'll just pass in the `word` parameter as a list with one element.) This gives us a list with one tuple whose format is (input word, POS).
*   Second, it searches for the synsets of `word` in WordNet using the `synsets()` function. If the `word` isn't in WordNet's database, we return the word as is.
*   Otherwise, we've found at least 1 entry (synset) for `word`! Using a list comprehension and the `lemma_names()` function, the function then compiles the synonyms for each entry and stores it using the variable `syns`.
*   If there are no synonyms, the function returns `word`. If there _are_ synonyms, we tag all of the `syns` using `nltk.pos_tag()`.
*   Now that all of the synonyms have been tagged, we use another list comprehension to make a list called `emma_options`. Words are only added to this list if they are in _Emma_ **and** it has the same part of speech/morphological status as the original `word`. 
*   If there are no synonyms that fit those criteria, we return `word`. If there _are_ synonyms, we randomly choose a synonym from `emma_options` and return that!



In [None]:
def better_pick_from_emma(word):
  tagged_pos = nltk.pos_tag([word]) # [(input word, POS)]
  synsets = wn.synsets(word) # look up the word

  if len(synsets) == 0: # if the word isn't in WordNet's database,
    return word # then just return the word as is

  syns = []
  for i in range(len(synsets)):
    syns += [w for w in synsets[i].lemma_names() if w != word] # otherwise, find the synonyms for each entry for the word

  if len(syns) > 0: # if there are synonyms:
    tagged_syns = nltk.pos_tag(syns) # returns a list of tuples in the format (word, POS)
    # only add the synonym to the list of options if it is in Emma AND it has the same POS as the original word
    emma_options = [t[0] for t in tagged_syns if t[0] in words and t[1] == tagged_pos[0][1]]

    if len(emma_options) > 0: # if there are options
      num_entries = len(emma_options)
      choice = random.randint(0, (num_entries)) # randomly choose one of them
      return emma_options[choice] # and return it
    return word # if there are no synonyms that are in Emma, return the word as is

  return word # if there are no synonyms at all, return the word as is

After reading through the function above, you may wonder why I didn't use the built-in part of speech that is already pre-tagged as a part of WordNet's functionality. After all, we saw in the Webster function that WordNet is capable of giving us the part of speech of any of the synsets we get. However, [WordNet only contains "open-class words," so it only provides 4 parts of speech](https://wordnet.princeton.edu/frequently-asked-questions): nouns, verbs, adjectives, and adverbs. Furthermore, WordNet cannot produce inflected forms— in fact, it just follows simple rules until it finds a valid  word form in the database, regardless of whether or not the "inflected form" is valid or not.

Though `nltk.pos_tag()` is not a perfect tagger, it allows us to get more specific with our categorization by providing more parts of speech as well as storing some morphological information.

### Additional Changes to Increase Accuracy/Faithfulness

With `better_pick_from_emma()`, the translator now has information on morphology and part of speech to better guide its word choice, so I can now make a better translator function! The new version has the same functionality as the first version, except I've made a couple of changes to make the translation a bit more faithful to the original input the user gives:

*   As mentioned above, `better_pick_from_emma()` allows the translator to take part of speech and morphological data into account when choosing its Austen synonym.
*   Before cleaning the input, the translator uses a list to store the indices of the capitalized words and the word-ending punctuation. Once the input has been "Austen-ized," the translator will use the indices it stored to add the capitalization and punctuation back in to make the sentence(s) look a bit more like the original.

Additionally, to make the translation a bit closer to  Jane Austen's literary choices, I've changed the way the translator deals with apostrophes, slang, and contractions during the input cleaning process. While skimming through some paragraphs of _Emma_, I found that Jane Austen doesn't really use contractions. In fact, using Command F on the [.txt version of the book](https://www.gutenberg.org/cache/epub/158/pg158.txt) shows that she used contractions less than thirty times throughout the entire novel. Because of this, I have installed the `contractions` package. [Originally developed by 
Pascal van Kooten,]((https://github.com/kootenpv/contractions)) this package takes slang and contractions and expands them to their full (prescriptivist) form. Simply put, `emmatize()` will expand the input words before adding them to the `cleaned` list of words, which will then be passed to `better_pick_from_emma()`. I've included an example of how `contractions` works to give you a better idea of what's happening behind the scenes:

In [None]:
# this package allows us to expand slang and contractions to their full form

!pip install contractions
import contractions

# for example:
example = "Ur so cool. Do u wanna be friends?"
print(example)
ex_words = example.split()

prescriptivist = ""
for w in ex_words:
  # this function will expand the "misspelled"/"slang-y" words to their "proper" form
  prescriptivist += contractions.fix(w)
  prescriptivist += " "

prescriptivist = prescriptivist[:-1]

print(prescriptivist)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Ur so cool. Do u wanna be friends?
You Are so cool. Do you want to be friends?


### Final Version and Results

With all of the changes mentioned in the last two sections, I've finally created a much better version of `emmatize()`! As with the other functions in this notebook, I have commented out the code as much as possible to make each step as clear as possible. To put it simply, what `emmatize()` does is:
1.   Take in input from the user.
2.   Record where the capital letters and punctuation is.
3.   Clean the input.
4.   Use `better_pick_from_emma()` to find an Austen synonym for every word.
6.   Add back the capitalization and punctuation.
5.   String the Austen-ized words together.
7.   Print out the translation.



In [None]:
def emmatize():
  take_in = input('What do you want to Austen-ize today? ') # input prompt

  input_words = take_in.split() # turns string into a list of strings containing each word

  uppercase_letters = [] # list of where all of the capitalized words are
  all_caps = [] # list of where all of the fully capitalized words are
  punctuation = {} # dictionary of where and what all sentence-ending punctuation is

  for i in range(len(input_words)): # for each word in the input string
    if input_words[i][0].isupper() and (not (input_words[i].isupper())): # if the first letter is capitalized but NOT the full word,
      uppercase_letters.append(i) # append the index to the list of loc's where capitalized words are
    elif (input_words[i].isupper()): # else if the entire word is capitalized,
      all_caps.append(i) # append the index to the list of loc's where ALL CAPS words are
    for c_ind in range(len(input_words[i])): # and for each character in the input word,
        # if the character is a punctuation mark (and NOT an apostrophe),
      if (input_words[i][c_ind] in list(string.punctuation)) and (input_words[i][c_ind] != "'"):
        punct_mark = input_words[i][c_ind]
        # checking if there are multiple punctuation marks at the end of the word...
        # if there are no letters or numbers from here until the end:
        if (not (any(c.isalpha() for c in input_words[i][c_ind:]))) and (not (any(c.isdigit() for c in input_words[i][c_ind:]))):
          # just call the entire thing a punctuation mark; don't loop through the rest of it
          punct_mark = input_words[i][c_ind:]
          c_ind = -1 # this will allow us to just attach it all onto the end
          if i not in punctuation.keys():
            punctuation[i] = [(c_ind, punct_mark)]
            break # need to break out and stop loop
          else:
            punctuation[i] += [(c_ind, punct_mark)]
            break # need to break out and stop loop
        if i not in punctuation.keys():
          punctuation[i] = [(c_ind, punct_mark)] # add the WORD index to the dictionary as a key, and make the CHARACTER tuple its value
        else: 
          punctuation[i] += [(c_ind, punct_mark)] # append the CHARACTER tuple to the dict of loc's where punct. marks are

  cleaned = [] # making a list for the cleaned words
  for w in input_words: # for each word,
    w = contractions.fix(w) # if it's slang/a contraction, we'll expand it to its orthographically correct form
    w = w.lower() # make the chars all lowercase
    word = "" # empty string for the word
    for i in range(len(w)):
      if (w[i] not in list(string.punctuation)) or (w[i] == "'"): # if the character isn't a punctuation mark OR if it is an apostrophe
        word += w[i] # add the character to the string
    cleaned.append(word) # append the cleaned word to the list

  output = [] # list of output words
  for word in cleaned: # for each input word
    output.append(better_pick_from_emma(word)) # pick its synonym from Emma and add it to the output list

  # the length of the output list should be the same length as the input list,
  # so the synonyms will be capitalized in the same place as their original capitalized counterparts
  for n in uppercase_letters:
    output[n] = output[n][0].upper() + output[n][1:]

  # same principle applies for fully capitalized words
  for n in all_caps:
     output[n] = output[n].upper()

  # finally, add back the punctuation to their appropriate places using the dictionary we made earlier
  for word_index in punctuation.keys():
    for t in punctuation[word_index]: # for each tuple
      punct_ind = t[0] # first elem is index
      punct = t[1] # second elem is mark
      if punct_ind != -1: # if it isn't at the end,
        # put it in the middle where it was before
        output[word_index] = output[word_index][:punct_ind] + punct + output[word_index][punct_ind:]
      else:
        # otherwise, add it onto the end
        output[word_index] += punct

  output_string = "" # create empty string
  for word in output:
    output_string += word # add the words to the output string
    output_string += " " # and put spaces in between!

  output_string = output_string[:-1] # remove trailing space after last word

  print(output_string) # print the "translated" output

Amazing! Now we can test _this_ version of the translator and see how it works.

In [None]:
emmatize()

What do you want to Austen-ize today? I want something to eat.
I privation something to feed.


Here are the example sentences from before with their *new translations*:


*   `I want something to eat.`  ➡  `I privation something to feed.`
*   `Have you seen my keys anywhere?`   ➡  `Get you seen my keys anywhere?`
*   `I miss you. It's been a long time since we've seen each other!`   ➡  `I drop you. It is cost a long clock since we have seen each former!`
*   `I went home and read a book.`   ➡  `I went base and interpret a reserve.`
*   `Good luck on finals!`   ➡  `Estimable chance along finals!`



From running the translator a couple of times, some of the translations of the example sentences are better, but they aren't exactly grammatical, let alone sensical. On a positive note, though, the capitalization and punctuation rules implemented into the final code make the translations appear more polished and more loyal to the user's input sentence.

### Conclusion and Next Steps

As it turns out, it is quite difficult to model a translator after a famous author's literary style. Even with the part of speech and morphological data added in, the finalized version of the Jane Austen translator is, once again, a Jane Austen thesaurus. Thanks to several revisions, the translator is relatively faithful to both the user's input and _Emma._ However, at its core, the translator relies solely on picking synonyms and constructing meaning word by word as opposed to getting the semantics of the input phrase and developing an "Austen phrase" with the same overall semantic meaning. Although the translator is capable of producing words that Jane Austen has used in _Emma_, the translator is incapable of producing sensical— let alone eloquent or stylistic— sentences, and it is a far cry from the literary classic _Emma_ is.

Once again, it seems as though the main issue with developing a literary translator like this is being able to grasp sense and semantic meaning from both the input source and the corpus. Even though using the part of speech tagger from NLTK in conjunction with the WordNet synsets made some word choices better (ex: `I drop you` rather than `i girl you` in example sentence 3), it did not completely fix the true problem, which was the translator changing some of the word-level and sentence-level meanings from those of its input.

As mentioned in class, the next steps to take from here would be finding some sort of way to incorporate semantic meaning into the literary translator. Though I doubted its usefulness during the project development stages, I would be open to implementing Word2Vec in some way to see if it would be a better way of getting synonyms than WordNet. Additionally, I wanted to try using the bigrams/pairs of words from the haiku assignment to see if preceding or following context would be able to help better inform the translator's word choices when picking Austen synonyms from the corpus. However, these methods may still require some sort of human hard-coding to get the translator to understand the senses both Austen and the user convey.

Additionally, I wonder if choosing a different author would cause the translator to be a bit better at imitating literary style. Though Jane Austen's writing is from about 200 years ago, much of the author's diction is quite similar to authors of the present day. If WordNet could support older or different forms of English, I would have loved to see if I could make an English to Shakespearean translator or English to Pirate Speak translator similar to those of LingoJam. Having that direct comparison to another "literary/stylistic translator" could have helped me understand the types of rules that govern a literary translator and further fine-tune my code to make the translations much more grammatical and sensical.

Regardless, I greatly enjoyed being able to develop this translator, even if its functionality is not as I had hoped. Being able to work through issues and further implement packages I had learned about in and out of class allowed this translator to be the best Austen translator/thesaurus it could be, and it is certainly an interesting project to play around with and input sentences into!

In [None]:
emmatize()

What do you want to Austen-ize today? I hope you enjoyed this project!
I desire you enjoyed this figure!
