# Find Synonyms project
- Anna Chen
- October 2024  
#### This program processes text data and calculates semantic similarity between words based on their co-occurrence in sentences, using a bag-of-words model and cosine similarity.

Key features include:<br>
    - Text preprocessing: expanding contractions, stemming, lemmatization, and removing stop words.<br>
    - Building semantic descriptors: creating word co-occurrence contexts.<br>
    - Calculating cosine similarity: measuring word relations.<br>
    - Running similarity tests: checking program accuracy using a test file.

synonyms.py builds on "Semantic Similarity" starter code.<br>

Starter Code<br>
Original Author: Michael Guerzhoy, University of Toronto, October 2014.<br>
Modified with permission by Marcus Gubanyi, Concordia University-Nebraska, October 2024.

In [1]:
# run the program 
%run synonyms.py


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\s9602\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\s9602\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\s9602\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


----------------
Find synonym program started...

Finished getting words and sentences from file (training data)
----------------
Finished getting word_context
test file: test.txt
The percentage of correct guesses is 0.0 %
The training data used: ['Swann’s Way by Marcel Proust.txt']
----------------
----------------
Find synonym program started...

Finished getting words and sentences from file (training data)
----------------
Finished getting word_context
test file: test.txt
The percentage of correct guesses is 0.0 %
The training data used: ['War and Peace by Leo Tolstoy.txt']
----------------


----------------
Find synonym program started...

Finished getting words and sentences from file (training data)
----------------
Finished getting word_context
test file: test altered.txt
The percentage of correct guesses is 25.0 %
The training data used: ['Swann’s Way by Marcel Proust.txt']
----------------
----------------
Find synonym program started...

Finished getting words and sentences

In [2]:
# import the find synonyms program function to test
from synonyms import expand_contractions, get_sentence_lists, get_sentence_lists_from_files
from synonyms import build_semantic_descriptors, most_similar_word, run_similarity_test, run_program


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\s9602\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\s9602\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\s9602\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Test the functions I wrote to varify it works correctly

In [3]:
def test_expand_contractions():
    text = "I don't think it's possible. He hasn't decided yet."
    print("Text:" + text)
    expanded_text = expand_contractions(text, contractions)
    print(f"Text after expand contraction: {expanded_text}")
    expected_output = "I do not think it is possible. He has not decided yet."

test_expand_contractions()


Text:I don't think it's possible. He hasn't decided yet.
Text after expand contraction: I do not think it's possible. He has not decided yet.


In [4]:
def test_get_sentence_lists():
    text = "I don't agree with that. However, it's his decision."
    print("Text: " + text)
    
    sentence_lists = get_sentence_lists(text)
    print(f"result: {sentence_lists}")

    
test_get_sentence_lists()


Text: I don't agree with that. However, it's his decision.
result: [['not', 'agre'], ['howev', 'decis']]


In [5]:
def test_get_sentence_lists_from_files():

    # test n.txt is to check how non-existing file is handled
    filenames = ["test 2.txt", "test 3.txt", "test n.txt"]
    print(f"file names: {filenames}\n")
    
    all_sentences = get_sentence_lists_from_files(filenames)
    print(f"\nresult: {all_sentences}")

test_get_sentence_lists_from_files()

file names: ['test 2.txt', 'test 3.txt', 'test n.txt']

Error: File test n.txt not found.

result: [['test', 'file', 'hope', 'python', 'function', 'work'], ['today', 'work', 'dicid', 'learn', 'python'], ['actual', 'review', 'python'], ['like', 'cat', 'dog'], ['think', 'dog', 'better', 'easier', 'keep', 'healthi'], ['not', 'agre', 'either'], ['like', 'plant', 'snake'], ['dragon', 'also', 'cool', 'expens', 'keep'], ['cannot', 'one', 'even', 'abl', 'find', 'wild', 'dragon']]


In [6]:
def test_build_semantic_descriptors():
    
    sentences = [['cat', 'dog'], ['cat', 'fish'], ['dog', 'fish']]
    print(f"sentences: {sentences}\n")
    
    descriptors = build_semantic_descriptors(sentences)
    print(f"word context: {descriptors}")
    
test_build_semantic_descriptors()


sentences: [['cat', 'dog'], ['cat', 'fish'], ['dog', 'fish']]

word context: {'dog': {'cat': 1, 'fish': 1}, 'cat': {'dog': 1, 'fish': 1}, 'fish': {'cat': 1, 'dog': 1}}


In [7]:
def test_most_similar_word():
    descriptors = {
        'feline': {'cat': 3, 'lion': 1, 'pet': 1},
        'cat': {'feline': 3, 'lion': 1, 'pet': 2},
        'dog': {'pet': 1},
        'horse': {'animal': 2}
    }
    word = 'feline'
    choices = ['cat', 'dog', 'horse']
    best_choice = most_similar_word(word, choices, descriptors)
    print(f"the best choice picked: {best_choice}")
    
test_most_similar_word()

the best choice picked: dog


def test_run_program():
    # train with Swann’s Way
    run_program(["Swann’s Way by Marcel Proust.txt"], "test.txt")

    # train with War and Peace
    run_program(["War and Peace by Leo Tolstoy.txt"], "test.txt")

    # train with both
    run_program(["Swann’s Way by Marcel Proust.txt", "War and Peace by Leo Tolstoy.txt"], "test.txt")

    print("\n============== ⬇️⬇️ Use the altered test file ⬇️⬇️ ============================\n")
    # I altered the test.txt file to make the format fit the description 
    # on the assignment instruction.

    # train with Swann’s Way
    run_program(["Swann’s Way by Marcel Proust.txt"], "test altered.txt")

    # train with War and Peace
    run_program(["War and Peace by Leo Tolstoy.txt"], "test altered.txt")


test_run_program()

###　Observation and Anzltsis
The model trained on *War and Peace* outperformed the one trained on *Swann's Way*. My assumption is that *War and Peace* provided a significantly larger dataset—its text file is approximately three times the size of *Swann’s Way*. Generally, larger datasets improve model performance by allowing the model to learn from more varied contexts.

However, even with this improvement, the model’s accuracy remains low, correctly identifying synonyms only 37.5% of the time. This level of performance is far from ideal. I believe that substantial improvements could be achieved if the model were trained on a much larger and more diverse dataset, which would allow it to capture word relationships more effectively.

#### Note on AI usage

Most of the code in synonyms.py was written with help from ChatGPT. I worked back and forth with it to improve the code, making sure I understood each part. I added comments and docstrings to explain what the code does and wrote the run_program and main functions myself. 

For the testing part, ChatGPT gave me a basic version, and I used it as inspiration to come up with the final version. The longer paragraphs in the documentation were first written by me, then I used ChatGPT to help with grammar and flow to make everything easier to read.