# Assignment 3: Vector Representations & LSA

**Please do not consult external resources for this assignment.**

Make sure you have done the required reading for this homework (which was also required reading for class):

- [Jurafsky, D., & Martin, J. H. (2020). Speech and Language Processing, Chapter 6: Vector Semantics and Embeddings (forthcoming 3rd edition). Prentice-Hall. [pp. 1-11]](https://web.stanford.edu/~jurafsky/slp3/6.pdf)

In this assignment, we are going to use a pretrained LSA model. Specifically we will be using the `EN_100k_lsa model` by Fritz Günther [1]. It will be useful for this assignment to read more about the model and understand the data it was trained on. The model is large, so you might have to spend some time downloading it.

[1] [Günther, F., Dudschig, C., & Kaup, B. (2015). LSAfun – An R package for computations based on Latent Semantic Analysis. Behavior Research Methods, 47, 930-944.](https://sites.google.com/site/fritzgntr/software-resources/semantic_spaces)

## Submission guidelines

Please upload your Jupyter notebook to the Canvas Assignment. Please do not include any system specific configuration, such as the installation of dependencies, as code in the notebook (you can comment it out).


## Retrieving the LSA Model

The `EN_100k_lsa` dataset we're using for this assignment is distributed for the R programming language by the author. We have converted it to the text format that's typically used for the vector representations, and that's what we will use for this assignment. You will need a Georgia Tech account to download it from [here.](https://gatech.instructure.com/files/54010573/download?download_frd=1)

In the file, each line is a quoted word, followed by 300 floating points numbers that constitute the vector representation of that vector. Here are a few examples from the file:

```
"is" 38.576145560208 21.6394607567694 19.7566635160741 -10.8594517804468 7.51268972210512 {295 others} 
"was" 69.1618505390472 -34.1040373913203 41.9157366516055 -14.2414103748296 -5.62593149303477 {295 others} 
"be" 24.9892859457854 16.3246671002679 3.70758479872187 -12.302839896438 2.23646336543604 {295 others}
```


## Parsing the File

We have provided the code to parse the file into a dictionary of word vectors. 

In [2]:
import numpy as np
import pandas as pd

def parse_file(file_name):
    word_vectors = {}
    with open(file_name) as f:
        for line in f:
            first_whitespace = line.index(" ")
            word = line[:first_whitespace].strip('"')
            vector = np.array(line[first_whitespace + 1:].split(" "), dtype=np.float32)
            word_vectors[word] = vector
    return word_vectors

In [3]:
dict_word_vectors = parse_file("files/EN_100k_lsa.txt")

For example, use `dict_word_vectors["is"]` to access the word vector for the word "is".

In [4]:
dict_word_vectors["is"]

array([ 4.21761322e+02, -7.53212280e+01,  1.96545593e+02,  1.54491928e+02,
        5.50510178e+01, -2.55067123e+02,  8.02171478e+01, -8.99195480e+01,
        2.91481934e+02,  3.79438019e+02,  9.17247925e+01, -1.30209885e+02,
        1.65959106e+02, -2.96785278e+02,  4.34712563e+01, -5.82672806e+01,
        1.69251709e+01, -8.79788113e+00, -1.97886444e+02, -9.18110809e+01,
       -8.26303329e+01,  9.52104874e+01,  2.03846970e+02, -2.85616882e+02,
        1.03230957e+02, -2.49806076e+02,  1.70082809e+02, -1.14347286e+01,
        7.39332123e+01,  4.62236595e+01, -6.16472321e+01,  1.24699478e+01,
        1.16835470e+01,  2.07751816e+02,  3.11458569e+01, -2.89486053e+02,
        1.32502594e+02, -1.45918457e+02,  1.14770508e+01, -8.10860901e+01,
       -4.93820839e+01,  2.18805885e+01,  9.85299587e+00, -1.80958881e+01,
       -2.29861088e+01,  1.07338896e+01,  1.49447527e+01, -2.02187519e+01,
       -7.29695587e+01,  5.11481285e+00,  1.07671593e+02,  1.54620132e+02,
       -4.15899353e+01, -

## Implementation Guidelines & Tips

The following parts of the assignment should be analyzed and answered by computing similarities and or differences between two words. You may use the [`scipy.spatial.distance.cosine`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html) function. Keep in mind that the function returns a **distance**. To find **similarity**, you can subtract the returned distance from 1.

## TOEFL Synonym questions

We're going to use the LSA model to complete TOEFL synonym questions. Let's see how it fares.

[You can see an example question here.](https://aclweb.org/aclwiki/TOEFL_Synonym_Questions_(State_of_the_art))

Use the `files/syntest.csv` dataset to complete this section.



Please use the LSA word vectors to predict the answers for each of the 20 questions.  

Print a table with columns \['Question', 'Option1', 'Option2', 'Option3', 'Option4', 'Correct', 'Type', 'model_answers', 'accuracy'\]

Then answer the following questions.

1. How well does the model do?
2. What kind of items does the model do well on? What kind of items does the model fail on? Why does it fail on those items? 

In [5]:
from scipy.spatial.distance import cosine

def compute_cosine_similarity(vector1, vector2):
    return (1 - cosine(vector1, vector2))

In [6]:
## TODO
# pass

syntest = pd.read_csv("files/syntest.csv")

def predict_answers(syntest, dict_word_vectors):
    predicted_answers = []
    
    for _, row in syntest.iterrows():
        question = row['Question']
        options = [row['Answer1'], row['Answer2'], row['Answer3'], row['Answer4']]
        target_vector = dict_word_vectors[question]
        max_similarity = float('-inf')
        model_answer = ""
        
        for option in options:
            if option in dict_word_vectors:
                option_vector = dict_word_vectors[option]
                similarity = compute_cosine_similarity(target_vector, option_vector)

                if similarity > max_similarity:
                    max_similarity = similarity
                    model_answer = option
            else:
                print("Word check not found in the lSA database provided by TAs")
        
        predicted_answers.append({
            'Question': row['Question'], 'Answer1': row['Answer1'],'Answer2': row['Answer2'],
            'Answer3': row['Answer3'],'Answer4': row['Answer4'],'Correct': row['Correct'],
            'Type': row['Type'],'model_answers': model_answer,'accuracy': "100%" if model_answer == row['Correct'] else "0%"
        })
    
    return pd.DataFrame(predicted_answers)

In [7]:
print(predict_answers(syntest, dict_word_vectors))

       Question     Answer1      Answer2    Answer3     Answer4    Correct  \
0         large        wide      massive       tall         far       tall   
1          near       small        close       open     similar      close   
2         enjoy  appreciate    celebrate       like       claim       like   
3         lucky   fortunate        happy     tricky      sneaky  fortunate   
4        pretty      bright       joyful    popular   beautiful  beautiful   
5        street      gutter         road    railway    building       road   
6       apology       guilt  forgiveness     excuse       grief     excuse   
7         quick         shy        hasty      small        fast       fast   
8           sad   desperate        angry    unhappy   disgusted    unhappy   
9   intelligent    educated        smart     active  successful      smart   
10        night        dawn          day        sky        dusk        day   
11       friend      father        enemy  colleague     partner 

## Part 2: SAT analogy questions

We're now going to use LSA to complete SAT analogy questions (which were discontinued after the year 2005).

[You can find an example question here.](https://aclweb.org/aclwiki/SAT_Analogy_Questions_(State_of_the_art))

Pick 5 analogy questions from the SAT practice test book that can be [found here.](https://dbgyan.files.wordpress.com/2013/02/501_word_analogy.pdf) Pick items that you think you will find interesting or useful to discuss when answering the questions.

Use the LSA model to predict answers on the example question, and your 5 selected questions. Here's one way to perform a comparison: To represent a word pair, you can add the two word vectors together. You can then compare the word pair vectors to each other to answer the questions. However, you can take other paths if you think they will be interesting to discuss.

Then answer the following questions:

3. Why did you pick the items you picked? How well did model do on them?
4. What kind of items does the model do well on? What kind of items does the model fail on? Why does it fail on those items? 

In [8]:
def solve_analogy(Question, dict_word_vectors):
    predicted_answers = []
    
    question = Question['Question'][2]
    vec1 = dict_word_vectors[Question['Question'][0]]
    vec2 = dict_word_vectors[Question['Question'][1]]
    options = [Question['Answer1'], Question['Answer2'], Question['Answer3'], Question['Answer4']]
    target_vector = dict_word_vectors[question]
    analogy_vector = vec2 - vec1
    max_similarity = float('-inf')
    model_answer = ""
    
    for option in options:
        if option in dict_word_vectors:
            option_vector = target_vector - dict_word_vectors[option]
            similarity = np.dot(analogy_vector, option_vector)

            if similarity > max_similarity:
                max_similarity = similarity
                model_answer = option
        else:
            print("Word check not found in the lSA database provided by TAs")
    
    predicted_answers.append({
        'Question': Question['Question'], 'Answer1': Question['Answer1'],'Answer2': Question['Answer2'],
        'Answer3': Question['Answer3'],'Answer4': Question['Answer4'],'Correct': Question['Correct'],
        'model_answers': model_answer,'accuracy': 100 if model_answer == Question['Correct'] else 0
    })
    
    return pd.DataFrame(predicted_answers)
    # return model_answer

In [9]:
example_question = {'Question': ['about', 'bout', 'mend'],
            'Answer1': 'amend',
            'Answer2': 'near',
            'Answer3': 'tear',
            'Answer4': 'dismiss',
            'Correct': 'amend'}

predicted_word = solve_analogy(example_question, dict_word_vectors)
print("The answer is:",  predicted_word)

example_question = {'Question': ['warm', 'hot', 'hilarious'],
            'Answer1': 'humid',
            'Answer2': 'raucous',
            'Answer3': 'summer',
            'Answer4': 'amusing',
            'Correct': 'amusing'}

predicted_word = solve_analogy(example_question, dict_word_vectors)
print("The answer is:",  predicted_word)


question = {'Question': ['trail', 'grain', 'grail'],
             'Answer1': 'train',
             'Answer2': 'path',
             'Answer3': 'wheat',
             'Answer4': 'holy',
             'Correct': 'train'}

predicted_word = solve_analogy(question, dict_word_vectors)
print("The answer is:",  predicted_word)

question = {'Question': ['particular', 'fussy', 'subservient'],
             'Answer1': 'meek',
             'Answer2': 'above',
             'Answer3': 'cranky',
             'Answer4': 'uptight',
             'Correct': 'uptight'}


predicted_word = solve_analogy(question, dict_word_vectors)
print("The answer is:",  predicted_word)

question = {'Question': ['horse', 'board', 'train'],
             'Answer1': 'stable',
             'Answer2': 'shoe',
             'Answer3': 'ride',
             'Answer4': 'mount',
             'Correct': 'shoe'}

predicted_word = solve_analogy(question, dict_word_vectors)
print("The answer is:",  predicted_word)

question = {'Question': ['son', 'nuclear', 'extended'],
             'Answer1': 'father',
             'Answer2': 'mother',
             'Answer3': 'cousin',
             'Answer4': 'daughters',
             'Correct': 'cousin'}

predicted_word = solve_analogy(question, dict_word_vectors)
print("The answer is:",  predicted_word)

question = {'Question': ['zenith', 'fear', 'composure'],
             'Answer1': 'apex',
             'Answer2': 'heaven',
             'Answer3': 'heights',
             'Answer4': 'nadir',
             'Correct': 'nadir'}

predicted_word = solve_analogy(question, dict_word_vectors)
print("The answer is:",  predicted_word)

question = {'Question': ['bog', 'slumber', 'sleep'],
             'Answer1': 'dream',
             'Answer2': 'foray',
             'Answer3': 'marsh',
             'Answer4': 'night',
             'Correct': 'marsh'}

predicted_word = solve_analogy(question, dict_word_vectors)
print("The answer is:",  predicted_word)


The answer is:               Question Answer1 Answer2 Answer3  Answer4 Correct model_answers  \
0  [about, bout, mend]   amend    near    tear  dismiss   amend          near   

   accuracy  
0         0  
The answer is:                  Question Answer1  Answer2 Answer3  Answer4  Correct  \
0  [warm, hot, hilarious]   humid  raucous  summer  amusing  amusing   

  model_answers  accuracy  
0       amusing       100  
The answer is:                 Question Answer1 Answer2 Answer3 Answer4 Correct  \
0  [trail, grain, grail]   train    path   wheat    holy   train   

  model_answers  accuracy  
0          path         0  
The answer is:                            Question Answer1 Answer2 Answer3  Answer4  Correct  \
0  [particular, fussy, subservient]    meek   above  cranky  uptight  uptight   

  model_answers  accuracy  
0         above         0  
The answer is:                 Question Answer1 Answer2 Answer3 Answer4 Correct  \
0  [horse, board, train]  stable    shoe    ride   mo

## Category Typicality

We're now going to attempt to replicate human typicality ratings of the categories `color` and `flower` from [Castro et al. (2022)](https://psyarxiv.com/4gzn6/). Castro et al. (2022) asked participants to generate as many examples as they could for the the categories and used the data to calculate human typicality ratings.

Retrieve the list of category types (responses) and typicality scores from `files/castro_et_al_typicality.csv` associated with `color` and `flower`.

* Compute how similar each response for category color is to the "color" (the word) vector representation
* Compute how similar each response for category flower is to the "flower" (the word) vector representation
* Compare the results to the human data (typicality scores in the csv)

This function might help with comparisons with human data: [`scipy.stats.spearmanr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html?highlight=spearman).

NOTE: If the model does not have a vector for a particular word, remove that word from the analysis.

1. What kind of items does the model do well on? What kind of items does the model fail on? Why does it fail on those items? 
2. How well does the model do compared to human data? Why does the model have higher correlation with human data for one category and negative correlation for the other category?

In [15]:
castro_et_al_typicality = pd.read_csv("files/castro_et_al_typicality.csv")
## TODO
# pass

from scipy.stats import spearmanr

color_data = castro_et_al_typicality[castro_et_al_typicality['Category'] == 'A color']
flower_data = castro_et_al_typicality[castro_et_al_typicality['Category'] == 'A flower']

color_data = color_data.dropna(subset=['Typicality_Scores'])
flower_data = flower_data.dropna(subset=['Typicality_Scores'])

similarities, valid_responses = [], []

def compute_similarity(response_list, category_word):
    category_vector = dict_word_vectors[category_word]
    
    for response in response_list:
        response_vector = dict_word_vectors[response.lower()]
        
        if response_vector is not None:
            similarity = compute_cosine_similarity(response_vector, category_vector)
            similarities.append(similarity)
            valid_responses.append(response)

    return similarities, valid_responses

color_similarities, color_responses = compute_similarity(color_data['Response'].tolist(), 'color')
flower_similarities, flower_responses = compute_similarity(flower_data['Response'].tolist(), 'flower')

def compare_with_human_data(human_data, model_similarities, valid_responses):
    human_data = human_data[human_data['Response'].isin(valid_responses)]
    response_to_score = dict(zip(human_data['Response'], human_data['Typicality_Scores']))
    human_scores = [response_to_score.get(response, np.nan) for response in valid_responses]
    valid_indices = ~np.isnan(human_scores)
    human_scores = np.array(human_scores)[valid_indices]
    model_similarities = np.array(model_similarities)[valid_indices]

    return spearmanr(human_scores, model_similarities)

color_correlation = compare_with_human_data(color_data, color_similarities, color_responses)
flower_correlation = compare_with_human_data(flower_data, flower_similarities, flower_responses)

print(f"Color Category Spearman Correlation: {color_correlation.correlation:.2f}")
print(f"Flower Category Spearman Correlation: {flower_correlation.correlation:.2f}")

# Color Category Spearman Correlation: -0.43
# Flower Category Spearman Correlation: 0.47

# Color Category Spearman Correlation: 0.58
# Flower Category Spearman Correlation: -0.28

    Category   Response  Typicality_Scores
178  A color       Blue               0.97
179  A color        Red               0.96
180  A color      Green               0.92
181  A color     Yellow               0.85
182  A color     Orange               0.79
183  A color     Purple               0.77
184  A color      Black               0.72
185  A color      White               0.58
186  A color       Pink               0.55
187  A color      Brown               0.38
188  A color       Gray               0.21
189  A color     Violet               0.19
190  A color    Magenta               0.13
191  A color       Teal               0.11
192  A color     Indigo               0.10
193  A color        Tan               0.09
194  A color  Turquoise               0.09
195  A color     Silver               0.08
196  A color   Lavender               0.07
197  A color       Gold               0.06
198  A color      Beige               0.06
     Category   Response  Typicality_Scores
450  A flo

## Questions on LSA

1. LSA uses 300 dimensions for vector representation. Imagine you repeat this assignment but only use first 50 dimensions of vector representations. How do you think this would affect the results on the different tasks?  
2. Do you think humans have a vector representation model for words? No, why not? Yes, then how do humans learn these vector representations?

You're done now! Submit the assignment to Canvas.

## References

Castro, N., Curley, T. M., & Hertzog, C. (2020, April 21). Category norms with a cross-sectional sample of adults in the United States: Consideration of cohort, age, and historical effects on semantic categories. https://doi.org/10.31234/osf.io/4gzn6

[Günther, F., Dudschig, C., & Kaup, B. (2015). LSAfun – An R package for computations based on Latent Semantic Analysis. Behavior Research Methods, 47, 930-944.](https://sites.google.com/site/fritzgntr/software-resources/semantic_spaces)


