# Lab 1: Comparing Embeddings

In this lab, you'll explore word embeddings for the first time. You'll investigate word similarities, analogies, and vector operations using travel-related concepts. In a larger project, you would typically generate these embeddings yourself using a tool like Word2Vec. For simplicity, we've provided pre-trained vectors. In the next lab, you'll learn how to generate embeddings from your own dataset.

## Reading our Embeddings

In [None]:
with open('vectors.txt', 'r') as f:
    vectors = f.read()
    print(vectors)

## A Caveat about our Embeddings

Normally, word embeddings involve tens or hundreds of thousands of vectors, each with 50 to 300 dimensions or more. For this exercise, we've filtered a small subset of travel-related embeddings with relatively low dimensionality (50). This setup is great for learning and prototyping. However, in real-world applications, you'd typically work with much larger datasets—both in terms of vocabulary size and vector dimensions—which requires significantly more compute power and memory.

## Loading Embeddings

Let's load our embeddings into a Python dictionary. The key will be the word, and the value will be its corresponding embedding. For the values, we'll use a NumPy array, which allows us to efficiently perform vector math functions.

In [None]:
import numpy

# Function to load our simple vector file
def load_word_vectors(file_path):
    word_vectors = {}
    with open('vectors.txt') as f:
        for line in f:
            values = line.strip().split()
            word = values[0]
            vector = numpy.array([float(val) for val in values[1:]])
            word_vectors[word] = vector
    return word_vectors

word_vectors = load_word_vectors('./vectors.txt')

print('Vector for "paris":')
print(word_vectors['paris'])

# Part 1: Defining Cosine Similarity

First, we need to define a cosine similarity function. Remember:

- a similarity of 1 means the vectors are identical in direction
- a similarity of 0 means they are orthogonal (Not similar, not opposites)
- a similarity of -1 means they are opposites

Cosine Similarity Forumula:

`(A, B) = (A · B) / (||A|| * ||B||)`

Where:

`A · B`: is the dot product of vectors A and B.

`||A||` and `||B||` are the magnitudes (Euclidean norms) of vectors A and B.

We've provided the expected cosine similarity for two pairs, so you'll know when you've done it correctly.

In [None]:
def cosine_similarity(vec1, vec2):

    # TODO

    pass


# Test our cosine_similarity
print('Similarity between vectors for "palace" and "temple":')
print(cosine_similarity(word_vectors['palace'], word_vectors['temple'])) # 0.7437538473269875
print()
print('Similarity between vectors for "dock" and "italy":')
print(cosine_similarity(word_vectors['dock'], word_vectors['italy']))    # 0.055356290110446675



## Part 2: Exploring Cosine Similarity

1. Create a few more tests with words from our set. Here's the list of words we have embeddings for:

'paris', 'france', 'rome', 'italy', 'tokyo', 'japan', 'barcelona', 'spain', 'london', 'england', 'berlin', 'germany', 'amsterdam', 'netherlands', 'vienna', 'austria', 'lisbon', 'portugal', 'athens', 'greece', 'hotel', 'hostel', 'resort', 'airbnb', 'motel', 'apartment', 'guesthouse', 'cabin', 'flight', 'train', 'bus', 'taxi', 'subway', 'ferry', 'cruise', 'bicycle', 'beach', 'mountain', 'desert', 'jungle', 'island', 'lake', 'river', 'forest', 'museum', 'restaurant', 'cafe', 'bar', 'market', 'mall', 'temple', 'church', 'mosque', 'castle', 'palace', 'airport', 'station', 'terminal', 'port', 'dock', 'passport', 'visa', 'ticket', 'boarding', 'customs', 'immigration', 'hiking', 'swimming', 'surfing', 'skiing', 'diving', 'camping', 'climbing', 'kayaking', 'sightseeing', 'shopping', 'dining', 'photography', 'touring', 'relaxing', 'backpack', 'suitcase', 'luggage', 'camera', 'map', 'guidebook', 'sunscreen', 'umbrella', 'wallet', 'charger', 'adapter', 'pillow'

2. Make note of any pairs that surprise you. What similarity score did you get vs what you expected? Can you explain the difference?

3. Using your intuition, try to find a pair that gives us a **negative** similarity score. Could you find one? Why do you think it's difficult to find a negative similarity score, given this set of words?

</details>

In [None]:
# TODO: Compare similarity scores for several pairs.

## Part 3: Define `find_nearest_word`

Write a function, `find_nearest_word`, that takes the following parameters:

- `target_vector`: A NumPy array representing the target vector.

- `word_vectors`: A dictionary where the keys are words (strings), and the values are the corresponding word vectors (NumPy arrays).

- `exclude` (optional): A list of words to exclude from the search. The default value should be an empty list.

The function should return a tuple: the word with the highest cosine similarity to the target vector (excluding any words in the `exclude` list), and its corresponding similarity score.

Write some tests that showcase your function and print out the results.

Hint: Use the `cosine_similarity` function that was defined in the previous code cell.

In [None]:
# Function to find the nearest word to a vector
def find_nearest_word(target_vector, word_vectors, exclude=[]):

    # TODO

    pass

print('--- Testing `find_nearest_word` ---\n')

# TODO: Tests showing functionality of `find_nearest_word`


## Part 4: Define `find_nearest_words`

Finding the nearest word is interesting, but it would be interesting to see more than just the nearest. Write a function, `find_nearest_words` that does just that.

Parameters:

- `target_vector`: A NumPy array representing the target vector.

- `word_vectors`: A dictionary where the keys are words (strings), and the values are the corresponding word vectors (NumPy arrays).

- `exclude` (optional): A list of words to exclude from the search. The default value should be an empty list.

- `top_n` (optional): An integer representing the number of nearest words we'd like to collect. Default to top 3.

The function should return a list of tuples containing: the word with the highest cosine similarity to the target vector (excluding any words in the `exclude` list), and its corresponding similarity score. The list should be sorted from highest similarity score to lowest.

Write some tests that showcase your function and print out the results.

In [None]:
def find_nearest_words(target_vector, word_vectors, exclude=[], top_n=3):

    # TODO

    pass


print('--- Testing `find_nearest_words` ---\n')

# TODO: Demonstrate `find_nearest_words`


# Part 5: Define `find_farthest_words`

For fun, let's write a function to see which words are furthest from a given word. You don't need to worry about keeping your code DRY, our goal is to explore our embeddings.

In [None]:
def find_farthest_words(target_vector, word_vectors, exclude=[], last_n=3):

    # TODO

    pass


print('--- Testing `find_farthest_words` ---\n')

# TODO: Demonstrate `find_farthest_words`


## Part 6: Vector Math

Remember this example that demonstrated relations between vectors?

king - man + woman = queen

The idea is that if we take the vector for "King", subtract the vector for "man", add the vector for "woman", we'll end up nearish to "queen"

Let's try that out with our sample set. We don't have kings and queens in our travel-related words, so let's try with countries and capitals.

Let's see if "paris" - "france" + "portugal" = "lisbon".

Write some code to calculate the vector resulting from the above calculation, and then print out the 3 nearest words to that vector.

Afterwards, come up with your own analogies and see if you get the expected results.

In [None]:
# Paris - France + Portugal = Lisbon

# TODO: Find 3 nearest words to "Paris - France + Portugal"


# TODO: Try to demonstrate at least one more analogy

