# Exploring Embeddings using spaCy

<div align="left">
  <a href="https://colab.research.google.com/github/simonguest/dp-applied-genai/blob/main/src/01/embeddings_using_spacy.ipynb" target="_blank">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
  </a>
</div>

## What are Vector Embeddings?

Vector embeddings are numerical representations that map complex data—such as words, sentences, or images—into a continuous, high-dimensional vector space. This transformation enables machines to capture and process semantic relationships and contextual meanings inherent in the data. For instance, in natural language processing (NLP), word embeddings position semantically similar words closer together in the vector space, facilitating tasks like sentiment analysis, machine translation, and information retrieval.

In this notebook, we'll use **spaCy**, a popular Python library for NLP that includes pre-trained word embeddings. spaCy's embeddings are based on word2vec and are trained on large text corpora to capture semantic relationships between words.

## spaCy Dependencies

Run the following cell to download the English pipeline (medium) optimized for CPU. This model includes:
- **Word vectors**: 300-dimensional embeddings for ~685k words
- **Part-of-speech tagging**: Grammatical information
- **Named entity recognition**: Identifying people, places, organizations, etc.
- **Dependency parsing**: Understanding sentence structure

You can find more information here: https://spacy.io/models/en#en_core_web_md

In [None]:
!python -m spacy download en_core_web_md

## Generate Embeddings

Let's start by generating embeddings for a simple sentence. spaCy automatically creates document-level embeddings by averaging the word vectors in the text.

**Try this**: Run the code below, then try with other words, sentences, or paragraphs. What do you notice about the embedding dimensions and values?

In [None]:
import spacy

# Load the pre-trained model
nlp = spacy.load("en_core_web_md")

# Process the word/sentence
doc = nlp("The cat sat on the mat.")

# Display information about the embeddings
print(f"Text: '{doc.text}'")
print(f"Embedding dimensions: {len(doc.vector)}")
print(f"First 10 values: {doc.vector[:10]}")
print(f"\nFull embedding: {doc.vector}")

## Understanding Word vs Document Embeddings

spaCy provides embeddings at different levels:
- **Token (word) level**: Each individual word has its own embedding
- **Document level**: The entire text gets a single embedding (average of word embeddings)

Let's explore both:

In [None]:
# Process a sentence
doc = nlp("The cat sat on the mat.")

print("Word-level embeddings:")
for token in doc:
    if token.has_vector:  # Check if the word has an embedding
        print(f"'{token.text}': {len(token.vector)} dimensions, first 5 values: {token.vector[:5]}")
    else:
        print(f"'{token.text}': No embedding available")

print(f"\nDocument-level embedding: {len(doc.vector)} dimensions")
print(f"Document vector (first 10): {doc.vector[:10]}")

## Investigate Similarity

We can use spaCy's built-in `similarity` function to generate a similarity score between two pieces of text. This function calculates the cosine similarity between the embeddings.

**Similarity scores range from 0 to 1:**
- 1 = identical meaning
- 0 = no relationship

In [None]:
import spacy

# Load the pre-trained model
nlp = spacy.load("en_core_web_md")

# Process the sentences
doc1 = nlp("The cat sat on the mat.")
doc2 = nlp("A feline rested on a rug.")

# Compute similarity
similarity_score = doc1.similarity(doc2)

print(f"Text 1: '{doc1.text}'")
print(f"Text 2: '{doc2.text}'")
print(f"Similarity score: {similarity_score:.4f}")

## Syntactic vs Semantic Similarity

What do you notice about the similarity scores? Are they capturing syntactic (word structure) or semantic (meaning) similarity? 

**Experiment**: Re-run the previous cell with different sentence pairs to find out. Try these examples:

1. **Same words, different order**: 
   - "The dog chased the cat" vs "The cat chased the dog"
   
2. **Synonyms**:
   - "The car is fast" vs "The automobile is quick"
   
3. **Different topics**:
   - "I love programming" vs "The weather is nice"

**Questions to consider**:
- Do sentences with similar meanings but different words get high similarity scores?
- Do sentences with the same words but different meanings get high scores?
- What does this tell you about what the embeddings are capturing?

## Comparing Multiple Texts

Let's create a more comprehensive comparison by looking at multiple texts and their pairwise similarities:

In [None]:
# Define multiple texts to compare
texts = [
    "The cat sat on the mat.",
    "A feline rested on a rug.",
    "Dogs are loyal pets.",
    "The weather is sunny today.",
    "Programming is fun and challenging."
]

# Process all texts
docs = [nlp(text) for text in texts]

# Create similarity matrix
print("Pairwise Similarity Scores:")
print("=" * 50)

for i, doc1 in enumerate(docs):
    for j, doc2 in enumerate(docs):
        if i < j:  # Only show upper triangle to avoid duplicates
            similarity = doc1.similarity(doc2)
            print(f"Text {i+1} vs Text {j+1}: {similarity:.4f}")
            print(f"  '{doc1.text}' vs '{doc2.text}'")
            print()

print("\nText Reference:")
for i, text in enumerate(texts):
    print(f"Text {i+1}: {text}")

## Word-Level Similarity Analysis

Let's also explore similarity at the word level to understand how individual words relate to each other:

In [None]:
# Compare individual words
words = ["cat", "feline", "dog", "canine", "car", "automobile", "happy", "joyful", "sad"]

# Process words
word_docs = [nlp(word) for word in words]

print("Word Similarity Examples:")
print("=" * 30)

# Compare some interesting word pairs
pairs = [
    ("cat", "feline"),
    ("dog", "canine"), 
    ("car", "automobile"),
    ("happy", "joyful"),
    ("happy", "sad"),
    ("cat", "dog"),
    ("cat", "car")
]

for word1, word2 in pairs:
    doc1 = nlp(word1)
    doc2 = nlp(word2)
    similarity = doc1.similarity(doc2)
    print(f"'{word1}' vs '{word2}': {similarity:.4f}")

## Key Takeaways

From this exploration with spaCy embeddings, you should understand:

1. **spaCy provides 300-dimensional word embeddings** trained on large text corpora
2. **Document embeddings are averages** of the individual word embeddings
3. **Similarity scores capture semantic relationships** between words and texts
4. **Embeddings work at multiple levels** - individual words, sentences, and documents
5. **Pre-trained models** like spaCy's make it easy to get started with embeddings

**Comparison with other approaches**: 
- spaCy embeddings are based on older techniques (word2vec) but are fast and work well for many tasks
- Modern transformer-based models (like OpenAI's embeddings) often provide better semantic understanding
- The choice depends on your specific use case, computational resources, and accuracy requirements

**Next steps**: Try experimenting with different types of text (technical documents, creative writing, news articles) to see how well the embeddings capture domain-specific relationships!