# Exploring Embeddings using spaCy

<div align="left">
  <a href="https://colab.research.google.com/github/simonguest/dp-applied-genai/blob/main/src/01/embeddings_using_spacy.ipynb" target="_blank">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
  </a>
</div>

## What are Vector Embeddings?

Vector embeddings are numerical representations that map complex data—such as words, sentences, or images—into a continuous, high-dimensional vector space. This transformation enables machines to capture and process semantic relationships and contextual meanings inherent in the data. For instance, in natural language processing (NLP), word embeddings position semantically similar words closer together in the vector space, facilitating tasks like sentiment analysis, machine translation, and information retrieval.

In this notebook, we'll use a popular Python library called spaCy.

## spaCy Dependencies

Run the following cell to download the English pipeline (medium) optimized for CPU. You can find more information here: https://spacy.io/models/en#en_core_web_md

In [1]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m119.2 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


## Generate Embeddings

Run the following code to generate the embeddings for a simple sentence. Try with other words/sentences/paragraphs. What do you notice?

In [None]:
import spacy

# Load the pre-trained model
nlp = spacy.load("en_core_web_md")

# Process the word/sentence
doc = nlp("The cat sat on the mat.")

# Display the embeddings
print(doc.vector)

[-0.72483    0.42538    0.025489  -0.39807    0.037463  -0.29811
 -0.28279    0.29333    0.57775    1.2205    -0.27903    0.80879
 -0.71291    0.045808  -0.46751    0.55944    0.42745    0.58238
  0.20854   -0.42718   -0.40284   -0.048941   0.1149    -0.6963
 -0.03338    0.052596  -0.22572   -0.35996    0.47961   -0.38386
 -0.73837    0.1718     0.52188    0.45584   -0.026621   0.48831
  0.67996   -0.73345   -0.27078    0.41739    0.1947     0.27389
 -0.70931   -0.45317   -0.22574   -0.12617    0.03268    0.142
  0.53923   -0.61285   -0.5322     0.19479    0.13889   -0.020284
  0.088162   0.85337    0.039407   0.11529   -0.42646    0.74832
  0.34421   -0.59462    0.0040537  0.027203  -0.063394   0.26538
  0.34757    0.21395   -0.39799   -0.027067  -0.36132    0.31979
  0.55813   -0.5652     0.55382    0.03928   -0.26933   -0.14705
  0.74032   -0.50566    0.023765   0.62273   -0.79388   -0.25165
  0.11992   -0.43056    1.0614     0.58571    0.8856    -0.056054
  0.055826   0.30485    0.

## Investigate Similarity

We can use spaCy's `similarity` function to generate a similarity score between two sentences. 

In [4]:
import spacy

# Load the pre-trained model
nlp = spacy.load("en_core_web_md")

# Process the sentences
doc1 = nlp("The cat sat on the mat.")
doc2 = nlp("A feline rested on a rug.")

# Compute similarity
similarity_score = doc1.similarity(doc2)

print(f"Similarity score: {similarity_score}")

Similarity score: 0.7523308992385864


## Syntactic or Semantic Similarity?

What do you notice about the similarity scores? Are they syntactic or semantic? Re-run the previous cell with different sentences to find out.