# Document Embedding

### 1. Embed

First, we will create sample documents to be embedded using the model

In [1]:
documents = [
    "The quick brown fox jumps over the lazy dog",
    "My dog is quick and can jump over fences",
    "I love my dog",
    "The dog is lazy but the fox is quick",
    "Uniqueness can help us find vectors, I hope this works!"
]

Next, we will use sklearn to vectorize the documents

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
embeddings = X.toarray()

### 2. Upload

After creating the embeddings, we will publish these to a pinecone index

In [3]:
import pinecone

# Load environment variables
from dotenv import load_dotenv
import os
load_dotenv()
pinecone_key = os.getenv("PINECONE_KEY")

  from tqdm.autonotebook import tqdm


In [4]:
# Determine dimensionality of embeddings
dimension = embeddings.shape[1]

# Get index name
index_name = "document-embeddings"

In [5]:
# Initialize pinecone session
pinecone.init(api_key=pinecone_key, environment='gcp-starter')

In [6]:
# Set index to be written to and read from
index = pinecone.Index('document-embeddings')

In [7]:
# Prepare data for uploading to pinecone
vectors = [{'id': str(i), 'values': [float(value) for value in embeddings[i]]} for i in range(len(embeddings))]
print(f"Number of vectors: {len(vectors)}")

Number of vectors: 5


In [8]:
# Upload embeddings to pinecone
index.upsert(vectors=vectors)

{'upserted_count': 5}

Now we have uploaded the origional embedded documents to the index. Below, we will add a few more.

In [9]:
import numpy as np

# New documents
new_documents = [
    "New document text here",
    "Another new document text here",
    "Uniqueness is very important, I will use it to find this vector!"
    # ... more documents
]

# Vectorize the new documents
new_X = vectorizer.transform(new_documents)  # Note: use transform, not fit_transform, to keep the same vocabulary

# Convert to dense array
new_embeddings = new_X.toarray()

In [10]:
# Connect to the Existing Pinecone Index
index = pinecone.Index(index_name)

In [11]:
# Prepare the new data for uploading to Pinecone
start_id = len(embeddings)
new_vectors = [{'id': str(start_id + i), 'values': [float(value) for value in new_embeddings[i]]} for i in range(len(new_embeddings))]

In [12]:
upsert_response = index.upsert(vectors=new_vectors)

### 3. Compare and Query

We have uploaded even more vectors to the index. Now, we need to retrieve some of them that are similar to a sentence I will write below.

In [13]:
# Vectorize new sentence
new_sentence = "I want to find the unique vector, it is important!"
new_vector = vectorizer.transform([new_sentence]).toarray()[0]  # Convert to dense array and get the first (and only) vector

In [14]:
# Convert the new vector to a list of standard Python float values
new_vector_list = [float(value) for value in new_vector]

# Query Pinecone for the most similar vector(s)
query_response = index.query(
    top_k=2,
    vector=new_vector_list,
    include_metadata=True,
    include_values=False
)

In [15]:
query_response['matches']

[{'id': '3', 'score': 0.593876541, 'values': []},
 {'id': '7', 'score': 0.567467749, 'values': []}]

We can see above that the cosine similarity for the sentence I wrote and the one I wanted to find (the sentence with unique in it) is almost the highest at 0.7072.

My query sentence: "I want to find the unique vector, it is important!"

The sentences returned:
1. The dog is lazy but the fox is quick
2. Uniqueness is very important, I will use it to find this vector!