# Document Embedding

### 1. Embed

First, we will create sample documents to be embedded using the model

In [11]:
import os

folder_path = "../06_Data/"
documents = []

for filename in os.listdir(folder_path):
    if filename.endswith('.txt'):
        with open(os.path.join(folder_path, filename), 'r') as f:
            content = f.read()
            documents.append(content)

In [12]:
documents[0]

'Apple announces the release of the iPad, a new device that is a combination of a laptop and a smartphone. The iPad is thinner and lighter than a laptop and has a touchscreen display. It can be used for browsing the web, email, viewing photos and videos, listening to music, and reading ebooks. The iPad also has a new app called iBooks, which allows users to purchase and read ebooks. The iPad will be available in both Wi-Fi and 3G models, with prices starting at $499. Accessories such as a dock, keyboard dock, and case are also available. The iPad will be available for purchase in 60 days.'

Next, we will use OpenAI to vectorize the documents

In [16]:
import openai

# Load environment variables
from dotenv import load_dotenv
import os
load_dotenv()
pinecone_key = os.getenv("PINECONE_KEY")
openai.api_key = os.getenv("OPENAI_KEY")

In [19]:
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

In [24]:
print("Testing API")
try:
    print(f"{len(get_embedding(documents[0]))} dims returned!")
except:
    print("API Call Failed")

Testing API
1536 dims returned!


In [27]:
# Get embeddings
embeddings = [get_embedding(doc) for doc in documents]

### 2. Upload

After creating the embeddings, we will publish these to a pinecone index

In [34]:
import pinecone

# Set index name
index_name = "document-embeddings"

# Initialize pinecone session
pinecone.init(api_key=pinecone_key, environment='gcp-starter')
index = pinecone.Index(index_name)

In [29]:
# Prepare data for uploading to pinecone
vectors = [{'id': str(i), 'values': [float(value) for value in embeddings[i]]} for i in range(len(embeddings))]
print(f"Number of vectors: {len(vectors)}")

Number of vectors: 1


In [32]:
# Upload embeddings to pinecone
index.upsert(vectors=vectors)

{'upserted_count': 1}

Now we have uploaded the origional embedded documents to the index. Below, we will add a few more.

In [9]:
import numpy as np

# New documents
new_documents = [
    "New document text here",
    "Another new document text here",
    "Uniqueness is very important, I will use it to find this vector!"
    # ... more documents
]

# Vectorize the new documents
new_X = vectorizer.transform(new_documents)  # Note: use transform, not fit_transform, to keep the same vocabulary

# Convert to dense array
new_embeddings = new_X.toarray()

In [10]:
# Connect to the Existing Pinecone Index
index = pinecone.Index(index_name)

In [11]:
# Prepare the new data for uploading to Pinecone
start_id = len(embeddings)
new_vectors = [{'id': str(start_id + i), 'values': [float(value) for value in new_embeddings[i]]} for i in range(len(new_embeddings))]

In [12]:
upsert_response = index.upsert(vectors=new_vectors)

### 3. Compare and Query

We have uploaded even more vectors to the index. Now, we need to retrieve some of them that are similar to a sentence I will write below.

In [51]:
# Vectorize new sentence
new_sentence = "I am going to buy a new iphone!"
new_vector = get_embedding(new_sentence)

In [52]:
# Convert the new vector to a list of standard Python float values
new_vector_list = [float(value) for value in new_vector]

# Query Pinecone for the most similar vector(s)
query_response = index.query(
    top_k=2,
    vector=new_vector_list,
    include_metadata=True,
    include_values=False
)

In [53]:
query_response['matches']

[{'id': '0', 'score': 0.802503169, 'values': []}]

We can see above that the cosine similarity for the sentence I wrote and the one I wanted to find (the sentence with unique in it) is almost the highest at 0.7072.

My query sentence: "I want to find the unique vector, it is important!"

The sentences returned:
1. The dog is lazy but the fox is quick
2. Uniqueness is very important, I will use it to find this vector!

In [134]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI

vector_db = Pinecone.from_existing_index(index_name=index_name, embedding=OpenAIEmbeddings(openai_api_key=openai.api_key))

In [136]:
# the vector_db is often called an "index" in the documentation, then you use a "retriever interface" to access said index 
vector_db_retriever = vector_db.as_retriever()

In [147]:
# Define answer template
demo_template = """Use the following pieces of context to answer the question at the end.

Pay attention to the tone of the question and use it to determine the technical familiarity of the user with the product, and then adjust your answer accordingly.

If you do not know the answer, advise the user to seek help via king wencheng support.

{context}

Question: {question}

Answer tailored to the technical familiarity of the user:"""

In [148]:
# Variable for prompt template
prompt_for_chain = PromptTemplate(template = demo_template, input_variables = ["context", "question"])

In [149]:
# choose an OpenAI LLM
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", openai_api_key=openai.api_key)

In [150]:
# the chatbot building part... it feels too simple, but I think this will be enough to get it to work
# chain type can be anything, "stuff" seems to be the most basic
# we use the retriever from earlier
# chain argument will just be our prompt template from earlier
assistant = RetrievalQA.from_chain_type(llm = llm,
                                        retriever = vector_db_retriever,
                                        chain_type = "stuff",
                                        chain_type_kwargs = {"prompt": prompt_for_chain})

In [151]:
# actually giving the prompt, run this to talk to the assistant
assistant.run("My IPad is not turning on, help!")

"I'm sorry to hear that your iPad is not turning on. Here are a few troubleshooting steps you can try:\n\n1. Make sure your iPad is charged. Connect it to a power source using the charging cable and let it charge for at least 15 minutes. Then try turning it on again.\n\n2. If your iPad is still not turning on, try a hard reset. Press and hold both the Home button and the Power button at the same time for about 10 seconds until you see the Apple logo. Then release the buttons and wait for your iPad to restart.\n\n3. If the above steps don't work, it's possible that there may be a hardware issue with your iPad. I recommend contacting Apple Support or visiting an Apple Store for further assistance. They will be able to help you diagnose and fix the problem.\n\nIf you need further assistance, I recommend reaching out to King Wencheng support for additional help."