# Indexing: Inspecting and Managing Documents in a Vectorstore

In [None]:
# Run the line of code below to check the version of langchain in the current environment.
# Substitute "langchain" with any other package name to check their version.

In [None]:
pip show langchain

In [1]:
%load_ext dotenv
%dotenv

In [2]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

In [3]:
embedding = OpenAIEmbeddings(model='text-embedding-ada-002')

In [4]:
vectorstore_from_directory = Chroma(persist_directory = "./intro-to-ds-lectures", 
                                    embedding_function = embedding)

  vectorstore_from_directory = Chroma(persist_directory = "./intro-to-ds-lectures",


In [5]:
vectorstore_from_directory.get()

{'ids': ['291e42e5-cf7a-450b-9027-188ba4df300e',
  '55463110-2515-4189-beb5-a33087cd8933',
  '63bdd151-0c54-4689-be90-1c1273948386',
  '1607d9e9-e656-4c1f-bb97-3fd91431bfdf',
  '29927a1c-2359-49a1-9fe4-b751601d68a2',
  'cf329c08-c5f3-4c8f-9bd4-7d02a32e56ab',
  '911fbe77-61c2-4e32-8771-fbc3250c1ed4',
  'b3a8352a-49af-4a80-99f0-33b1ad909cb5',
  '6c438000-b120-4c1e-b14c-08cc32454412',
  '12b51d79-c204-4a36-a5d7-429bac3aa9d0',
  '9a15f8a4-3215-4a09-bb88-ad17c404b8fd',
  '574ac2c5-6074-4653-8f9e-d57d3a094279',
  '156281da-cc43-4b05-be9f-3cb1222ba050',
  'da7c555e-4884-4339-ab60-121cfe014f0b',
  'b68fcbde-6873-49d3-94ce-22c5cf3e70d1',
  'e71f1a46-5739-496f-b919-a5092c14cf6a',
  '8780b9f3-b125-4381-b674-4b21c2557942',
  '14199141-ad22-4798-9395-41ebeb91a560',
  '14c4532a-9461-4d22-84c8-837052af4559',
  'ea2f5bd1-2e76-4e5a-a3c1-758bbf7edf78',
  '81fbfec2-562e-4136-b71b-a71e7c33bccc',
  'cd0cfa52-1377-4358-993b-45561d9a8b05',
  '949a66a4-539c-4e74-b1a8-2cc575045537',
  '96bf0b57-e22f-4893-95ff-

In [6]:
vectorstore_from_directory.get(ids = "00829d7d-4ad7-4f83-93c2-1c606f3c14bf", 
                               include = ["embeddings"])

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': None,
 'uris': None,
 'data': None,
 'metadatas': None,
 'included': [<IncludeEnum.embeddings: 'embeddings'>]}

In [7]:
added_document = Document(page_content='Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the similarity of the words, some people believe they share the same meaning, and thus use them interchangeably. Technically, this isn’t correct. There is, in fact, a distinct difference between the two. And the reason for one often being used instead of the other is the lack of a transparent understanding of both. So, let’s clear this up, shall we? First, we will start with analysis', 
                          metadata={'Course Title': 'Introduction to Data and Data Science', 
                                    'Lecture Title': 'Analysis vs Analytics'})

In [8]:
vectorstore_from_directory.add_documents([added_document])

['2175823a-9a5e-4efa-8e7d-3a2993d2efbc']

In [9]:
vectorstore_from_directory.get("55409552-1943-4892-949a-3b475ff9c840")

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [10]:
updated_document = Document(page_content='Great! We hope we gave you a good idea about the level of applicability of the most frequently used programming and software tools in the field of data science. Thank you for watching!', 
                            metadata={'Course Title': 'Introduction to Data and Data Science', 
                                     'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'})

In [11]:
vectorstore_from_directory.update_document(document_id = "55409552-1943-4892-949a-3b475ff9c840", 
                                           document = updated_document)

Update of nonexisting embedding ID: 55409552-1943-4892-949a-3b475ff9c840
Update of nonexisting embedding ID: 55409552-1943-4892-949a-3b475ff9c840


In [12]:
vectorstore_from_directory.get("55409552-1943-4892-949a-3b475ff9c840")

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [13]:
vectorstore_from_directory.delete("55409552-1943-4892-949a-3b475ff9c840")

Delete of nonexisting embedding ID: 55409552-1943-4892-949a-3b475ff9c840
Delete of nonexisting embedding ID: 55409552-1943-4892-949a-3b475ff9c840


In [14]:
vectorstore_from_directory.get("55409552-1943-4892-949a-3b475ff9c840")

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

# Retrieval: Similarity Search

In [None]:
# Now our goal with this lesson is the following.

# First, define a question related to data science.

# Secondly, use the embedding function of the vector store to create a vector representation of this

# question.

# And thirdly, retrieve a pre-selected number of documents relevant to this question.

# At the next step, these retrieved documents will be fed to an LLM to devise a response.

# So let's create a variable called question defined as follows.

In [17]:
vectorstore = Chroma(persist_directory = "./intro-to-ds-lectures", 
                     embedding_function = embedding)

In [18]:
added_document = Document(page_content='Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action', 
                          metadata={'Course Title': 'Introduction to Data and Data Science', 
                                    'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'})

In [19]:
vectorstore.add_documents([added_document])

['e723440e-a268-4c60-9591-eff9f778a887']

In [20]:
question = "What programming languages do data scientists use?"

In [21]:
retrieved_docs = vectorstore.similarity_search(query = question, 
                                               k = 5)

In [None]:
# The first parameter we must pass is query.

# Our question.

# The second parameter to set is k corresponding to the number of retrieved documents.

# It defaults to four, but let's change its value to five documents, allowing us to identify a pain

# point in the similarity search algorithm.

In [22]:
retrieved_docs

[Document(metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}, page_content='What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data'),
 Document(metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}, page_content='What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statis

In [None]:
# Displaying the variable, we find that retrieved documents is a list of five documents.

# I find this a bit difficult to read, so let me use a for loop that displays only the page content and

# the lecture title of each document.

# Much better.

In [23]:
for i in retrieved_docs:
    print(f"Page Content: {i.page_content}\n----------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page Content: What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data
----------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page Content: What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data
----------
Lecture Title:Programming Langua

# Retrieval: Maximal Marginal Relevance Search

In [24]:
question = "What software do data scientists use?"

In [25]:
retrieved_docs = vectorstore.max_marginal_relevance_search(
    query=question, 
    k=3, 
    lambda_mult = 1, 
    filter = {"Lecture Title": "Programming Languages & Software Employed in Data Science - All the Tools You Need"}
)

In [26]:
for i in retrieved_docs:
    print(f"Page Content: {i.page_content}\n----------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page Content: As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end
----------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page Content: As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related 

# Generation: Stuffing Documents

In [27]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain_core.output_parsers import StrOutputParser

In [28]:
vectorstore = Chroma(persist_directory = "./intro-to-ds-lectures", 
                     embedding_function = OpenAIEmbeddings(model='text-embedding-ada-002'))

In [29]:
len(vectorstore.get()['documents'])

42

In [30]:
retriever = vectorstore.as_retriever(search_type = 'mmr', 
                                     search_kwargs = {'k':3, 
                                                      'lambda_mult':0.7})

In [31]:
TEMPLATE = '''
Answer the following question:
{question}

To answer the question, use only the following context:
{context}

At the end of the response, specify the name of the lecture this context is taken from in the format:
Resources: *Lecture Title*
where *Lecture Title* should be substituted with the title of all resource lectures.
'''

prompt_template = PromptTemplate.from_template(TEMPLATE)

In [32]:
chat = ChatOpenAI(model_name = 'gpt-4', 
                  model_kwargs = {'seed':365},
                  max_tokens = 250)

  if await self.run_code(code, result, async_=asy):


In [33]:
question = "What software do data scientists use?"

In [34]:
chain = {'context': retriever, 
         'question': RunnablePassthrough()} | prompt_template

In [35]:
chain.invoke(question)

StringPromptValue(text="\nAnswer the following question:\nWhat software do data scientists use?\n\nTo answer the question, use only the following context:\n[Document(metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}, page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end'), Document(metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}, page_content='It’

In [36]:
print("\nAnswer the following question:\nWhat software do data scientists use?\n\nTo answer the question, use only the following context:\n[Document(page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end', metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}), Document(page_content='It’s actually a software framework which was designed to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers which is basically the way to handle big data nowadays. Power BI, SaS, Qlik, and especially Tableau are top-notch examples of software designed for business intelligence visualizations', metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}), Document(page_content='Great! We hope we gave you a good idea about the level of applicability of the most frequently used programming and software tools in the field of data science. Thank you for watching!', metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'})]\n\nAt the end of the response, specify the name of the lecture this context is taken from in the format:\nResources: *Lecture Title*\nwhere *Lecture Title* should be substituted with the title of all resource lectures.\n")


Answer the following question:
What software do data scientists use?

To answer the question, use only the following context:
[Document(page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end', metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}), Document(page_content='It’s actually a software framework which was designed to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers

# Generation: Generating a Response

In [37]:
chain = ({'context': retriever, 
         'question': RunnablePassthrough()} 
         | prompt_template 
         | chat 
         | StrOutputParser())

In [38]:
chain.invoke(question)

'Data scientists use a variety of software and programming languages. R and Python are two of the most popular tools due to their ability to manipulate data and their integration within multiple data and data science software platforms. They can also solve a wide range of business and data-related problems. In addition to these, Hadoop, a software framework, is used to address the complexity of big data and its computational intensity. Furthermore, software like Power BI, SaS, Qlik, and Tableau are used for business intelligence visualizations.\n\nResources: Programming Languages & Software Employed in Data Science - All the Tools You Need'

In [39]:
print('Data scientists use a variety of software tools. R and Python are the two most popular tools as they can manipulate data and are integrated within multiple data and data science software platforms. They are adaptable and can solve a wide range of business and data-related problems. Hadoop is a software framework designed to handle the complexity and computational intensity of big data by distributing computational tasks on multiple computers. Additionally, Power BI, SaS, Qlik, and Tableau are top-notch examples of software designed for business intelligence visualizations.\n\nResources: Programming Languages & Software Employed in Data Science - All the Tools You Need')

Data scientists use a variety of software tools. R and Python are the two most popular tools as they can manipulate data and are integrated within multiple data and data science software platforms. They are adaptable and can solve a wide range of business and data-related problems. Hadoop is a software framework designed to handle the complexity and computational intensity of big data by distributing computational tasks on multiple computers. Additionally, Power BI, SaS, Qlik, and Tableau are top-notch examples of software designed for business intelligence visualizations.

Resources: Programming Languages & Software Employed in Data Science - All the Tools You Need
