### RAG

Three major parts:

1.Indexing

2.Retrival

3.generating

#### Indexing

##### Document Loading

In [1]:
#for processing pdf
from langchain_community.document_loaders import PyPDFLoader
import copy

loader_pdf=PyPDFLoader(r'Resources\\Introduction_to_Data_and_Data_Science.pdf')

pages_pdf=loader_pdf.load()

#removing new line character
pages_pdf_cut=copy.deepcopy(pages_pdf)

for i in pages_pdf_cut:
    i.page_content=' '.join(i.page_content.split())

In [2]:
# for docx extension
from langchain_community.document_loaders import Docx2txtLoader

In [3]:
loader_docx=Docx2txtLoader(r'Resources\\Introduction_to_Data_and_Data_Science.docx')

#### Document Splitting

##### charcter text splitter

In [4]:
from langchain_text_splitters.character import CharacterTextSplitter

In [5]:
pages=loader_docx.load()

In [6]:
for i in pages:
    i.page_content=' '.join(i.page_content.split())

In [7]:
char_splitter=CharacterTextSplitter(separator='',chunk_size=500,chunk_overlap=0)

pages_char_split=char_splitter.split_documents(pages)

len(pages_char_split)

17

In [8]:
#lets introduce overlap & separator
char_splitter=CharacterTextSplitter(separator='.',chunk_size=500,chunk_overlap=50)

pages_char_split=char_splitter.split_documents(pages)

len(pages_char_split)

21

##### using Markdown header text splitter

In [9]:
from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter

In [10]:
loader_docx=Docx2txtLoader(r'Resources\\Introduction_to_Data_and_Data_Science_markdown.docx')
pages=loader_docx.load()

In [11]:
md_splitter=MarkdownHeaderTextSplitter([
    ('#','course Title'),
    ('##','Lecture Title')
])

pages_md_split=md_splitter.split_text(pages[0].page_content)

In [12]:
for i in pages_md_split:
    i.page_content=' '.join(i.page_content.split())

In [13]:
char_splitter=CharacterTextSplitter(separator='.',chunk_size=500,chunk_overlap=50)

In [14]:
pages_char_split=char_splitter.split_documents(pages_md_split)

#### Embedding

In [15]:
from langchain_openai.embeddings import OpenAIEmbeddings
import os
from dotenv import load_dotenv
import numpy as np

In [16]:
#initializing env variable
load_dotenv(dotenv_path='D:\Project\.env.txt')

True

In [17]:
embedding=OpenAIEmbeddings( model="text-embedding-ada-002")

In [18]:
vector1=embedding.embed_query(pages_char_split[3].page_content)
vector2=embedding.embed_query(pages_char_split[5].page_content)
vector3=embedding.embed_query(pages_char_split[18].page_content)

In [19]:
np.dot(vector1,vector2)

0.8791284497943928

##### Storing

In [25]:
from langchain_community.vectorstores import Chroma

In [27]:
vectorestore= Chroma.from_documents(documents=pages_char_split,
                                    embedding=embedding,
                                    persist_directory="./vector_Store")

In [31]:
#what if we need to refer from existing database
vectorestore_from_directory=Chroma(embedding_function=embedding,
                                    persist_directory="./vector_Store")

In [32]:
vectorestore_from_directory.get()

{'ids': ['868c4d89-e5d7-4728-afc0-ab3d26f94585',
  'c4d76855-be40-4ed0-9dae-98a960f743a9',
  'f8b80cc3-ba19-4c52-86f4-bec5e14685d7',
  '4b7516f0-687a-4c18-b41b-21c8f18d9552',
  '05721762-f3d2-443f-9f97-555b054e8a9d',
  '00d6255a-cd3f-4a28-b1ee-93c30bcae4a1',
  '5d4c3fd2-ede5-4389-b800-c706291f8afd',
  '91b00f24-7f4d-41e6-9ab4-01a9b06db6fb',
  'b3ba41a3-43e0-4741-aa29-1faf4f0a5119',
  '1e3e61f8-3e83-48f6-85cd-cd27ea18eb39',
  '36af7a5e-c36b-40c5-8b99-28c3e02767b6',
  'b7aae5f1-3a8c-4b76-860e-ac267cc2a157',
  '96a2fe04-da6c-4940-a0b9-f7e4d980f065',
  'c8121ba3-4272-4404-92e2-17e62265ead5',
  'd00d6be7-4a30-40cd-97cf-aadec35ee8e1',
  '0d3d1daf-2721-4a04-b60e-c7c5476a7f34',
  'cf203599-f18b-4f10-b1fa-bfcd02fc6105',
  '540c555c-b5f6-4f20-b743-0fcceb9c14d9',
  'dd4acbd8-fae6-41bb-9ad8-982fd5d5d45e',
  'e421c7f5-e384-4b66-b5e9-a9901d5573ad'],
 'embeddings': None,
 'documents': ['Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the simi

In [34]:
#adding new document to vector store
from langchain_core.documents import Document
added_document=Document(page_content='This is a new document',
                        metadata={'course Title':'Introduction_to_Data_and_Data_Science',
                                  'Lecture Title': 'New'})

In [36]:
vectorestore_from_directory.add_documents([added_document])

['ac3586ca-2db6-4060-9ee2-4b4aceb58dde']

In [37]:
vectorestore_from_directory.get('ac3586ca-2db6-4060-9ee2-4b4aceb58dde')

{'ids': ['ac3586ca-2db6-4060-9ee2-4b4aceb58dde'],
 'embeddings': None,
 'documents': ['This is a new document'],
 'uris': None,
 'data': None,
 'metadatas': [{'Lecture Title': 'New',
   'course Title': 'Introduction_to_Data_and_Data_Science'}],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [39]:
#updaing a document
updated_document=Document(page_content='This is a new updated document',
                        metadata={'course Title':'Introduction_to_Data_and_Data_Science',
                                  'Lecture Title': 'New'})

In [40]:
vectorestore_from_directory.update_document(document_id='ac3586ca-2db6-4060-9ee2-4b4aceb58dde',
                                            document=updated_document)

In [41]:
vectorestore_from_directory.get('ac3586ca-2db6-4060-9ee2-4b4aceb58dde')

{'ids': ['ac3586ca-2db6-4060-9ee2-4b4aceb58dde'],
 'embeddings': None,
 'documents': ['This is a new updated document'],
 'uris': None,
 'data': None,
 'metadatas': [{'Lecture Title': 'New',
   'course Title': 'Introduction_to_Data_and_Data_Science'}],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [45]:
#deleting the document
vectorestore_from_directory.delete('ac3586ca-2db6-4060-9ee2-4b4aceb58dde')

Delete of nonexisting embedding ID: ac3586ca-2db6-4060-9ee2-4b4aceb58dde
Delete of nonexisting embedding ID: ac3586ca-2db6-4060-9ee2-4b4aceb58dde


In [44]:
#verify
vectorestore_from_directory.get('ac3586ca-2db6-4060-9ee2-4b4aceb58dde')

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

#### Retrival

In [49]:
#similarity search

question="What programming languages data scientists use?"

retrieve_documnets=vectorestore_from_directory.similarity_search(query=question,k=5)

retrieve_documnets

[Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data'),
 Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='Thus, we need a lot of computational power, and we can expect people to use the languages similar to those in the big data column. Apart from R, Python, and MATLAB, other, faster languages are used

In [51]:
#maximum marginal relevance search - MMR Search
question="What software data scientists use?"

In [53]:
retrieve_documnets=vectorestore_from_directory.similarity_search(query=question,k=3)

retrieve_documnets
#not very helpful

[Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end'),
 Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. 

In [55]:
retrieve_documnets=vectorestore_from_directory.max_marginal_relevance_search(query=question,k=3,lambda_mult=0.1)

retrieve_documnets

[Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end'),
 Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='Great! We hope we gave you a good idea about the level of applicability of the most frequently used programming and software tools in the field of data scienc

In [57]:
retrieve_documnets=vectorestore_from_directory.max_marginal_relevance_search(query=question,
                                                                             k=3,
                                                                             lambda_mult=0.1,
                                                                            filter={'Lecture Title':'Programming Languages & Software Employed in Data Science - All the Tools You Need'}
                                                                            )

retrieve_documnets

[Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end'),
 Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='Among the many applications we have plotted, we can say there is an increasing amount of software designed for working with big data such as Apache Hadoop, Ap

In [58]:
retrieve_documnets=vectorestore_from_directory.max_marginal_relevance_search(query=question,
                                                                             k=3,
                                                                             lambda_mult=0.7,
                                                                            filter={'Lecture Title':'Programming Languages & Software Employed in Data Science - All the Tools You Need'}
                                                                            )

retrieve_documnets

[Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end'),
 Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='It’s actually a software framework which was designed to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes 

In [59]:
retrieve_documnets=vectorestore_from_directory.max_marginal_relevance_search(query=question,
                                                                             k=3,
                                                                             lambda_mult=1,
                                                                            filter={'Lecture Title':'Programming Languages & Software Employed in Data Science - All the Tools You Need'}
                                                                            )

retrieve_documnets

[Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end'),
 Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. 

In [60]:
retriever=vectorestore_from_directory.as_retriever(search_type='mmr',
                                                   search_kargs={'k':3,'lambda_mult':0.7})

In [61]:
retrieve_docs=retriever.invoke(question)

In [62]:
retrieve_docs

[Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end'),
 Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='Among the many applications we have plotted, we can say there is an increasing amount of software designed for working with big data such as Apache Hadoop, Ap

#### Generation

In [87]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough,RunnableParallel
from langchain_core.output_parsers import StrOutputParser

In [71]:
vectorestore=Chroma(embedding_function=OpenAIEmbeddings(model='text-embedding-ada-002'),
                                    persist_directory="./vector_Store")

In [72]:
retriever=vectorestore.as_retriever(search_type='mmr',
                                                   search_kargs={'k':3,'lambda_mult':0.7})

In [74]:
TEMPLATE='''
Answer the following question:
{question}

To answer the question, use only the following context:
{context}

At the end f the response , specify the name of the lecture this context is taken from in format:
Resources: *lecture Title*
Where *lecture Title* should be substituted with the title of all resource lectures.
'''
prompt_template=PromptTemplate.from_template(TEMPLATE)

In [75]:
chat=ChatOpenAI(model_name='gpt-4',model_kwargs={'seed':365},temperature=0,max_tokens=300)

  if await self.run_code(code, result, async_=asy):


In [76]:
question="what software does datascientist use?"

In [77]:
chain= RunnableParallel({'context':retriever,'question':RunnablePassthrough()})

In [78]:
chain.invoke(question)

{'context': [Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='It’s actually a software framework which was designed to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers which is basically the way to handle big data nowadays. Power BI, SaS, Qlik, and especially Tableau are top-notch examples of software designed for business intelligence visualizations'),
  Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='Great! We hope we gave you a good idea about the level of applicability of the most frequently used programming and software tools in the field of data science. Thank you for watching!'),
  Document(metada

In [80]:
chain= {'context':retriever,'question':RunnablePassthrough()}|prompt_template

In [81]:
chain.invoke(question)

StringPromptValue(text="\nAnswer the following question:\nwhat software does datascientist use?\n\nTo answer the question, use only the following context:\n[Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='It’s actually a software framework which was designed to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers which is basically the way to handle big data nowadays. Power BI, SaS, Qlik, and especially Tableau are top-notch examples of software designed for business intelligence visualizations'), Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'course Title': 'Introduction_to_Data_and_Data_Science'}, page_content='Great! We hope we gave you a good idea about the level of appli

In [82]:
chain= ({'context':retriever,'question':RunnablePassthrough()}|prompt_template|chat)

In [83]:
chain.invoke(question)

AIMessage(content="Data scientists use a variety of software and programming languages. Hadoop is a software framework designed to handle the complexity and computational intensity of big data by distributing computational tasks on multiple computers. Other software tools used for business intelligence visualizations include Power BI, SaS, Qlik, and Tableau. In terms of programming languages, R and Python are the most popular tools in the field of data science. They are capable of manipulating data and are integrated within multiple data and data science software platforms. They are adaptable and can solve a wide variety of business and data-related problems.\n\nResources: 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'Analysis vs Analytics'", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 138, 'prompt_tokens': 492, 'total_tokens': 630, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'a

In [89]:
chain= ({'context':retriever,'question':RunnablePassthrough()}
        |prompt_template
        |chat
        | StrOutputParser())

In [91]:
print(chain.invoke(question))

Data scientists use a variety of software for their work. This includes Hadoop, a software framework designed to handle the complexity and computational intensity of big data by distributing computational tasks on multiple computers. Other software used by data scientists for business intelligence visualizations include Power BI, SaS, Qlik, and Tableau. Additionally, R and Python are popular tools in the field of data science due to their ability to manipulate data and their integration within multiple data and data science software platforms. They are adaptable and can solve a wide variety of business and data-related problems.

Resources: 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'Analysis vs Analytics'
