## Installing Dependencies

In [None]:
!pip install langchain-openai unstructured==0.7.12 pinecone-client openai tiktoken
!pip install langchain

## Importing Dependencies

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain_community.document_loaders import UnstructuredURLLoader
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.vectorstores.pinecone import Pinecone
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain.chains import RetrievalQA
from langchain import OpenAI
from langchain_openai import ChatOpenAI
import os
import nltk
import pinecone
import openai

  from tqdm.autonotebook import tqdm


## User Input for URLs to build copilot on

In [None]:
urls = []
n_weblinks = int(input("How many web links you want the copilot to refer for response generation & insights? Enter here: "))
print("Enter your links below: ")
for i in range(0, n_weblinks):
  inp = input()
  # appending the element in list
  urls.append(inp)

How many web links you want the copilot to refer for response generation & insights? Enter here: 1
Enter your links below: 
https://milvus.io/


## Loading URL through LangChain's UnstructuredURLLoader

In [None]:
from langchain.document_loaders import UnstructuredURLLoader
loader = UnstructuredURLLoader(urls=urls)
urls = loader.load()

loader = UnstructuredPDFLoader("A Systematic Review of Transformer-Based Pre-Trained Language Models through Self Supervised Learning.pdf")
pdf = loader.load()



In [None]:
# Merging urls into a single list
documents = []
documents.extend(urls)
documents.extend(pdf)

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
texts = text_splitter.split_documents(documents)
texts

## OpenAI API Key Setting



In [None]:
# Set the OpenAI API key as an environment variable
os.environ["OPENAI_API_KEY"] = "sk-xxxx"

## Pinecone API Setting

In [None]:
#Set the pinecone key
pinecone.init(api_key="71ed08b1-xxxx", environment="gcp-starter")

In [None]:
#Creating a Pinecone index automatically if index doesn't exists
embeddings = OpenAIEmbeddings()
index_name = 'myindex' #index name
if index_name not in pinecone.list_indexes():
    # Creating a new index
    pinecone.create_index(name=index_name, metric="cosine", dimension=1536)
# The OpenAI embedding model 'text-embedding-ada-002 uses 1536 dimensions'
docsearch = Pinecone.from_documents(texts, embeddings, index_name=index_name)

  warn_deprecated(




## Copilot: URL QA engine using LangChain

In [None]:
llm = ChatOpenAI(temperature=0.1, model_name="gpt-4", max_tokens=256)
qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=docsearch.as_retriever(),
        verbose=True,
        return_source_documents=True)

In [None]:
initial_prompt = "You are a researcher who is going to search the web links, summarize them and share insights as asked"

In [None]:
query = "What is a RAG framework?"
result = qa({"query": query, "prompt": initial_prompt})

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [None]:
print(result['result'])

The RAG framework refers to a research paradigm in the field of technology. It has evolved over time and is categorized into three types: Naive RAG, Advanced RAG, and Modular RAG. These categories were developed to address specific shortcomings in the initial RAG model. The framework involves various technologies that work together to form a cohesive and effective system. It is used for information retrieval and context-aware generation. The performance of RAG models is assessed using certain metrics and benchmarks.


In [None]:
result['source_documents']

[Document(page_content='3 RAG Framework The RAG research paradigm is continuously evolving, and this section primarily delineates its progression. We cate- gorize it into three types: Naive RAG, Advanced RAG, and Modular RAG. While RAG were cost-effective and surpassed the performance of the native LLM, they also exhibited sev- eral limitations. The development of Advanced RAG and Modular RAG was a response to these specific shortcomings in Naive RAG.', metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf'}),
 Document(page_content='technologies embedded in each of these critical components, providing a profound understanding of the advancements in RAG systems. Furthermore, this paper introduces the metrics and benchmarks for assessing RAG models, along with the most up-to-date evaluation In conclusion, the paper delineates framework. including the prospective avenues for research, identification of challenges, the expansion of multi-modalities, and the progression of the RAG infr