# Adaptive RAG -- Dev 0.1

## Goal: 
- Create Query analysis to redirect question to Rag or web-search

## Tools used:
- OpenAI Embeddings
- ChatOpenAI
- Pinecone

The reason for these tools is to get the system running efficiently online

Running embeddings locally will make development taking longer than needed

In [5]:
# dotenv
import os
import dotenv
dotenv.load_dotenv()

True

In [10]:
llm_model = "gpt-3.5-turbo"

## Embedding Documents

Documents are loaded from AIM Website (for now). We will utilize Langchains feature to parse and chunk HTML sites.

### Indexing

In [1]:
### Build Index

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# set embeddings
embd = OpenAIEmbeddings()

Load documents from AIM website into notebook

In [3]:
import requests
from bs4 import BeautifulSoup
import re

base_url = "https://www.faa.gov/air_traffic/publications/atpubs/aim_html/"

response = requests.get(base_url)
soup = BeautifulSoup(response.content, "html.parser")

links = soup.find_all("a", href=True)
subpages = set()

for link in links:
        href = link['href']
        if re.search(r'(chap|appendix)', href, re.IGNORECASE):  # Adjust regex to match 'chapter' or 'appendix'
                full_url = f"{base_url.rstrip('/')}/{href.lstrip('/')}"
                subpages.add(full_url)

subpages

{'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_1.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_2.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_3.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_4.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_5.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_cfr.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_chap0_policy.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_faa_desc.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_info_eoc.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_subscription_info.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap10_section_1.html',
 'https://www.faa.gov/air_traffic/publications/atpubs

create pinecone index

In [7]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

index_name = "adaptive-rag"

pc.create_index(
    name=index_name,
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

In [9]:

# Load docs

from langchain_pinecone import PineconeVectorStore

docs = [WebBaseLoader(url).load() for url in subpages] # pulls pages from subpages set
docs_list = [item for sublist in docs for item in sublist] # flattens the list

# Split docs
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500, chunk_overlap=0,
)
doc_splits = text_splitter.split_documents(docs_list)

# Initialize vectorstore and add documents simultaneously
vectorstore_from_docs = PineconeVectorStore.from_documents(
    doc_splits,
    index_name=index_name,
    embedding=embd,
)

retriever = vectorstore_from_docs.as_retriever()

To add more reccords

Once initialized, you can add more documents to the index with `add_documents` or `add_texts`


**THIS NEEDS TO BE CONTINUED**

In [11]:
### Retrieval Grader

from langchain_community.chat_models import ChatOpenAI
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate

# LLM
llm = ChatOpenAI(model=llm_model, format="json", temperature=0, )

# prompt to instructor llm how to act as grader
prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing relevance 
    of a retrieved document to a user question. If the document contains keywords related to the user question, 
    grade it as relevant. It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. \n
    Provide the binary score as a JSON with a single key 'score' and no premable or explanation.
     <|eot_id|><|start_header_id|>user<|end_header_id|>
    Here is the retrieved document: \n\n {document} \n\n
    Here is the user question: {question} \n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    """,
    input_variables=["question", "document"],
)

retrieval_grader = prompt | llm | JsonOutputParser() # retrieval grader chain

question = "instrument approach"

docs = retriever.invoke(question)
doc_txt = docs[1].page_content

print(retrieval_grader.invoke({"question": question, "document": doc_txt}))

                    format was transferred to model_kwargs.
                    Please confirm that format is what you intended.


TypeError: Completions.create() got an unexpected keyword argument 'format'