# QA over unstructured data
This notebook presents the task of question-answering over unstructured data (e.g. PDFs, content from websites, etc.).

Our goal is to build an NBA assistant bot that can answer questions regarding NBA, NBA rules, players, etc.

In [None]:
import bs4
import dotenv
import tiktoken

from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import WebBaseLoader, WikipediaLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

In [None]:
dotenv.load_dotenv()

## Document loading
First step of indexing is loading the documents - this is the data that we want the LLM to see when answering the questions.

In our example we are using two data sources: official NBA rulebook and a small set of Wikipedia articles regarding the general topic of NBA.

In [None]:
nba_rules_urls = [
    "https://official.nba.com/rule-no-1-court-dimensions-equipment/",
    "https://official.nba.com/rule-no-2-duties-of-the-officials/",
    "https://official.nba.com/rule-no-3-players-substitutes-and-coaches/",
    "https://official.nba.com/rule-no-4-definitions/",
    "https://official.nba.com/rule-no-5-scoring-and-timing/",
    "https://official.nba.com/rule-no-6-putting-ball-in-play-live-dead-ball/",
    "https://official.nba.com/rule-no-7-24-second-clock/",
    "https://official.nba.com/rule-no-8-out-of-bounds-and-throw-in/",
    "https://official.nba.com/rule-no-9-free-throws-and-penalties/",
    "https://official.nba.com/rule-no-10-violations-and-penalties/",
    "https://official.nba.com/rule-no-11-basket-interference-goaltending/",
    "https://official.nba.com/rule-no-12-fouls-and-penalties/",
    "https://official.nba.com/rule-no-13-instant-replay/",
    "https://official.nba.com/rule-no-14-coaches-challenge/"
]

To load the data, LangChain provides *DocumentLoaders*. There are different *DocumentLoaders* available for different input sources, e.g. PDFs, HTML, Markdown, etc.

In [None]:
loader = WebBaseLoader(
    web_paths=nba_rules_urls,
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_="col-xs-12 col-md-9")),
)

nba_rules_docs = loader.load()

In [None]:
nba_rules_docs[0]

Each document contains *page_content* and *metadata* (can be important in retrieval pipeline, e.g. improving performance with filtering based on attributes either manually or with SelfQueryRetriever).

In [None]:
nba_wiki_docs = WikipediaLoader(query="NBA").load()
nba_wiki_docs

In [None]:
docs = nba_rules_docs + nba_wiki_docs
for player in ["Luka Doncic", "Nikola Jokic"]:
    docs.append(WikipediaLoader(query=player, load_max_docs=1).load()[0])

In [None]:
len(docs)

## Chunking / splitting
The process of splitting original documents into smaller, more managable (in LLM terms) segments.

In [None]:
len(nba_rules_docs[11].page_content)

In [None]:
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
len(encoding.encode(nba_rules_docs[11].page_content))

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=400)
split_docs = splitter.split_documents(docs)

In [None]:
len(split_docs)

## Embed & store

In [None]:
vectorstore = FAISS.from_documents(documents=split_docs, embedding=OpenAIEmbeddings())

## Retrieve
Start of the LLM application logic.

In [None]:
retriever = vectorstore.as_retriever(search_kwargs={'k': 4})

In [None]:
retriever.get_relevant_documents("How old is Nikola Jokic?")

## Putting it all together

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [None]:
prompt = """You are a helpful assistant that answers NBA related question.
Use the following pieces of context to answer the user question at the end.
Your answers should be concise and to the point. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}
Question: {question}
Answer:"""
rag_prompt_template = PromptTemplate.from_template(prompt)

In [None]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [None]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt_template
    | llm
    | StrOutputParser()
)

In [None]:
print(rag_chain.invoke("How old is Nikola Jokic?"))

In [None]:
print(rag_chain.invoke("How many referees are in a game?"))

In [None]:
print(rag_chain.invoke("Which team has won the most NBA titles?"))

In [None]:
print(rag_chain.invoke("What is the meaning of life?"))

Our RAG Chain is stateless, we would need to incorporate memory to add a conversational ability.

In [None]:
print(rag_chain.invoke("How old is Nikola Jokic?"))
print(rag_chain.invoke("And how tall is he?"))

In [None]:
print(rag_chain.invoke("How tall is Nikola Jokic?"))