# Odgovarjanje na vprašanja po nestrukturiranih podatkih
Ta zvezek predstavi nalogo odgovarjanja na vprašanja po nestrukturiranih podatkih (kot so: pdf-i, spletne strani, ...).

Naš cilj je narediti NBA asistenta, ki lahko odgovraja na vprašanja glede NBA pravil in igralcev.

In [None]:
%%capture
!pip install langchain tiktoken beautifulsoup4 python-dotenv wikipedia openai faiss-cpu

In [2]:
import os

import bs4
import dotenv
import tiktoken

from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import WebBaseLoader, WikipediaLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

In [3]:
if not dotenv.load_dotenv():
    os.environ["OPENAI_API_KEY"] = "openai-api-key"

## Nalaganje Dokumentov
Prvi korak indeksiranja je nalaganje dokumentov. Se pravi podatkov katere želimo da LLM uporablja za odgovarjanje.

V našem primeru uporabimo dva vira. Uraden NBA pravilnik in manjšo zbirko člankov in Wikipedije.

In [4]:
nba_rules_urls = [
    "https://official.nba.com/rule-no-1-court-dimensions-equipment/",
    "https://official.nba.com/rule-no-2-duties-of-the-officials/",
    "https://official.nba.com/rule-no-3-players-substitutes-and-coaches/",
    "https://official.nba.com/rule-no-4-definitions/",
    "https://official.nba.com/rule-no-5-scoring-and-timing/",
    "https://official.nba.com/rule-no-6-putting-ball-in-play-live-dead-ball/",
    "https://official.nba.com/rule-no-7-24-second-clock/",
    "https://official.nba.com/rule-no-8-out-of-bounds-and-throw-in/",
    "https://official.nba.com/rule-no-9-free-throws-and-penalties/",
    "https://official.nba.com/rule-no-10-violations-and-penalties/",
    "https://official.nba.com/rule-no-11-basket-interference-goaltending/",
    "https://official.nba.com/rule-no-12-fouls-and-penalties/",
    "https://official.nba.com/rule-no-13-instant-replay/",
    "https://official.nba.com/rule-no-14-coaches-challenge/"
]

Za nalaganje dokumentov LangChain nudi *DocumentLoaders*. Obstaja več različnih nalagalnikov za različne tipe datotek, kot so: PDS, HTML, Markdown, ...

In [5]:
loader = WebBaseLoader(
    web_paths=nba_rules_urls,
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_="col-xs-12 col-md-9")),
)

nba_rules_docs = loader.load()

In [6]:
nba_rules_docs[0]

Document(page_content='\n\nRULE NO. 1: Court Dimensions – Equipment\n\n\nSection I—Court and Dimensions\n\nThe playing court shall be measured and marked as shown in the court (See below)\nA free throw lane shall be marked at each end of the court with dimensions and markings as shown on the court diagram.\xa0 \xa0All boundary lines are part of the lane; lane space marks and neutral zone marks are not. The areas identified by the lane space markings are 2” by 6” inches.\nA free throw line shall be drawn (2” wide) across each of the circles indicated in the court diagram.\xa0 It shall be parallel to the end line and shall be 15’ from the plane of the face of the backboard.\nThe three-point field goal area has parallel lines 3’ from the sidelines, extending from the baseline and an arc of 23’9” from the middle of the basket which intersects the parallel lines.\nFour hash marks shall be drawn (2” wide) perpendicular to the sideline on each side of the court and 28’ from the baseline.\xa0 

Vsak dokument ima *page_content* in *metadata*. (Pomembno za RAG sisteme saj nudi dodatne informacije za filtriranje ali za LLM)

In [7]:
nba_wiki_docs = WikipediaLoader(query="NBA").load()
[doc.metadata["source"] for doc in nba_wiki_docs]

['https://en.wikipedia.org/wiki/National_Basketball_Association',
 'https://en.wikipedia.org/wiki/List_of_NBA_champions',
 'https://en.wikipedia.org/wiki/NBA_Finals',
 'https://en.wikipedia.org/wiki/NBA_play-in_tournament',
 'https://en.wikipedia.org/wiki/2024_NBA_playoffs',
 'https://en.wikipedia.org/wiki/LeBron_James',
 'https://en.wikipedia.org/wiki/NBA_playoffs',
 'https://en.wikipedia.org/wiki/2023%E2%80%9324_NBA_season',
 'https://en.wikipedia.org/wiki/2023_NBA_playoffs',
 'https://en.wikipedia.org/wiki/NBA_TV',
 'https://en.wikipedia.org/wiki/NBA_G_League',
 'https://en.wikipedia.org/wiki/NBA_draft',
 'https://en.wikipedia.org/wiki/Kobe_Bryant',
 'https://en.wikipedia.org/wiki/Stephen_Curry',
 'https://en.wikipedia.org/wiki/Michael_Jordan',
 'https://en.wikipedia.org/wiki/Women%27s_National_Basketball_Association',
 'https://en.wikipedia.org/wiki/List_of_people_banned_or_suspended_by_the_NBA',
 'https://en.wikipedia.org/wiki/List_of_current_NBA_team_rosters',
 'https://en.wikipe

In [8]:
docs = nba_rules_docs + nba_wiki_docs
for player in ["Luka Doncic", "Nikola Jokic"]:
    docs.append(WikipediaLoader(query=player, load_max_docs=1).load()[0])

In [9]:
len(docs)

41

## Deljenje
Proces deljenja originalnih dokumentov v manjše, bolj obvladljive enote.

In [10]:
len(nba_rules_docs[11].page_content)

31958

In [11]:
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
len(encoding.encode(nba_rules_docs[11].page_content))

6716

In [12]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=400)
split_docs = splitter.split_documents(docs)

In [13]:
len(split_docs)

411

## Embed & Shrani

In [14]:
vectorstore = FAISS.from_documents(documents=split_docs, embedding=OpenAIEmbeddings())

  warn_deprecated(


Save it for later use

In [15]:
vectorstore.save_local("../Data/nba_rules_faiss")

## Iskanje
Začetek LLM logike.

In [16]:
retriever = vectorstore.as_retriever(search_kwargs={'k': 4})

In [17]:
retriever.get_relevant_documents("How old is Luka Doncic?")

[Document(page_content="Luka Dončić ( DON-chich; Slovene: [ˈlùːka ˈdòːntʃitʃ]; born 28 February 1999) is a Slovenian professional basketball player for the Dallas Mavericks of the National Basketball Association (NBA). Nicknamed “Luka Magic”, he also plays for the Slovenia national team and is regarded as one of the greatest European players of all time.\nBorn in Ljubljana, Dončić shone as a youth player for Union Olimpija before joining the youth academy of Real Madrid. In 2015 he made his debut for the academy's senior team at age 16, becoming the youngest in club history. He led Madrid to the 2018 EuroLeague title, winning the EuroLeague MVP and the Final Four MVP. Dončić was named the ACB Most Valuable Player and won back-to-back EuroLeague Rising Star and ACB Best Young Player awards. In addition, he was selected to the EuroLeague 2010–20 All-Decade Team.", metadata={'title': 'Luka Dončić', 'summary': "Luka Dončić ( DON-chich; Slovene: [ˈlùːka ˈdòːntʃitʃ]; born 28 February 1999) i

## Sestavimo vse skupaj:

In [18]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [19]:
prompt = """You are a helpful assistant that answers NBA related question.
Use the following pieces of context to answer the user question at the end.
Your answers should be concise and to the point. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}
Question: {question}
Answer:"""
rag_prompt_template = PromptTemplate.from_template(prompt)

In [20]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

  warn_deprecated(


In [21]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt_template
    | llm
    | StrOutputParser()
)

In [22]:
print(rag_chain.invoke("How old is Nikola Jokic?"))

Nikola Jokić was born on February 19, 1995, so he is currently 27 years old.


In [23]:
print(rag_chain.invoke("How many referees are in a game?"))

There are three referees in an NBA game.


In [24]:
print(rag_chain.invoke("Which team has won the most NBA titles?"))

The Los Angeles Lakers and the Boston Celtics have both won the most NBA titles, with 17 championships each.


In [25]:
print(rag_chain.invoke("What is the meaning of life?"))

I'm sorry, I am an assistant focused on answering NBA related questions. I do not have information on the meaning of life.


Our RAG Chain is stateless, we would need to incorporate memory to add a conversational ability.

In [26]:
print(rag_chain.invoke("How old is Nikola Jokic?"))
print(rag_chain.invoke("And how tall is he?"))

Nikola Jokić was born on February 19, 1995, so he is currently 27 years old.
Luka Dončić is 6 feet 7 inches tall.


In [27]:
print(rag_chain.invoke("How old are Nikola Jokic and Luka Doncic?"))

Nikola Jokić was born on February 19, 1995, making him 27 years old. Luka Dončić was born on February 28, 1999, making him 23 years old.
