# Your data + LLM

- There are many LLMs that are released with pre-determined weights and architrecutres. 
- They are state of the art and building your own would require an enormous amount of compute/storage.



### Let us look at how we can connect our database to a LLM 

Setup your OpenAI key, by making an account and adding billing details

In [34]:
import os
os.environ['OPENAI_API_KEY'] = "YOUR_KEY" # replace with your own API Key

In [35]:
from langchain import OpenAI, ConversationChain, LLMChain, PromptTemplate
from langchain.memory import ConversationBufferWindowMemory
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

Let us define a prompt template with:

- What the AI does:     
It tells us how many goals a player has scored in the Premier League for a given season
- Context/Background Information:    
Keep the answer relevant only to what the LLM knows
- Question:    
User input, Example : Mohammed Salah 17/18, Eden Hazard 14/15

In [36]:
prompt_template = """Tell me how many goals this player has scored in the Premier League for a given season

Please answer the question by using your own knowledge about the topic

Question: {question}
"""

In [37]:
prompt = PromptTemplate(input_variables=[ "question"], template = prompt_template)

chatgpt_chain = LLMChain(
    llm=ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0),
    prompt=prompt,
    verbose=True,
    memory=ConversationBufferWindowMemory(k=2),
)

while True:
    res = chatgpt_chain.predict(question = input('Enter player name and the season'))
    if(res == 'stop'):
        break
    print(res)

Enter player name and the season Salah 17/18




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mTell me how many goals this player has scored in the Premier League for a given season

Please answer the question by using your own knowledge about the topic

Question: Salah 17/18
[0m

[1m> Finished chain.[0m
In the 2017/2018 Premier League season, Mohamed Salah scored a total of 32 goals.


Enter player name and the season Erling Haaland 22/23




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mTell me how many goals this player has scored in the Premier League for a given season

Please answer the question by using your own knowledge about the topic

Question: Erling Haaland 22/23
[0m

[1m> Finished chain.[0m
As an AI language model, I don't have real-time information or access to current statistics. Therefore, I cannot provide you with the exact number of goals Erling Haaland has scored in the Premier League for the 2022/2023 season. To find the accurate and up-to-date information, I recommend checking reliable sports websites, news outlets, or official Premier League sources.


Enter player name and the season stop




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mTell me how many goals this player has scored in the Premier League for a given season

Please answer the question by using your own knowledge about the topic

Question: stop
[0m

[1m> Finished chain.[0m
I'm sorry, but I cannot provide you with the specific number of goals a player has scored in the Premier League for a given season without any specific information about the player or the season.


KeyboardInterrupt: Interrupted by user

 OpenAI models are only trained till mid-2021. This means the model does not have access to information post that. It will not provide you with any result or it can also provide you with inaccurate results. How can we solve this problem?
## Retrievel Augmentaiton

Let us connect an external database with stats from the 22/23 season

In [38]:
import pandas as pd
df = pd.read_csv('stats.csv')
df['Goals'] =df['Goals'].astype('str')
df['text'] = df['Player'] + ' scored ' + df['Goals'] + ' goals in 22/23 season'
df_text = df['text']
df.head()

Unnamed: 0,Player,Goals,text
0,Erling Haaland,36,Erling Haaland scored 36 goals in 22/23 season
1,Harry Kane,30,Harry Kane scored 30 goals in 22/23 season
2,Ivan Toney,20,Ivan Toney scored 20 goals in 22/23 season
3,Mohamed Salah,19,Mohamed Salah scored 19 goals in 22/23 season
4,Callum Wilson,18,Callum Wilson scored 18 goals in 22/23 season


In [39]:
chunks = list(df_text)
chunks[:3]

['Erling Haaland scored 36 goals in 22/23 season',
 'Harry Kane scored 30 goals in 22/23 season',
 'Ivan Toney scored 20 goals in 22/23 season']

Let us convert our textual information to an embedding, so it can be utilized by our LLM

In [10]:
from langchain.embeddings.openai import OpenAIEmbeddings
model_name = 'text-embedding-ada-002'
embed = OpenAIEmbeddings(
    model=model_name,
)
res = embed.embed_documents(chunks)

In [16]:
len(res[0]), res[0][:4]

(1536,
 [-0.014846178470043785,
  0.011162775393497753,
  0.028516670241503065,
  0.013382822996481001])

### Vector Database - Pinecone

- Vector databases help in effecient storage and management of embeddings.     
- They also have capabilities to search/cluster embeddings based on metrics.     
- Let us store all of our embeddings in a pinecone database.

In [21]:
import pinecone
index_name = 'premstats'

pinecone.init(
    api_key='62fa0cb0-aeb2-41ca-903c-6b4e7bdf361f',
    environment='us-west4-gcp-free'
)

In [22]:
pinecone.whoami()

WhoAmIResponse(username='a3a1ead', user_label='epl', projectname='b99002e')

In [23]:
if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
    )

In [40]:
index = pinecone.GRPCIndex('premstats')
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 20}},
 'total_vector_count': 20}

In [26]:
player_name = list(df['Player'])
text = list(df['text'])
chunks = text

In [27]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 10

texts = []
metadatas = []


record_texts = text
record_metadatas = [{
    "chunk": j, "text": text
} for j, text in enumerate(record_texts)]

texts.extend(record_texts)
metadatas.extend(record_metadatas)

if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

In [41]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 20}},
 'total_vector_count': 20}

In [42]:
from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

In [43]:
prompt_template = """Tell me how many goals this player scored in the Premier League for given year

If the store does not have the answer, please answer the question by using your own knowledge about the english premier league. 
If player does not have stats, maybe you could check when they joined and left the club. 

{context}
Question: {question}
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": PROMPT}

# completion llm
llm = ChatOpenAI(
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(), 
                                 chain_type_kwargs=chain_type_kwargs)


In [44]:
while True:
    human_input = input("Enter player name and season")
    if(human_input == 'stop'):
        break
    print(qa.run(human_input))

Enter player name and season Salah 17/18


Mohamed Salah scored 32 goals in the Premier League for the 17/18 season.


Enter player name and season Erling Haaland 22/23


Erling Haaland scored 36 goals in the Premier League for the 22/23 season.


Enter player name and season stop


#### Conclusion

- LLMS are powerful
- Based on what it is trained on, it might have hallucinations.
- It can provide inaccurate or absoluelty no result.
- Connect LLM to extrernal or domain specific dataset to levarage it fully.