# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

---
**Dataset Choice**

I choose to build a dataset from the wikipedia page about Sporting Clube the Portugal, the club I support, because I wanted a chatbot that was able to respond to questions about the club. A typical use case for this would be for the club to have a chatbot supported by a LLM in their website.


## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
# Install the open AI library compatibale with the legacy version and also tiketoken
!pip install openai==0.28
!pip install tiktoken

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.0
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [2]:
# Import libraries
import requests
import pandas as pd
import string
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import numpy as np
import tiktoken


In [7]:
openai.api_key = ""

In [3]:
# create a pandas dataframe
df = pd.DataFrame()

# Get the Wikipedia page for "Sporting_CP" ##since OpenAI's models stop in 2021
params = {
    "action": "query",
    "prop": "extracts",
    "exlimit": 1,
    "titles": "Sporting_CP",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

# Add the column "text" with information form the page to the dataframe
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")
#response_dict["query"]["pages"][0]["extract"].split("\n")
df


Unnamed: 0,text
0,Sporting Clube de Portugal (Portuguese pronunc...
1,"Founded on 1 July 1906, Sporting is one of the..."
2,Sporting is the third most decorated Portugues...
3,
4,
...,...
389,== External links ==
390,
391,Official website (in Portuguese and English)
392,Sporting CP at LPFP (in English and Portuguese)


In [4]:
# Clean the data

# delelete the empty lines
df = df[df["text"].str.len() > 0]


# Remove the lines starting with "=="
df = df[~df["text"].str.startswith("==")]
df



Unnamed: 0,text
0,Sporting Clube de Portugal (Portuguese pronunc...
1,"Founded on 1 July 1906, Sporting is one of the..."
2,Sporting is the third most decorated Portugues...
10,Sporting Clube de Portugal has its origins in ...
12,The club also organized parties and picnics. E...
...,...
379,"Secretaries: Miguel de Castro, Luís Pereira, T..."
380,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui..."
391,Official website (in Portuguese and English)
392,Sporting CP at LPFP (in English and Portuguese)


In [5]:
# reset the dataframe index
df.reset_index(inplace=True,drop=True)
df

Unnamed: 0,text
0,Sporting Clube de Portugal (Portuguese pronunc...
1,"Founded on 1 July 1906, Sporting is one of the..."
2,Sporting is the third most decorated Portugues...
3,Sporting Clube de Portugal has its origins in ...
4,The club also organized parties and picnics. E...
...,...
161,"Secretaries: Miguel de Castro, Luís Pereira, T..."
162,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui..."
163,Official website (in Portuguese and English)
164,Sporting CP at LPFP (in English and Portuguese)


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [8]:
# Create the embeddings Index
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input = df["text"].tolist(),
    model = EMBEDDING_MODEL_NAME)

embeddings = [data["embedding"] for data in response["data"]]
embeddings
# check the type and len of embeddings
#type(embeddings)
#len(embeddings)

[[-0.0077106766402721405,
  0.014129783026874065,
  0.019334813579916954,
  -0.019618958234786987,
  -0.029731957241892815,
  0.043655090034008026,
  0.0035970243625342846,
  0.010016130283474922,
  0.01037777028977871,
  -0.04554077982902527,
  0.019438138231635094,
  0.01258635614067316,
  -0.023222440853714943,
  -0.012386162765324116,
  0.01477556861937046,
  -0.013522745110094547,
  0.01646752655506134,
  -0.0215950608253479,
  0.009254103526473045,
  -0.025056470185518265,
  -0.006800119765102863,
  -0.004000640008598566,
  -0.008020653389394283,
  -0.016286706551909447,
  -0.0015264750691130757,
  -0.02965446189045906,
  0.011482062749564648,
  -0.005721658002585173,
  0.019722284749150276,
  -0.016338368877768517,
  -0.0044526900164783,
  -0.0008750391425564885,
  0.004723919555544853,
  0.016687093302607536,
  -0.016299622133374214,
  -0.008052943274378777,
  -0.006422335281968117,
  -0.00494994455948472,
  0.009589912369847298,
  -0.02249916084110737,
  0.018546955659985542,


In [9]:
#add the embeddings to the datframe

df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,Sporting Clube de Portugal (Portuguese pronunc...,"[-0.0077106766402721405, 0.014129783026874065,..."
1,"Founded on 1 July 1906, Sporting is one of the...","[-0.00791094359010458, 0.019953303039073944, 0..."
2,Sporting is the third most decorated Portugues...,"[-0.01409248635172844, 0.009561830200254917, 0..."
3,Sporting Clube de Portugal has its origins in ...,"[-0.011116445064544678, 0.0033065220341086388,..."
4,The club also organized parties and picnics. E...,"[-0.014874421991407871, 0.010217789560556412, ..."
...,...,...
161,"Secretaries: Miguel de Castro, Luís Pereira, T...","[-0.0013719497947022319, -0.002953538205474615..."
162,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui...","[-0.010881505906581879, 0.002510081510990858, ..."
163,Official website (in Portuguese and English),"[-0.0049140783958137035, 0.0027945355977863073..."
164,Sporting CP at LPFP (in English and Portuguese),"[0.0027724900282919407, 0.03402586281299591, 0..."


In [19]:
# save the dataframe for future use and every user question
df.to_csv("sporting_embeddings.csv")

In [20]:
# load the dataframe and transform the text of the embeddings column in nparray's
df = pd.read_csv("sporting_embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings
0,Sporting Clube de Portugal (Portuguese pronunc...,"[-0.0077106766402721405, 0.014129783026874065,..."
1,"Founded on 1 July 1906, Sporting is one of the...","[-0.00791094359010458, 0.019953303039073944, 0..."
2,Sporting is the third most decorated Portugues...,"[-0.01409248635172844, 0.009561830200254917, 0..."
3,Sporting Clube de Portugal has its origins in ...,"[-0.011116445064544678, 0.0033065220341086388,..."
4,The club also organized parties and picnics. E...,"[-0.014874421991407871, 0.010217789560556412, ..."
...,...,...
161,"Secretaries: Miguel de Castro, Luís Pereira, T...","[-0.0013719497947022319, -0.002953538205474615..."
162,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui...","[-0.010881505906581879, 0.002510081510990858, ..."
163,Official website (in Portuguese and English),"[-0.0049140783958137035, 0.0027945355977863073..."
164,Sporting CP at LPFP (in English and Portuguese),"[0.0027724900282919407, 0.03402586281299591, 0..."


In [21]:
#define our first question
question1 = "When were the golden years of Sporting?"

# get the embeddings for the question
question1_embeddings = get_embedding(question1,engine=EMBEDDING_MODEL_NAME)
question1_embeddings

[-0.001195858814753592,
 -0.003209763905033469,
 0.012615653686225414,
 -0.016492338851094246,
 -0.03214363381266594,
 0.008371012285351753,
 0.009849408641457558,
 0.004293921869248152,
 -0.01653176359832287,
 -0.033300068229436874,
 0.012090001255273819,
 0.02035588212311268,
 -0.004816288594156504,
 0.010184512473642826,
 0.015598730184137821,
 0.008962370455265045,
 0.02335210144519806,
 -0.007431408390402794,
 -0.00812789797782898,
 -0.0013905144296586514,
 -0.022103676572442055,
 0.02156488224864006,
 0.0038931118324398994,
 -0.013719523325562477,
 -0.0026676850393414497,
 0.012517093680799007,
 0.013193870894610882,
 -0.016676317900419235,
 0.012977039441466331,
 -0.022011687979102135,
 0.008508995175361633,
 -0.009941398166120052,
 -0.019396567717194557,
 0.01897604577243328,
 -0.03319494053721428,
 -0.038241200149059296,
 -0.005525919143110514,
 0.01666317507624626,
 0.00927119143307209,
 -0.004468044266104698,
 -0.006311112083494663,
 0.01671574078500271,
 -0.0088046751916408

In [22]:
# get the distances from our question to the dataframe embeddings using the cosine similarity

distances = distances_from_embeddings(question1_embeddings, df["embeddings"].tolist(), distance_metric="cosine")
distances

[0.19418981393162804,
 0.1668128193360051,
 0.1710763770652064,
 0.1701259641326348,
 0.17490915298507848,
 0.16461123893188756,
 0.16332478706984577,
 0.16447409484638598,
 0.16151509415621035,
 0.17455537775976204,
 0.1699999164528485,
 0.1993444003649768,
 0.21119494005737305,
 0.16096025987771279,
 0.1505745764530041,
 0.17922675455049975,
 0.1500212376464406,
 0.2207179149759182,
 0.17797204391047583,
 0.14899673401934466,
 0.17756997624158521,
 0.19103729381027956,
 0.17617037783951373,
 0.18032936002802435,
 0.17386025085814572,
 0.1911263076608587,
 0.1833058655297527,
 0.21478632910845696,
 0.20158825711896378,
 0.2159330668220586,
 0.18081615919353677,
 0.16993153253225124,
 0.18322661761821069,
 0.18455332837061433,
 0.1833622661625729,
 0.19542382807678949,
 0.20536509129842062,
 0.2233885279987693,
 0.18841039412578442,
 0.21222030293790783,
 0.21581841930877443,
 0.19563109522004984,
 0.18745682704698596,
 0.16656778256042404,
 0.1789869570218633,
 0.19670083352884038,
 0

In [23]:
# add the distances to the dataframe

df["distances"] = distances
df

Unnamed: 0,text,embeddings,distances
0,Sporting Clube de Portugal (Portuguese pronunc...,"[-0.0077106766402721405, 0.014129783026874065,...",0.194190
1,"Founded on 1 July 1906, Sporting is one of the...","[-0.00791094359010458, 0.019953303039073944, 0...",0.166813
2,Sporting is the third most decorated Portugues...,"[-0.01409248635172844, 0.009561830200254917, 0...",0.171076
3,Sporting Clube de Portugal has its origins in ...,"[-0.011116445064544678, 0.0033065220341086388,...",0.170126
4,The club also organized parties and picnics. E...,"[-0.014874421991407871, 0.010217789560556412, ...",0.174909
...,...,...,...
161,"Secretaries: Miguel de Castro, Luís Pereira, T...","[-0.0013719497947022319, -0.002953538205474615...",0.241377
162,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui...","[-0.010881505906581879, 0.002510081510990858, ...",0.251405
163,Official website (in Portuguese and English),"[-0.0049140783958137035, 0.0027945355977863073...",0.264371
164,Sporting CP at LPFP (in English and Portuguese),"[0.0027724900282919407, 0.03402586281299591, 0...",0.203241


In [24]:
# sort the dataframe by distances
df = df.sort_values(by="distances")

# save the dataframe for future use
df.to_csv("sporting_distances_sorted.csv")
df

Unnamed: 0,text,embeddings,distances
19,"Domestically, Sporting had back-to-back wins i...","[-0.02016436494886875, 0.004032873082906008, 0...",0.148997
16,"In 2000, Sporting, led by manager Augusto Inác...","[-0.0054290262050926685, 0.014657382853329182,...",0.150021
14,Highlights of this period of time also include...,"[-0.015408716164529324, 0.0008742476929910481,...",0.150575
13,English manager Malcolm Allison arrived at Spo...,"[-0.011266198940575123, 0.000316598336212337, ...",0.160960
8,The football team had their height during the ...,"[-0.02185836434364319, 0.01282205805182457, 0....",0.161515
...,...,...,...
51,) net income was €25.2 million for a record-br...,"[-0.024319594725966454, -0.018804073333740234,...",0.272438
77,Ivaylo Yordanov – 1998,"[-0.00696224719285965, -0.016699116677045822, ...",0.273836
76,Krasimir Balakov – 1995,"[0.001955274725332856, -0.003372971899807453, ...",0.278328
117,Krasimir Balakov – 1994 United States,"[-0.0007707463228143752, -0.019931858405470848...",0.280139


In [25]:
# Define the base tokenizer and max number of tokens to use
tokenizer = tiktoken.get_encoding("cl100k_base")
max_token_count = 1000

In [26]:
# compose a custom prompt
prompt_template = """
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{}

---

Question: {}
Answer:"""

prompt = prompt_template.format("context",question1)
print(prompt)


Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

context

---

Question: When were the golden years of Sporting?
Answer:


In [27]:
# create the context and make sure it does not exceeds the max token count

current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question1))
current_token_count
context = []
for text in df["text"].values:
    text_token_count = len(tokenizer.encode(text))
    current_token_count += text_token_count

    if current_token_count <= max_token_count:
        #print(current_token_count)
        context.append(text)
    else:
        break


In [28]:
#check the content of context
context

['Domestically, Sporting had back-to-back wins in the Portuguese Cup in 2007 and 2008 (led by coach Paulo Bento). Sporting also reached, for the first time, the knockout phase of UEFA Champions League, in the 2008–09 season, but were roundly defeated by Bayern Munich, with an aggregate loss of 12–1. This is widely regarded as one of the lowest points in the history of the club. The club almost reached another European final in 2012, but were dropped out of the competition by Athletic Bilbao, in the semi-finals of the 2011–12 Europa League.',
 'In 2000, Sporting, led by manager Augusto Inácio (a former Sporting player, who replaced Giuseppe Materazzi at the beginning of the season), won the league title on the last match day, with a 4–0 victory over Salgueiros, ending an 18-year drought. In the following season, Sporting conquered the 2000 Super Cup but came third in the league. In the 2001–02 season, led by coach László Bölöni, Sporting conquered their 18th league title, the Portuguese

In [29]:
# define a function to reveice a prompt template, a context and a text and return the complete custom prompt
def create_prompt(prompt_template,context, question):
  return prompt_template.format("\n".join(context),question)


In [30]:
#lets text our first custom promt text
my_prompt = create_prompt(prompt_template,context,question1)
print(my_prompt)


Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

Domestically, Sporting had back-to-back wins in the Portuguese Cup in 2007 and 2008 (led by coach Paulo Bento). Sporting also reached, for the first time, the knockout phase of UEFA Champions League, in the 2008–09 season, but were roundly defeated by Bayern Munich, with an aggregate loss of 12–1. This is widely regarded as one of the lowest points in the history of the club. The club almost reached another European final in 2012, but were dropped out of the competition by Athletic Bilbao, in the semi-finals of the 2011–12 Europa League.
In 2000, Sporting, led by manager Augusto Inácio (a former Sporting player, who replaced Giuseppe Materazzi at the beginning of the season), won the league title on the last match day, with a 4–0 victory over Salgueiros, ending an 18-year drought. In the following season, Sporting conquered the 2000 Super Cup but ca

In [32]:
# test our first custom question
final_response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt = my_prompt
    )

# the response is in the dictionary in the key "choices", let's get the text form the index 0
final_response["choices"][0]["text"]

' The golden years of Sporting were between the 1940s and 1950'

**text in the wiki page under "Golden years and fading (1946–1982)" : The football team had their height during the 1940s and 1950s.**

In [33]:
#now lets text the same question without the context dataframe
sporting_prompt = """
Question: "When were the golden years of Sporting?"
Answer:
"""
initial_sporting_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=sporting_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_sporting_answer)

The golden years of Sporting, or Sporting Clube de Portugal, are generally considered to be from the 1940s to the 1960s. During this time period, the club won numerous trophies and titles, including 8 national championships and 5 Portuguese Cups. Several notable players, such as José Travassos, João Azevedo, and Fernando Peyroteo, were part of the team during this era. Additionally, Sporting had a strong presence in European competitions, reaching the finals of the UEFA Cup Winners' Cup in 1964.


# **RESULT FROM 1st TEST**

***Response with context***: The golden years of Sporting were during the 1940s and 1950 - CORRECT


***Response without the context***: The golden years of Sporting, or Sporting Clube de Portugal, are generally considered to be from the 1940s to the 1960s. During this time period, the club won numerous trophies and titles, including 8 national championships and 5 Portuguese Cups. Several notable players, such as José Travassos, João Azevedo, and Fernando Peyroteo, were part of the team during this era. Additionally, Sporting had a strong presence in European competitions, reaching the finals of the UEFA Cup Winners' Cup in 1964. - INCORRECT


**As we can see from the wikipedia page the correct response is the 1st one, and so the performace with the context included performed correctly.**

**LET'S MAKE SOME REUSABLE CODE**

Until now I have been running the code almost line by line and with just a few helper functions.

But there is a big problem, if we want to test the model with other questions we need to create the embeddings for the new question and calculate the new distances, and there is no reusable code for that, so let's make it.

In [36]:
# Funtions that finds the pieces of text for a given question
def get_relevante_text(question, df):
  """
  Function that takes in a question string and a dataframe containing
  rows of text and associated embeddings, and returns that dataframe
  sorted from least to most relevant for that question
  """

  # get the embeddings
  question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

  # add the distances column to the dataframe and make
  # based on the cosine similarity
  df_copy = df.copy()
  df_copy["distances"] = distances_from_embeddings(
      question_embeddings,
      df_copy["embeddings"].tolist(),
      distance_metric="cosine"
  )

  # sort the dataframe by the distances column
  df_copy = df_copy.sort_values(by="distances")
  return df_copy

In [63]:
# lets load the sporting embeding and test two questions

df = pd.read_csv("sporting_embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

sporting_question_1 = "When were the golden years of Sporting?"
sporting_question_2 = "Who was the first president of Sporting?"

In [65]:
#check results for question 1
get_relevante_text(sporting_question_1, df)


Unnamed: 0,text,embeddings,distances
19,"Domestically, Sporting had back-to-back wins i...","[-0.02016436494886875, 0.004032873082906008, 0...",0.148997
16,"In 2000, Sporting, led by manager Augusto Inác...","[-0.0054290262050926685, 0.014657382853329182,...",0.150021
14,Highlights of this period of time also include...,"[-0.015408716164529324, 0.0008742476929910481,...",0.150575
13,English manager Malcolm Allison arrived at Spo...,"[-0.011266198940575123, 0.000316598336212337, ...",0.160960
8,The football team had their height during the ...,"[-0.02185836434364319, 0.01282205805182457, 0....",0.161515
...,...,...,...
51,) net income was €25.2 million for a record-br...,"[-0.024319594725966454, -0.018804073333740234,...",0.272438
77,Ivaylo Yordanov – 1998,"[-0.00696224719285965, -0.016699116677045822, ...",0.273836
76,Krasimir Balakov – 1995,"[0.001955274725332856, -0.003372971899807453, ...",0.278328
117,Krasimir Balakov – 1994 United States,"[-0.0007707463228143752, -0.019931858405470848...",0.280139


In [64]:
#check results for question 2
get_relevante_text(sporting_question_2, df)

Unnamed: 0,text,embeddings,distances
4,The club also organized parties and picnics. E...,"[-0.014874421991407871, 0.010217789560556412, ...",0.134886
3,Sporting Clube de Portugal has its origins in ...,"[-0.011116445064544678, 0.0033065220341086388,...",0.140602
5,"The year 1907 marked some ""firsts"" for the clu...","[-0.015808388590812683, 0.004282540176063776, ...",0.141500
7,Sporting played their first Primeira Liga game...,"[-0.006334827747195959, 0.004593885038048029, ...",0.141996
9,Sporting and the Yugoslavian team Partizan bot...,"[-0.012343442998826504, -0.009965997189283371,...",0.145729
...,...,...,...
79,Islam Slimani – 2013,"[-0.014562901109457016, 0.009879589080810547, ...",0.286726
122,Hilário (474),"[-0.04050804302096367, -0.010751191526651382, ...",0.288979
77,Ivaylo Yordanov – 1998,"[-0.00696224719285965, -0.016699116677045822, ...",0.293735
76,Krasimir Balakov – 1995,"[0.001955274725332856, -0.003372971899807453, ...",0.297089


In [66]:
# crate a function that builds a text prompt and allows to define the maximum number of tokens to use in the prompt
def create_final_prompt(question, df, max_token_count):
  """
  Given a question and a dataframe containing rows of text and their
  embeddings, return a text prompt to send to a completion model
  """

  # create the base tokenizer aligned with our embeddings
  tokenizer = tiktoken.get_encoding("cl100k_base")

  # Count the number of tokens in the prompt template and question
  prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:

{}

---

Question: {}
Answer:"""

  current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
  context=[]
  for text in get_relevante_text(question, df)["text"].values:
     text_token_count = len(tokenizer.encode(text))
     current_token_count += text_token_count

     # Add the row of text to the list if we haven't exceeded the max
     if current_token_count <= max_token_count:
          context.append(text)
     else:
          break

  return prompt_template.format("\n".join(context), question)

In [67]:
# now let's test it with question 1
print(create_final_prompt(sporting_question_1, df, 1000))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:

Domestically, Sporting had back-to-back wins in the Portuguese Cup in 2007 and 2008 (led by coach Paulo Bento). Sporting also reached, for the first time, the knockout phase of UEFA Champions League, in the 2008–09 season, but were roundly defeated by Bayern Munich, with an aggregate loss of 12–1. This is widely regarded as one of the lowest points in the history of the club. The club almost reached another European final in 2012, but were dropped out of the competition by Athletic Bilbao, in the semi-finals of the 2011–12 Europa League.
In 2000, Sporting, led by manager Augusto Inácio (a former Sporting player, who replaced Giuseppe Materazzi at the beginning of the season), won the league title on the last match day, with a 4–0 victory over Salgueiros, ending an 18-year drought. In the following season, Sporting conquered the 2000 Super Cup but ca

In [68]:
# now let's test it with question 2
print(create_final_prompt(sporting_question_2, df,1000))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:

The club also organized parties and picnics. Eventually, during one picnic, on 12 April 1906, discussions erupted, as some members defended that the club should only be focused on organizing picnics and social events, with another group defending that the club should be focused on the practising of sports instead. Some time later, José Gavazzo, José Alvalade and 17 other members left the club, with José Alvalade saying: "I'll go to my grandad and he'll give me money to make another club." As such, a new club, without a name, was founded on 8 May 1906, and on 26 May, it was named "Campo Grande Sporting Clube". The Viscount of Alvalade, whose money and land helped found the club, was the first president of Sporting. José Alvalade, as one of the main founders and first club member (sócio), uttered on behalf of himself and his fellow co-founders: "We wa

In [69]:
#create a function to anwer questions

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(question, df, max_prompt_tokens=1800, max_answer_tokens = 150):
  """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """
  prompt = create_prompt(question, df, max_prompt_tokens)

  try:
    response = openai.Completion.create(
            model = COMPLETION_MODEL_NAME,
            prompt = prompt,
            max_tokens = max_answer_tokens
        )
    return response["choices"][0]["text"].strip()
  except Exception as e:
      print(e)
      return ""


In [70]:
# with all the reusable code built let's text it
custom_sporting_question_1 = sporting_question_1
# "When were the golden years of Sporting?"
custom_sporting_question_2 = sporting_question_2
#"Who was the first president of Sporting?"



In [74]:
custom_sporting_answer_1 = answer_question(custom_sporting_question_1, df)
print(custom_sporting_answer_1)

The golden years of Sporting may vary depending on the sport or country in question. Generally, the phrase refers to a period of sustained success or dominance in a specific sport.

For example, in football (soccer), the golden years of Sporting may be considered to be the 1940s and 1950s when the club won multiple league titles and competed in European competitions. In basketball, the golden years may refer to the period from the late 1970s to the early 1990s when the team won multiple league championships. In athletics, the golden years may refer to the 1980s and 1990s when athletes from Sporting won numerous Olympic medals.

Overall, the golden years of Sporting can be seen as a


In [75]:
custom_sporting_answer_2 = answer_question(custom_sporting_question_2, df)
print(custom_sporting_answer_2)

The first president of Sporting was José Alvalade, who founded the club in 1906.


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [51]:
sporting_prompt_1 = """
Question: "When were the golden years of Sporting?"
Answer:
"""
initial_sporting_answer_1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=sporting_prompt_1,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_sporting_answer_1)

custom_sporting_answer_1 = answer_question(custom_sporting_question_1, df)
print(custom_sporting_answer_1)

The golden years of Sporting can be considered to have started in the early 2000s and continued until around 2012. During this time, Sporting achieved great success in competitions such as the Primeira Liga, the Portuguese Cup, and the UEFA Europa League. They also achieved their highest ever league finish in 2007, coming in second place. This period can be considered the "golden years" due to the consistent success and strong performances of the team.
The golden years of Sporting are considered to be between the late 1940s and early 1960s. During this time, the club won several league titles and national cups, including the Primeira Liga title in 1950, 1951, 1952, 1953, 1958, and 1962, and the Taça de Portugal in 1945, 1946, 1948, 1954, and 1963. Additionally, during this period, Sporting had some of their greatest players, such as Vasques, Albano, and Peyroteo, and was considered one of the top teams in Europe.


### Question 2

In [76]:
sporting_prompt_2 = """
Question: "Who was the first president of Sporting?"
Answer:
"""
initial_sporting_answer_2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=sporting_prompt_2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_sporting_answer_2)

custom_sporting_answer_2 = answer_question(custom_sporting_question_2, df)
print(custom_sporting_answer_2)

The first president of Sporting Clube de Portugal was his Royal Highness Prince Dom Carlos I of Braganza. He served as president of the club from 1906 until his assassination in 1908.
The first president of Sporting Club de Portugal was José Alvalade, who founded the club in 1906. He served as president from 1906 to 1910.


In [79]:
print(f"""
When were the golden years of Sporting??

Original Answer: {initial_sporting_answer_1}
Custom Answer:   {custom_sporting_answer_1}

Who founded Sporting Clube the Portugal?

Original Answer: {initial_sporting_answer_2}
Custom Answer:   {custom_sporting_answer_2}
""")


When were the golden years of Sporting??

Original Answer: The golden years of Sporting can be considered to have started in the early 2000s and continued until around 2012. During this time, Sporting achieved great success in competitions such as the Primeira Liga, the Portuguese Cup, and the UEFA Europa League. They also achieved their highest ever league finish in 2007, coming in second place. This period can be considered the "golden years" due to the consistent success and strong performances of the team.
Custom Answer:   The golden years of Sporting may vary depending on the sport or country in question. Generally, the phrase refers to a period of sustained success or dominance in a specific sport.

For example, in football (soccer), the golden years of Sporting may be considered to be the 1940s and 1950s when the club won multiple league titles and competed in European competitions. In basketball, the golden years may refer to the period from the late 1970s to the early 1990s w

# CONCLUSIONS

The model with the custom prompt responds correctly to the questions and without the context given it's not able to respond correctly, so we can conclude that by providing the context based on the dataframe we built based on the wikipedia page of the club the model is much more accurate.



---
**Notes to the reviwer:**



1.   This code was built and run in google colab.
2.   The answer to the questions changes everytime we run the code. This is normal because questions (especially question nº2) are a little bit open given the context that is possible to collect from the wikipedia page. And of course due to the model we are using - gpt-3.5-turbo-instruct. But in 90% of the times i have run it the custom model responds correctly and non customised ones are much more random.


