# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

---
**Dataset Choice**

I choose to build a dataset from the wikipedia page about Sporting Clube the Portugal, the club I support, because I wanted a chatbot that was able to respond to questions about the club. A typical use case for this would be for the club to have a chatbot supported by a LLM in their website.


## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
# Install the open AI library compatibale with the legacy version and also tiketoken
!pip install openai==0.28
!pip install tiktoken

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.0
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [2]:
# Import libraries
import requests
import pandas as pd
import string
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import numpy as np
import tiktoken


In [4]:
openai.api_key = ""

In [5]:
# create a pandas dataframe
df = pd.DataFrame()

# Get the Wikipedia page for "Sporting_CP" ##since OpenAI's models stop in 2021
params = {
    "action": "query",
    "prop": "extracts",
    "exlimit": 1,
    "titles": "Sporting_CP",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

# Add the column "text" with information form the page to the dataframe
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")
#response_dict["query"]["pages"][0]["extract"].split("\n")
df


Unnamed: 0,text
0,Sporting Clube de Portugal (Portuguese pronunc...
1,"Founded on 1 July 1906, Sporting is one of the..."
2,Sporting is the third most decorated Portugues...
3,
4,
...,...
389,== External links ==
390,
391,Official website (in Portuguese and English)
392,Sporting CP at LPFP (in English and Portuguese)


In [6]:
# Clean the data

# delelete the empty lines
df = df[df["text"].str.len() > 0]


# Remove the lines starting with "=="
df = df[~df["text"].str.startswith("==")]
df



Unnamed: 0,text
0,Sporting Clube de Portugal (Portuguese pronunc...
1,"Founded on 1 July 1906, Sporting is one of the..."
2,Sporting is the third most decorated Portugues...
10,Sporting Clube de Portugal has its origins in ...
12,The club also organized parties and picnics. E...
...,...
379,"Secretaries: Miguel de Castro, Luís Pereira, T..."
380,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui..."
391,Official website (in Portuguese and English)
392,Sporting CP at LPFP (in English and Portuguese)


In [7]:
# reset the dataframe index
df.reset_index(inplace=True,drop=True)
df

Unnamed: 0,text
0,Sporting Clube de Portugal (Portuguese pronunc...
1,"Founded on 1 July 1906, Sporting is one of the..."
2,Sporting is the third most decorated Portugues...
3,Sporting Clube de Portugal has its origins in ...
4,The club also organized parties and picnics. E...
...,...
161,"Secretaries: Miguel de Castro, Luís Pereira, T..."
162,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui..."
163,Official website (in Portuguese and English)
164,Sporting CP at LPFP (in English and Portuguese)


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [None]:
# Create the embeddings Index
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input = df["text"].tolist(),
    model = EMBEDDING_MODEL_NAME)

embeddings = [data["embedding"] for data in response["data"]]
embeddings
# check the type and len of embeddings
#type(embeddings)
#len(embeddings)

**Output removed to reduce the size of the file**

In [49]:
#add the embeddings to the datframe

df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,Sporting Clube de Portugal (Portuguese pronunc...,"[-0.0077106766402721405, 0.014129783026874065,..."
1,"Founded on 1 July 1906, Sporting is one of the...","[-0.007837342098355293, 0.019974472001194954, ..."
2,Sporting is the third most decorated Portugues...,"[-0.014131806790828705, 0.00956854596734047, 0..."
3,Sporting Clube de Portugal has its origins in ...,"[-0.011024683713912964, 0.003270503832027316, ..."
4,The club also organized parties and picnics. E...,"[-0.014927705749869347, 0.010244372300803661, ..."
...,...,...
161,"Secretaries: Miguel de Castro, Luís Pereira, T...","[-0.001424439251422882, -0.002896219026297331,..."
162,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui...","[-0.010861829854547977, 0.0025186119601130486,..."
163,Official website (in Portuguese and English),"[-0.0049142902716994286, 0.002890567295253277,..."
164,Sporting CP at LPFP (in English and Portuguese),"[0.0028089697007089853, 0.034062452614307404, ..."


In [10]:
# save the dataframe for future use and every user question
df.to_csv("sporting_embeddings.csv")

In [11]:
# load the dataframe and transform the text of the embeddings column in nparray's
df = pd.read_csv("sporting_embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings
0,Sporting Clube de Portugal (Portuguese pronunc...,"[-0.0077106766402721405, 0.014129783026874065,..."
1,"Founded on 1 July 1906, Sporting is one of the...","[-0.007837342098355293, 0.019974472001194954, ..."
2,Sporting is the third most decorated Portugues...,"[-0.014131806790828705, 0.00956854596734047, 0..."
3,Sporting Clube de Portugal has its origins in ...,"[-0.011024683713912964, 0.003270503832027316, ..."
4,The club also organized parties and picnics. E...,"[-0.014927705749869347, 0.010244372300803661, ..."
...,...,...
161,"Secretaries: Miguel de Castro, Luís Pereira, T...","[-0.001424439251422882, -0.002896219026297331,..."
162,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui...","[-0.010861829854547977, 0.0025186119601130486,..."
163,Official website (in Portuguese and English),"[-0.0049142902716994286, 0.002890567295253277,..."
164,Sporting CP at LPFP (in English and Portuguese),"[0.0028089697007089853, 0.034062452614307404, ..."


In [None]:
#define our first question
question1 = "When were the golden years of Sporting?"

# get the embeddings for the question
question1_embeddings = get_embedding(question1,engine=EMBEDDING_MODEL_NAME)
question1_embeddings

**Output removed to reduce the size of the file**

In [None]:
# get the distances from our question to the dataframe embeddings using the cosine similarity

distances = distances_from_embeddings(question1_embeddings, df["embeddings"].tolist(), distance_metric="cosine")
distances

**Output removed to reduce the size of the file**

In [14]:
# add the distances to the dataframe

df["distances"] = distances
df

Unnamed: 0,text,embeddings,distances
0,Sporting Clube de Portugal (Portuguese pronunc...,"[-0.0077106766402721405, 0.014129783026874065,...",0.194190
1,"Founded on 1 July 1906, Sporting is one of the...","[-0.007837342098355293, 0.019974472001194954, ...",0.166783
2,Sporting is the third most decorated Portugues...,"[-0.014131806790828705, 0.00956854596734047, 0...",0.171046
3,Sporting Clube de Portugal has its origins in ...,"[-0.011024683713912964, 0.003270503832027316, ...",0.170200
4,The club also organized parties and picnics. E...,"[-0.014927705749869347, 0.010244372300803661, ...",0.174893
...,...,...,...
161,"Secretaries: Miguel de Castro, Luís Pereira, T...","[-0.001424439251422882, -0.002896219026297331,...",0.241348
162,"Substitutes: Diogo Orvalho, Manuel Mendes, Rui...","[-0.010861829854547977, 0.0025186119601130486,...",0.251353
163,Official website (in Portuguese and English),"[-0.0049142902716994286, 0.002890567295253277,...",0.264433
164,Sporting CP at LPFP (in English and Portuguese),"[0.0028089697007089853, 0.034062452614307404, ...",0.203166


In [15]:
# sort the dataframe by distances
df = df.sort_values(by="distances")

# save the dataframe for future use
df.to_csv("sporting_distances_sorted.csv")
df

Unnamed: 0,text,embeddings,distances
19,"Domestically, Sporting had back-to-back wins i...","[-0.02016436494886875, 0.004032873082906008, 0...",0.148997
16,"In 2000, Sporting, led by manager Augusto Inác...","[-0.005464556161314249, 0.014642377384006977, ...",0.150014
14,Highlights of this period of time also include...,"[-0.015408716164529324, 0.0008742476929910481,...",0.150575
13,English manager Malcolm Allison arrived at Spo...,"[-0.011236773803830147, 0.00034831243101507425...",0.160921
8,The football team had their height during the ...,"[-0.021699093282222748, 0.012809421867132187, ...",0.161489
...,...,...,...
51,) net income was €25.2 million for a record-br...,"[-0.024319594725966454, -0.018804073333740234,...",0.272438
77,Ivaylo Yordanov – 1998,"[-0.00691758980974555, -0.016648462042212486, ...",0.273847
76,Krasimir Balakov – 1995,"[0.00198308564722538, -0.0033269948326051235, ...",0.278395
117,Krasimir Balakov – 1994 United States,"[-0.0007893025758676231, -0.019976649433374405...",0.280160


In [16]:
# Define the base tokenizer and max number of tokens to use
tokenizer = tiktoken.get_encoding("cl100k_base")
max_token_count = 1000

In [17]:
# compose a custom prompt
prompt_template = """
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{}

---

Question: {}
Answer:"""

prompt = prompt_template.format("context",question1)
print(prompt)


Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

context

---

Question: When were the golden years of Sporting?
Answer:


In [18]:
# create the context and make sure it does not exceeds the max token count

current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question1))
current_token_count
context = []
for text in df["text"].values:
    text_token_count = len(tokenizer.encode(text))
    current_token_count += text_token_count

    if current_token_count <= max_token_count:
        #print(current_token_count)
        context.append(text)
    else:
        break


In [19]:
#check the content of context
context

['Domestically, Sporting had back-to-back wins in the Portuguese Cup in 2007 and 2008 (led by coach Paulo Bento). Sporting also reached, for the first time, the knockout phase of UEFA Champions League, in the 2008–09 season, but were roundly defeated by Bayern Munich, with an aggregate loss of 12–1. This is widely regarded as one of the lowest points in the history of the club. The club almost reached another European final in 2012, but were dropped out of the competition by Athletic Bilbao, in the semi-finals of the 2011–12 Europa League.',
 'In 2000, Sporting, led by manager Augusto Inácio (a former Sporting player, who replaced Giuseppe Materazzi at the beginning of the season), won the league title on the last match day, with a 4–0 victory over Salgueiros, ending an 18-year drought. In the following season, Sporting conquered the 2000 Super Cup but came third in the league. In the 2001–02 season, led by coach László Bölöni, Sporting conquered their 18th league title, the Portuguese

In [20]:
# define a function to reveice a prompt template, a context and a text and return the complete custom prompt
def create_prompt(prompt_template,context, question):
  return prompt_template.format("\n".join(context),question)


In [21]:
#lets text our first custom promt text
my_prompt = create_prompt(prompt_template,context,question1)
print(my_prompt)


Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

Domestically, Sporting had back-to-back wins in the Portuguese Cup in 2007 and 2008 (led by coach Paulo Bento). Sporting also reached, for the first time, the knockout phase of UEFA Champions League, in the 2008–09 season, but were roundly defeated by Bayern Munich, with an aggregate loss of 12–1. This is widely regarded as one of the lowest points in the history of the club. The club almost reached another European final in 2012, but were dropped out of the competition by Athletic Bilbao, in the semi-finals of the 2011–12 Europa League.
In 2000, Sporting, led by manager Augusto Inácio (a former Sporting player, who replaced Giuseppe Materazzi at the beginning of the season), won the league title on the last match day, with a 4–0 victory over Salgueiros, ending an 18-year drought. In the following season, Sporting conquered the 2000 Super Cup but ca

In [22]:
# test our first custom question
final_response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt = my_prompt
    )

# the response is in the dictionary in the key "choices", let's get the text form the index 0
final_response["choices"][0]["text"]

' The golden years of Sporting occurred during the 1940s and 1950'

**text in the wiki page under "Golden years and fading (1946–1982)" : The football team had their height during the 1940s and 1950s.**

In [23]:
#now lets text the same question without the context dataframe
sporting_prompt = """
Question: "When were the golden years of Sporting?"
Answer:
"""
initial_sporting_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=sporting_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_sporting_answer)

The golden years of Sporting may vary depending on personal opinions, but a popular period of success for the club was between 1994 and 1996, when they won back-to-back Portuguese Liga titles and the Portuguese Cup. Other successful years for Sporting include the 1950s and 1960s, when they won multiple league titles, and the early 2000s, when they won another league title and multiple domestic cups. Overall, Sporting has had various golden years throughout its history, depending on its accomplishments and achievements.


# **RESULT FROM 1st TEST**

***Response with context***: The golden years of Sporting were during the 1940s and 1950 - CORRECT


***Response without the context***: the golden years of Sporting may vary depending on personal opinions, but a popular period of success for the club was between 1994 and 1996, when they won back-to-back Portuguese Liga titles and the Portuguese Cup. Other successful years for Sporting include the 1950s and 1960s, when they won multiple league titles, and the early 2000s, when they won another league title and multiple domestic cups. Overall, Sporting has had various golden years throughout its history, depending on its accomplishments and achievements. - INCORRECT


**As we can see from the wikipedia page the correct response is the 1st one, and so the performace with the context included performed correctly.**

**LET'S MAKE SOME REUSABLE CODE**

Until now I have been running the code almost line by line and with just a few helper functions.

But there is a big problem, if we want to test the model with other questions we need to create the embeddings for the new question and calculate the new distances, and there is no reusable code for that, so let's make it.

In [24]:
# Funtions that finds the pieces of text for a given question
def get_relevante_text(question, df):
  """
  Function that takes in a question string and a dataframe containing
  rows of text and associated embeddings, and returns that dataframe
  sorted from least to most relevant for that question
  """

  # get the embeddings
  question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

  # add the distances column to the dataframe and make
  # based on the cosine similarity
  df_copy = df.copy()
  df_copy["distances"] = distances_from_embeddings(
      question_embeddings,
      df_copy["embeddings"].tolist(),
      distance_metric="cosine"
  )

  # sort the dataframe by the distances column
  df_copy = df_copy.sort_values(by="distances")
  return df_copy

In [25]:
# lets load the sporting embeding and test two questions

df = pd.read_csv("sporting_embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

sporting_question_1 = "When were the golden years of Sporting?"
sporting_question_2 = "Who was the first president of Sporting?"

In [26]:
#check results for question 1
get_relevante_text(sporting_question_1, df)


Unnamed: 0,text,embeddings,distances
19,"Domestically, Sporting had back-to-back wins i...","[-0.02016436494886875, 0.004032873082906008, 0...",0.148997
16,"In 2000, Sporting, led by manager Augusto Inác...","[-0.005464556161314249, 0.014642377384006977, ...",0.150014
14,Highlights of this period of time also include...,"[-0.015408716164529324, 0.0008742476929910481,...",0.150575
13,English manager Malcolm Allison arrived at Spo...,"[-0.011236773803830147, 0.00034831243101507425...",0.160921
8,The football team had their height during the ...,"[-0.021699093282222748, 0.012809421867132187, ...",0.161489
...,...,...,...
51,) net income was €25.2 million for a record-br...,"[-0.024319594725966454, -0.018804073333740234,...",0.272438
77,Ivaylo Yordanov – 1998,"[-0.00691758980974555, -0.016648462042212486, ...",0.273847
76,Krasimir Balakov – 1995,"[0.00198308564722538, -0.0033269948326051235, ...",0.278395
117,Krasimir Balakov – 1994 United States,"[-0.0007893025758676231, -0.019976649433374405...",0.280160


In [27]:
#check results for question 2
get_relevante_text(sporting_question_2, df)

Unnamed: 0,text,embeddings,distances
4,The club also organized parties and picnics. E...,"[-0.014927705749869347, 0.010244372300803661, ...",0.134779
3,Sporting Clube de Portugal has its origins in ...,"[-0.011024683713912964, 0.003270503832027316, ...",0.140690
5,"The year 1907 marked some ""firsts"" for the clu...","[-0.015821464359760284, 0.004305632784962654, ...",0.141565
7,Sporting played their first Primeira Liga game...,"[-0.006375958677381277, 0.004544531926512718, ...",0.142007
9,Sporting and the Yugoslavian team Partizan bot...,"[-0.01237823162227869, -0.00989230815321207, 0...",0.145727
...,...,...,...
129,Viktor Gyokeres (21 million euros as of Februa...,"[-0.013638434000313282, -0.020718703046441078,...",0.286750
122,Hilário (474),"[-0.0405082143843174, -0.010725759901106358, 0...",0.288856
77,Ivaylo Yordanov – 1998,"[-0.00691758980974555, -0.016648462042212486, ...",0.293774
76,Krasimir Balakov – 1995,"[0.00198308564722538, -0.0033269948326051235, ...",0.297109


In [28]:
# crate a function that builds a text prompt and allows to define the maximum number of tokens to use in the prompt
def create_final_prompt(question, df, max_token_count):
  """
  Given a question and a dataframe containing rows of text and their
  embeddings, return a text prompt to send to a completion model
  """

  # create the base tokenizer aligned with our embeddings
  tokenizer = tiktoken.get_encoding("cl100k_base")

  # Count the number of tokens in the prompt template and question
  prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:

{}

---

Question: {}
Answer:"""

  current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
  context=[]
  for text in get_relevante_text(question, df)["text"].values:
     text_token_count = len(tokenizer.encode(text))
     current_token_count += text_token_count

     # Add the row of text to the list if we haven't exceeded the max
     if current_token_count <= max_token_count:
          context.append(text)
     else:
          break

  return prompt_template.format("\n".join(context), question)

In [29]:
# now let's test it with question 1
print(create_final_prompt(sporting_question_1, df, 1000))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:

Domestically, Sporting had back-to-back wins in the Portuguese Cup in 2007 and 2008 (led by coach Paulo Bento). Sporting also reached, for the first time, the knockout phase of UEFA Champions League, in the 2008–09 season, but were roundly defeated by Bayern Munich, with an aggregate loss of 12–1. This is widely regarded as one of the lowest points in the history of the club. The club almost reached another European final in 2012, but were dropped out of the competition by Athletic Bilbao, in the semi-finals of the 2011–12 Europa League.
In 2000, Sporting, led by manager Augusto Inácio (a former Sporting player, who replaced Giuseppe Materazzi at the beginning of the season), won the league title on the last match day, with a 4–0 victory over Salgueiros, ending an 18-year drought. In the following season, Sporting conquered the 2000 Super Cup but ca

In [30]:
# now let's test it with question 2
print(create_final_prompt(sporting_question_2, df,1000))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:

The club also organized parties and picnics. Eventually, during one picnic, on 12 April 1906, discussions erupted, as some members defended that the club should only be focused on organizing picnics and social events, with another group defending that the club should be focused on the practising of sports instead. Some time later, José Gavazzo, José Alvalade and 17 other members left the club, with José Alvalade saying: "I'll go to my grandad and he'll give me money to make another club." As such, a new club, without a name, was founded on 8 May 1906, and on 26 May, it was named "Campo Grande Sporting Clube". The Viscount of Alvalade, whose money and land helped found the club, was the first president of Sporting. José Alvalade, as one of the main founders and first club member (sócio), uttered on behalf of himself and his fellow co-founders: "We wa

In [31]:
#create a function to anwer questions

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(question, df, max_prompt_tokens=1800, max_answer_tokens = 150):
  """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """
  prompt = create_prompt(question, df, max_prompt_tokens)

  try:
    response = openai.Completion.create(
            model = COMPLETION_MODEL_NAME,
            prompt = prompt,
            max_tokens = max_answer_tokens
        )
    return response["choices"][0]["text"].strip()
  except Exception as e:
      print(e)
      return ""


In [32]:
# with all the reusable code built let's text it
custom_sporting_question_1 = sporting_question_1
# "When were the golden years of Sporting?"
custom_sporting_question_2 = sporting_question_2
#"Who was the first president of Sporting?"



In [34]:
custom_sporting_answer_1 = answer_question(custom_sporting_question_1, df)
print(custom_sporting_answer_1)

The golden years of Sporting can be considered to be between the 1940s and 1960s. During this time, the club won numerous league titles and domestic cups, as well as the European Cup Winners' Cup in 1964. The team also had a strong generation of players, including legends such as Fernando Peyroteo and Hilário.


In [41]:
custom_sporting_answer_2 = answer_question(custom_sporting_question_2, df)
print(custom_sporting_answer_2)

The first president of Sporting was José Holtreman Roquette, also known as José Alvalade. He served as president from the club's founding in 1906 until his death in 1918.


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [42]:
sporting_prompt_1 = """
Question: "When were the golden years of Sporting?"
Answer:
"""
initial_sporting_answer_1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=sporting_prompt_1,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_sporting_answer_1)

custom_sporting_answer_1 = answer_question(custom_sporting_question_1, df)
print(custom_sporting_answer_1)

The golden years of Sporting can be generally considered to be the period from the 1940s to the 1970s. During this time, the team achieved great success both domestically and internationally. They won numerous league titles and cups, as well as European competitions. Some of the most iconic players in the club's history, such as Peyroteo, Fernandes, and Yazalde, played during this era, cementing Sporting's reputation as one of the top teams in Europe.
The golden years of Sporting can be debated, as the club has seen success throughout its history. However, many would consider the late 1940s and early 1950s to be the golden years, as the club won five Primeira Liga titles in a span of six years (1947-48, 1948-49, 1950-51, 1951-52, 1952-53) and also won the Taça de Portugal (Portuguese Cup) twice (1941, 1945).


### Question 2

In [47]:
sporting_prompt_2 = """
Question: "Who was the first president of Sporting?"
Answer:
"""
initial_sporting_answer_2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=sporting_prompt_2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_sporting_answer_2)

custom_sporting_answer_2 = answer_question(custom_sporting_question_2, df)
print(custom_sporting_answer_2)

There is not one specific "Sporting" organization, but if you are referring to Sporting Lisbon, then the first president was José Alvalade. If you are referring to Sporting Clube de Portugal, then the first president was Bernardino Rebelo.
The first president of Sporting was Jose de Alvalade, elected in 1906 when the club was founded.


In [48]:
print(f"""
When were the golden years of Sporting??

Original Answer: {initial_sporting_answer_1}
Custom Answer:   {custom_sporting_answer_1}

Who founded Sporting Clube the Portugal?

Original Answer: {initial_sporting_answer_2}
Custom Answer:   {custom_sporting_answer_2}
""")


When were the golden years of Sporting??

Original Answer: The golden years of Sporting can be generally considered to be the period from the 1940s to the 1970s. During this time, the team achieved great success both domestically and internationally. They won numerous league titles and cups, as well as European competitions. Some of the most iconic players in the club's history, such as Peyroteo, Fernandes, and Yazalde, played during this era, cementing Sporting's reputation as one of the top teams in Europe.
Custom Answer:   The golden years of Sporting can be debated, as the club has seen success throughout its history. However, many would consider the late 1940s and early 1950s to be the golden years, as the club won five Primeira Liga titles in a span of six years (1947-48, 1948-49, 1950-51, 1951-52, 1952-53) and also won the Taça de Portugal (Portuguese Cup) twice (1941, 1945).

Who founded Sporting Clube the Portugal?

Original Answer: There is not one specific "Sporting" organi

# CONCLUSIONS

The model with the custom prompt responds correctly to the questions and without the context given it's not able to respond correctly, so we can conclude that by providing the context based on the dataframe we built based on the wikipedia page of the club the model is much more accurate.



---
**Notes to the reviwer:**



1.   This code was built and run in google colab.
2.   The answer to the questions changes everytime we run the code. This is normal because questions (especially question nº2) are a little bit open given the context that is possible to collect from the wikipedia page. And of course due to the model we are using - gpt-3.5-turbo-instruct. But in 90% of the times i have run it the custom model responds correctly and non customised ones are much more random.


