# Custom Chatbot Project

The dataset consists of the wikipedia information about Project 2025, which was first released in 2022 after the training of GPT-3.5.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [None]:
import requests
import pandas as pd
import tiktoken
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

from openai.embeddings_utils import get_embedding, distances_from_embeddings

print(openai.__version__)

0.28.0


In [2]:
overview_req_params = {
    "action": "query",
    "prop": "extracts",
    "titles": "Project 2025",
    "formatversion": 2,
    "exlimit": 1,
    "explaintext": 1,
    "format": "json"
}

overview_resp = requests.get("https://en.wikipedia.org/w/api.php", params=overview_req_params)

In [3]:
overview_resp_dict = overview_resp.json()

overview_extract = overview_resp_dict["query"]["pages"][0]["extract"]
extracts = [line for line in overview_extract.split("\n") if len(line)>1 and not line.startswith("==")]


In [4]:
df = pd.DataFrame()
df['text'] = extracts
df

Unnamed: 0,text
0,Project 2025 (also known as the 2025 President...
1,The ninth iteration of the Heritage Foundation...
2,The project calls for merit-based federal civi...
3,Most of Project 2025's writers and contributor...
4,"The Heritage Foundation, a conservative think ..."
...,...
159,Project Esther
160,Official website
161,"""Admin 2025"" a presentation of the Project 202..."
162,"""Top Project 2025 architect talks conservative..."


In [5]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
embedding = openai.Embedding.create(model=EMBEDDING_MODEL_NAME, input=df["text"].tolist())

In [6]:
len(embedding["data"][0]["embedding"])

1536

In [7]:
embeddings = [data["embedding"] for data in embedding["data"]]
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,Project 2025 (also known as the 2025 President...,"[-0.030147729441523552, -0.03871603310108185, ..."
1,The ninth iteration of the Heritage Foundation...,"[-0.020986000075936317, -0.021598828956484795,..."
2,The project calls for merit-based federal civi...,"[-0.024278998374938965, -0.010739050805568695,..."
3,Most of Project 2025's writers and contributor...,"[-0.02568434365093708, -0.05058874562382698, -..."
4,"The Heritage Foundation, a conservative think ...","[-0.03339548408985138, -0.020173756405711174, ..."
...,...,...
159,Project Esther,"[-0.0025224597193300724, -0.026541264727711678..."
160,Official website,"[-0.010165003128349781, -0.005637652240693569,..."
161,"""Admin 2025"" a presentation of the Project 202...","[-0.015709342435002327, -0.02042486146092415, ..."
162,"""Top Project 2025 architect talks conservative...","[-0.023287400603294373, -0.01814079098403454, ..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [8]:
prompt_template = """
Anwser the question based on the context below, and if the question can't be answered based on the context, say "I don't know"

Context:

{}

---

Question: {}
Answer:"""

In [9]:
def get_rows_sorted_by_relevance(prompt, df):
    prompt_embeddings = get_embedding(prompt, engine=EMBEDDING_MODEL_NAME)

    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(prompt_embeddings, df_copy["embeddings"].values, distance_metric="cosine")

    df_copy = df_copy.sort_values("distances")
    return df_copy

In [10]:
get_rows_sorted_by_relevance("Donald Trump", df)

Unnamed: 0,text,embeddings,distances
154,Donald Trump and fascism,"[-0.03350825980305672, -0.016422364860773087, ...",0.086432
155,Hiring and personnel concerns about Donald Trump,"[-0.028792042285203934, -0.022012826055288315,...",0.122461
130,Project 2025 seems to be full of a whole array...,"[-0.019907109439373016, -0.01649446226656437, ...",0.155392
134,Spencer Ackerman and John Nichols in The Natio...,"[-0.02807612717151642, -0.028156574815511703, ...",0.160186
118,Aspects of the project implemented in the firs...,"[-0.023910224437713623, -0.028847601264715195,...",0.163993
...,...,...,...
106,"To prevent teenage pregnancy, Project 2025 adv...","[-0.027568044140934944, -0.00808256957679987, ...",0.260280
4,"The Heritage Foundation, a conservative think ...","[-0.03339548408985138, -0.020173756405711174, ...",0.261955
97,Project 2025 recommends curtailing the Biparti...,"[-0.006110614165663719, -0.004481117241084576,...",0.265858
122,"On February 7, 2025, the National Institutes o...","[-0.006858889013528824, -0.002144216326996684,...",0.276925


In [11]:
def create_prompt(prompt, df):

    max_token_count = 2000

    tokenizer = tiktoken.get_encoding("cl100k_base")

    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(prompt))

    context = []
    for text in get_rows_sorted_by_relevance(prompt, df)["text"].values:
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), prompt)

In [12]:
def send_prompt(prompt):
    return openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=create_prompt(prompt, df),
        max_tokens=2000
    )

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [13]:
prompt1 = "Which organization is responsible for Project 2025?"

In [14]:
answer1_before = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=prompt1,
        max_tokens=2000
    )["choices"][0]["text"].strip()

answer1_before

'It is not possible to determine which organization is responsible for Project 2025 without more context. Project 2025 could be a project name used by multiple organizations, or it could be a fictitious project.'

In [15]:
answer1_after = send_prompt(prompt1)["choices"][0]["text"].strip()

answer1_after

'The Heritage Foundation.'

### Question 2

In [16]:
prompt2 = "What is the main ideology behind Project 2025?"

In [17]:
answer2_before = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=prompt2,
        max_tokens=2000
    )["choices"][0]["text"].strip()

answer2_before

'The main ideology behind Project 2025 is to actively work towards creating a better and more sustainable future for all individuals, with a particular focus on economic development, social progress, and environmental stewardship. This ideology is based on the belief that by setting specific goals and implementing targeted strategies, significant progress can be made towards achieving a more prosperous and equitable society by the year 2025. The project aims to mobilize and empower individuals, organizations, and governments to take action and make meaningful changes to address pressing global issues and shape a more positive future for the world.'

In [18]:
answer2_after = send_prompt(prompt2)["choices"][0]["text"].strip()

answer2_after

'The main ideology behind Project 2025 is right-wing conservative policies.'