# Custom Chatbot Project

I used the Wikipedia API to gather general news from 2024. The chatbot will leverage this dataset to provide relevant responses. This demonstrates how large language models with retrieval-augmented generation can efficiently access and deliver the necessary information from news sources.

In [29]:
!pip show openai

Name: openai
Version: 0.26.1
Summary: Python client library for the OpenAI API
Home-page: https://github.com/openai/openai-python
Author: OpenAI
Author-email: support@openai.com
License: 
Location: /opt/venv/lib/python3.9/site-packages
Requires: aiohttp, requests, tqdm
Required-by: 


In [52]:
import openai
import pandas as pd
import requests
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import tiktoken
from dateutil.parser import parse
import re


In [116]:
OPENAI_API_KEY = "your key"
openai.api_key = api_key=OPENAI_API_KEY

In [59]:
wiki_2024= "https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2024&explaintext=1&formatversion=2&format=json"
CSV_EMBEDDINGS = './wikipedia_with_embeddings.csv'

# OpenAI Setup
EMBEDDING_MODEL_NAME = 'text-embedding-3-small'
COMPLETION_MODEL_NAME = 'gpt-3.5-turbo-instruct'

# Training Setup
batch_size = 100

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [60]:
resp = requests.get(wiki_2024)
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

df

Unnamed: 0,text
0,"2024 (MMXXIV) is the current year, and is a le..."
1,"So far, this year has witnessed the continuati..."
2,"Approximately 79 countries, representing aroun..."
3,
4,
...,...
136,
137,== Notes ==
138,
139,


In [61]:
# Cleaning up:
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

In [62]:
prefix = ""
for i, row in df.iterrows():
    text = row['text'].strip()
    # Try to determine if the current row is a date
    try:
        # If the text can be parsed as a date, update the prefix
        parse(text, fuzzy=False)
        prefix = text
    except ValueError:
        # If it's not a date, check if we need to prepend the stored date
        if " – " not in text and prefix:
            # Concatenate the date prefix with the event description
            df.at[i, 'text'] = prefix + " – " + text

# Filter out any rows that still do not contain " – " (indicating missing dates or improperly formatted rows)
df = df[df["text"].str.contains(" – ")]

In [64]:
df

Unnamed: 0,text
10,"January 1 – Egypt, Ethiopia, Iran and the Unit..."
11,January 1 – The Republic of Artsakh is formall...
12,January 1 – A 7.5 Mww earthquake strikes the w...
13,January 1 – Ethiopia announces an agreement wi...
14,January 2 – 2023 Marshallese general election:...
...,...
126,September – 2024 ICC Women's T20 World Cup.
127,September or October – 2024 Austrian legislati...
129,October – 2024 Botswana general election.
130,October – 2024 Georgian presidential election.


### Create Embedding

In [65]:

embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
10,"January 1 – Egypt, Ethiopia, Iran and the Unit...","[-0.05981713533401489, -0.02182900160551071, 0..."
11,January 1 – The Republic of Artsakh is formall...,"[-0.006478521972894669, -0.01933402195572853, ..."
12,January 1 – A 7.5 Mww earthquake strikes the w...,"[-0.007353253662586212, 0.03705304116010666, 0..."
13,January 1 – Ethiopia announces an agreement wi...,"[0.01613444834947586, -0.05987289920449257, 0...."
14,January 2 – 2023 Marshallese general election:...,"[0.023038113489747047, -0.020961811766028404, ..."
...,...,...
126,September – 2024 ICC Women's T20 World Cup.,"[-0.014973701909184456, -0.022901620715856552,..."
127,September or October – 2024 Austrian legislati...,"[-0.07156816869974136, -0.026048362255096436, ..."
129,October – 2024 Botswana general election.,"[0.007074444554746151, -0.0002971009525936097,..."
130,October – 2024 Georgian presidential election.,"[0.0004317939456086606, -0.03775555640459061, ..."


In [12]:
df.to_csv(CSV_EMBEDDINGS)

#### Getting relevant data from vector storage

In [66]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [71]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"
Try to give on point answer without adding unnecessery informations.
Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [109]:
def answer_question(
    question, df=None, max_prompt_tokens=1800, max_answer_tokens=150, custom=True
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    if custom: 
        prompt = create_prompt(question, df, max_prompt_tokens)
    
        try:
            response = openai.Completion.create(
                model=COMPLETION_MODEL_NAME,
                prompt=prompt,
                max_tokens=max_answer_tokens
            )
            return response["choices"][0]["text"].strip()
        except Exception as e:
            print(e)
            return ""
    else: 
        try: 
            response = openai.Completion.create(
                model=COMPLETION_MODEL_NAME,
                prompt=f"Answer the following question: {question}",
                max_tokens=max_answer_tokens
            )
        except Exception as e:
            print(e)
            return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [104]:
df.empty

False

In [105]:
question1 = "What is the exceeded number of palestinian casualties ?"

In [106]:
## Answer from custome query
ans1 = answer_question(question1, df)
print(ans1)

More than 30,000.


In [107]:
## General Answer: 
ans1 = answer_question(question1, custom=False)
print(ans1)

Object of type set is not JSON serializable



### Question 2

In [79]:
question2 = "Who is elected as prime minister of Tuvaluan"

In [80]:
ans2 = answer_question(question2, df)
print(ans2)

Feleti Teo.
