# **GPT Multi-turn Chatbot**

**Chatbot: https://huggingface.co/spaces/tclints/GPT-Multi-Turn-Chatbot**

## **Overview**
This script utilizes OpenAI's GPT-3.5 API to generate concise answers based on a given context and question. It maintains a conversation history to support multi-turn conversations, with user and assistant messages stored in the conversation_history list. The generate_gpt3_answer function constructs a prompt instructing GPT-3 to provide a concise answer (specifically, a person's name) based on the context and question. The API is called with the prompt, and the response is processed to return only the relevant part of the answer. This ensures that extra details are removed, keeping the output short and focused.

In [None]:
import openai
import os
import re
from datasets import load_dataset

openai.api_key = os.getenv('OPENAI_API_KEY')

conversation_history = []

def add_to_conversation(role, content):
    """Function to add user/assistant messages to the conversation history."""
    conversation_history.append({"role": role, "content": content})

def generate_gpt3_answer(context, question):
    """Generate a concise answer using GPT-3 based on context and question."""
    prompt_message = (
        f"Answer the following question as concisely as possible, providing only the name of the person:\n\nContext: {context}\n\nQuestion: {question}\n\nAnswer:"
    )

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt_message}
        ],
        max_tokens=50,
        temperature=0.3  # Lower temperature for deterministic responses
    )

    answer = response.choices[0].message.content.strip()

    # Post-processing: truncate answer to the first sentence to remove extra details
    processed_answer = answer.split('.')[0].strip() + '.'

    return processed_answer

This function, **normalize_answer(s)**, is designed to clean and standardize text for comparison purposes by applying several normalization techniques. Specifically, it performs the following steps:

**Convert to Lowercase: **The input text is converted to lowercase to ensure case-insensitive comparisons.

**Remove Punctuation:** All punctuation is removed from the text, leaving only words and spaces.

**Remove Articles:** Common articles like "a," "an," and "the" are removed to avoid unnecessary distinctions.

**Fix Whitespace:** Any extra spaces are removed, and the text is reformatted with a single space between words.

This normalization process ensures that text is simplified and consistent, making it easier to compare answers in a standardized format.

In [None]:
def normalize_answer(s):
    """Lower text and remove punctuation, articles, and extra whitespace."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punctuation(text):
        return re.sub(r'[^\w\s]', '', text)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punctuation(lower(s))))

**compute_exact_match(prediction, truth):**

Purpose: Checks if the predicted answer exactly matches the true answer.
How it works:
Both the prediction and truth are cleaned using normalize_answer (which removes articles, punctuation, and extra spaces, and makes the text lowercase).
If the cleaned prediction is exactly the same as the cleaned truth, it returns 1 (indicating an exact match), otherwise, it returns 0.

**compute_f1(prediction, truth):**

Purpose: Calculates the F1 Score, which measures how well the predicted answer overlaps with the true answer.
How it works:
Both the prediction and truth are cleaned and split into individual words (tokens).

The score considers how many words overlap between the prediction and truth.
It returns 0 if no words match, otherwise it calculates the F1 score based on precision (how many of the predicted words are correct) and recall (how many of the true words were found).

In [None]:
def compute_exact_match(prediction, truth):
    """Computes exact match between the prediction and truth."""
    return int(normalize_answer(prediction) == normalize_answer(truth))

def compute_f1(prediction, truth):
    """Computes F1 score between the prediction and truth."""
    pred_tokens = normalize_answer(prediction).split()
    truth_tokens = normalize_answer(truth).split()
    common = set(pred_tokens) & set(truth_tokens)
    if len(common) == 0:
        return 0
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

**test_gpt3_on_squad** evaluates GPT-3's performance on the SQuAD (Stanford Question Answering Dataset) by comparing its generated answers to the correct answers.

In [None]:
def test_gpt3_on_squad(num_examples=5):
    """Test GPT-3 on a number of SQuAD examples and compute F1 and Exact Match scores."""

    # Load the SQuAD dataset
    squad_dataset = load_dataset("squad")

    total_f1 = 0
    total_exact_match = 0

    for i in range(num_examples):
        example = squad_dataset['train'][i]
        context = example['context']
        question = example['question']
        true_answer = example['answers']['text'][0]

        generated_answer = generate_gpt3_answer(context, question)

        exact_match = compute_exact_match(generated_answer, true_answer)
        f1 = compute_f1(generated_answer, true_answer)

        print(f"Example {i+1}:")
        print(f"Question: {question}")
        print(f"Generated Answer: {generated_answer}")
        print(f"True Answer: {true_answer}")
        print(f"Exact Match: {exact_match}, F1 Score: {f1}")
        print("-" * 50)

        total_exact_match += exact_match
        total_f1 += f1

    avg_f1 = total_f1 / num_examples
    avg_exact_match = total_exact_match / num_examples
    print(f"Average Exact Match: {avg_exact_match * 100:.2f}%")
    print(f"Average F1 Score: {avg_f1 * 100:.2f}%")

# Run the test on the first 30 examples
test_gpt3_on_squad(num_examples=30)

Example 1:
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Generated Answer: Saint Bernadette Soubirous.
True Answer: Saint Bernadette Soubirous
Exact Match: 1, F1 Score: 1.0
--------------------------------------------------
Example 2:
Question: What is in front of the Notre Dame Main Building?
Generated Answer: Copper statue of Christ.
True Answer: a copper statue of Christ
Exact Match: 1, F1 Score: 1.0
--------------------------------------------------
Example 3:
Question: The Basilica of the Sacred heart at Notre Dame is beside to which structure?
Generated Answer: The Grotto.
True Answer: the Main Building
Exact Match: 0, F1 Score: 0
--------------------------------------------------
Example 4:
Question: What is the Grotto at Notre Dame?
Generated Answer: A Marian place of prayer and reflection.
True Answer: a Marian place of prayer and reflection
Exact Match: 1, F1 Score: 1.0
--------------------------------------------------
Example 5:
Question:

## **Findings**:
The GPT-3 model performed well on the SQuAD dataset, achieving an average Exact Match score of 80.00% and an average F1 Score of 84.39% across 30 examples. The model excelled in generating concise, factual answers for straightforward questions but encountered some challenges with more complex or nuanced queries. Overall, the results demonstrate GPT-3's strong capabilities in question-answering tasks. More detailed analysis and insights can be found in the supporting documentation.