# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

In this project I'm going to use the **2023 Turkey–Syria earthquakes** Wikipedia page as it contains information about an event that did not happen when GPT was trained (2021), i.e. questions regarding this event could not be anwered by the bot without this additional contextual information.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
# Some imports required for this notebook
import pandas as pd
import numpy as np
import requests

from utils import *

In [2]:
# Set openai api key by providing a valid key or text file containing the key.
set_api_key("openaikey.txt")  #  <=== adapt this argument!!!

# Create a configuration object
config = create_config({
    'EMBEDDING_MODEL_NAME': 'text-embedding-ada-002',
    'COMPLETION_MODEL_NAME': 'text-davinci-003',
    'ENCODING': 'cl100k_base',
    'MAX_PROMPT_TOKENS': 2000,
    'MAX_RESPONSE_TOKENS': 150,
    'BATCH_SIZE': 100,
    'FROM_SCRATCH': False,  # create every intermediate result from scratch (true) or use stored data (false)
})

In [3]:
################
### CLEANING ###
################

if config.FROM_SCRATCH:
    # Load data from Wikipedia
    params = {
        "action": "query", 
        "prop": "extracts",
        "exlimit": 1,
        "titles": "2023_Turkey–Syria_earthquakes",
        "explaintext": 1,
        "formatversion": 2,
        "format": "json"
    }
    response = requests.get("https://en.wikipedia.org/w/api.php", params=params)
    
    # Create dataframe an load data from Wikipedia into column "text"
    df = pd.DataFrame()
    df['text'] = response.json()['query']['pages'][0]['extract'].split('\n')

    # Perform some data cleaning steps to prepare the data appropriately
    # -- remove empty lines and headings
    # -- ...
    df = clean_data(df)
    # Store the result in a csv file
    df.to_csv('./data/results/df_preprocessed.csv')
else:
    # Load cleaned data from csv file
    df = pd.read_csv('./data/results/df_preprocessed.csv', index_col = 0)    
    
df.head()

Unnamed: 0,text
0,"On 6 February 2023, at 04:17 TRT (01:17 UTC), ..."
1,8 earthquake struck southern and central Turke...
2,The epicenter was 37 km (23 mi) west–northwest...
3,The earthquake had a maximum Mercalli intensit...
4,It was followed by a Mw 7.


In [4]:
##################
### EMBEDDINGS ###
##################

if config.FROM_SCRATCH:
    # Get embeddings for all text rows from openai and store in csv file
    df['embeddings'] = get_embeddings(df, config)
    df.to_csv('./data/results/df_embeddings.csv')
    df['embeddings'] = df['embeddings'].apply(np.array)
else:
    # Load preprocessed date with embeddings from csv file
    df = pd.read_csv('./data/results/df_embeddings.csv', index_col = 0)
    df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)    

In [5]:
df.head()

Unnamed: 0,text,embeddings
0,"On 6 February 2023, at 04:17 TRT (01:17 UTC), ...","[-0.007678510621190071, -0.01049796398729086, ..."
1,8 earthquake struck southern and central Turke...,"[-0.007633099798113108, -0.02849690616130829, ..."
2,The epicenter was 37 km (23 mi) west–northwest...,"[0.0034406911581754684, 0.013554828241467476, ..."
3,The earthquake had a maximum Mercalli intensit...,"[0.0018920789007097483, -0.0034654918126761913..."
4,It was followed by a Mw 7.,"[-0.004148250445723534, 0.0018140365136787295,..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [6]:
answer = answer_question("When was the last earthquake in Turkey?", df, config, custom=True)
print(answer)

The last earthquake in Turkey was the 8.2 magnitude earthquake that occurred in February 2023.


In [7]:
# Example question and answer
answer = answer_question("Did the earthquake in Turkey in 2023 also have an impact on Syria?", df, config, custom=True)
print(answer)

Yes, the earthquake in Turkey in 2023 had an impact on Syria, with the Syrian Ministry of Health recording over 2,248 earthquake-related deaths and 2,950 injuries in government held areas, most of which were in the governorates of Aleppo and Latakia. Additionally, the World Health Organization said up to 26 million people may have been affected; 15 million in Turkey and 11 million in Syria.


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [8]:
# Question with context
print(answer_question('Did any earthquake happen in Turkey in 2023?', df, config, custom=True))

Yes, two mainshocks reaching above Mw 7 occurred between 6 and 17 February 2023.


In [9]:
# Same question without context
print(answer_question('Did any earthquake happen in Turkey in 2023?', df, config, custom=False))

I don't know.


### Question 2

In [10]:
print(answer_question('Did any people die due to the earthquake in Turkey in 2023?', df, config, custom=True))

Yes, at least 50,783 people died due to the earthquake in Turkey in 2023.


In [11]:
print(answer_question('Did any people die due to the earthquake in Turkey in 2023?', df, config, custom=False))

I don't know.


## Chat Bot

In [12]:
print('Hello, what do you want to know?\n')
while True:
    question = input('You: ')
    if len(question) > 0:
        print(f'\nBot: {answer_question(question, df, config, custom=True)}', end='\n\n')
    else:
        print('\nGood bye!')
        break

Hello, what do you want to know?

You: What happend in Turkey in 2023?

Bot: In 2023, a destructive earthquake struck İzmir in Turkey; President Erdogan declared a 3-month state of emergency in 10 affected provinces; serious actions were taken to address the issue; an estimated 14 million people, or 16% of Turkey's population, were affected; Turkey sent an official request to NATO and allies for assistance; over 53,000 Turkish emergency workers were deployed to the regions affected; more than 20% of Turkey's agriculture production was affected; these earthquakes were the largest Turkish earthquakes in over 2,000 years; the Turkish government announced plans to construct 200,000 homes in the 11 affected provinces and a further 70,000 in villages; the governing alliance between the MHP and the AKP approved a state of emergency in

You: Was only Turkey impacted by this earthquake?

Bot: No, the European-Mediterranean Seismological Centre said shaking was felt in Armenia, Egypt, Palestine,