# Step 1: Preparing a Dataset with Embeddings

For a custom chatbot we need to parse the data and then prepare the dataset. We will get the dataset from Wikipedia as a source of information.

You can read more of the wikipedia API via [this link](https://www.mediawiki.org/wiki/API:Main_page)

### Loading The Data

We are using the `requests` library ([documentation here](https://requests.readthedocs.io/en/latest/user/quickstart/)) to get the text of a page from Wikipedia using the `extracts` API feature ([documentation here](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bextracts)). You can ignore the details of the `params` being sent — the important takeaway is that **`response_dict` is a Python dictionary containing the the response to our query**.

Run the cell below as-is.

In [1]:
import requests

In [61]:
# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2022",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}

In [62]:
# inspect dataset
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()
response_dict["query"]["pages"][0]["extract"].split("\n")

['2022 (MMXXII) was a common year starting on Saturday of the Gregorian calendar, the 2022nd year of the Common Era (CE) and Anno Domini (AD) designations, the 22nd  year of the 3rd millennium and the 21st century, and the  3rd   year of the 2020s decade.  ',
 '2022 saw the removal of nearly all COVID-19 restrictions and the reopening of international borders in most countries, and the global rollout of COVID-19 vaccines continued. The global economic recovery from the pandemic continued, though many countries experienced an ongoing inflation surge; in response, many central banks raised their interest rates to landmark levels. The world population reached eight billion people in 2022, though the year also witnessed numerous natural disasters, including two devastating Atlantic hurricanes (Fiona and Ian), and the most powerful volcano eruption of the century so far. The later part of the year also saw the first public release of ChatGPT by OpenAI starting an arms race in artificial int

### Importing to our dataframe

Next we will use pandas to treat practically the dataset and clean before training.

We will see that we have some spaces and some data to treat before sending to the model and for that treatment we can use filtering of the data.

In [63]:
import pandas as pd

In [64]:
df = pd.DataFrame(
    data = response_dict["query"]["pages"][0]["extract"].split("\n"),
    columns = ['text']
)

df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,2022 saw the removal of nearly all COVID-19 re...
2,2022 was also dominated by wars and armed conf...
3,
4,
...,...
247,
248,== Nobel Prizes ==
249,
250,


In [65]:
df = df[df.text.str.len()>0] # remove null rows
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,2022 saw the removal of nearly all COVID-19 re...
2,2022 was also dominated by wars and armed conf...
5,== Events ==
8,=== January ===
...,...
241,December 24 – 2022 Fijian general election: Th...
242,December 29 – Brazilian football legend Pelé d...
245,== Deaths ==
248,== Nobel Prizes ==


In [66]:
df = df[~df.text.str.startswith('==')] # remove == row strings
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,2022 saw the removal of nearly all COVID-19 re...
2,2022 was also dominated by wars and armed conf...
10,January 1 – The Regional Comprehensive Econom...
11,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
238,December 17 – Leo Varadkar succeeds Micheál Ma...
239,December 19 – At the UN Biodiversity Conferenc...
240,December 21–26 – A major winter storm hits the...
241,December 24 – 2022 Fijian general election: Th...


In [67]:
df.tail(20)

Unnamed: 0,text
220,November 15
221,The world population reaches 8 billion.
222,"The 2022 G20 Bali summit in Bali, Indonesia ta..."
223,"November 16 – NASA launches Artemis 1, the fir..."
224,November 19 – The 2022 Malaysian general elect...
225,November 19–26 – The 2022 Central American and...
226,November 20–December 18 – The 2022 FIFA World ...
227,November 20 – 2022 Nepalese general election: ...
228,November 21 – A 5.6 earthquake strikes near Ci...
229,"November 30 – OpenAI releases ChatGPT, an arti..."


### Parse `response_dict` to get a list of text data samples

Look at the nested data structure of `response_dict` and find the key-value pair with the key of `"extract"`. The associated value will be a string containing a long block of text. Split this text into a list of strings using the `"\n"` separator and assign to the variable `text_data`.
</details>

In [70]:
from dateutil.parser import parse

prefix = ''
for n, row in df.iterrows():
    if ' – ' not in row['text']:
        try: # if the row's text is a date, set it as the new prefix
            parse(row['text']) 
            prefix = row['text']
        except: # if the row's text isn't date, add the prefix
            row['text'] = prefix + ' – ' + row['text'] 
df = df[df['text'].str.contains(' – ')]

In [71]:
df.tail(10)

Unnamed: 0,text
229,"November 30 – OpenAI releases ChatGPT, an arti..."
233,December 2 – The G7 and Australia join the EU ...
234,December 5 – The National Ignition Facility ac...
236,December 7 – The Congress of Peru removes Pres...
237,December 7 – After substantial protests agains...
238,December 17 – Leo Varadkar succeeds Micheál Ma...
239,December 19 – At the UN Biodiversity Conferenc...
240,December 21–26 – A major winter storm hits the...
241,December 24 – 2022 Fijian general election: Th...
242,December 29 – Brazilian football legend Pelé d...


Filter additional components over the dataset

In [74]:
index_filter = df.apply(lambda x: x.str.startswith(' – ')).text # select data to remove 
df.loc[index_filter, 'text'] = df[index_filter].text.apply(lambda x: x[2:])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[index_filter, 'text'] = df[index_filter].text.apply(lambda x: x[2:])


In [75]:
df.head()

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on S...
1,2022 saw the removal of nearly all COVID-19 r...
2,2022 was also dominated by wars and armed con...
10,January 1 – The Regional Comprehensive Econom...
11,January 2 – Abdalla Hamdok resigns as Prime Mi...


In [76]:
df.tail()

Unnamed: 0,text
238,December 17 – Leo Varadkar succeeds Micheál Ma...
239,December 19 – At the UN Biodiversity Conferenc...
240,December 21–26 – A major winter storm hits the...
241,December 24 – 2022 Fijian general election: Th...
242,December 29 – Brazilian football legend Pelé d...


In [77]:
df.reset_index(drop=True).to_csv('text.csv', index=False)

### Numeric Data Representation of text

Now that we have our dataset cleaned we need to use an encoding technique like embeddings to represent our data from text to numbers.

In [80]:
df = pd.read_csv('text.csv')
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on S...
1,2022 saw the removal of nearly all COVID-19 r...
2,2022 was also dominated by wars and armed con...
3,January 1 – The Regional Comprehensive Econom...
4,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
174,December 17 – Leo Varadkar succeeds Micheál Ma...
175,December 19 – At the UN Biodiversity Conferenc...
176,December 21–26 – A major winter storm hits the...
177,December 24 – 2022 Fijian general election: Th...


## Creating the Embeddings Index

This code creates embeddings for that text sample..

In [82]:
import openai

openai.api_key = ''

In [83]:
response = openai.Embedding.create(
    model='text-embedding-ada-002', 
    input=df.text.tolist()
)

In [84]:
type(response)

openai.openai_object.OpenAIObject

In [85]:
response.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [87]:
response['data'][0]

<OpenAIObject embedding at 0x1b7e8aaab80> JSON: {
  "object": "embedding",
  "index": 0,
  "embedding": [
    -0.0029914826154708862,
    -0.019716661423444748,
    -0.01627466268837452,
    -0.016673734411597252,
    -0.012246527709066868,
    -0.006522336043417454,
    -0.006120146252214909,
    0.01449130941182375,
    -0.021138355135917664,
    -0.0030008358880877495,
    0.030778443440794945,
    0.02121318131685257,
    -0.023408077657222748,
    -0.011174021288752556,
    -0.0019267704337835312,
    0.003187900874763727,
    0.01947971247136593,
    -0.009571497328579426,
    0.015850648283958435,
    0.013718106783926487,
    0.005019580014050007,
    0.004988402593880892,
    -0.0036976533010601997,
    -0.010101514868438244,
    0.010625297203660011,
    0.019292647019028664,
    0.007788143120706081,
    -0.0016602027462795377,
    0.033372413367033005,
    -0.020028436556458473,
    0.00030651394627057016,
    -0.012645600363612175,
    -0.01641184464097023,
    -0.01935500

In [88]:
response['data'][0]['embedding']

[-0.0029914826154708862,
 -0.019716661423444748,
 -0.01627466268837452,
 -0.016673734411597252,
 -0.012246527709066868,
 -0.006522336043417454,
 -0.006120146252214909,
 0.01449130941182375,
 -0.021138355135917664,
 -0.0030008358880877495,
 0.030778443440794945,
 0.02121318131685257,
 -0.023408077657222748,
 -0.011174021288752556,
 -0.0019267704337835312,
 0.003187900874763727,
 0.01947971247136593,
 -0.009571497328579426,
 0.015850648283958435,
 0.013718106783926487,
 0.005019580014050007,
 0.004988402593880892,
 -0.0036976533010601997,
 -0.010101514868438244,
 0.010625297203660011,
 0.019292647019028664,
 0.007788143120706081,
 -0.0016602027462795377,
 0.033372413367033005,
 -0.020028436556458473,
 0.00030651394627057016,
 -0.012645600363612175,
 -0.01641184464097023,
 -0.01935500092804432,
 0.0028558603953570127,
 -0.03020477667450905,
 -0.011841220781207085,
 -0.001498079625889659,
 0.005509066861122847,
 -0.01437907014042139,
 0.01079365611076355,
 0.021113412454724312,
 0.01140473

In [89]:
len(response['data'][0]['embedding'])

1536

### Creating a list of embeddings

This code sends all of the data from `df["text"].tolist()` to the `openai.Embedding.create` function, then extracts the resulting embeddings and creates a list of embeddings called `embeddings`.

In [92]:
embeddings = list(map(lambda x: x['embedding'], response['data']))
type(embeddings)

list

In [93]:
len(embeddings)

179

### Adding Embeddings to DataFrame and Saving as CSV

Run the cell below as-is.

In [94]:
df['embeddings'] = embeddings
df.to_csv('embeddings.csv', index=False)