# Step 1: Preparing a Dataset with Embeddings

Add your API key to the cell below then run it.

In [None]:
import openai
openai.api_key = "YOUR API KEY"

## Loading the Data

We are using the `requests` library ([documentation here](https://requests.readthedocs.io/en/latest/user/quickstart/)) to get the text of a page from Wikipedia using the `extracts` API feature ([documentation here](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bextracts)). You can ignore the details of the `params` being sent — the important takeaway is that **`response_dict` is a Python dictionary containing the the response to our query**.

Run the cell below as-is.

In [None]:
import requests

# Get the Wikipedia page for the 2023 Turkey–Syria earthquake
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2023_Turkey–Syria_earthquakes",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

In [None]:
response_dict

### TODO: Parse `response_dict` to get a list of text data samples

Look at the nested data structure of `response_dict` and find the key-value pair with the key of `"extract"`. The associated value will be a string containing a long block of text. Split this text into a list of strings using the `"\n"` separator and assign to the variable `text_data`.

If you're getting stuck, you can click to reveal the solution then copy and paste this into the cell below.

---

<details>
    <summary style="cursor: pointer"><strong>Solution (click to show/hide)</strong></summary>

```python
text_data = response_dict["query"]["pages"][0]["extract"].split("\n")
```

</details>

In [None]:
text_data = 

### Adding the Text Data to a DataFrame

Run the cell below as-is.

In [None]:
import pandas as pd

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = text_data

# Clean up dataframe to remove empty lines and headings
df = df[(
    (df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))
)].reset_index(drop=True)
df.head()

## Creating the Embeddings Index

Here is the text from the first row of our dataset. Run the cell below as-is.

In [None]:
df["text"][0]

This code creates embeddings for that text sample. Run the cell below as-is.

In [None]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input=[df["text"][0]],
    engine=EMBEDDING_MODEL_NAME
)

# Extract and print the first 20 numbers in the embedding
response_list = response["data"]
first_item = response_list[0]
first_item_embedding = first_item["embedding"]
print(first_item_embedding[:20])

### Creating a list of embeddings

This code sends all of the data from `df["text"].tolist()` to the `openai.Embedding.create` function, then extracts the resulting embeddings and creates a list of embeddings called `embeddings`.

Run the cell below as-is.

In [None]:
# Send text data to the model
response = openai.Embedding.create(
    input=df["text"].tolist(),
    engine=EMBEDDING_MODEL_NAME
)

# Extract embeddings
embeddings = [data["embedding"] for data in response["data"]]

### Adding Embeddings to DataFrame and Saving as CSV

Run the cell below as-is.

In [None]:
# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.to_csv("embeddings.csv")

## Conclusion

You have now created and saved an embeddings index!