# Codealong Notebook

Use this notebook as your "scratch pad" as you go through the course contents. Feel free to copy any example code and tweak it to get a better understanding of how it works!

Use the **+** button or `Insert` menu to add additional code cells as needed.

In [21]:
import openai
import os

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = os.getenv("VOCAREUM_OPENAI_API_KEY")

## Step 0: Inspecting Non-Customized Results

Before we perform any prompt engineering, **let's ask the OpenAI model some questions and see how it answers**.

(If you encounter an `AuthenticationError` when running this code, make sure that you have added a valid API key to the cell above and executed it.)

In [22]:
ukraine_prompt = """
Question: "When did Russia invade Ukraine?"
Answer:
"""
initial_ukraine_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=ukraine_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_ukraine_answer)

Russia invaded Ukraine in February 2014. This invasion led to the annexation of Crimea and ongoing conflict in eastern Ukraine.


In [23]:
twitter_prompt = """
Question: "Who owns Twitter?"
Answer:
"""
initial_twitter_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=twitter_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_twitter_answer)

Twitter is a publicly traded company and its ownership is divided among its shareholders. The majority shareholder is CEO and co-founder Jack Dorsey, who owns about 2% of the company's stock. Other major shareholders include venture capital firm Spark Capital, mutual fund companies Vanguard and BlackRock, and other technology companies like Google and Microsoft. Ultimately, the ownership of Twitter is constantly changing as shares are bought and sold on the stock market.


## Step 1

### Scrap Data from Wikipedia

In [7]:
import requests

params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2022",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
response = requests.get("https://en.wikipedia.org/w/api.php", params=params)
for event in response.json()['query']['pages'][0]['extract'].split('\n'):
    print(event)

2022 (MMXXII) was a common year starting on Saturday of the Gregorian calendar, the 2022nd year of the Common Era (CE) and Anno Domini (AD) designations, the 22nd  year of the 3rd millennium and the 21st century, and the  3rd   year of the 2020s decade.  
The year began with another wave in the COVID-19 pandemic, with Omicron spreading rapidly and becoming the dominant variant of the SARS-CoV-2 virus worldwide. Tracking a decrease in cases and deaths, 2022 saw the removal of most COVID-19 restrictions and the reopening of international borders in the vast majority of countries, while the global rollout of COVID-19 vaccines continued. The global economic recovery from the pandemic continued, though many countries experienced an ongoing inflation surge; in response, many central banks raised their interest rates to landmark levels. The world population reached eight billion people in 2022. The year also witnessed numerous natural disasters, including two devastating Atlantic hurricanes (

### Loading the Data with `pandas`

In [15]:
import pandas as pd

df = pd.DataFrame()
df['text'] = response.json()['query']['pages'][0]['extract'].split('\n')
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
df.tail(20)

Unnamed: 0,text
243,November 19 – The 2022 Malaysian general elect...
244,November 19–November 26 – The 2022 Central Ame...
245,November 20–December 18 – The 2022 FIFA World ...
246,November 20 – 2022 Nepalese general election: ...
247,November 21 – A 5.6 earthquake strikes near Ci...
248,"November 30 – OpenAI releases ChatGPT, an arti..."
252,December 2 – The G7 and Australia join the EU ...
253,December 5 – The National Ignition Facility ac...
254,December 7
255,The Congress of Peru removes President Pedro C...


In [18]:
from dateutil.parser import parse

prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)
df.tail(20)

Unnamed: 0,text
178,"November 16 – NASA launches Artemis 1, the fir..."
179,November 19 – The 2022 Malaysian general elect...
180,November 19–November 26 – The 2022 Central Ame...
181,November 20–December 18 – The 2022 FIFA World ...
182,November 20 – 2022 Nepalese general election: ...
183,November 21 – A 5.6 earthquake strikes near Ci...
184,"November 30 – OpenAI releases ChatGPT, an arti..."
185,December 2 – The G7 and Australia join the EU ...
186,December 5 – The National Ignition Facility ac...
187,December 7 – The Congress of Peru removes Pres...


### Creating an Embeddings Index with `openai.Embedding`

In [30]:
response = openai.Embedding.create(
    input=df["text"].tolist(),
    model="text-embedding-ada-002",
)
print(type(response))
response.keys()

<class 'openai.openai_object.OpenAIObject'>


dict_keys(['object', 'data', 'model', 'usage'])

In [35]:
len(response['data'][0]['embedding'])

1536

In [40]:
embeddings = [ data['embedding'] for data in response['data']]
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,– 2022 (MMXXII) was a common year starting on...,"[4.6367837057914585e-05, -0.017940208315849304..."
1,– The year began with another wave in the COV...,"[-0.004722667392343283, -0.019994843751192093,..."
2,– 2022 was also dominated by wars and armed c...,"[-0.009606238454580307, -0.015301033854484558,..."
3,– The ongoing Russian invasion of Ukraine esc...,"[-0.014721134677529335, -0.007632689084857702,..."
4,January 1 – The Regional Comprehensive Econom...,"[-0.0005679309833794832, -0.02413112111389637,..."
...,...,...
193,December 24 – 2022 Fijian general election: Th...,"[-0.011724342592060566, -0.009384616278111935,..."
194,December 29 – Brazilian football legend Pelé d...,"[-0.007616756483912468, 0.004072672221809626, ..."
195,December 31 – Former Pope Benedict XVI dies at...,"[0.023532262071967125, 0.007705941330641508, -..."
196,December 7 – The world population was estimate...,"[-0.004041583277285099, -0.014363067224621773,..."


In [41]:
df.to_csv("embeddings.csv")

## Step 2

### Finding Relevant Data with Cosine Similarity

In [42]:
import numpy as np
import pandas as pd

df = pd.read_csv("embeddings.csv")
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0.1,Unnamed: 0,text,embeddings
0,0,– 2022 (MMXXII) was a common year starting on...,"[4.6367837057914585e-05, -0.017940208315849304..."
1,1,– The year began with another wave in the COV...,"[-0.004722667392343283, -0.019994843751192093,..."
2,2,– 2022 was also dominated by wars and armed c...,"[-0.009606238454580307, -0.015301033854484558,..."
3,3,– The ongoing Russian invasion of Ukraine esc...,"[-0.014721134677529335, -0.007632689084857702,..."
4,4,January 1 – The Regional Comprehensive Econom...,"[-0.0005679309833794832, -0.02413112111389637,..."
...,...,...,...
193,193,December 24 – 2022 Fijian general election: Th...,"[-0.011724342592060566, -0.009384616278111935,..."
194,194,December 29 – Brazilian football legend Pelé d...,"[-0.007616756483912468, 0.004072672221809626, ..."
195,195,December 31 – Former Pope Benedict XVI dies at...,"[0.023532262071967125, 0.007705941330641508, -..."
196,196,December 7 – The world population was estimate...,"[-0.004041583277285099, -0.014363067224621773,..."


In [54]:
from openai.embeddings_utils import get_embedding

question = "When did Russia invade Ukraine?"
question_embeddings = get_embedding(question, engine='text-embedding-ada-002')
question_embeddings

[0.0016044961521402001,
 -0.019282648339867592,
 0.0034815892577171326,
 -0.013990121893584728,
 -0.02527659200131893,
 0.001979914726689458,
 -0.013735060580074787,
 -0.024689950048923492,
 -0.013352468609809875,
 -0.02120836079120636,
 0.022394398227334023,
 0.02462618611752987,
 -0.009048305451869965,
 -0.011873112060129642,
 -0.006309583317488432,
 -0.010489403270184994,
 0.010610557161271572,
 -0.003959829453378916,
 0.03336204215884209,
 -0.01868325285613537,
 -0.01435996126383543,
 -0.01624741591513157,
 0.0033349287696182728,
 0.0013040017802268267,
 -0.014831825159490108,
 0.006733622867614031,
 0.013735060580074787,
 -0.029102513566613197,
 0.015431219711899757,
 -0.014449232257902622,
 -0.01113981008529663,
 -0.022815249860286713,
 -0.020494189113378525,
 -0.016158144921064377,
 -0.035121966153383255,
 -0.03239280730485916,
 0.009150330908596516,
 -0.009367132559418678,
 0.015048626810312271,
 -0.004061853978782892,
 0.009507416747510433,
 0.017892561852931976,
 -0.004600671

In [55]:
from openai.embeddings_utils import distances_from_embeddings

distances = distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric="cosine")
distances

[np.float64(0.29090909661562714),
 np.float64(0.28040051586260084),
 np.float64(0.1852374402882785),
 np.float64(0.13050339291125068),
 np.float64(0.2896189252189425),
 np.float64(0.2615814398449129),
 np.float64(0.23044793898777793),
 np.float64(0.21798277484102024),
 np.float64(0.17768780859353805),
 np.float64(0.2620326331892231),
 np.float64(0.2627021441424128),
 np.float64(0.2868424457643298),
 np.float64(0.27513885933329385),
 np.float64(0.25393750587907993),
 np.float64(0.26444585906224705),
 np.float64(0.23332109111578614),
 np.float64(0.2624089523129116),
 np.float64(0.23041193302820318),
 np.float64(0.27159767448924477),
 np.float64(0.2555990201834101),
 np.float64(0.24665716337806087),
 np.float64(0.20020380984369113),
 np.float64(0.26136517179444285),
 np.float64(0.250155613816749),
 np.float64(0.25322965719484214),
 np.float64(0.2830258742219385),
 np.float64(0.1213398615247856),
 np.float64(0.24856493192781903),
 np.float64(0.14702339834089473),
 np.float64(0.137522337102

In [56]:
df["distances"] = distances
df

Unnamed: 0.1,Unnamed: 0,text,embeddings,distances
0,0,– 2022 (MMXXII) was a common year starting on...,"[4.6367837057914585e-05, -0.017940208315849304...",0.290909
1,1,– The year began with another wave in the COV...,"[-0.004722667392343283, -0.019994843751192093,...",0.280401
2,2,– 2022 was also dominated by wars and armed c...,"[-0.009606238454580307, -0.015301033854484558,...",0.185237
3,3,– The ongoing Russian invasion of Ukraine esc...,"[-0.014721134677529335, -0.007632689084857702,...",0.130503
4,4,January 1 – The Regional Comprehensive Econom...,"[-0.0005679309833794832, -0.02413112111389637,...",0.289619
...,...,...,...,...
193,193,December 24 – 2022 Fijian general election: Th...,"[-0.011724342592060566, -0.009384616278111935,...",0.248067
194,194,December 29 – Brazilian football legend Pelé d...,"[-0.007616756483912468, 0.004072672221809626, ...",0.287816
195,195,December 31 – Former Pope Benedict XVI dies at...,"[0.023532262071967125, 0.007705941330641508, -...",0.293179
196,196,December 7 – The world population was estimate...,"[-0.004041583277285099, -0.014363067224621773,...",0.264046


In [60]:
shortest_distance = df.loc[df["distances"].idxmin()]
shortest_distance['text']

'March 2 – Russian invasion of Ukraine: Russia captures its first large city, the Black Sea port of Kherson, as shelling intensifies across many parts of Ukraine, including civilian areas.'

In [61]:
df.sort_values("distances").head(10)

Unnamed: 0.1,Unnamed: 0,text,embeddings,distances
38,38,March 2 – Russian invasion of Ukraine: Russia ...,"[0.0007103696698322892, -0.018340742215514183,...",0.109154
61,61,April 3 – Russian invasion of Ukraine: As Russ...,"[-0.012136607430875301, -0.012402704916894436,...",0.111381
174,174,November 11 – Russian invasion of Ukraine: Ukr...,"[-0.012362104840576649, -0.014023439958691597,...",0.115558
145,145,September 21 – Russian invasion of Ukraine: Fo...,"[-0.025618210434913635, -0.022040605545043945,...",0.116926
87,87,May 16 – Russian invasion of Ukraine: The Sieg...,"[-0.018358226865530014, -0.006472041830420494,...",0.119257
157,157,October 8 – Russian invasion of Ukraine: An ex...,"[-0.013957943767309189, -0.008272825740277767,...",0.121102
26,26,February 21 – February 24 – Russian President ...,"[-0.011927724815905094, -0.006975589320063591,...",0.12134
33,33,February 28 – Russian invasion of Ukraine: Rus...,"[-0.011818074621260166, -0.008877783082425594,...",0.126049
37,37,March 1 – Russian invasion of Ukraine: In an e...,"[-0.007628186140209436, -0.01614471897482872, ...",0.126218
49,49,March 9 – Russian invasion of Ukraine: Russia ...,"[0.0010949264978989959, -0.010443422012031078,...",0.126793


In [62]:
df.iloc[26]["text"]

"February 21 – February 24 – Russian President Vladimir Putin signs a decree declaring the Luhansk People's Republic and Donetsk People's Republic as independent from Ukraine, and, despite international condemnation and sanctions, begins a full-scale invasion of Ukraine; at dawn on 24 February missiles strike Kyiv. Ukraine severs diplomatic relations with Russia, followed by the Federated States of Micronesia on 25 February."

In [63]:
df.sort_values("distances").to_csv("distances_sorted.csv")

## Step 3

### Tokenizing with `tiktoken`

### Composing a Custom Text Prompt

## Step 4

### Getting a Custom Q&A Response with `openai.Completion`