# Step 2: Finding Relevant Data

#### Import the libraries and the embeddings of step 1

In [1]:
import numpy as np
import pandas as pd

In [22]:
df = pd.read_csv('embeddings.csv') # set index_col=0 if loaded with Unnamed:0

In [6]:
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings
0,2022 (MMXXII) was a common year starting on S...,"[-0.0029914826154708862, -0.019716661423444748..."
1,2022 saw the removal of nearly all COVID-19 r...,"[-0.010901058092713356, -0.02552184648811817, ..."
2,2022 was also dominated by wars and armed con...,"[-0.010935657657682896, -0.015258442610502243,..."
3,January 1 – The Regional Comprehensive Econom...,"[-0.000557306339032948, -0.02417675219476223, ..."
4,January 2 – Abdalla Hamdok resigns as Prime Mi...,"[-0.014919677749276161, 0.0010699920821934938,..."
...,...,...
174,December 17 – Leo Varadkar succeeds Micheál Ma...,"[0.008571368642151356, -0.022190673276782036, ..."
175,December 19 – At the UN Biodiversity Conferenc...,"[0.0017108714673668146, -0.009228819981217384,..."
176,December 21–26 – A major winter storm hits the...,"[-0.02456742338836193, -0.021020671352744102, ..."
177,December 24 – 2022 Fijian general election: Th...,"[-0.011603529565036297, -0.0093703493475914, -..."


### Create Embeddings for the User's Question

We created embeddings previously for the dataset. Now we will create an embeddings for just one string: the user's question. 

We will assign the result to the variable question_embeddings. This variable should contain a list of 1,536 floating point numbers, and the provided code will print some of the question_embeddings to see if they have been created correctly.

In [7]:
question = 'When did Russia invade Ukraine?'

In [9]:
import openai

openai.api_key = ''

In [15]:
from openai.embeddings_utils import get_embedding

question_embeddings = get_embedding(question, engine='text-embedding-ada-002')
question_embeddings

[0.001530271372757852,
 -0.019294172525405884,
 0.003417606232687831,
 -0.013976478949189186,
 -0.025249479338526726,
 0.0020084811840206385,
 -0.013746938668191433,
 -0.02468837983906269,
 -0.013402626849710941,
 -0.021232515573501587,
 0.022354714572429657,
 0.024637369439005852,
 -0.009034977294504642,
 -0.011834098957479,
 -0.006395259406417608,
 -0.010488735511898994,
 0.010571625083684921,
 -0.003943637013435364,
 0.033385422080755234,
 -0.018643807619810104,
 -0.014244276098906994,
 -0.016322895884513855,
 0.0033506567124277353,
 0.0013134829932823777,
 -0.014818128198385239,
 0.0067076897248625755,
 0.013746938668191433,
 -0.02915167063474655,
 0.015391980297863483,
 -0.014346295036375523,
 -0.01106258761137724,
 -0.022903062403202057,
 -0.020505636930465698,
 -0.016093354672193527,
 -0.03514523431658745,
 -0.03241625055670738,
 0.009162500500679016,
 -0.009430297650396824,
 0.015085926279425621,
 -0.004020150750875473,
 0.009481307119131088,
 0.017776653170585632,
 -0.00463544

### Find the Cosine Distances

Next we will create a list of `distances`, which represents the cosine distances between `question_embeddings` and each value in the `'embeddings'` column of `df`.

In [16]:
from openai.embeddings_utils import distances_from_embeddings

distances = distances_from_embeddings(question_embeddings, df['embeddings'].tolist(), distance_metric='cosine')
distances

[0.2913450498228335,
 0.27054409757910913,
 0.1851863782454587,
 0.2891506394014042,
 0.26141364552829327,
 0.23012193385865476,
 0.21749372666396405,
 0.1775066830482639,
 0.26195003934013705,
 0.26448812941822286,
 0.2862077262079489,
 0.2535415862920277,
 0.2641409370705752,
 0.26215839021775444,
 0.23009876678848296,
 0.27268949771440465,
 0.2550207112605807,
 0.2461250520937982,
 0.19976869991525947,
 0.25930687618960824,
 0.24972941414277805,
 0.2527826427405171,
 0.2824213923157325,
 0.12146975210770361,
 0.14652538919230373,
 0.13718795583905208,
 0.13820905534040118,
 0.17933638513473316,
 0.270936814988335,
 0.12575203019520875,
 0.15583250323587594,
 0.19380166729757298,
 0.16520348593383405,
 0.1259905076459663,
 0.10720408482598554,
 0.13470400182093512,
 0.15436054391406684,
 0.1469194407552209,
 0.24168368147259234,
 0.26376856675141713,
 0.21779994302884365,
 0.2672562151673086,
 0.26990914169957647,
 0.1934276122209494,
 0.2538121542083097,
 0.12661100319996077,
 0.235

In [26]:
df['distances'] = distances
df

Unnamed: 0,text,embeddings,distances
0,2022 (MMXXII) was a common year starting on S...,"[-0.0029914826154708862, -0.019716661423444748...",0.291345
1,2022 saw the removal of nearly all COVID-19 r...,"[-0.010901058092713356, -0.02552184648811817, ...",0.270544
2,2022 was also dominated by wars and armed con...,"[-0.010935657657682896, -0.015258442610502243,...",0.185186
3,January 1 – The Regional Comprehensive Econom...,"[-0.000557306339032948, -0.02417675219476223, ...",0.289151
4,January 2 – Abdalla Hamdok resigns as Prime Mi...,"[-0.014919677749276161, 0.0010699920821934938,...",0.261414
...,...,...,...
174,December 17 – Leo Varadkar succeeds Micheál Ma...,"[0.008571368642151356, -0.022190673276782036, ...",0.268644
175,December 19 – At the UN Biodiversity Conferenc...,"[0.0017108714673668146, -0.009228819981217384,...",0.274309
176,December 21–26 – A major winter storm hits the...,"[-0.02456742338836193, -0.021020671352744102, ...",0.256902
177,December 24 – 2022 Fijian general election: Th...,"[-0.011603529565036297, -0.0093703493475914, -...",0.247695


In [27]:
df.to_csv('distances.csv', index=False)

In [28]:
pd.read_csv('distances.csv')

Unnamed: 0,text,embeddings,distances
0,2022 (MMXXII) was a common year starting on S...,"[-0.0029914826154708862, -0.019716661423444748...",0.291345
1,2022 saw the removal of nearly all COVID-19 r...,"[-0.010901058092713356, -0.02552184648811817, ...",0.270544
2,2022 was also dominated by wars and armed con...,"[-0.010935657657682896, -0.015258442610502243,...",0.185186
3,January 1 – The Regional Comprehensive Econom...,"[-0.000557306339032948, -0.02417675219476223, ...",0.289151
4,January 2 – Abdalla Hamdok resigns as Prime Mi...,"[-0.014919677749276161, 0.0010699920821934938,...",0.261414
...,...,...,...
174,December 17 – Leo Varadkar succeeds Micheál Ma...,"[0.008571368642151356, -0.022190673276782036, ...",0.268644
175,December 19 – At the UN Biodiversity Conferenc...,"[0.0017108714673668146, -0.009228819981217384,...",0.274309
176,December 21–26 – A major winter storm hits the...,"[-0.02456742338836193, -0.021020671352744102, ...",0.256902
177,December 24 – 2022 Fijian general election: Th...,"[-0.011603529565036297, -0.0093703493475914, -...",0.247695


## Sorting by Distance

The code below uses the `distances` list to update `df` then sorts `df` to find the most related rows. 

Shorter distance means more similarity, so we'll use an ascending sorting order. Run the cell below as-is.

In [32]:
sorted_distances = df.sort_values(by='distances', ascending=True)
sorted_distances

Unnamed: 0,text,embeddings,distances
34,March 2 – Russian invasion of Ukraine: Russia ...,"[-5.313744259183295e-05, -0.019540982320904732...",0.107204
56,April 3 – Russian invasion of Ukraine: As Russ...,"[-0.012207494117319584, -0.012519340962171555,...",0.111251
160,November 11 – Russian invasion of Ukraine: Ukr...,"[-0.012295315973460674, -0.014077062718570232,...",0.115467
132,September 21 – Russian invasion of Ukraine: Fo...,"[-0.025522246956825256, -0.022120986133813858,...",0.116897
152,October 29 – Russian invasion of Ukraine: In r...,"[-0.00995244737714529, -0.030325081199407578, ...",0.117591
...,...,...,...
0,2022 (MMXXII) was a common year starting on S...,"[-0.0029914826154708862, -0.019716661423444748...",0.291345
55,March 31 – Expo 2020 closes in Dubai after a 6...,"[-0.0032101301476359367, -0.04666922986507416,...",0.292565
159,"November 11 – The cryptocurrency exchange FTX,...","[0.002234421204775572, -0.025721479207277298, ...",0.293966
167,November 20 – 2022 Nepalese general election: ...,"[-0.00431521050632, -0.0008002328686416149, -0...",0.294986


In [33]:
sorted_distances.to_csv('distances_sorted.csv', index=False)