<a href="https://colab.research.google.com/github/wangzuohao/ipynb/blob/main/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenAI Word Embeddings, Semantic Search

Word embeddings are a way of representing words and phrases as vectors. They can be used for a variety of tasks, including semantic search, anomaly detection, and classification. In the video on OpenAI Whisper, I mentioned how words whose vectors are numerically similar are also similar in semantic meaning. In this tutorial, we will learn how to implement semantic search using OpenAI embeddings. Understanding the Embeddings concept will be crucial to the next several videos in this series since we will use it to build several practical applications.

To get started, we will need to install and import OpenAI and input an API Key. We learned how to do this in [Video 3 of this series](https://www.youtube.com/watch?v=LWYgjcZye1c).

In [None]:
!pip install openai -q

In [None]:
import openai
import pandas as pd
import numpy as np
from getpass import getpass

openai.api_key = "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"

# Read Data File Containing Words

Now that we have configured OpenAI, let's start with a simple CSV file with familiar words. From here we'll build up to a more complex semantic search using sentences from the Fed speech. [Save the linked "words.csv" as a CSV](https://gist.github.com/hackingthemarkets/25240a55e463822d221539e79d91a8d0) and upload it to Google Colab. Once the file is uploaded, let's read it into a pandas dataframe using the code below:

In [None]:
df = pd.read_csv('words.csv')
print(df)

            text
0            red
1       potatoes
2           soda
3         cheese
4          water
5           blue
6         crispy
7      hamburger
8         coffee
9          green
10          milk
11      la croix
12        yellow
13     chocolate
14  french fries
15         latte
16          cake
17         brown
18  cheeseburger
19      espresso
20    cheesecake
21         black
22         mocha
23         fizzy
24        carbon
25        banana


# Calculate Word Embeddings

To use word embeddings for semantic search, you first compute the embeddings for a corpus of text using a word embedding algorithm. What does this mean? We are going to create a numerical representation of each of these words. To perform this computation, we'll use OpenAI's 'get_embedding' function. 

Since we have our words in a pandas dataframe, we can use "apply" to apply the get_embedding function to each row in the dataframe. We then store the calculated word embeddings in a new text file called "word_embeddings.csv" so that we don't have to call OpenAI again to perform these calculations.

In [None]:
from openai.embeddings_utils import get_embedding
get_embedding("the fox crossed the road", engine='text-embedding-ada-002')

[-0.0005114655359648168,
 0.00039769697468727827,
 -0.020240942016243935,
 0.0070919012650847435,
 -0.01388094574213028,
 0.02533903531730175,
 -0.022348321974277496,
 -0.022335704416036606,
 0.013868327252566814,
 -0.03318807855248451,
 0.029124747961759567,
 0.008833329193294048,
 0.032960936427116394,
 -0.015836898237466812,
 0.0048425570130348206,
 -0.0010726185282692313,
 0.01957213319838047,
 -0.0009803418070077896,
 0.01833546720445156,
 -0.030664270743727684,
 -0.00983654335141182,
 0.029503319412469864,
 0.002295088255777955,
 -0.023698560893535614,
 -0.003937141038477421,
 0.006423092447221279,
 0.015294278971850872,
 -0.013779993169009686,
 -0.002787230769172311,
 -0.011615827679634094,
 0.008101425133645535,
 0.001911784871481359,
 -0.02882189117372036,
 0.0005173807148821652,
 -0.013855707831680775,
 -0.01738903857767582,
 -0.0018849693005904555,
 -0.00869451928883791,
 0.011294042691588402,
 -0.027358083054423332,
 0.027156177908182144,
 0.005277913995087147,
 -0.01663189

In [None]:


df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
df.to_csv('word_embeddings.csv')

# Semantic Search

Now that we have our word embeddings stored, let's load them into a new dataframe and use it for semantic search. Since the 'embedding' in the CSV is stored as a string, we'll use apply() and to interpret this string as Python code and convert it to a numpy array so that we can perform calculations on it.

In [None]:
df = pd.read_csv('word_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df

Unnamed: 0.1,Unnamed: 0,text,embedding
0,0,red,"[1.8579006791696884e-05, -0.024676261469721794..."
1,1,potatoes,"[0.005025846417993307, -0.031079445034265518, ..."
2,2,soda,"[0.025859493762254715, -0.007452284451574087, ..."
3,3,cheese,"[-0.003942061681300402, -0.009351087734103203,..."
4,4,water,"[0.019031280651688576, -0.01257743313908577, 0..."
5,5,blue,"[0.005434895399957895, -0.0072994716465473175,..."
6,6,crispy,"[-0.0010056837927550077, -0.005415474995970726..."
7,7,hamburger,"[-0.013204505667090416, -0.0018326942808926105..."
8,8,coffee,"[-0.0007566261338070035, -0.0194522924721241, ..."
9,9,green,"[0.01538460049778223, -0.010931522585451603, 0..."


Let's now prompt ourselves for a search term that isn't in the dataframe. We'll use word embeddings to perform a semantic search for the words that are most similar to the word we entered. I'll first try the word "hot dog". Then we'll come back and try the word "yellow".

In [None]:
search_term = input('Enter a search term: ')


Enter a search term: hot dogs


Now that we have a search term, let's calculate an embedding or vector for that search term using the OpenAI get_embedding function.

In [None]:
# semantic search
search_term_vector = get_embedding(search_term, engine="text-embedding-ada-002")
search_term_vector

[-0.009838630445301533,
 -0.0025951541028916836,
 -0.01048901304602623,
 -0.032621148973703384,
 -0.014984305016696453,
 0.015264862217009068,
 -0.032493624836206436,
 -0.007147832307964563,
 -0.004839611705392599,
 -0.02823425643146038,
 0.02831077203154564,
 0.03468707203865051,
 -0.021590150892734528,
 0.0016737787518650293,
 -0.012401903048157692,
 0.015392388217151165,
 0.034125957638025284,
 -0.005607955623418093,
 0.021284088492393494,
 -0.00023950976901687682,
 -0.010106435045599937,
 0.0032519130036234856,
 0.018197959288954735,
 -0.017777124419808388,
 0.009379536844789982,
 -0.0035930450540035963,
 0.00856974720954895,
 -0.013466745615005493,
 -0.011260545812547207,
 -0.010055424645543098,
 0.03690602257847786,
 -0.004772660322487354,
 -0.021845202893018723,
 -0.0028055720031261444,
 0.006357171107083559,
 -0.009436924010515213,
 -0.005914018023759127,
 -0.0018889788771048188,
 0.01840200088918209,
 0.005101040005683899,
 -0.0075941733084619045,
 0.003838532604277134,
 0.003

 Once we have a vector representing that word, we can see how similar it is to other words in our dataframe by calculating the cosine similarity of our search term's word vector to each word embedding in our dataframe.

In [None]:
from openai.embeddings_utils import cosine_similarity

df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector))

df

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
0,0,red,"[1.8579006791696884e-05, -0.024676261469721794...",0.798395
1,1,potatoes,"[0.005025846417993307, -0.031079445034265518, ...",0.848055
2,2,soda,"[0.025859493762254715, -0.007452284451574087, ...",0.822913
3,3,cheese,"[-0.003942061681300402, -0.009351087734103203,...",0.840856
4,4,water,"[0.019031280651688576, -0.01257743313908577, 0...",0.798258
5,5,blue,"[0.005434895399957895, -0.0072994716465473175,...",0.779419
6,6,crispy,"[-0.0010056837927550077, -0.005415474995970726...",0.814716
7,7,hamburger,"[-0.013204505667090416, -0.0018326942808926105...",0.881941
8,8,coffee,"[-0.0007566261338070035, -0.0194522924721241, ...",0.796697
9,9,green,"[0.01538460049778223, -0.010931522585451603, 0...",0.778072


# Sorting By Similarity

Now that we have calculated the similarities to each term in our dataframe, we simply sort the similarity values to find the terms that are most similar to the term we searched for. Notice how the foods are most similar to "hot dog". Not only that, it puts fast food closer to hot dog. Also some colors are ranked closer to hot dog than others. Let's go back and try the word "yellow" and walk through the results.

In [None]:
df.sort_values("similarities", ascending=False).head(20)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
7,7,hamburger,"[-0.013204505667090416, -0.0018326942808926105...",0.881941
14,14,french fries,"[0.0014476682990789413, -0.016491735354065895,...",0.863672
18,18,cheeseburger,"[-0.018216600641608238, 0.005054354667663574, ...",0.859472
1,1,potatoes,"[0.005025846417993307, -0.031079445034265518, ...",0.848055
3,3,cheese,"[-0.003942061681300402, -0.009351087734103203,...",0.840856
13,13,chocolate,"[0.0015591585543006659, -0.013005273416638374,...",0.836868
2,2,soda,"[0.025859493762254715, -0.007452284451574087, ...",0.822913
10,10,milk,"[0.0009292512550018728, -0.019319288432598114,...",0.818677
6,6,crispy,"[-0.0010056837927550077, -0.005415474995970726...",0.814716
20,20,cheesecake,"[0.011245746165513992, -0.012743037194013596, ...",0.812682


# Adding Words Together

What's even more interesting is that we can add word vectors together. What happens when we add the numbers for milk and espresso, then search for the word vector most similar to milk + espresso? Let's make a copy of the original dataframe and call it food_df. We'll operate on this copy. Let's try adding word together. Let's add milk + espresso and store the results in milk_espresso_vector.

In [None]:
food_df = df.copy()

milk_vector = food_df['embedding'][10]
espresso_vector = food_df['embedding'][19]

milk_espresso_vector = milk_vector + espresso_vector
milk_espresso_vector

array([-0.02157659, -0.03206679, -0.01620988, ..., -0.00423221,
        0.00078145, -0.02898556])

Now let's find the words most similar to milk + espresso. If you have never done this before, it's pretty surprising that you can add words together like this and find similar words using numbers.

In [None]:
food_df["similarities"] = food_df['embedding'].apply(lambda x: cosine_similarity(x, milk_espresso_vector))
food_df.sort_values("similarities", ascending=False)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
19,19,espresso,"[-0.02250584401190281, -0.012747502885758877, ...",0.960501
10,10,milk,"[0.0009292512550018728, -0.019319288432598114,...",0.960501
15,15,latte,"[-0.015634099021553993, -0.003942839801311493,...",0.922975
22,22,mocha,"[-0.012473775073885918, -0.026152553036808968,...",0.899301
8,8,coffee,"[-0.0007566261338070035, -0.0194522924721241, ...",0.895382
3,3,cheese,"[-0.003942061681300402, -0.009351087734103203,...",0.885276
13,13,chocolate,"[0.0015591585543006659, -0.013005273416638374,...",0.88344
2,2,soda,"[0.025859493762254715, -0.007452284451574087, ...",0.874156
4,4,water,"[0.019031280651688576, -0.01257743313908577, 0...",0.866049
7,7,hamburger,"[-0.013204505667090416, -0.0018326942808926105...",0.852628


# Microsoft Earnings Call Transcript

Let's tie this back to finance. I have attached some text from a recent [Microsoft earnings call here](https://gist.github.com/hackingthemarkets/1c827a7750384fcf52c84594ef216a2d). Click on "raw" and save the file as a CSV. Upload it to Google Colab as microsoft-earnings.csv. Let's use what we just learned to perform a semantic search on sentences in the Microsoft earnings call. We'll start by reading the paragraphs into a pandas dataframe.

In [None]:
earnings_df = pd.read_csv('microsoft-earnings.csv')
earnings_df

Unnamed: 0,text
0,"Thank you, Brett. To start, I want to outline ..."
1,"With that context, this quarter, the Microsoft..."
2,It helps them align their spend with demand an...
3,We are the platform of choice for customers' S...
4,Now to data and AI. With our Microsoft Intelli...
5,"Cosmos DB now supports postscript SQL, making ..."
6,"All of, Azure ML revenue has increased more th..."
7,And GitHub's developer-first ethos has never b...
8,Now on to Power Platform. We are helping custo...
9,Power Automate has more than seven million mon...


Once we have the dataframe, we'll once again compute the embeddings for each line in our CSV file.

In [None]:
earnings_df['embedding'] = earnings_df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
earnings_df.to_csv('earnings-embeddings.csv')

If you download the earnings_embeddings.csv file locally and open it up, you'll see that our embeddings are for entire paragraphs - not just words. This means that we'll be able to search on similar sentences even if there isn't an exact match for the string we search for. We are searching on meaning.

In [None]:
earnings_search = input("Search earnings for a sentence:")

Search earnings for a sentence:I am a AI engineer, would you please share some AI knowledge?


In [None]:

earnings_search_vector = get_embedding(earnings_search, engine="text-embedding-ada-002")
earnings_search_vector

[-0.00698842853307724,
 -0.01678837090730667,
 0.012907405383884907,
 -0.01580635830760002,
 0.0014864703407511115,
 0.02163117006421089,
 -0.01494541671127081,
 0.013627098873257637,
 -0.03112843818962574,
 -0.015671836212277412,
 0.023837333545088768,
 -0.0035312077961862087,
 0.018644776195287704,
 -0.009322388097643852,
 -0.00040398698183707893,
 0.00174710713326931,
 0.023312697187066078,
 -0.001457884325645864,
 0.005875257309526205,
 0.001779056154191494,
 -0.011387304402887821,
 0.014380423352122307,
 0.008777573704719543,
 -0.007876275107264519,
 -0.0013494258746504784,
 0.0050277672708034515,
 0.029810119420289993,
 -0.014783989638090134,
 0.0069278934970498085,
 -0.03656313568353653,
 0.048885367810726166,
 0.0077215745113790035,
 -0.01405757013708353,
 0.004563665483146906,
 -0.004193729721009731,
 0.02484625019133091,
 0.016748014837503433,
 -2.2037206690583844e-06,
 0.03155890852212906,
 -0.007479434367269278,
 0.0359981395304203,
 0.019586432725191116,
 0.013156271539628

In [None]:

earnings_df["similarities"] = earnings_df['embedding'].apply(lambda x: cosine_similarity(x, earnings_search_vector))

earnings_df


Unnamed: 0,text,embedding,similarities
0,"Thank you, Brett. To start, I want to outline ...","[-0.009504559449851513, -0.003731543431058526,...",0.723151
1,"With that context, this quarter, the Microsoft...","[-0.0016425022622570395, -0.028921114280819893...",0.710276
2,It helps them align their spend with demand an...,"[0.008828130550682545, -0.03199512138962746, 0...",0.717725
3,We are the platform of choice for customers' S...,"[0.011994918808341026, -0.024179909378290176, ...",0.728586
4,Now to data and AI. With our Microsoft Intelli...,"[-0.004754434805363417, 0.0038801338523626328,...",0.76151
5,"Cosmos DB now supports postscript SQL, making ...","[-0.004492022562772036, -0.005987092852592468,...",0.774238
6,"All of, Azure ML revenue has increased more th...","[-0.012585409916937351, -0.02354775369167328, ...",0.737878
7,And GitHub's developer-first ethos has never b...,"[-0.0005771665018983185, -0.019794421270489693...",0.73724
8,Now on to Power Platform. We are helping custo...,"[-0.0214706901460886, -0.013261307962238789, 0...",0.730131
9,Power Automate has more than seven million mon...,"[-0.025379547849297523, -0.03403877094388008, ...",0.767758


In [None]:
earnings_df.sort_values("similarities", ascending=False)

Unnamed: 0,text,embedding,similarities
5,"Cosmos DB now supports postscript SQL, making ...","[-0.004492022562772036, -0.005987092852592468,...",0.774238
9,Power Automate has more than seven million mon...,"[-0.025379547849297523, -0.03403877094388008, ...",0.767758
4,Now to data and AI. With our Microsoft Intelli...,"[-0.004754434805363417, 0.0038801338523626328,...",0.76151
12,Our cloud for sustainability is off to a fast ...,"[0.008903194218873978, -0.01629571244120598, 0...",0.742022
6,"All of, Azure ML revenue has increased more th...","[-0.012585409916937351, -0.02354775369167328, ...",0.737878
7,And GitHub's developer-first ethos has never b...,"[-0.0005771665018983185, -0.019794421270489693...",0.73724
23,"And with our acquisition of EduBrite, they wil...","[-0.03107352927327156, -0.014467155560851097, ...",0.735756
19,"Accenture, for example, has deployed Windows 1...","[0.01380225084722042, -0.027254054322838783, 0...",0.734656
26,And with PromoteIQ we offer an omnichannel med...,"[-0.017536817118525505, -0.010471336543560028,...",0.733878
22,We once again saw record engagement among our ...,"[-0.04005807638168335, -0.01202403288334608, 0...",0.731632


# Sentences of the Fed Speech

Let's use the Fed Speech example once more. Let's calculate the word embeddings for a particular sentence in the November 2nd speech that we discussed in the OpenAI Whisper tutorial. Then we'll take a new sentence from a future speech that isn't in our dataset, and find the most similar sentence in our dataset. Here is the sentence we will use to search for similarity:

"the inflation is too damn high"

As we did previously, take [the linked CSV file](https://gist.github.com/hackingthemarkets/9b55ea8b73c7f4e04b42a9f8eddb8393) and upload it to Google Colab as fed-speech.csv. We'll once again read it into a pandas dataframe.

In [None]:
fed_df = pd.read_csv('fed-speech.csv')
fed_df

Unnamed: 0,text
0,Good afternoon
1,My colleagues and I are strongly committed to ...
2,We have both the tools that we need and the re...
3,Price stability is the responsibility of the F...
4,"Without price stability, the economy does not ..."
5,"In particular, without price stability, we wil..."
6,"Today, the FOMC raised our policy interest rat..."
7,We are moving our policy stance purposefully t...
8,"In addition, we are continuing the process of ..."
9,Restoring price stability will likely require ...


We'll once again calculate the embeddings and save them in a new CSV file.

In [None]:
fed_df['embedding'] = fed_df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
fed_df.to_csv('fed-embeddings.csv')

We'll then enter the new sentence that we want to find similarity for:

"We will continue to increase interest rates and tighten monetary policy"

In [None]:
fed_sentence = input('Enter something Jerome Powell said: ')


Enter something Jerome Powell said: the inflation is too damn high


Again we'll get the vector for this sentence, find the cosine similarity, and sort by most similar.

In [None]:
fed_sentence_vector = get_embedding(fed_sentence, engine="text-embedding-ada-002")
fed_sentence_vector

[-0.00413066940382123,
 -0.011251280084252357,
 -0.005313646513968706,
 -0.02224256657063961,
 -0.012122263200581074,
 0.0024195776786655188,
 -0.03860924765467644,
 -0.005732887890189886,
 -0.016691673547029495,
 -0.0204096008092165,
 0.022372564300894737,
 0.006987363565713167,
 0.023464541882276535,
 0.006652620155364275,
 0.014026726596057415,
 0.011277279816567898,
 0.0338253416121006,
 0.007643850985914469,
 0.02031860314309597,
 -0.015677694231271744,
 0.0025706999003887177,
 0.011101783253252506,
 -0.0122522609308362,
 -0.0034319330006837845,
 -0.020214606076478958,
 -0.0012877873377874494,
 0.016340680420398712,
 -0.02594749443233013,
 -0.0051089003682136536,
 -0.002343204338103533,
 0.007513853255659342,
 -0.0077023496851325035,
 -0.03166738152503967,
 -0.0024634518194943666,
 -0.020019609481096268,
 -0.03564530611038208,
 -0.013870729133486748,
 -0.016990669071674347,
 -0.0031215641647577286,
 -0.00859933253377676,
 0.026168489828705788,
 -0.010932786390185356,
 0.0133507391

In [None]:
fed_df = pd.read_csv('fed-embeddings.csv')
fed_df['embedding'] = fed_df['embedding'].apply(eval).apply(np.array)
fed_df


Unnamed: 0.1,Unnamed: 0,text,embedding
0,0,Good afternoon,"[-0.017524775117635727, 0.02069251798093319, -..."
1,1,My colleagues and I are strongly committed to ...,"[-0.026972517371177673, -0.012394015677273273,..."
2,2,We have both the tools that we need and the re...,"[0.003941578324884176, -0.015006175264716148, ..."
3,3,Price stability is the responsibility of the F...,"[0.009378707036376, -0.016561055555939674, -0...."
4,4,"Without price stability, the economy does not ...","[-0.003026996273547411, -0.014454687014222145,..."
5,5,"In particular, without price stability, we wil...","[-0.03618694841861725, -0.008898851461708546, ..."
6,6,"Today, the FOMC raised our policy interest rat...","[-0.024621201679110527, -0.02114815264940262, ..."
7,7,We are moving our policy stance purposefully t...,"[-0.025701606646180153, -0.012234759517014027,..."
8,8,"In addition, we are continuing the process of ...","[-0.03149143233895302, 0.0019273122306913137, ..."
9,9,Restoring price stability will likely require ...,"[-0.010953230783343315, -0.020290518179535866,..."


In [None]:

fed_df["similarities"] = fed_df['embedding'].apply(lambda x: cosine_similarity(x, fed_sentence_vector))

fed_df


Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
0,0,Good afternoon,"[-0.017524775117635727, 0.02069251798093319, -...",0.750047
1,1,My colleagues and I are strongly committed to ...,"[-0.026972517371177673, -0.012394015677273273,...",0.826724
2,2,We have both the tools that we need and the re...,"[0.003941578324884176, -0.015006175264716148, ...",0.770154
3,3,Price stability is the responsibility of the F...,"[0.009378707036376, -0.016561055555939674, -0....",0.775339
4,4,"Without price stability, the economy does not ...","[-0.003026996273547411, -0.014454687014222145,...",0.80408
5,5,"In particular, without price stability, we wil...","[-0.03618694841861725, -0.008898851461708546, ...",0.775005
6,6,"Today, the FOMC raised our policy interest rat...","[-0.024621201679110527, -0.02114815264940262, ...",0.787081
7,7,We are moving our policy stance purposefully t...,"[-0.025701606646180153, -0.012234759517014027,...",0.812895
8,8,"In addition, we are continuing the process of ...","[-0.03149143233895302, 0.0019273122306913137, ...",0.745955
9,9,Restoring price stability will likely require ...,"[-0.010953230783343315, -0.020290518179535866,...",0.790525


In [None]:

fed_df.sort_values("similarities", ascending=False)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
24,24,The recent inflation data again have come in h...,"[-0.021040253341197968, -0.009753845632076263,...",0.871317
22,22,Inflation remains well above our longer run go...,"[-0.023937253281474113, -0.0032772799022495747...",0.869225
31,31,My colleagues and I are acutely aware that hig...,"[-0.011414038017392159, -0.01515731681138277, ...",0.847498
29,29,The longer the current amount of high inflatio...,"[-0.018355058506131172, -0.012731979601085186,...",0.847374
32,32,We are highly attentive to the risks that high...,"[-0.025864068418741226, -0.015762366354465485,...",0.833747
27,27,"Despite elevated inflation, longer term inflat...","[-0.023557519540190697, -0.024205774068832397,...",0.828953
1,1,My colleagues and I are strongly committed to ...,"[-0.026972517371177673, -0.012394015677273273,...",0.826724
37,37,"It will take time, however, for the full effec...","[-0.02066067047417164, -0.018034202978014946, ...",0.826675
46,46,Reducing inflation is likely to require a sust...,"[-0.03423553332686424, -0.014666956849396229, ...",0.823727
26,26,Russia's war against Ukraine has boosted price...,"[-0.009621184319257736, -0.019101163372397423,...",0.818095


# Calculating Cosine Similarity

We used the Cosine Similarity function, but how does it actually work? Cosine similarity is just calculating the similarity between two vectors. There is a mathematical equation for calculating the angle between two vectors. 

![](https://drive.google.com/uc?export=view&id=1cehvtx7LKuFeq_LqfnLi-gzIz1D1wSf9)

In [None]:
v1 = np.array([1,2,3])
v2 = np.array([4,5,6])

# (1 * 4) + (2 * 5) + (3 * 6)
dot_product = np.dot(v1, v2)
dot_product

32

In [None]:
# square root of (1^2 + 2^2 + 3^2) = square root of (1+4+9) = square root of 14
np.linalg.norm(v1)

3.7416573867739413

In [None]:
# square root of (4^2 + 5^2 + 6^2) = square root of (16+25+36) = square root of 14
np.linalg.norm(v2)

8.774964387392123

In [None]:
magnitude = np.linalg.norm(v1) * np.linalg.norm(v2)
magnitude

32.83291031876401

In [None]:
dot_product / magnitude

0.9746318461970762

In [None]:
from scipy import spatial

result = 1 - spatial.distance.cosine(v1, v2)

result

0.9746318461970761

In [None]:
ewq

NameError: ignored

**bold text** Lilly CHina Earnings Call Transcript

Let's tie this back to finance. I have attached some text from a recent [Microsoft earnings call here](https://gist.github.com/hackingthemarkets/1c827a7750384fcf52c84594ef216a2d). Click on "raw" and save the file as a CSV. Upload it to Google Colab as microsoft-earnings.csv. Let's use what we just learned to perform a semantic search on sentences in the Microsoft earnings call. We'll start by reading the paragraphs into a pandas dataframe.

In [None]:
earnings_df = pd.read_csv('Lilly-earnings -cn.csv',encoding='utf-8')

earnings_df

Unnamed: 0,text
0,2?2?????2022??????????????1%??285.41??????????...
1,???????66.30?????????77%??????71.91?????????25...
2,"?????????????Verzenio??????, Mounjaro??????, J..."
3,?????????????????????????????????16%?20.61????...
4,??????????????????????????????????????????????...


Once we have the dataframe, we'll once again compute the embeddings for each line in our CSV file.

In [None]:
from openai.embeddings_utils import get_embedding
earnings_df['embedding'] = earnings_df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
earnings_df.to_csv('earnings-embeddings-cn.csv')

If you download the earnings_embeddings.csv file locally and open it up, you'll see that our embeddings are for entire paragraphs - not just words. This means that we'll be able to search on similar sentences even if there isn't an exact match for the string we search for. We are searching on meaning.

In [None]:
earnings_search = input("Search earnings for a sentence:")

Search earnings for a sentence:when the report was released?


In [None]:

earnings_search_vector = get_embedding(earnings_search, engine="text-embedding-ada-002")
earnings_search_vector

[-0.02198362722992897,
 -0.02233108878135681,
 -0.002921684179455042,
 -0.026714451611042023,
 -0.005111694801598787,
 -0.0008544548181816936,
 -0.003377727698534727,
 0.0028515236917883158,
 -0.026540720835328102,
 -0.0062743546441197395,
 0.033543407917022705,
 0.01377818826586008,
 -0.030255885794758797,
 -0.005953620653599501,
 -0.012421752326190472,
 0.00616076122969389,
 0.01778736151754856,
 -0.01572931930422783,
 0.0016028336249291897,
 0.0027730108704417944,
 0.0012896170374006033,
 0.006167443469166756,
 -0.009882609359920025,
 -0.000896217068657279,
 0.004096037708222866,
 -0.0008356618345715106,
 0.013644549064338207,
 -0.02867894619703293,
 -0.00700268754735589,
 -0.008486080914735794,
 -0.015101215802133083,
 -0.013450772501528263,
 -0.040412455797195435,
 0.010691125877201557,
 -0.04514328017830849,
 -0.0033226015511900187,
 -0.008165347389876842,
 -0.008439307101070881,
 0.022758733481168747,
 0.01686525158584118,
 0.010864856652915478,
 0.0013388964580371976,
 0.002051

In [None]:
from openai.embeddings_utils import cosine_similarity

earnings_df["similarities"] = earnings_df['embedding'].apply(lambda x: cosine_similarity(x, earnings_search_vector))

earnings_df


Unnamed: 0,text,embedding,similarities
0,Revenue in Q4 2022 decreased 9%. Excluding COV...,"[-0.004673336632549763, -0.020571749657392502,...",0.733356
1,Pipeline advancements included FDA approval of...,"[-0.010662214830517769, -0.0006312928162515163...",0.730091
2,"Key growth products - consisting of Verzenio, ...","[-6.474502151831985e-05, -0.02099028415977955,...",0.72726
3,Q4 2022 EPS increased 13% to $2.14 on a report...,"[-0.013158386573195457, -0.01244228333234787, ...",0.758426
4,2023 EPS guidance updated to be in the range o...,"[-0.01687677949666977, -0.026842594146728516, ...",0.75031
5,"INDIANAPOLIS, Feb. 2, 2023 /PRNewswire/ -- Eli...","[-0.020822124555706978, -0.01763373613357544, ...",0.752954
6,2023 is an inflection point for Lilly - a chan...,"[-0.021301742643117905, -0.008894026279449463,...",0.733448
7,"Anat Ashkenazi, Lilly's executive vice preside...","[-0.024574697017669678, -0.018498240038752556,...",0.746573
8,Lilly shared numerous updates recently on key ...,"[-0.02845693752169609, -0.012535544112324715, ...",0.75251
9,The U.S. Food and Drug Administration (FDA) ap...,"[-0.01572350598871708, -0.004544011317193508, ...",0.735274


In [None]:
earnings_df.sort_values("similarities", ascending=False)

Unnamed: 0,text,embedding,similarities
3,Q4 2022 EPS increased 13% to $2.14 on a report...,"[-0.013158386573195457, -0.01244228333234787, ...",0.758426
10,FDA issuance of a complete response letter for...,"[-0.0265525970607996, 0.001474871882237494, -0...",0.75432
5,"INDIANAPOLIS, Feb. 2, 2023 /PRNewswire/ -- Eli...","[-0.020822124555706978, -0.01763373613357544, ...",0.752954
8,Lilly shared numerous updates recently on key ...,"[-0.02845693752169609, -0.012535544112324715, ...",0.75251
4,2023 EPS guidance updated to be in the range o...,"[-0.01687677949666977, -0.026842594146728516, ...",0.75031
14,Positive donanemab data from the first Phase 3...,"[-0.027630707249045372, -0.01595093309879303, ...",0.750019
7,"Anat Ashkenazi, Lilly's executive vice preside...","[-0.024574697017669678, -0.018498240038752556,...",0.746573
20,For additional information on these and other ...,"[-0.018457798287272453, 0.0007494832389056683,...",0.744683
15,Plans to invest an additional $450 million and...,"[-0.012064642272889614, -0.03562004864215851, ...",0.743732
17,The fifth consecutive 15% annual increase in L...,"[-0.03042193129658699, 0.0021608711685985327, ...",0.74118


# Lilly Earnings Call Transcript

Let's tie this back to finance. I have attached some text from a recent [Microsoft earnings call here](https://gist.github.com/hackingthemarkets/1c827a7750384fcf52c84594ef216a2d). Click on "raw" and save the file as a CSV. Upload it to Google Colab as microsoft-earnings.csv. Let's use what we just learned to perform a semantic search on sentences in the Microsoft earnings call. We'll start by reading the paragraphs into a pandas dataframe.

In [None]:
earnings_df = pd.read_csv('Lilly-earnings.csv',encoding='latin1')

earnings_df

Unnamed: 0,text
0,Revenue in Q4 2022 decreased 9%. Excluding COV...
1,Pipeline advancements included FDA approval of...
2,"Key growth products - consisting of Verzenio, ..."
3,Q4 2022 EPS increased 13% to $2.14 on a report...
4,2023 EPS guidance updated to be in the range o...
5,"INDIANAPOLIS, Feb. 2, 2023 /PRNewswire/ -- Eli..."
6,2023 is an inflection point for Lilly - a chan...
7,"Anat Ashkenazi, Lilly's executive vice preside..."
8,Lilly shared numerous updates recently on key ...
9,The U.S. Food and Drug Administration (FDA) ap...


Once we have the dataframe, we'll once again compute the embeddings for each line in our CSV file.

In [None]:
from openai.embeddings_utils import get_embedding
earnings_df['embedding'] = earnings_df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
earnings_df.to_csv('earnings-embeddings.csv')

If you download the earnings_embeddings.csv file locally and open it up, you'll see that our embeddings are for entire paragraphs - not just words. This means that we'll be able to search on similar sentences even if there isn't an exact match for the string we search for. We are searching on meaning.

In [None]:
earnings_search = input("Search earnings for a sentence:")

Search earnings for a sentence:when the report was released?


In [None]:

earnings_search_vector = get_embedding(earnings_search, engine="text-embedding-ada-002")
earnings_search_vector

[-0.02198362722992897,
 -0.02233108878135681,
 -0.002921684179455042,
 -0.026714451611042023,
 -0.005111694801598787,
 -0.0008544548181816936,
 -0.003377727698534727,
 0.0028515236917883158,
 -0.026540720835328102,
 -0.0062743546441197395,
 0.033543407917022705,
 0.01377818826586008,
 -0.030255885794758797,
 -0.005953620653599501,
 -0.012421752326190472,
 0.00616076122969389,
 0.01778736151754856,
 -0.01572931930422783,
 0.0016028336249291897,
 0.0027730108704417944,
 0.0012896170374006033,
 0.006167443469166756,
 -0.009882609359920025,
 -0.000896217068657279,
 0.004096037708222866,
 -0.0008356618345715106,
 0.013644549064338207,
 -0.02867894619703293,
 -0.00700268754735589,
 -0.008486080914735794,
 -0.015101215802133083,
 -0.013450772501528263,
 -0.040412455797195435,
 0.010691125877201557,
 -0.04514328017830849,
 -0.0033226015511900187,
 -0.008165347389876842,
 -0.008439307101070881,
 0.022758733481168747,
 0.01686525158584118,
 0.010864856652915478,
 0.0013388964580371976,
 0.002051

In [None]:
from openai.embeddings_utils import cosine_similarity

earnings_df["similarities"] = earnings_df['embedding'].apply(lambda x: cosine_similarity(x, earnings_search_vector))

earnings_df


Unnamed: 0,text,embedding,similarities
0,Revenue in Q4 2022 decreased 9%. Excluding COV...,"[-0.004673336632549763, -0.020571749657392502,...",0.733356
1,Pipeline advancements included FDA approval of...,"[-0.010662214830517769, -0.0006312928162515163...",0.730091
2,"Key growth products - consisting of Verzenio, ...","[-6.474502151831985e-05, -0.02099028415977955,...",0.72726
3,Q4 2022 EPS increased 13% to $2.14 on a report...,"[-0.013158386573195457, -0.01244228333234787, ...",0.758426
4,2023 EPS guidance updated to be in the range o...,"[-0.01687677949666977, -0.026842594146728516, ...",0.75031
5,"INDIANAPOLIS, Feb. 2, 2023 /PRNewswire/ -- Eli...","[-0.020822124555706978, -0.01763373613357544, ...",0.752954
6,2023 is an inflection point for Lilly - a chan...,"[-0.021301742643117905, -0.008894026279449463,...",0.733448
7,"Anat Ashkenazi, Lilly's executive vice preside...","[-0.024574697017669678, -0.018498240038752556,...",0.746573
8,Lilly shared numerous updates recently on key ...,"[-0.02845693752169609, -0.012535544112324715, ...",0.75251
9,The U.S. Food and Drug Administration (FDA) ap...,"[-0.01572350598871708, -0.004544011317193508, ...",0.735274


In [None]:
earnings_df.sort_values("similarities", ascending=False)

Unnamed: 0,text,embedding,similarities
3,Q4 2022 EPS increased 13% to $2.14 on a report...,"[-0.013158386573195457, -0.01244228333234787, ...",0.758426
10,FDA issuance of a complete response letter for...,"[-0.0265525970607996, 0.001474871882237494, -0...",0.75432
5,"INDIANAPOLIS, Feb. 2, 2023 /PRNewswire/ -- Eli...","[-0.020822124555706978, -0.01763373613357544, ...",0.752954
8,Lilly shared numerous updates recently on key ...,"[-0.02845693752169609, -0.012535544112324715, ...",0.75251
4,2023 EPS guidance updated to be in the range o...,"[-0.01687677949666977, -0.026842594146728516, ...",0.75031
14,Positive donanemab data from the first Phase 3...,"[-0.027630707249045372, -0.01595093309879303, ...",0.750019
7,"Anat Ashkenazi, Lilly's executive vice preside...","[-0.024574697017669678, -0.018498240038752556,...",0.746573
20,For additional information on these and other ...,"[-0.018457798287272453, 0.0007494832389056683,...",0.744683
15,Plans to invest an additional $450 million and...,"[-0.012064642272889614, -0.03562004864215851, ...",0.743732
17,The fifth consecutive 15% annual increase in L...,"[-0.03042193129658699, 0.0021608711685985327, ...",0.74118
