<font color='green'>
Pip install is the command you use to install Python packages with the help of a tool called Pip package manager.
<br><br>Installing LangChain package
</font>

In [5]:
!pip install langchain



<font color='green'>
Installing Openai package, which includes the classes that we can use to communicate with Openai services
<font>

In [6]:
!pip install Openai



## Let's use OpenAI

<font color='green'>
Imports the Python built-in module called "os."
<br>This module provides a way to interact with the operating system, such as accessing environment variables, working with files and directories, executing shell commands, etc
<br><br>
The environ attribute is a dictionary-like object that contains the environment variables of the current operating system session
<br><br>
By accessing os.environ, you can retrieve and manipulate environment variables within your Python program. For example, you can retrieve the value of a specific environment variable using the syntax os.environ['VARIABLE_NAME'], where "VARIABLE_NAME" is the name of the environment variable you want to access.
<font>

In [7]:

from dotenv import load_dotenv
load_dotenv()

True

<font color='green'>
LangChain has built a Wrapper around OpenAI APIs, using which we can get access to all the services OpenAI provides.
<br>
The code snippet below imports a specific class called 'OpenAIEmbeddings'(Wrapper around OpenAI large language models) from the 'embeddings' module of the 'langchain' library.

<font>

In [8]:
from langchain.embeddings import OpenAIEmbeddings

<font color='green'>
Initialize the OpenAIEmbeddings object
<font>

In [9]:
embeddings = OpenAIEmbeddings()

  warn_deprecated(


<font color='green'>
Let's read our input data and get its embedding representation, so that we use it up for our future tasks
<font>

In [10]:
import pandas as pd
df = pd.read_csv('Data.csv')
print(df)

         Words
0     Elephant
1         Lion
2        Tiger
3          Dog
4      Cricket
5      Footbal
6       Tennis
7   Basketball
8        Apple
9       Orange
10      Banana


<font color='green'>
    We can use "apply" to apply the get_embedding function to each row in the dataframe because our words are stored in a pandas dataframe. In order to save time and to save the calculated word embeddings in a new csv file called "word_embeddings.csv" rather than calling OpenAI once more to carry out these computations.
    <font>

In [11]:
df['embedding'] = df['Words'].apply(lambda x: embeddings.embed_query(x))
df.to_csv('word_embeddings.csv')
print(df)

         Words                                          embedding
0     Elephant  [-0.017855134320424067, -0.008739002273680945,...
1         Lion  [-0.0015144460887108187, -0.010011047775235732...
2        Tiger  [-0.013417221828848816, -0.009594874215361256,...
3          Dog  [-0.0009933243881749653, -0.015114395874422863...
4      Cricket  [0.003939178751371586, -0.007197194694541304, ...
5      Footbal  [-0.011442505362835326, -0.008127146122306165,...
6       Tennis  [-0.0229627046214536, 0.001620174408225852, 0....
7   Basketball  [-0.012779986743709604, -0.013293189227440116,...
8        Apple  [0.014477049403261352, -0.003934278727982006, ...
9       Orange  [0.02067122263312988, -0.02922207528370551, 9....
10      Banana  [-0.012999765903696864, -0.01998321619391984, ...


<font color='green'>
    Let's load the existing file, which contains the embeddings, so that we can save chargers by not hitting the API repeatedly
    <font>

In [12]:
new_df = pd.read_csv('word_embeddings.csv')
print(new_df)

    Unnamed: 0       Words                                          embedding
0            0    Elephant  [-0.017855134320424067, -0.008739002273680945,...
1            1        Lion  [-0.0015144460887108187, -0.010011047775235732...
2            2       Tiger  [-0.013417221828848816, -0.009594874215361256,...
3            3         Dog  [-0.0009933243881749653, -0.015114395874422863...
4            4     Cricket  [0.003939178751371586, -0.007197194694541304, ...
5            5     Footbal  [-0.011442505362835326, -0.008127146122306165,...
6            6      Tennis  [-0.0229627046214536, 0.001620174408225852, 0....
7            7  Basketball  [-0.012779986743709604, -0.013293189227440116,...
8            8       Apple  [0.014477049403261352, -0.003934278727982006, ...
9            9      Orange  [0.02067122263312988, -0.02922207528370551, 9....
10          10      Banana  [-0.012999765903696864, -0.01998321619391984, ...


<font color='green'>
Let's get the embeddings for our text
<font>

In [13]:
our_Text = "Mango"

In [14]:
text_embedding = embeddings.embed_query(our_Text)

In [15]:
print (f"Our embedding is {text_embedding}")

Our embedding is [-0.003366075032926374, -0.019877087575502057, 0.010494233241386205, -0.016172489355289033, 0.0062690750554954735, 0.008526964220638825, -0.025204046015862437, -0.014230768375730658, 0.001090621007388965, -0.03298370209204527, -0.0011105812294751384, 0.0036726624702348265, 0.0019449142518763543, 0.01682398675735074, -0.021576093781861604, 0.004784042075997121, 0.03267711442190617, -0.008220376550499725, 0.009408403306627441, -0.015750931774509064, -0.015916999630173116, 0.002847111819822978, 0.0014938154679030426, -0.01239124328621865, 0.004633941746076197, 0.00949782424475672, 0.024501448806132378, -0.013630367987369548, 0.011548128124658712, 0.0030243576419991597, 0.047827646739376714, -0.014971688578566815, -0.004388033002067452, -0.009868283880513504, -0.017616005836532667, 0.00460519926841598, -0.010328165385722154, -0.006336140991923078, -0.006316979495370031, -0.018344151087451733, 0.01444793510774048, -0.008380057395866527, -0.015750931774509064, -0.01803756341

<font color='green'>
    We can determine how similar a word is to other words in our dataframe after we have a vector representing that word.
    <br>
By computing the cosine similarity of the word vector for our search term to each word embedding in our dataframe.
    <font>

In [20]:
print(text_embedding)

[-0.003366075032926374, -0.019877087575502057, 0.010494233241386205, -0.016172489355289033, 0.0062690750554954735, 0.008526964220638825, -0.025204046015862437, -0.014230768375730658, 0.001090621007388965, -0.03298370209204527, -0.0011105812294751384, 0.0036726624702348265, 0.0019449142518763543, 0.01682398675735074, -0.021576093781861604, 0.004784042075997121, 0.03267711442190617, -0.008220376550499725, 0.009408403306627441, -0.015750931774509064, -0.015916999630173116, 0.002847111819822978, 0.0014938154679030426, -0.01239124328621865, 0.004633941746076197, 0.00949782424475672, 0.024501448806132378, -0.013630367987369548, 0.011548128124658712, 0.0030243576419991597, 0.047827646739376714, -0.014971688578566815, -0.004388033002067452, -0.009868283880513504, -0.017616005836532667, 0.00460519926841598, -0.010328165385722154, -0.006336140991923078, -0.006316979495370031, -0.018344151087451733, 0.01444793510774048, -0.008380057395866527, -0.015750931774509064, -0.018037563417312632, -0.01548

In [21]:
print(df)

         Words                                          embedding
0     Elephant  [-0.017855134320424067, -0.008739002273680945,...
1         Lion  [-0.0015144460887108187, -0.010011047775235732...
2        Tiger  [-0.013417221828848816, -0.009594874215361256,...
3          Dog  [-0.0009933243881749653, -0.015114395874422863...
4      Cricket  [0.003939178751371586, -0.007197194694541304, ...
5      Footbal  [-0.011442505362835326, -0.008127146122306165,...
6       Tennis  [-0.0229627046214536, 0.001620174408225852, 0....
7   Basketball  [-0.012779986743709604, -0.013293189227440116,...
8        Apple  [0.014477049403261352, -0.003934278727982006, ...
9       Orange  [0.02067122263312988, -0.02922207528370551, 9....
10      Banana  [-0.012999765903696864, -0.01998321619391984, ...


In [30]:
import numpy as np
# https://github.com/openai/openai-python/issues/676
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

df["similarity score"] = df['embedding'].apply(lambda x: cosine_similarity(x, text_embedding))

df

Unnamed: 0,Words,embedding,similarity score
0,Elephant,"[-0.017855134320424067, -0.008739002273680945,...",0.830492
1,Lion,"[-0.0015144460887108187, -0.010011047775235732...",0.827272
2,Tiger,"[-0.013417221828848816, -0.009594874215361256,...",0.852028
3,Dog,"[-0.0009933243881749653, -0.015114395874422863...",0.773127
4,Cricket,"[0.003939178751371586, -0.007197194694541304, ...",0.817989
5,Footbal,"[-0.011442505362835326, -0.008127146122306165,...",0.777491
6,Tennis,"[-0.0229627046214536, 0.001620174408225852, 0....",0.805602
7,Basketball,"[-0.012779986743709604, -0.013293189227440116,...",0.794313
8,Apple,"[0.014477049403261352, -0.003934278727982006, ...",0.81402
9,Orange,"[0.02067122263312988, -0.02922207528370551, 9....",0.843934


<font color='green'>
    Sorting by similarity values in dataframe reveals Banana, Orange, and Apple are closest to searched term, such as Mango.
    <font>

In [31]:
df.sort_values("similarity score", ascending=False).head(10)

Unnamed: 0,Words,embedding,similarity score
10,Banana,"[-0.012999765903696864, -0.01998321619391984, ...",0.898706
2,Tiger,"[-0.013417221828848816, -0.009594874215361256,...",0.852028
9,Orange,"[0.02067122263312988, -0.02922207528370551, 9....",0.843934
0,Elephant,"[-0.017855134320424067, -0.008739002273680945,...",0.830492
1,Lion,"[-0.0015144460887108187, -0.010011047775235732...",0.827272
4,Cricket,"[0.003939178751371586, -0.007197194694541304, ...",0.817989
8,Apple,"[0.014477049403261352, -0.003934278727982006, ...",0.81402
6,Tennis,"[-0.0229627046214536, 0.001620174408225852, 0....",0.805602
7,Basketball,"[-0.012779986743709604, -0.013293189227440116,...",0.794313
5,Footbal,"[-0.011442505362835326, -0.008127146122306165,...",0.777491
