<a href="https://colab.research.google.com/github/sudama-inc/llm_finetuning/blob/main/gpt3_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://openai.com/blog/introducing-text-and-code-embeddings/

In [None]:
!pip install openai

The **embedding is an information dense representation of the semantic meaning of a piece of text**. <br>
Each embedding is a vector of floating point numbers, such that the **distance between two embeddings in the vector space is correlated with semantic similarity between two inputs** in the original format. <br>
For example, if two texts are similar, then their vector representations should also be similar.

**Use cases:**

*   Text Similarity
*   Semantic Search
*   Classification
*   Clustering




1.   **Similarity embeddings** : These models are good at capturing semantic similarity between two or more pieces of text.
2.   **Text search embeddings**: These models help measure whether long documents are relevant to a short search query. There are two types: one for ***embedding the documents*** to be retrieved, and one for ***embedding the search query***.




In [None]:
import pandas as pd
import openai, numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity

In [None]:
api_key = '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'
openai.api_key = api_key

<h3>Text Similarity</h3>

In [None]:
resp = openai.Embedding.create(
    input=["eating food", "I am hungry", "I am traveling" , "exploring new places"],
    engine="text-similarity-davinci-001")

In [None]:
type(resp['data'])

list

In [None]:
len(resp['data'])

4

In [None]:
type(resp['data'][0])

openai.openai_object.OpenAIObject

In [None]:
resp['data'][0].keys()

dict_keys(['object', 'index', 'embedding'])

In [None]:
resp['data'][0]['embedding']

In [None]:
embedding_a = resp['data'][0]['embedding']
embedding_b = resp['data'][1]['embedding']
embedding_c = resp['data'][2]['embedding']
embedding_d = resp['data'][3]['embedding']

In [None]:
np.dot(embedding_a, embedding_b)

0.8724587274814626

In [None]:
np.dot(embedding_a, embedding_c)

0.7891928072645734

In [None]:
np.dot(embedding_c, embedding_d)

0.8543601927339739

In [None]:
# https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
datafile_path = "https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv"  # for your convenience, we precomputed the embeddings
df = pd.read_csv(datafile_path)
df.head()

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens,babbage_similarity,babbage_search
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,51,"[-0.01274053193628788, 0.010849879123270512, -...","[-0.01880764216184616, 0.019457539543509483, -..."
1,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,35,"[-0.024154752492904663, 0.0024838377721607685,...","[-0.03571609780192375, 0.010356518439948559, -..."
2,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ....",277,"[0.0032693513203412294, 0.017815979197621346, ...","[-0.010433986783027649, 0.024620095267891884, ..."
3,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...,246,"[-0.03584608808159828, 0.03424076735973358, -0...","[-0.040209852159023285, 0.03804996609687805, -..."
4,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...,87,"[0.005218076519668102, 0.018165964633226395, -...","[0.010450801812112331, 0.022801749408245087, -..."


In [None]:
type(df.loc[0]['babbage_search'])

str

In [None]:
df["babbage_search"] = df.babbage_search.apply(eval).apply(np.array)
df["babbage_similarity"] = df.babbage_similarity.apply(eval).apply(np.array)


In [None]:
type(df.loc[0]['babbage_search'])

numpy.ndarray

<h3>Semantic Search</h3>

In [None]:
# search through the reviews for a specific product
def search_reviews(df, search_query, n=3):
    embedding = get_embedding(
        search_query,
        engine="text-search-babbage-query-001"
    )
    df["similarities"] = df.babbage_search.apply(lambda x: cosine_similarity(x, embedding))

    top_n =df.sort_values("similarities", ascending=False).head(n)
    # res = top_n.combined.str.replace("Title: ", "").str.replace("; Content:", ": ")
    return top_n

In [None]:
res = search_reviews(df, "delicious beans", n=3)
res['combined'].to_list()

['Title: Fantastic Instant Refried beans; Content: Fantastic Instant Refried Beans have been a staple for my family now for nearly 20 years.  All 7 of us love it and my grown kids are passing on the tradition.',
 'Title: Jamaican Blue beans; Content: Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and before any oil appears on the bean itself (455F @ 17 minutes).',
 "Title: Delicious!; Content: I enjoy this white beans seasoning, it gives a rich flavor to the beans I just love it, my mother in law didn't know about this Zatarain's brand and now she is traying different seasoning and she likes it very much.<br />Thank you Amazon for having it because now I can't find it in stores, I like to have this 12 boxes because I can made it whenever my family want it."]

<h3>Classification</h3>

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    list(df.babbage_similarity.values),
    df.Score,
    test_size = 0.2,
    random_state=42
)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)

In [None]:
print(classification_report(y_test,preds))

              precision    recall  f1-score   support

           1       0.62      0.72      0.67        18
           2       1.00      0.35      0.52        17
           3       0.50      0.12      0.20         8
           4       0.62      0.38      0.48        26
           5       0.83      0.98      0.90       131

    accuracy                           0.80       200
   macro avg       0.72      0.51      0.55       200
weighted avg       0.79      0.80      0.77       200



In [None]:
len(df)

1000

In [None]:
df['Score'].value_counts(normalize=True)

5    0.651
4    0.138
1    0.087
3    0.075
2    0.049
Name: Score, dtype: float64

<h3>Clustering</h3>

In [None]:
# source: https://stackoverflow.com/questions/55619176/how-to-cluster-similar-sentences-using-bert
from sklearn.cluster import KMeans
# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'Horse is eating grass.',
          'A man is eating pasta.',
          'A Woman is eating Biryani.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.',
          'The cheetah is chasing a man who is riding the horse.',
          'man and women with their baby are watching cheetah in zoo'
          ]

In [None]:
response = openai.Embedding.create(
    input=corpus,
    model="text-similarity-babbage-001"
)

In [None]:
type(response['data'])

list

In [None]:
# response['data'][0]['embedding']

In [None]:
corpus_embeddings = [ d['embedding'] for d in response['data']]
# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [None]:
clustering_model = KMeans(n_clusters=3)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

[1 1 1 1 2 2 2 1 1 0 0 0 0 0 0]


In [None]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])
clustered_sentences

{0: ['A monkey is playing drums.',
  'Someone in a gorilla costume is playing a set of drums.',
  'A cheetah is running behind its prey.',
  'A cheetah chases prey on across a field.',
  'The cheetah is chasing a man who is riding the horse.',
  'man and women with their baby are watching cheetah in zoo'],
 1: ['A man is eating food.',
  'A man is eating a piece of bread.',
  'Horse is eating grass.',
  'A man is eating pasta.',
  'A man is riding a horse.',
  'A man is riding a white horse on an enclosed ground.'],
 2: ['A Woman is eating Biryani.',
  'The girl is carrying a baby.',
  'The baby is carried by the woman']}

In [None]:
clustering_model = KMeans(n_clusters=4)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

[1 1 1 1 2 2 2 1 1 0 0 3 3 3 3]


In [None]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])
clustered_sentences

{0: ['A monkey is playing drums.',
  'Someone in a gorilla costume is playing a set of drums.'],
 1: ['A man is eating food.',
  'A man is eating a piece of bread.',
  'Horse is eating grass.',
  'A man is eating pasta.',
  'A man is riding a horse.',
  'A man is riding a white horse on an enclosed ground.'],
 2: ['A Woman is eating Biryani.',
  'The girl is carrying a baby.',
  'The baby is carried by the woman'],
 3: ['A cheetah is running behind its prey.',
  'A cheetah chases prey on across a field.',
  'The cheetah is chasing a man who is riding the horse.',
  'man and women with their baby are watching cheetah in zoo']}