## 1. Load the dataset

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [20]:
# imports
import pandas as pd
import tiktoken
from openai.embeddings_utils import get_embedding
import time

In [21]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

In [22]:
# load & inspect dataset
df = pd.read_csv("../2-Data/dialogues.csv", sep = '\t')

In [23]:
df = df.dropna().head()

In [24]:
df

Unnamed: 0,Description,Patient,Doctor
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...
2,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...
3,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...
4,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...


In [25]:
df["combined"] = (
    "Description: " + df.Description.str.strip() + "; Patient: " + df.Patient.str.strip()
)
df.head(2)

Unnamed: 0,Description,Patient,Doctor,combined
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...,Description: Q. What does abutment of the nerv...
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...,Description: Q. What should I do to reduce my ...


In [26]:
# subsample to 1k most recent reviews and remove samples that are too long
top_n = 1000
df = df.tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out

In [27]:
df

Unnamed: 0,Description,Patient,Doctor,combined
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...,Description: Q. What does abutment of the nerv...
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...,Description: Q. What should I do to reduce my ...
2,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...,Description: Q. I have started to get lots of ...
3,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...,Description: Q. Why do I have uncomfortable fe...
4,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...,Description: Q. My symptoms after intercourse ...


In [28]:
encoding = tiktoken.get_encoding(embedding_encoding)
# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))


5

In [29]:
df

Unnamed: 0,Description,Patient,Doctor,combined,n_tokens
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...,Description: Q. What does abutment of the nerv...,58
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...,Description: Q. What should I do to reduce my ...,173
2,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...,Description: Q. I have started to get lots of ...,174
3,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...,Description: Q. Why do I have uncomfortable fe...,189
4,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...,Description: Q. My symptoms after intercourse ...,295


There are different ways to convert  text into a vector or into embeddings.

Unfortunately, most good methods to get embeddings in Python are not free.


## 2. Get embeddings using  SentenceTransformers

Let us use SentenceTransformers, a Python framework for state-of-the-art sentence, text, and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

First we verify that Torch is CUDA capable

In [47]:
import torch
torch.cuda.is_available()

True

We define our list of sentences. You can use a larger list (it is best to use a list of sentences for easier processing of each sentence)

We can install Sentence BERT using:
`!pip install sentence-transformers`



Step 1: We will then load the pre-trained BERT model. There are many other pre-trained models available.

In [None]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

We proceed to test the embeding creation

In [48]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
#Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']
#Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)

In [49]:
sentence

['This framework generates embeddings for each input sentence']

In [50]:
def get_embeddings(x,transformer='paraphrase-MiniLM-L6-v2'):
    model = SentenceTransformer(transformer)
    #Sentences we want to encode
    sentence =x
    #Sentences are encoded by calling model.encode()
    embedding = model.encode(sentence)
    return embedding

In [51]:
# This may take a few minutes
embedding_mod='paraphrase-MiniLM-L6-v2'
df["embedding"] = df.combined.apply(lambda x: get_embeddings(x, transformer=
                                                             embedding_mod))

In [52]:
df['embedding']

0    [-0.1678172, 0.25504777, 0.32995197, -0.019847...
1    [-0.20603478, 0.1933242, 0.13318594, 0.0191498...
2    [-0.29600272, 0.13250932, -0.12288458, 0.32883...
3    [-0.13440488, -0.18298218, -0.12564877, -0.101...
4    [-0.06823079, 0.23422238, 0.04620348, -0.27123...
Name: embedding, dtype: object

In [55]:
df.to_csv("../2-Data/dialogues_embededd.csv", sep='\t', encoding='utf-8', index=False)

## 3. Get embeddings using OpenAI (optional)
If we have a subscription in OpenAI, you can follow the following steps.
Is optional, we are going to use the previous method.

In [24]:
# Python program to read
# json file
import json
# Opening JSON file
f = open('./credentials/api.json')
# returns JSON object as
# a dictionary
data = json.load(f)

In [28]:
# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage
import openai
openai.api_key = data['OPENAI_API_KEY']
# Closing file
f.close()

In [42]:
# This may take a few minutes
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))

In [None]:
df.to_csv("../2-Data/dialogues_embededd_openai.csv", sep='\t', encoding='utf-8', index=False)

## Additional Notes (not neeeded)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["I am doga.",
       "I am a dog"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["I am  doc.", "I am dog"]
# create the transform
vectorizer = HashingVectorizer(n_features=20)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())