## 1. Load the dataset
We will combine the  Description and Patient text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [1]:
# imports
import pandas as pd
import tiktoken
from openai.embeddings_utils import get_embedding
import time

In [2]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

In [3]:
# load & inspect dataset
df = pd.read_csv("../2-Data/dialogues.csv", sep = '\t')

In [4]:
df = df.dropna()#.head(1000)

In [7]:
df["combined"] = (
    "Description: " + df.Description.str.strip() + "; Patient: " + df.Patient.str.strip()+ "; Doctor: " + df.Doctor.str.strip()
)
df.head(2)

Unnamed: 0,Description,Patient,Doctor,combined
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...,Description: Q. What does abutment of the nerv...
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...,Description: Q. What should I do to reduce my ...


In [7]:
#df["combined"] = ( "Description: " + df.Description.str.strip() + "; Patient: " + df.Patient.str.strip())
#df.head(2)

In [8]:
# subsample to 1k most recent reviews and remove samples that are too long
top_n = df.shape[0]
#df = df.tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out

In [9]:
encoding = tiktoken.get_encoding(embedding_encoding)
# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)

256916

In [10]:
df

Unnamed: 0,Description,Patient,Doctor,combined,n_tokens
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...,Description: Q. What does abutment of the nerv...,95
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...,Description: Q. What should I do to reduce my ...,519
2,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...,Description: Q. I have started to get lots of ...,285
3,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...,Description: Q. Why do I have uncomfortable fe...,324
4,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...,Description: Q. My symptoms after intercourse ...,442
...,...,...,...,...,...
256911,Why is hair fall increasing while using Bontre...,I am suffering from excessive hairfall. My doc...,"Hello Dear Thanks for writing to us, we are he...",Description: Why is hair fall increasing while...,211
256912,Why was I asked to discontinue Androanagen whi...,"Hi Doctor, I have been having severe hair fall...","hello, hair4u is combination of minoxid...",Description: Why was I asked to discontinue An...,154
256913,Can Mintop 5% Lotion be used by women for seve...,Hi..i hav sever hair loss problem so consulted...,HI I have evaluated your query thoroughly you...,Description: Can Mintop 5% Lotion be used by w...,191
256914,Is Minoxin 5% lotion advisable instead of Foli...,"Hi, i am 25 year old girl, i am having massive...",Hello and Welcome to ‘Ask A Doctor’ service.I ...,Description: Is Minoxin 5% lotion advisable in...,232


There are different ways to convert  text into a vector or into embeddings.

Unfortunately, most good methods to get embeddings in Python are not free.


## 2. Get embeddings using  SentenceTransformers

Let us use SentenceTransformers, a Python framework for state-of-the-art sentence, text, and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

First we verify that Torch is CUDA capable

In [11]:
import torch
torch.cuda.is_available()

True

We define our list of sentences. You can use a larger list (it is best to use a list of sentences for easier processing of each sentence)

We can install Sentence BERT using:
`!pip install sentence-transformers`



Step 1: We will then load the pre-trained BERT model. There are many other pre-trained models available.

In [12]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

We proceed to test the embeding creation

In [13]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
#Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']
#Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)

In [14]:
sentence

['This framework generates embeddings for each input sentence']

In [15]:
def get_embeddings(x,transformer='paraphrase-MiniLM-L6-v2'):
    model = SentenceTransformer(transformer)
    #Sentences we want to encode
    sentence =x
    #Sentences are encoded by calling model.encode()
    embedding = model.encode(sentence)
    return embedding

In [16]:
# This may take a few minutes
embedding_mod='paraphrase-MiniLM-L6-v2'
#df["embedding"] = df.combined.apply(lambda x: get_embeddings(x, transformer=embedding_mod))

In [None]:
#embedding_doctor
# This may take a few minutes
df["embedding"] = df.Doctor.apply(lambda x: get_embeddings(x, transformer=embedding_mod))

In [None]:
df

In [None]:
from ast import literal_eval
import numpy as np

In [None]:
df["embedding"] = df.embedding.apply(np.array)  # convert string to numpy array

In [None]:
#df["embedding_doctor"] = df.embedding_doctor.apply(np.array)  # convert string to numpy array

In [None]:
df.to_pickle("../2-Data/dialogues_embededd.pkl")

In [None]:
#df.to_csv("../2-Data/dialogues_embededd.csv", sep = '\t', encoding='utf-8', index=False)

## 3. Get embeddings using OpenAI (optional)
If we have a subscription in OpenAI, you can follow the following steps.
Is optional, we are going to use the previous method.

In [24]:
# Python program to read
# json file
import json
# Opening JSON file
f = open('./credentials/api.json')
# returns JSON object as
# a dictionary
data = json.load(f)

In [28]:
# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage
import openai
openai.api_key = data['OPENAI_API_KEY']
# Closing file
f.close()

In [42]:
# This may take a few minutes
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))

In [None]:
df.to_csv("../2-Data/dialogues_embededd_openai.csv", sep='\t', encoding='utf-8', index=False)

## Additional Notes (not neeeded)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["I am doga.",
       "I am a dog"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["I am  doc.", "I am dog"]
# create the transform
vectorizer = HashingVectorizer(n_features=20)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())