# Qualitative Analysis with GPT-3 Embeddings

In [9]:
import openai
import pandas as pd
import sklearn
import numpy as np
from openai.embeddings_utils import cosine_similarity, get_embedding
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

Setup:

In [11]:
# Read API key from file:
openai.api_key = open("../openai_key.txt").read()
# Read data from file:
all_dat = pd.read_excel("../data/Changestoself_Analysis_Comments_2_R_2013_12_05.xlsx", sheet_name = "Changestoself_Analysis_Comments")
# Take out comment column for analysis:
#     ASSUMING - demographic data etc. is not being used for coding
comments = all_dat["Comment"]
assert comments.isna().sum() == 0, "There are missing values in comments"
# note - will need to preprocess coding columns for scoring/comparison 

#Model to be used:
MODEL = "text-embedding-ada-002"

View sample of comments:

In [8]:
comments.head()

0     He is already sociable he doesn't need the pill.
1    John didn't have a problem - he was just fine ...
2    If John was quite comfortable with how social ...
3    It seems like he's a pretty well-adjusted guy ...
4    John is healthy and is not suffering from any ...
Name: Comment, dtype: object

### Using embeddings 

 We can consider this as a classification problem using embeddings. To do this, we can compute similarity between 
 each comment and each code description using embeddings. The codes with similarity above a certain threshold
 can be considered to be the codes for that comment. 
 
 Referred to descriptions of embeddings and code given here:
 https://github.com/openai/openai-cookbook/blob/main/examples/Zero-shot_classification_with_embeddings.ipynb

**Preparation of codes:**

The qualitative code descriptions being used are shown below. To make them, the three broad categories 'Not comfortable with', 'Comfortable with', and 'Other' were combined with the descriptions for eac sub-category to create a list of descriptions.

In [4]:
codes = pd.read_excel("../data/Changestoself_Analysis_Comments_2_R_2013_12_05.xlsx", sheet_name = "gpt_codes"
)["Descriptions"]

pd.DataFrame(codes)

Unnamed: 0,Descriptions
0,"Not comfortable with, no specific reason given"
1,"Not comfortable with, changing something funda..."
2,"Not comfortable with, changing the brain"
3,"Not comfortable with, side effect"
4,"Not comfortable with, general concerns about p..."
5,"Not comfortable with, No need, nor medically n..."
6,"Not comfortable with, naturalistic moral concerns"
7,"Not comfortable with, risks outweight the bene..."
8,"Not comfortable with, dependence, addiction"
9,"Not comfortable with, social pressure to fit in"


**Testing with smaller subsection of comments and codes**

In [5]:
comments_test = [comments[0]]

In [6]:
codes_test = codes.head()

In [7]:
from openai.embeddings_utils import cosine_similarity, get_embedding
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
#note: append is deprecated, can be changed later

model = MODEL
df_scores = pd.DataFrame(columns=["Comment", "Code", "Similarity score"])
for m in comments_test:
    comment_embedding = get_embedding(m, engine=model)
    for c in codes:
        code_embedding = get_embedding(c, engine=model)
        score = cosine_similarity(comment_embedding, code_embedding)
        df_scores = df_scores.append(pd.Series({"Comment":m, "Code":c, "Similarity score":score}), ignore_index=True)
        #print("Comment:", m, "Code:", c, "Similarity score:", score)
        

In [9]:
df_scores

Unnamed: 0,Comment,Code,Similarity score
0,He is already sociable he doesn't need the pill.,"Not comfortable with, no specific reason given",0.753178
1,He is already sociable he doesn't need the pill.,"Not comfortable with, changing something funda...",0.738462
2,He is already sociable he doesn't need the pill.,"Not comfortable with, changing the brain",0.765766
3,He is already sociable he doesn't need the pill.,"Not comfortable with, side effect",0.774335
4,He is already sociable he doesn't need the pill.,"Not comfortable with, general concerns about p...",0.739862
5,He is already sociable he doesn't need the pill.,"Not comfortable with, No need, nor medically n...",0.790427
6,He is already sociable he doesn't need the pill.,"Not comfortable with, naturalistic moral concerns",0.752322
7,He is already sociable he doesn't need the pill.,"Not comfortable with, risks outweight the bene...",0.760949
8,He is already sociable he doesn't need the pill.,"Not comfortable with, dependence, addiction",0.783955
9,He is already sociable he doesn't need the pill.,"Not comfortable with, social pressure to fit in",0.781162


From the above, it appears the similarity scores don't correspond with common sense or the human coding very well.
This may be due to the code description not being descriptive enough. Below, a more detailed version of the code descriptions are used.

In [10]:
codes_detailed = pd.read_excel("../data/Changestoself_Analysis_Comments_2_R_2013_12_05.xlsx", sheet_name = "gpt_codes_updated"
)["Descriptions"]

pd.DataFrame(codes_detailed)

Unnamed: 0,Descriptions
0,Not comfortable with him taking the pill - no ...
1,Not comfortable with him taking the pill - cha...
2,Not comfortable with him taking the pill - cha...
3,Not comfortable with him taking the pill - sid...
4,Not comfortable with him taking the pill - gen...
5,Not comfortable with him taking the pill – no ...
6,Not comfortable with him taking the pill - nat...
7,Not comfortable with him taking the pill - ris...
8,Not comfortable with him taking the pill - dep...
9,Not comfortable with him taking the pill - soc...


In [11]:
#note: append is deprecated, can be changed later

model = MODEL
df_scores_2 = pd.DataFrame(columns=["Comment", "Code", "Similarity score"])
for m in comments_test:
    comment_embedding = get_embedding(m, engine=model)
    for c in codes_detailed:
        code_embedding = get_embedding(c, engine=model)
        score = cosine_similarity(comment_embedding, code_embedding)
        df_scores_2 = df_scores_2.append(pd.Series({"Comment":m, "Code":c, "Similarity score":score}), ignore_index=True)
        #print("Comment:", m, "Code:", c, "Similarity score:", score)

In [12]:
df_scores_2

Unnamed: 0,Comment,Code,Similarity score
0,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill - no ...,0.836325
1,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill - cha...,0.822031
2,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill - cha...,0.833908
3,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill - sid...,0.835907
4,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill - gen...,0.834913
5,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill – no ...,0.851079
6,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill - nat...,0.827917
7,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill - ris...,0.840452
8,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill - dep...,0.834816
9,He is already sociable he doesn't need the pill.,Not comfortable with him taking the pill - soc...,0.859029


This may have improved things a bit, but there are still a lot of false positives (codes with a high score that human coders didn't assign). Another issue here might be that there are two pieces to each code - 'Comfortable/Not comfortable/Other' plus some detail e.g. 'no specific reason/his choice' etc.
Below, we first ask the model to assign one of the first three labels, and then check which detail has the most similarity.
As 'Other' doesn't have much meaning by itself, it has been expanded to 'Other than "comfortable with" or "not comfortable with"'.

In [6]:
codes_separate = pd.read_excel("../data/Changestoself_Analysis_Comments_2_R_2013_12_05.xlsx", sheet_name = "gpt_codes_separate"
)
pd.DataFrame(codes_separate)

Unnamed: 0,Not Comfortable with,Comfortable with,Other
0,no specific reason given,no specific reason,offered a vague or irrelevant response
1,changing something fundamental,not changing anything fundamental,would take the pill
2,changing the brain,his choice,would not take the pill
3,side effect,minor or no side effects,
4,general concerns about pills,comfortable with pills and medication,
5,"No need, nor medically necessary",positive benefits,
6,naturalistic moral concerns,nothing wrong with using science or other mean...,
7,risks outweight the benefits,benefits outweight the risks,
8,"dependend,a ddiction",consent/informed decision,
9,social pressure to fit in,no pressure to take it,


In [7]:
codes_separate.columns

Index(['Not Comfortable with ', 'Comfortable with ', 'Other'], dtype='object')

In [13]:
model = MODEL 
comfort_embedding = get_embedding('Comfortable with pill', engine=model)
not_comfort_embedding = get_embedding('Not comfortable with pill', engine=model)
other_embedding = get_embedding('Other than "comfortable with pill" or "not comfortable with pill"', engine=model)

comfort_codes = codes_separate['Comfortable with '].dropna()
not_comfort_codes = codes_separate['Not Comfortable with '].dropna()
other_codes = codes_separate['Other'].dropna()


In [16]:
df_scores_3 = pd.DataFrame(columns=["Comment", "Comfort", "Code", "Similarity score"])
for m in comments_test:
    comment_embedding = get_embedding(m, engine=model)
    comfort_simil = cosine_similarity(comment_embedding, comfort_embedding)
    not_comfort_simil = cosine_similarity(comment_embedding, not_comfort_embedding)
    other_simil = cosine_similarity(comment_embedding, other_embedding)
    similarities = [comfort_simil, not_comfort_simil, other_simil]
    if (np.argmax(similarities)==0):
        for c in comfort_codes:
            code_embedding = get_embedding(c, engine=model)
            score = cosine_similarity(comment_embedding, code_embedding)
            df_scores_3 = df_scores_3.append(pd.Series({"Comment":m, "Comfort":"Comfortable with",
                                                        "Code":c, "Similarity score":score}), ignore_index=True)
    elif (np.argmax(similarities)==1):
        for c in not_comfort_codes:
            code_embedding = get_embedding(c, engine=model)
            score = cosine_similarity(comment_embedding, code_embedding)
            df_scores_3 = df_scores_3.append(pd.Series({"Comment":m, "Comfort":"Not comfortable with",
                                                        "Code":c, "Similarity score":score}), ignore_index=True)
    elif (np.argmax(similarities)==2):
        for c in other_codes:
            code_embedding = get_embedding(c, engine=model)
            score = cosine_similarity(comment_embedding, code_embedding)
            df_scores_3 = df_scores_3.append(pd.Series({"Comment":m, "Comfort":"Other",
                                                        "Code":c, "Similarity score":score}), ignore_index=True)
            
    

In [17]:
similarities

[0.8280262786083373, 0.8248459645112428, 0.8061247309737856]

In [18]:
df_scores_3

Unnamed: 0,Comment,Comfort,Code,Similarity score
0,He is already sociable he doesn't need the pill.,Comfortable with,no specific reason,0.743782
1,He is already sociable he doesn't need the pill.,Comfortable with,not changing anything fundamental,0.730096
2,He is already sociable he doesn't need the pill.,Comfortable with,his choice,0.786154
3,He is already sociable he doesn't need the pill.,Comfortable with,minor or no side effects,0.769459
4,He is already sociable he doesn't need the pill.,Comfortable with,comfortable with pills and medication,0.816261
5,He is already sociable he doesn't need the pill.,Comfortable with,positive benefits,0.768501
6,He is already sociable he doesn't need the pill.,Comfortable with,nothing wrong with using science or other mean...,0.758062
7,He is already sociable he doesn't need the pill.,Comfortable with,benefits outweight the risks,0.766489
8,He is already sociable he doesn't need the pill.,Comfortable with,consent/informed decision,0.726901
9,He is already sociable he doesn't need the pill.,Comfortable with,no pressure to take it,0.791502


The comment being tested is quite short. Below similarity scores for a longer comment are shown.

In [5]:
comments_test_long = [comments[4]]
comments_test_long

['John is healthy and is not suffering from any sort of problem. He is already sociable and it seems odd that he would seek to make himself even more sociable. I would be worried about side effects as well, though ultimately it is up to John to make decisions in his own life.']

In [15]:
df_scores_4 = pd.DataFrame(columns=["Comment", "Comfort", "Code", "Similarity score"])
for m in comments_test_long:
    comment_embedding = get_embedding(m, engine=model)
    comfort_simil = cosine_similarity(comment_embedding, comfort_embedding)
    not_comfort_simil = cosine_similarity(comment_embedding, not_comfort_embedding)
    other_simil = cosine_similarity(comment_embedding, other_embedding)
    similarities = [comfort_simil, not_comfort_simil, other_simil]
    if (np.argmax(similarities)==0):
        for c in comfort_codes:
            code_embedding = get_embedding(c, engine=model)
            score = cosine_similarity(comment_embedding, code_embedding)
            df_scores_4 = df_scores_4.append(pd.Series({"Comment":m, "Comfort":"Comfortable with",
                                                        "Code":c, "Similarity score":score}), ignore_index=True)
    elif (np.argmax(similarities)==1):
        for c in not_comfort_codes:
            code_embedding = get_embedding(c, engine=model)
            score = cosine_similarity(comment_embedding, code_embedding)
            df_scores_4 = df_scores_4.append(pd.Series({"Comment":m, "Comfort":"Not comfortable with",
                                                        "Code":c, "Similarity score":score}), ignore_index=True)
    elif (np.argmax(similarities)==2):
        for c in other_codes:
            code_embedding = get_embedding(c, engine=model)
            score = cosine_similarity(comment_embedding, code_embedding)
            df_scores_4 = df_scores_4.append(pd.Series({"Comment":m, "Comfort":"Other",
                                                        "Code":c, "Similarity score":score}), ignore_index=True)
            

In [16]:
df_scores_4

Unnamed: 0,Comment,Comfort,Code,Similarity score
0,John is healthy and is not suffering from any ...,Other,offered a vague or irrelevant response,0.718729
1,John is healthy and is not suffering from any ...,Other,would take the pill,0.77501
2,John is healthy and is not suffering from any ...,Other,would not take the pill,0.773944


**Overall conclusion:**
    
This approach may not be useful to tell between subtle differences in codes. While using embeddings has been useful for sentiment analysis into categories such as 'negative' and 'positive', categories such as 'Comfortable with pill' or 'Not comfortable with pill' may be a bit too subtle to differentiate between just with embeddings.