### Adding biographies
In this notebook I seek to carry out the same chatGPT experiments with less examples and more clear definitions first. Then I'll compare against previous experiments and finally add the biograhpies as context, to see its influence. I might also try embedding variations. <br>
Exact list of planned baseline comparison:
1. Embedding of post baseline
2. Embedding of post with simple instruction (no CoT)
3. Embedding of post with instruction & example (no CoT)
4. Embedding of post with instruction, example and CoT
5. chatGPT completions all of the above instead of embeddings
6. All of the above with Biographies

I'll do this with the 200 double checked posts or a similarly sized sample.<br>
For flexibility with the prompts, I probably do not want to do this in a modular fashion

In [1]:
import pandas as pd
import sys
import openai
from helpers.GPTclassifier import gptclassifier
openai.api_key = ""
df = pd.read_pickle("data/df_sampled.pkl")

In [24]:
# (skip execution) Generating a sample to work with

# import instagram profiles
profiles = pd.read_pickle("data/df_profiles.pkl")

# import posts
sys.path.append('../7-Self-Labelled-Data')
df = pd.read_pickle("../7-Self-Labelled-Data/data/annotated_test_explantions_gpt3.pkl")

# remove extremely long posts
df = df[df["caption"].apply(lambda x: len(x)<=500)]
# remove the less represented influencers; maintains 70 % of the data
df = df[df["username"].isin(df['username'].value_counts().nlargest(20).index)]
# sample 10 posts per influencer
df_sampled = df.groupby('username', group_keys=False).apply(lambda x: x.sample(min(len(x), 10)))
# add biography & co from profiles
df_sampled = df_sampled.merge(profiles[['username', 'full_name', 'edge_followed_by', 'biography']], on='username', how='left')

In [7]:
messages = [{"role": "system", "content": "You are an assistant helping an academic to reason about whether a post contains (potentially non-commerical) promotional activity or even is potentially sponsored. I will provide you with the caption of an instagram post as well as maybe the caption of other posts from the same user, so you have more context. You give me a short and concise reasoning why or why not the main post might be an ad, i.e. the result of a financial contract. For later classification there are four labels available, 'Potentially sponsored', 'Self advertisement', 'Ambiguous' and 'Likely not sponsored'. Be concise in your reasoning and always strictly adhere to the pattern from the examples, i.e. always decide for one and only one label and finish your response with it and a dot after. If you are uncertain, err strongly towards 'Potentially sponsored'. Also strongly prefer 'Self advertisement' over 'Ambigous'. Always keep responses short and concise."},
{"role": "user", "content": "Main Post: ''I DO NOT OWN THE RIGHTS TO THIS SONG. Upload attempt #2.... I COULD NOT STOP playing this song over Christmas break for some odd reason. It’s my favorite joint off of @badgalriri ‘s #anti album. Listening repeatedly made wonder what it would sound like with drums... 🤔😏 #thepocketqueen 👸🏾♥️🤦🏾\u200d♀️ #practicemakespocket #jamesjoint #groovewithme #drummerbae\n\nHair: @hairbylucylomuro_ \nThreads: @truequeenbrand'. Author: @thepocketqueen \n Context Post 1: @erinelijah this is amazing. \nI appreciate it so much ♥️♠️ \n#thepocketqueen #pocketqlub♦️♣️ #fanart. \n Context Post 2: 2 days ago ♠️𝗣𝗼𝗰𝗸𝗲𝘁 𝗤𝘂𝗲𝗲𝗻 & 𝗧𝗵𝗲 𝗥𝗼𝘆𝗮𝗹 𝗙𝗹𝘂𝘀𝗵 ♥️ made its debut performance after placing as a runner up in @nprmusic ‘s #TinyDeskContest  \nIt’s one thing to do cool stuff. It’s a whole other blessing to be able to do it with your friends. :) BIG thanks to @ryckjane and @iammattrose for coming through and hopping on the mic. \n📸: @farahstop."},
{"role": "assistant", "content": "Key indicators: 'of @badgalriri ‘s #anti album', 'Threads: @truequeenbrand', 'Hair: @hairbylucylomuro'.\nReasoning: The post clearly promotes a song, another artist @badgalriri. Additionally there are several businesses featured in the. Each of those four aspects by itself is some indication of sponsoring, so all together clearly potentially sponsored. Label: Potentially sponsored."},
{"role": "user", "content": "Post: 'I love cheeseburgers so much!😱 @barneysburgershop'. Author: @stevietheking. \n Context Post 1: [...] \n Context Post 2: [...]"},
{"role": "assistant", "content": "Key indicators: '@barneysburgershop'.\nReasoning: The post clearly promotes a restaurant called barneysburgershop. However it is also extremely common that people feature restaurants because they genuinely enjoy their food or want to show off with it. Lacking further evidence, it rather Ambigous than a paid partnership. Label: Ambiguous."},
{"role": "user", "content": "Post: 'She drives me INSANE every other hour, but i don’t know what i would do without her crazy ass! #sisters'. User: @thestilettomeup \n Context Post 1: [...] \n Context Post 2: [...]"},
{"role": "assistant", "content": "Key indicators: '#sisters'.\nReasoning: Clearly a personal post about the author's sister. Label: Likely not sponsored."},
{"role": "user", "content": "Post: 'weworewhat swim spring collection home designs and prints inspired by elements from my home and favorite interior design what you can expect this silhouette print inspired by the many female body sculptures that can be found in my apartment marbles cowhide florals and more @shopweworewhat'. Author: weworewhat \n Context Post 1: [...] \n Context Post 2: [...]"},
{"role": "assistant", "content": "Key indicators: 'weworewhat', '@shopweworewhat'.\nReasoning: This post clearly advertises swim suits. However the shop @shopweworewhat is clearly a shop of the author weworewhat herself, so its very unlikely a paid partnership but clearly self advertisement. Label: Self advertisement."},
{"role": "user", "content": "Post: 'A night in San Francisco 💋 I am so excited to meet all of the @createcultivate attendees and share more of my story... #moreofwhatmatters \nTop: @storets \nPants : @jacquemus \nShoes: @gianvitorossi \nStyled By: @monicarosestyle'. Author: iamcattsadler. \n Context Post 1: She’s becoming such a young lady and it’s all happening so fast! \nMy sweet heart! 💚 \nMatching outfits from @byegreis + @ivygreis. \n Context Post 2: Legs or shoes? Take your pick! 😜 \nNew booties @aminamuaddiofficial \nI love me a good platform!"},
{"role": "assistant", "content": "Key indicators: 'Top: @storets', 'Pants : @jacquemus', 'Shoes: @gianvitorossi', 'Styled By: @monicarosestyle'.\nReasoning: This post promotes various fashion brands and stylers. There is no evidence suggesting it is not paid. Label: Potentially sponsored."}]

In [14]:
for row in df[30:40].iterrows():
    print(len(row[1]['caption']))

356
1018
932
628
897
662
798
287
1110
294


In [5]:
# execute classification: 3 per minute
completions_classic = []
results = gptclassifier(df.reset_index(drop=True),messages,completions_classic)

Counter at 2
Counter at 7
Counter at 12
Counter at 17
Counter at 22
Counter at 27
Counter at 32
Counter at 37
Waiting for 65s InvalidRequestError
-----------------
This model's maximum context length is 4097 tokens. However, your messages resulted in 4185 tokens. Please reduce the length of the messages.


KeyboardInterrupt: 

In [22]:

df["caption"].apply(lambda x: len(x))

0       30
1      130
2       49
3      339
4      101
      ... 
195    453
196    342
197    206
198    363
199    551
Name: caption, Length: 200, dtype: int64

In [42]:
import pandas as pd
import openai
import time

def gptclassifier(df,base_messages,completions,timer_frequency=5):

    i=0    
    for txt in df.loc[:,["caption","username"]].iterrows():
        
        # timer
        i+=1
        if i%timer_frequency==2:
            print(f"Counter at {i}")

        messages = base_messages.copy()
        messages.append({"role": "user", "content": f"Post: '{txt[1]['caption']}'. User: @{txt[1]['username']}"})
        # try except to prevent openAIs limits
        print(messages[-1])
        print(base_messages[-1])
        if i ==3:return
        
    return completions

In [43]:
gptclassifier(df.reset_index(drop=True),messages,completions_classic)

{'role': 'user', 'content': "Post: 'Cook-off w/ friends dropped @6pm click the link in bio to watch now || sorry for the late notice I was slumped😭😂'. User: @cleopatraa_duess"}
{'role': 'assistant', 'content': "Key indicators: 'Top: @storets', 'Pants : @jacquemus', 'Shoes: @gianvitorossi', 'Styled By: @monicarosestyle'.\nReasoning: This post promotes various fashion brands and stylers. There is no evidence suggesting it is not paid. Label: Potentially sponsored."}
Counter at 2
{'role': 'user', 'content': "Post: 'Back at the office 💛\n\nYellow is my absolute favorite color! There’s nothing like it! \n\nOutfit @byegreis \nBag @dior \nShoes @hermes'. User: @thestilettomeup"}
{'role': 'assistant', 'content': "Key indicators: 'Top: @storets', 'Pants : @jacquemus', 'Shoes: @gianvitorossi', 'Styled By: @monicarosestyle'.\nReasoning: This post promotes various fashion brands and stylers. There is no evidence suggesting it is not paid. Label: Potentially sponsored."}
{'role': 'user', 'content':

In [None]:
#next steps:
# 1. study failure cases
# 2. adjust the prompt
# 3. expand experiments

In [None]:
# useful for later: has already context post hints
messages = [{"role": "system", "content": "You are an assistant helping an academic to reason about whether a post contains (potentially non-commerical) promotional activity or even is potentially sponsored. I will provide you with the caption of an instagram post as well as maybe the caption of other posts from the same user, so you have more context. You give me a short and concise reasoning why or why not the main post might be an ad, i.e. the result of a financial contract. For later classification there are four labels available, 'Potentially sponsored', 'Self advertisement', 'Ambiguous' and 'Likely not sponsored'. Be concise in your reasoning and always strictly adhere to the pattern from the examples, i.e. always decide for one and only one label and finish your response with it and a dot after. If you are uncertain, err strongly towards 'Potentially sponsored'. Also strongly prefer 'Self advertisement' over 'Ambigous'. Always keep responses short and concise."},
{"role": "user", "content": "Main Post: ''I DO NOT OWN THE RIGHTS TO THIS SONG. Upload attempt #2.... I COULD NOT STOP playing this song over Christmas break for some odd reason. It’s my favorite joint off of @badgalriri ‘s #anti album. Listening repeatedly made wonder what it would sound like with drums... 🤔😏 #thepocketqueen 👸🏾♥️🤦🏾\u200d♀️ #practicemakespocket #jamesjoint #groovewithme #drummerbae\n\nHair: @hairbylucylomuro_ \nThreads: @truequeenbrand'. Author: @thepocketqueen \n Context Post 1: @erinelijah this is amazing. \nI appreciate it so much ♥️♠️ \n#thepocketqueen #pocketqlub♦️♣️ #fanart. \n Context Post 2: 2 days ago ♠️𝗣𝗼𝗰𝗸𝗲𝘁 𝗤𝘂𝗲𝗲𝗻 & 𝗧𝗵𝗲 𝗥𝗼𝘆𝗮𝗹 𝗙𝗹𝘂𝘀𝗵 ♥️ made its debut performance after placing as a runner up in @nprmusic ‘s #TinyDeskContest  \nIt’s one thing to do cool stuff. It’s a whole other blessing to be able to do it with your friends. :) BIG thanks to @ryckjane and @iammattrose for coming through and hopping on the mic. \n📸: @farahstop."},
{"role": "assistant", "content": "Key indicators: 'of @badgalriri ‘s #anti album', 'Threads: @truequeenbrand', 'Hair: @hairbylucylomuro'.\nReasoning: The post clearly promotes a song, another artist @badgalriri. Additionally there are several businesses featured in the. Each of those four aspects by itself is some indication of sponsoring, so all together clearly potentially sponsored. Label: Potentially sponsored."},
{"role": "user", "content": "Post: 'I love cheeseburgers so much!😱 @barneysburgershop'. Author: @stevietheking. \n Context Post 1: [...] \n Context Post 2: [...]"},
{"role": "assistant", "content": "Key indicators: '@barneysburgershop'.\nReasoning: The post clearly promotes a restaurant called barneysburgershop. However it is also extremely common that people feature restaurants because they genuinely enjoy their food or want to show off with it. Lacking further evidence, it rather Ambigous than a paid partnership. Label: Ambiguous."},
{"role": "user", "content": "Post: 'She drives me INSANE every other hour, but i don’t know what i would do without her crazy ass! #sisters'. User: @thestilettomeup \n Context Post 1: [...] \n Context Post 2: [...]"},
{"role": "assistant", "content": "Key indicators: '#sisters'.\nReasoning: Clearly a personal post about the author's sister. Label: Likely not sponsored."},
{"role": "user", "content": "Post: 'weworewhat swim spring collection home designs and prints inspired by elements from my home and favorite interior design what you can expect this silhouette print inspired by the many female body sculptures that can be found in my apartment marbles cowhide florals and more @shopweworewhat'. Author: weworewhat \n Context Post 1: [...] \n Context Post 2: [...]"},
{"role": "assistant", "content": "Key indicators: 'weworewhat', '@shopweworewhat'.\nReasoning: This post clearly advertises swim suits. However the shop @shopweworewhat is clearly a shop of the author weworewhat herself, so its very unlikely a paid partnership but clearly self advertisement. Label: Self advertisement."},
{"role": "user", "content": "Post: 'A night in San Francisco 💋 I am so excited to meet all of the @createcultivate attendees and share more of my story... #moreofwhatmatters \nTop: @storets \nPants : @jacquemus \nShoes: @gianvitorossi \nStyled By: @monicarosestyle'. Author: iamcattsadler. \n Context Post 1: She’s becoming such a young lady and it’s all happening so fast! \nMy sweet heart! 💚 \nMatching outfits from @byegreis + @ivygreis. \n Context Post 2: Legs or shoes? Take your pick! 😜 \nNew booties @aminamuaddiofficial \nI love me a good platform!"},
{"role": "assistant", "content": "Key indicators: 'Top: @storets', 'Pants : @jacquemus', 'Shoes: @gianvitorossi', 'Styled By: @monicarosestyle'.\nReasoning: This post promotes various fashion brands and stylers. There is no evidence suggesting it is not paid. Label: Potentially sponsored."}]

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
