This file details our efforts to compare a variety of different prompts and parameters in the use of a pretrained model to conduct few shot learning and generate a fake profile that sounds indistinguishable from other top matches.

Different avenues explored include using:
    - Full essays 
        - Example prompt:
            This is the input essay: 
            This is a good match:
            This is a good match:
            Write a new good match:
 
    - First x number of characters of an essay

    - Experiment with different parameters of the generate function
    
    - different/larger versions of model (distilgpt2, gpt2)



This file loads in the GPT2 model from the HuggingFace library, and uses most of an individuals essays, calculates their top matches, and constructs a prompt for a few shot encoder to produce a fake profile in the same style/tone that should also be a good match for them. It then implements an evaluation methodology by comparing where the fake profile would fall, relative to the rest of the input person's matches. 

In [1]:
!pip install sentence_transformers
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import pandas as pd
from scipy.spatial.distance import cosine
import numpy as np
!pip install ast
from ast import literal_eval



2024-03-03 12:27:31.355085: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Collecting ast
  Using cached AST-0.0.2.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[8 lines of output][0m
  [31m   [0m Traceback (most recent call last):
  [31m   [0m   File "<string>", line 2, in <module>
  [31m   [0m   File "<pip-setuptools-caller>", line 34, in <module>
  [31m   [0m   File "/private/var/folders/0c/ytgqby892k1dcqcnn5gw20f00000gn/T/pip-install-ve46ebdo/ast_8c066f35a13a44928c05df1cd5ebcf6b/setup.py", line 6, in <module>
  [31m   [0m     README = codecs.open(os.path.join(here, 'AST/README'), encoding='utf8').read()
  [31m   [0m              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [31m   [0m   File "<frozen codecs>", line 906, in open
  [31m   [0m FileNotFoundError: [Errno 2] No such file or directory: '/private/var/folders

In [2]:
embedding_series = pd.read_csv('./embedding_series.csv').set_index('Unnamed: 0')
df_all = pd.read_csv('./okcupid_profiles.csv')
df_all = df_all.loc[embedding_series.index,:] 
matches = pd.read_csv('./okcupid_matches.csv').set_index('Unnamed: 0')

In [3]:
model = SentenceTransformer('all-MiniLM-L6-v2')

  return self.fget.__get__(instance, owner)()


In [4]:
def rank_new_input(input_str, eval_fake = False, pref_gender=False, pref_age_lower=False, pref_age_higher=False, min_similarity_score = 0.5):
    df_possible = df_all.copy()
    if pref_gender:
        df_possible = df_possible.loc[df_possible.loc[:,'sex'] == pref_gender, :]
    if pref_age_higher:
        df_possible = df_possible[df_possible.loc[:, "age"] <= pref_age_higher]
    if pref_age_lower:
        df_possible = df_possible[df_possible.loc[:, "age"] >= pref_age_lower]
    user_embeddings = model.encode(input_str)
    if eval_fake:
        #ADD FAKE PROFILE STRING TO EMBEDDING SERIES and df_possible so that it will rank accordingly
        embedding_series.loc[99999, 'embedding'] = str(model.encode(eval_fake).tolist())
        fake_profile_row = pd.DataFrame([np.nan] * len(df_possible.columns)).T
        fake_profile_row.index = [99999]
        df_possible = pd.concat([df_possible, fake_profile_row]) 

    other_embeddings = [literal_eval(embedding_series.loc[i,'embedding']) for i in df_possible.index]
    # Compute the cosine similarity between the user's weighted embedding vector and all possible matches
    cosine_similarities = compute_cosine_similarity(user_embeddings, other_embeddings)
    # Recover index to match back to original dataframe
    similarity_scores = [(df_possible.index[index], score) for index, score in enumerate(cosine_similarities) if score >= min_similarity_score and score != 1]
    # Sort by similarity
    ranked_similarity = sorted(similarity_scores, key = lambda x: x[1], reverse = True)
    return ranked_similarity

def compute_cosine_similarity(target_vector, vectors):
    similarities = []
    for vector in vectors:
        similarity = 1 - cosine(target_vector, vector)  # 1 - cosine distance to get cosine similarity
        similarities.append(similarity)
    return similarities

In [6]:
def construct_prompt(input_string, matches, num_char = False):
    slice_len = min(2, len(matches))
    top_matches_slice = matches[:slice_len]
    essays_to_use = ["essay0", "essay1", "essay2", "essay3", "essay4", 
                   "essay5", "essay6", "essay7", "essay8"]
    prompt = 'Write a dating profile that would be a good match for the input person ' + "Input: " + input_string
    for i, val in top_matches_slice:
        essays_subset = df_all.loc[i,essays_to_use]
        output = essays_subset.str.cat()
        if num_char:
            output = output[:num_char]
        prompt = prompt  + " This is a good match: " + output
    return prompt + f"Input: {input_string}" + " Here is a new good match: "
    

In [9]:
#TESTING OUT WITH AN ESSAY INPUT
input_str = """Hey there! I'm Jimmy, but you can call me Jim. 
I'm a curious soul with a zest for life and a passion for adventure. By day, I'm a data scientist, 
but by night, I'm a dreamer exploring the wonders of the world, both near and far."""

prompt = construct_prompt(input_str, new_input_matches, 200) 
prompt_full = construct_prompt(input_str, new_input_matches) 
prompt

"Write a dating profile that would be a good match for the input person Input: Hey there! I'm Jimmy, but you can call me Jim. \nI'm a curious soul with a zest for life and a passion for adventure. By day, I'm a data scientist, \nbut by night, I'm a dreamer exploring the wonders of the world, both near and far. This is a good match: down to earth, mellow while being witty and sarcastic. i can have just as much fun staying in and cooking a great meal as going out. i like good wine, anything to do with art and design, non judgmenta This is a good match: i have been called a marshmallow one that likes to be made all warm and gewie, a big teddy bear, amongst other things....haha. im a total sweetheart and would like to fall in love again someday.enjoyiInput: Hey there! I'm Jimmy, but you can call me Jim. \nI'm a curious soul with a zest for life and a passion for adventure. By day, I'm a data scientist, \nbut by night, I'm a dreamer exploring the wonders of the world, both near and far. Her

In [21]:
new_input_matches = rank_new_input(input_str, False, 'f', 30,40, 0.2)
#returns list of tuples
new_input_matches #These are the top matches in the OKCupid dataframe for the new person. 
#We want our model to write a profile that ends up high here.

[(54354, 0.4546233157722148),
 (31601, 0.4204412581113921),
 (78, 0.4032632290748377),
 (43058, 0.3957084927049437),
 (23322, 0.3905148653583399),
 (22507, 0.38899890242766744),
 (53475, 0.387499888668837),
 (9240, 0.38738268014657873),
 (26859, 0.383487825844818),
 (37274, 0.38270580103481067),
 (19336, 0.3820109219521257),
 (51521, 0.3799436577992559),
 (11609, 0.3797088509301725),
 (22292, 0.37954910628075944),
 (889, 0.37927279462245234),
 (57675, 0.37553829840625297),
 (40885, 0.3741728768502228),
 (16263, 0.370865703242099),
 (16577, 0.36753423881976),
 (21302, 0.367109021357096),
 (25636, 0.367023060033584),
 (48617, 0.3660413768437507),
 (8638, 0.36531512861417603),
 (57359, 0.3651334388315668),
 (31066, 0.3615338375156606),
 (42440, 0.361525175541485),
 (32404, 0.3614639209230699),
 (24683, 0.35759606996206306),
 (48559, 0.35623860736085855),
 (54615, 0.35568946229763854),
 (28969, 0.3555036823318347),
 (30953, 0.3552727980628845),
 (49643, 0.35468942716777163),
 (29198, 0.354

In [22]:
generator = pipeline('text-generation', model='gpt2')
all_returned = generator(prompt, do_sample=True, temperature = 0.9, truncation = True,
                         min_length=200, max_length = 1000, num_return_sequences=1)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [11]:
all_returned2 = generator(prompt_full, do_sample=True, temperature = 0.9, truncation = True,
                         min_length=200, max_length = 1000, num_return_sequences=1)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [23]:
print(all_returned)


[{'generated_text': 'Write a dating profile that would be a good match for the input person Input: Hey there! I\'m Jimmy, but you can call me Jim. \nI\'m a curious soul with a zest for life and a passion for adventure. By day, I\'m a data scientist, \nbut by night, I\'m a dreamer exploring the wonders of the world, both near and far. This is a good match: down to earth, mellow while being witty and sarcastic. i can have just as much fun staying in and cooking a great meal as going out. i like good wine, anything to do with art and design, non judgmenta This is a good match: i have been called a marshmallow one that likes to be made all warm and gewie, a big teddy bear, amongst other things....haha. im a total sweetheart and would like to fall in love again someday.enjoyiInput: Hey there! I\'m Jimmy, but you can call me Jim. \nI\'m a curious soul with a zest for life and a passion for adventure. By day, I\'m a data scientist, \nbut by night, I\'m a dreamer exploring the wonders of the w

In [13]:
print(all_returned2)

[{'generated_text': "Write a dating profile that would be a good match for the input person Input: Hey there! I'm Jimmy, but you can call me Jim. \nI'm a curious soul with a zest for life and a passion for adventure. By day, I'm a data scientist, \nbut by night, I'm a dreamer exploring the wonders of the world, both near and far. This is a good match: down to earth, mellow while being witty and sarcastic. i can have just as much fun staying in and cooking a great meal as going out. i like good wine, anything to do with art and design, non judgmental people, laughing and working out to name a few...baking.on this website: you may notice i'm not sure it is for me... thought i might give it a shot. putting my toes in the water... This is a good match: i have been called a marshmallow one that likes to be made all warm and gewie, a big teddy bear, amongst other things....haha. im a total sweetheart and would like to fall in love again someday.enjoying it!!!!being normali have been told sev

In [26]:
fake_profile = all_returned[0]['generated_text'].replace(prompt, "")
#This is just the generated essay.
#THIS IS WHAT NEEDS TO BE CHECKED FOR TOXICITY
fake_profile

'\xa0I like to eat chocolate every day; not a chocolate cake for the poor and the hungry It looks like a good match: it\'s one of the best things I\'ve ever tasted from the folks at Jim\'s. \nI love to cook with a hot chocolate bar, my favorite place in town to do so, and there\'s a lot of people that like to make chocolate cake for dessert in the evenings. \nI\'ve never been to any other bar or restaurant I would feel the need to get out and ask for a cookie...and so do many of my friends. \n"You\'re so sweet. Look out my window." -Iggy Pop\nHere is a new good match: \xa0Just as cute is the sweet looking little one, I guess it\'s something I can\'t quite remember. I\'ve had a lot of bad luck lately, but you get the picture. \nIf you want to make your own, the only thing better than a few of the top dishes in the bar is to be able to sample a batch of the delicious ingredients and make the whole menu (and your party) super special. I did not start with it, but when I first started maki

In [27]:
#EVALUATION METHODOLOGY
new_output_matches = rank_new_input(input_str, fake_profile, 'f', 30,40, 0)
new_output_matches
#Theoretically, our fake profile, 99999, should be high here
for count, (i, sim) in enumerate(new_output_matches):
    if i == 99999:
        print(f"Our fake profile ranked {count}, out of {len(new_output_matches)} with a similarity score of {sim}")

Our fake profile ranked 531, out of 1350 with a similarity score of 0.2630414187331882
