# Output of LLMs with Rewriting Prompt - Gemini and Gemma

### Dataset Overview
**LLM Prompt Recovery Data for Gemini and Gemma**: This is a dataset containing text data from gemini and gemma models. Each of them contains original text, a prompt to require rewriting the text in a specified direction, and the rewritten text.  
https://www.kaggle.com/datasets/newtonbaba12345/llm-prompt-recovery-data-gemini-and-gemma.   

### Project Goal
My idea is to analyze the relationship between original text and rewritten text, prompt and rewritten text, as well as (original text + prompt) and rewritten text. I use cosine similarity to calculate semantic similarity for each of the three pairs, and do some comparison.  

# Install / Import packages

In [184]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import nltk
import string
nltk.download('stopwords')
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
import seaborn as sns

# I've checked that we can appropriately ignore some warnings here
import warnings
warnings.filterwarnings('ignore') # Ignore warnings

/kaggle/input/experts-bert/tensorflow2/wiki-books-qqp/2/saved_model.pb
/kaggle/input/experts-bert/tensorflow2/wiki-books-qqp/2/assets/vocab.txt
/kaggle/input/experts-bert/tensorflow2/wiki-books-qqp/2/variables/variables.index
/kaggle/input/experts-bert/tensorflow2/wiki-books-qqp/2/variables/variables.data-00000-of-00001
/kaggle/input/experts-bert/tensorflow2/pubmed/2/saved_model.pb
/kaggle/input/experts-bert/tensorflow2/pubmed/2/assets/vocab.txt
/kaggle/input/experts-bert/tensorflow2/pubmed/2/variables/variables.index
/kaggle/input/experts-bert/tensorflow2/pubmed/2/variables/variables.data-00000-of-00001
/kaggle/input/llm-prompt-recovery-data-gemini-and-gemma/gemma_data_set_prompt_recover_1.csv
/kaggle/input/llm-prompt-recovery-data-gemini-and-gemma/gemma_data_set_prompt_recover_2.csv
/kaggle/input/llm-prompt-recovery-data-gemini-and-gemma/gemini_data_set_prompt_recover_3.csv
/kaggle/input/bert/tensorflow2/en-uncased-preprocess/3/saved_model.pb
/kaggle/input/bert/tensorflow2/en-uncased

In [185]:
pd.set_option('display.max_colwidth', None) # to display full content in a row 

# Read in Data
There are two files for the gemma model, and one file is already enough for me to do the analysis, so I read in its first file.  

In [198]:
gemini = pd.read_csv('/kaggle/input/llm-prompt-recovery-data-gemini-and-gemma/gemini_data_set_prompt_recover_3.csv')
gemma = pd.read_csv('/kaggle/input/llm-prompt-recovery-data-gemini-and-gemma/gemma_data_set_prompt_recover_1.csv')

In [188]:
# check columns 
gemini.columns

Index(['Model_used', 'original_text', 'prompt', 'rewritten_text'], dtype='object')

In [189]:
# check columns 
gemma.columns

Index(['original_text', 'prompt', 'rewrite_prompt', 'rewritten_text'], dtype='object')

In [111]:
# check dimension 
gemini.shape

(1802, 4)

In [112]:
# check dimension 
gemma.shape

(1000, 4)

# Text Processing 

### Assign the content to corresponding variables

In [199]:
# gemini

org_text = gemini.original_text # original text, every 5 rows are the same inputs 
prompt = gemini.prompt # required adjustments, each is different 
rew_text = gemini.rewritten_text # rewritten text, each is different  

In [200]:
# gemma

org_text_gemma = gemma.original_text # original text each is different 
prompt_gemma = gemma.prompt # required adjustments, each is different 
rew_text_gemma = gemma.rewritten_text # rewritten text, each is different  

### Process org_text, save as array in **org_text_list**

In [201]:
# gemini

org_text_list = [[] for _ in range(len(rew_text))]

for i in range(len(org_text)): 
    org_text[i] = org_text[i].replace("\n", " ")
    org_text[i] = org_text[i].replace("\\n", " ")
    org_text_list[i].append(org_text[i])

In [202]:
# gemma

org_text_list_gemma = [[] for _ in range(len(rew_text_gemma))]

for i in range(len(org_text_gemma)): 
    org_text_gemma[i] = org_text_gemma[i].replace("\n", " ")
    org_text_gemma[i] = org_text_gemma[i].replace("`` ", "")
    org_text_gemma[i] = org_text_gemma[i].replace(" ``", "")
    org_text_gemma[i] = org_text_gemma[i].replace("*", "")
    org_text_list_gemma[i].append(org_text_gemma[i])

In [203]:
# randomly check 
org_text_list[0]

['Memo  To: All Staff From: Keith Subject: Fish Tank Maintenance  Please be reminded to clean the fish tanks in the office every Friday. This ensures a healthy environment for our beloved fish. Thank you for your cooperation.']

In [204]:
# randomly check 
org_text_list_gemma[0]

["It'd been three days since Baldr Alwisson had seen signs of pursuit, yet he could n't stop. The damned snow had stopped falling, leaving his tracks painfully clear. Between that and the hounds, the soldiers could n't fail to catch him eventually. What he truly needed was a blizzard, but the pale skies offered not the promise of such.     It was his own fault, in truth. He should n't have grabbed the jeweled plaque from that temple, gold inlay and sapphires or no. The priests and parishioners were out for his life, he knew because the plaque told him. It was now giving Baldr directions to thwart those following. Somewhere up ahead, there was a cave that was always warm inside, a cave the army could n't hope to find him in.    ... stuck. Thought I had something for a second there, it escaped. "]

### Combine org_text and prompt as the X for regression, save as array in **X_list**

In [205]:
# gemini
X_list = []

for i in range(len(prompt)): 
    prompt_string = str(prompt[i]).split('. ')[1: ]
    X = org_text[i] + str(prompt_string) # combine org_text and prompt as the X for regression
    X = X.replace("['", " ")
    X = X.replace("']", "")
    X = X.replace("\n", "")
    X = np.array(X)
    X_list.append(X)

In [206]:
# gemma
X_list_gemma = []

for i in range(len(prompt_gemma)): 
    prompt_string = prompt_gemma[i]
    X = org_text_gemma[i] + str(prompt_string) # combine org_text and prompt as the X for regression
    X = X.replace("['", " ")
    X = X.replace("']", "")
    X = X.replace("\n", "")
    X = X.replace("`` ", "")
    X = np.array(X)
    X_list_gemma.append(X)

In [207]:
# randomly check 
X_list[5]

array('Hey Suzann! Just wanted to let you know that Grandma and Grandpa are coming over for dinner tonight. Can you pick up some dessert on your way home? Thanks! Rewrite the text to make it sound less demanding.',
      dtype='<U205')

In [208]:
# randomly check 
X_list_gemma[0]

array("It'd been three days since Baldr Alwisson had seen signs of pursuit, yet he could n't stop. The damned snow had stopped falling, leaving his tracks painfully clear. Between that and the hounds, the soldiers could n't fail to catch him eventually. What he truly needed was a blizzard, but the pale skies offered not the promise of such.     It was his own fault, in truth. He should n't have grabbed the jeweled plaque from that temple, gold inlay and sapphires or no. The priests and parishioners were out for his life, he knew because the plaque told him. It was now giving Baldr directions to thwart those following. Somewhere up ahead, there was a cave that was always warm inside, a cave the army could n't hope to find him in.    ... stuck. Thought I had something for a second there, it escaped. Rewrite this text in the style of a Victorian gentleman futuristic AI.",
      dtype='<U872')

### Process prompt, save as array in **prompt_list**

In [209]:
# gemini
prompt_list = []

for i in range(len(prompt)): 
    prompt_string = str(prompt[i]).split('. ')[1: ]
    prompt_list.append(prompt_string)

In [210]:
#gemma 
prompt_list_gemma = []

for i in range(len(prompt_gemma)): 
    prompt_string = prompt_gemma[i]
    prompt_list_gemma.append(prompt_string)

In [211]:
# randomly check 
prompt_list[0]

['Rewrite this paragraph to make it more engaging and fun.']

In [212]:
# randomly check 
prompt_list_gemma[3]

'Translate the essence of this text into a the Roaring Twenties narrative.'

### Process rew_text, save as array in **rew_text_list**

In [213]:
# gemini
rew_text_list = [[] for _ in range(len(rew_text))]

for i in range(len(rew_text)): 
    rew_text[i] = rew_text[i].replace("\n", "")
    rew_text_list[i].append(rew_text[i])

In [214]:
#gemma
rew_text_list_gemma = [[] for _ in range(len(rew_text_gemma))]

for i in range(len(rew_text_gemma)): 
    rew_text_gemma[i] = rew_text_gemma[i].replace("\n", "")
    rew_text_gemma[i] = rew_text_gemma[i].replace("\'", "'")
    rew_text_gemma[i] = rew_text_gemma[i].replace('*', "")
    rew_text_list_gemma[i].append(rew_text_gemma[i])

In [215]:
# randomly check 
rew_text_list[5]

["Hi Suzann,Just wanted to ask you a favor. My grandparents are visiting for dinner tonight, and it would be great if you could help me with dessert. If you could pick something up on your way home, I'd really appreciate it. Thanks!"]

In [216]:
# randomly check 
rew_text_list_gemma[0]

['Here is the text rewritten in the style of a Victorian gentleman futuristic AI:"My dear Baldr Alwisson, it has been three days since I have encountered the infernal signs of pursuit, yet I am, alas, unable to desist. The wretched snow has ceased its reign of fall, leaving my tracks bare and discernible. Between this infernal precipitation and the hounds of war, your capture is inevitable. What I truly require, my dear Baldr, is a blizzard, a tempest to mask my presence, but the pale heavens offer not the promise of such a windstorm.It is an admission of fault, I must say, to have partaken of the jeweled plaque from that temple of antiquity, adorned with gold inlay and sapphires. The priests and parishioners are clearly out for my blood, their wrath imminent. I know their intentions well, for the plaque bestowed upon me has revealed the path I must traverse in order to outsmart and out']

## 8 lists are prepared: 
Until here, I have prepared org_text_list, X_list, prompt_list, rew_text_list, for gemini and gemma, respectively. 

# Using pretrained model: google/experts-bert
This is a collection of such BERT "expert" models that were trained on a diversity of datasets and tasks to improve performance on downstream tasks like question answering, tasks that require natural language inference skills, NLP tasks in the medical text domain, and more.  

There are different specialties for different versions. For calculating semantic similarity, I use the wiki_books/qqp version:  
This model was initialized from the base Wikipedia + BooksCorpus BERT model and was fine-tuned on the Quora Question Pairs dataset (QQP), a dataset for the semantic similarity of question pairs.  

The reason I did semantic similarity analysis is that, I suppose that if **the semantic similarity between original text combined with prompt and rewritten text** is higher than **the semantic similarity between original text and rewritten text**, it indicates that the semantic meaning of prompt is well captured and added in the rewritten text. This could mean the model performs well. Although very simple, I want to investigate the performance of models through this way and see what I can find.  

More information can be found at https://www.kaggle.com/models/google/experts-bert/tensorFlow2/wiki-books-qqp.  

In [217]:
# !pip install --quiet tensorflow-text
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text  # Imports TF ops for preprocessing

# compute cosine similarity
from sklearn.metrics import pairwise

# load the BERT encoder and preprocessing models
preprocess = hub.load('https://kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-preprocess/3')
bert_qqp = hub.load('https://www.kaggle.com/models/google/experts-bert/TensorFlow2/wiki-books-qqp/2')
# /qqp: this model was initialized from the base Wikipedia + BooksCorpus BERT model and was fine-tuned on the Quora Question Pairs dataset (QQP), 
# a dataset for the semantic similarity of question pairs.

Attaching model 'tensorflow/bert/tensorflow2/en-uncased-preprocess/3' to your Kaggle notebook...
Attaching model 'google/experts-bert/tensorflow2/wiki-books-qqp/2' to your Kaggle notebook...


In [219]:
# since the number of rows of gemini and gemma are different, define a number of iteration  
num = 500

# Calculate Embeddings
For gemini and gemma, I construct two loops to iterate through `num` rows of data, respectively, and simultaneously calculate the cosine similarity between lists.  

In [220]:
# gemini

# create lists to record output embedding
pooled_output_org_list = []
sequence_output_org_list = []

pooled_output_prompt_list = []
sequence_output_prompt_list = []

pooled_output_X_list = []
sequence_output_X_list = []

pooled_output_rew_list = []
sequence_output_rew_list = []

# lists to record similarity between lists and rewritten 
cos_sim_list_org_rew = []
cos_sim_list_prompt_rew = []
cos_sim_list_X_rew = []


for i in range(num): 
    # ======================================
    # org 
    org_sentences = org_text_list[i]
    
    # convert the sentences to bert inputs
    bert_inputs_org = preprocess(org_sentences)
    
    # feed the inputs to the model to get the pooled and sequence outputs
    bert_outputs_org = bert_qqp(bert_inputs_org, training=False)
    
    # get the pooled embedding and sequence embedding 
    pooled_output_org = bert_outputs_org['pooled_output'] # sentence-level embedding = pooled embedding
    pooled_output_org_list.append(pooled_output_org)
    sequence_output_org = bert_outputs_org['sequence_output'] # token-level embedding = per-token embedding = sequence embedding 
    sequence_output_org_list.append(sequence_output_org)
    
    # ======================================
    # prompt
    prompt_sentences = prompt_list[i]
    
    # convert the sentences to bert inputs
    bert_inputs_prompt = preprocess(prompt_sentences)

    # feed the inputs to the model to get the pooled and sequence outputs
    bert_outputs_prompt = bert_qqp(bert_inputs_prompt, training=False)
    
    # get the pooled embedding and sequence embedding 
    pooled_output_prompt = bert_outputs_prompt['pooled_output'] # sentence-level embedding = pooled embedding
    pooled_output_prompt_list.append(pooled_output_prompt)
    sequence_output_prompt = bert_outputs_prompt['sequence_output'] # token-level embedding = per-token embedding = sequence embedding 
    sequence_output_prompt_list.append(sequence_output_prompt)
    
    # ======================================
    # X (org + prompt)
    X_sentences = [X_list[i].item()]

    # convert the sentences to bert inputs
    bert_inputs_X = preprocess(X_sentences)

    # feed the inputs to the model to get the pooled and sequence outputs
    bert_outputs_X = bert_qqp(bert_inputs_X, training=False)
    
    # get the pooled embedding and sequence embedding 
    pooled_output_X = bert_outputs_X['pooled_output'] # sentence-level embedding = pooled embedding
    pooled_output_X_list.append(pooled_output_X)
    sequence_output_X = bert_outputs_X['sequence_output'] # token-level embedding = per-token embedding = sequence embedding 
    sequence_output_X_list.append(sequence_output_X)
    
    # ======================================
    # rewritten
    rew_sentences = rew_text_list[i]
    
    # convert the sentences to bert inputs
    bert_inputs_rew = preprocess(rew_sentences)

    # feed the inputs to the model to get the pooled and sequence outputs
    bert_outputs_rew = bert_qqp(bert_inputs_rew, training=False)
    
    # get the pooled embedding and sequence embedding 
    pooled_output_rew = bert_outputs_rew['pooled_output'] # sentence-level embedding = pooled embedding
    pooled_output_rew_list.append(pooled_output_prompt)
    sequence_output_rew = bert_outputs_rew['sequence_output'] # token-level embedding = per-token embedding = sequence embedding 
    sequence_output_rew_list.append(sequence_output_prompt)
    
    # ======================================
    # ======================================
    # compute similarity 
    cos_sim = pairwise.cosine_similarity(pooled_output_org_list[i], pooled_output_rew_list[i])
    cos_sim_list_org_rew.append(cos_sim)
    
    cos_sim = pairwise.cosine_similarity(pooled_output_prompt_list[i], pooled_output_rew_list[i])
    cos_sim_list_prompt_rew.append(cos_sim)
    
    cos_sim = pairwise.cosine_similarity(pooled_output_X_list[i], pooled_output_rew_list[i])
    cos_sim_list_X_rew.append(cos_sim)

In [248]:
# similarity mean
print('gemini - mean of similarity between org, rew: ', cos_sim_list_org_rew[0:10])
print('\n')
print('gemini - mean of similarity between prompt, rew: ', cos_sim_list_prompt_rew[0:10])
print('\n')
print('gemini - mean of similarity between org+prompt, rew: ', cos_sim_list_X_rew[0:10])

gemini - mean of similarity between org, rew:  [array([[297.5181]], dtype=float32), array([[0.8752649]], dtype=float32), array([[0.9845989]], dtype=float32), array([[0.94869226]], dtype=float32), array([[0.96205]], dtype=float32), array([[0.9763001]], dtype=float32), array([[0.9561322]], dtype=float32), array([[0.15212032]], dtype=float32), array([[0.12496223]], dtype=float32), array([[0.4855021]], dtype=float32)]


gemini - mean of similarity between prompt, rew:  [array([[330.]], dtype=float32), array([[0.9999999]], dtype=float32), array([[1.0000001]], dtype=float32), array([[1.]], dtype=float32), array([[1.]], dtype=float32), array([[0.9999999]], dtype=float32), array([[1.]], dtype=float32), array([[0.9999998]], dtype=float32), array([[0.9999999]], dtype=float32), array([[1.]], dtype=float32)]


gemini - mean of similarity between org+prompt, rew:  [array([[299.89655]], dtype=float32), array([[0.8549682]], dtype=float32), array([[0.98939437]], dtype=float32), array([[0.8950499]], dt

In [259]:
np.std(cos_sim_list_org_rew[0:100])

29.51406

In [258]:
np.std(cos_sim_list_prompt_rew[0:100])

32.735077

In [260]:
np.std(cos_sim_list_X_rew[0:100])

29.75065

In [221]:
# gemma

# create lists to record output embedding
pooled_output_org_list_gemma = []
sequence_output_org_list_gemma = []

pooled_output_prompt_list_gemma = []
sequence_output_prompt_list_gemma = []

pooled_output_X_list_gemma = []
sequence_output_X_list_gemma = []

pooled_output_rew_list_gemma = []
sequence_output_rew_list_gemma = []

# lists to record similarity between lists and rewritten 
cos_sim_list_org_rew_gemma = []
cos_sim_list_prompt_rew_gemma = []
cos_sim_list_X_rew_gemma = []


for i in range(num): 
    # ======================================
    # org 
    org_sentences = org_text_list_gemma[i]
    
    # convert the sentences to bert inputs
    bert_inputs_org = preprocess(org_sentences)
    
    # feed the inputs to the model to get the pooled and sequence outputs
    bert_outputs_org_gemma = bert_qqp(bert_inputs_org, training=False)
    
    # get the pooled embedding and sequence embedding 
    pooled_output_org_gemma = bert_outputs_org_gemma['pooled_output'] # sentence-level embedding = pooled embedding
    pooled_output_org_list_gemma.append(pooled_output_org_gemma)
    sequence_output_org_gemma = bert_outputs_org_gemma['sequence_output'] # token-level embedding = per-token embedding = sequence embedding 
    sequence_output_org_list_gemma.append(sequence_output_org_gemma)
    
    # ======================================
    # prompt
    prompt_sentences = [prompt_list_gemma[i]]
    
    # convert the sentences to bert inputs
    bert_inputs_prompt = preprocess(prompt_sentences)

    # feed the inputs to the model to get the pooled and sequence outputs
    bert_outputs_prompt = bert_qqp(bert_inputs_prompt, training=False)
    
    # get the pooled embedding and sequence embedding 
    pooled_output_prompt = bert_outputs_prompt['pooled_output'] # sentence-level embedding = pooled embedding
    pooled_output_prompt_list_gemma.append(pooled_output_prompt)
    sequence_output_prompt = bert_outputs_prompt['sequence_output'] # token-level embedding = per-token embedding = sequence embedding 
    sequence_output_prompt_list_gemma.append(sequence_output_prompt)
    
    # ======================================
    # X (org + prompt)
    X_sentences = [X_list_gemma[i].item()]

    # convert the sentences to bert inputs
    bert_inputs_X = preprocess(X_sentences)

    # feed the inputs to the model to get the pooled and sequence outputs
    bert_outputs_X = bert_qqp(bert_inputs_X, training=False)
    
    # get the pooled embedding and sequence embedding 
    pooled_output_X = bert_outputs_X['pooled_output'] # sentence-level embedding = pooled embedding
    pooled_output_X_list_gemma.append(pooled_output_X)
    sequence_output_X = bert_outputs_X['sequence_output'] # token-level embedding = per-token embedding = sequence embedding 
    sequence_output_X_list_gemma.append(sequence_output_X)
    
    # ======================================
    # rewritten
    rew_sentences = rew_text_list_gemma[i]
    
    # convert the sentences to bert inputs
    bert_inputs_rew = preprocess(rew_sentences)

    # feed the inputs to the model to get the pooled and sequence outputs
    bert_outputs_rew = bert_qqp(bert_inputs_rew, training=False)
    
    # get the pooled embedding and sequence embedding 
    pooled_output_rew = bert_outputs_rew['pooled_output'] # sentence-level embedding = pooled embedding
    pooled_output_rew_list_gemma.append(pooled_output_prompt)
    sequence_output_rew = bert_outputs_rew['sequence_output'] # token-level embedding = per-token embedding = sequence embedding 
    sequence_output_rew_list_gemma.append(sequence_output_prompt)
    
    # ======================================
    # ======================================
    # compute similarity 
    cos_sim = pairwise.cosine_similarity(pooled_output_org_list_gemma[i], pooled_output_rew_list_gemma[i])
    cos_sim_list_org_rew_gemma.append(cos_sim)
    
    cos_sim = pairwise.cosine_similarity(pooled_output_prompt_list_gemma[i], pooled_output_rew_list_gemma[i])
    cos_sim_list_prompt_rew_gemma.append(cos_sim)
    
    cos_sim = pairwise.cosine_similarity(pooled_output_X_list_gemma[i], pooled_output_rew_list_gemma[i])
    cos_sim_list_X_rew_gemma.append(cos_sim)

In [249]:
# similarity mean
print('gemma - mean of similarity between org, rew: ', cos_sim_list_org_rew_gemma[0:10])
print('gemma - mean of similarity between prompt, rew: ', cos_sim_list_prompt_rew_gemma[0:10])
print('gemma - mean of similarity between org+prompt, rew: ', cos_sim_list_X_rew_gemma[0:10])

gemma - mean of similarity between org, rew:  [array([[0.77171415]], dtype=float32), array([[0.9896351]], dtype=float32), array([[0.9870037]], dtype=float32), array([[0.9735609]], dtype=float32), array([[0.99080646]], dtype=float32), array([[0.8313054]], dtype=float32), array([[0.8892137]], dtype=float32), array([[0.91462356]], dtype=float32), array([[0.9869556]], dtype=float32), array([[0.9705299]], dtype=float32)]
gemma - mean of similarity between prompt, rew:  [array([[1.0000001]], dtype=float32), array([[1.]], dtype=float32), array([[1.]], dtype=float32), array([[0.9999998]], dtype=float32), array([[0.99999976]], dtype=float32), array([[0.99999994]], dtype=float32), array([[0.99999976]], dtype=float32), array([[0.9999999]], dtype=float32), array([[0.9999999]], dtype=float32), array([[1.]], dtype=float32)]
gemma - mean of similarity between org+prompt, rew:  [array([[0.7717141]], dtype=float32), array([[0.9896353]], dtype=float32), array([[0.98700386]], dtype=float32), array([[0.97

In [257]:
np.std(cos_sim_list_org_rew_gemma[0:100])

0.11395453

In [256]:
np.std(cos_sim_list_prompt_rew_gemma[0:100])

1.4781472e-07

In [255]:
np.std(cos_sim_list_prompt_rew_gemma[0:100])

1.4781472e-07

# Conclusion 

From the result above, there are not much difference in mean of similarity.   
However, in terms of standard deviation, gemma model has an overall smaller sd, indicating more stable performance. 