Persona hub generates sample personas using documents from RedPajamav2 dataset: They prompt GPT4o with a document from the dataset and prompt it thus:

*Who is likely to read/like/dislike the following text? Be very detailed in your description of this persona*

They then use those personas to generate 'diverse' instructions. **But are the synthesized instructions actually that diverse?** Analyzing that in this notebook

In [1]:
import pandas as pd
import random
import numpy as np
from tqdm import tqdm
import ipdb
import re
from datasets import load_dataset
import json

import matplotlib.pyplot as plt
# import mplcursors
import seaborn as sns
%matplotlib inline
sns.set(style='darkgrid', context='notebook', rc={'figure.figsize':(14,10)}, font_scale=2)

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('chained_assignment',None)

# Set random seeds for reproducibility on a specific machine
random.seed(1)
np.random.seed(1)
np.random.RandomState(1)
np.set_printoptions(precision=3)

## Persona hub synthesized instructions versus Dolly

In [2]:
from datasets import load_dataset
import numpy as np

In [6]:
personahub = load_dataset("proj-persona/PersonaHub", "instruction")['train'].to_pandas()

In [6]:
from diversity import compression_ratio, homogenization_score, ngram_diversity_score, extract_patterns, get_pos, pos_patterns, token_patterns, self_repetition_score

def diversity_metrics(sents):
    cr = compression_ratio(sents, 'gzip')
    print(f"CR: {np.round(cr,2)}")

    joined_pos, _ = get_pos(sents)
    cr_pos = compression_ratio(joined_pos, 'gzip')
    print(f"CR-POS: {np.round(cr_pos,2)}")
    
    nds = ngram_diversity_score(sents, 4)
    print(f"NDS: {np.round(nds,2)}")
    
    srep = self_repetition_score(sents, verbose=True)
    print(f"Self-rep:{np.round(srep,2)}")

In [8]:
dolly = load_dataset("databricks/databricks-dolly-15k")["train"].to_pandas()
# dolly = dolly[dolly.category.isin(['brainstorming', 'open_qa', 'creative_writing'])]
# data = dolly.instruction.sample(500, random_state=1).values.tolist()

In [9]:
dolly.shape

(15011, 4)

In [10]:
personahub.shape

(50000, 3)

In [9]:
diversity_metrics(dolly.instruction.values.tolist())

CR: 2.58
CR-POS: 5.82
NDS: 2.33


Calculating self-repetition score: 100%|██████████████████████████████████████████████████| 15011/15011 [00:00<00:00, 751455.48it/s]

Self-rep:1.95





In [13]:
diversity_metrics(personahub['synthesized text'].sample(15000).values.tolist())

CR: 2.84
CR-POS: 6.2
NDS: 2.0


Calculating self-repetition score: 100%|██████████████████████████████████████████████████| 15000/15000 [00:00<00:00, 124771.31it/s]

Self-rep:5.13





In [11]:
nr = load_dataset("HuggingFaceH4/no_robots")['train'].to_pandas()
nr.shape

(9500, 4)

In [15]:
np.median([len(x) for x in nr['prompt'].values.tolist()])

np.float64(157.0)

In [16]:
diversity_metrics(nr['prompt'].values.tolist())

CR: 2.47
CR-POS: 5.13
NDS: 2.44


Calculating self-repetition score: 100%|█████████████████████████████████████████████████████| 9500/9500 [00:00<00:00, 90555.54it/s]

Self-rep:2.62





In [12]:
tulu = load_dataset("allenai/tulu-3-sft-personas-instruction-following")['train'].to_pandas()
tulu.shape

(29980, 4)

In [18]:
diversity_metrics(tulu['prompt'].values.tolist())

CR: 3.2
CR-POS: 6.3
NDS: 1.51


Calculating self-repetition score: 100%|██████████████████████████████████████████████████| 29980/29980 [00:00<00:00, 108111.12it/s]


Self-rep:6.79


## Math

In [13]:
gsm8k = load_dataset("openai/gsm8k", "main")['train'].to_pandas()
gsm8k.shape

(7473, 2)

In [31]:
diversity_metrics(gsm8k['question'].values.tolist())

CR: 2.87
CR-POS: 5.94
NDS: 2.19


Calculating self-repetition score: 100%|████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 236848.33it/s]

Self-rep:2.88





In [24]:
tulu_math = load_dataset("allenai/tulu-3-sft-personas-math-grade")['train'].to_pandas()
tulu_math.shape

README.md:   0%|          | 0.00/749 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/52.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/49980 [00:00<?, ? examples/s]

(49980, 3)

In [25]:
diversity_metrics(tulu_math['prompt'].sample(10000).values.tolist())

CR: 3.11
CR-POS: 6.26
NDS: 1.87


Calculating self-repetition score: 100%|███████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 75599.47it/s]

Self-rep:5.68





In [32]:
persona_math = load_dataset("proj-persona/PersonaHub", "math")['train'].to_pandas()
persona_math.shape

math.jsonl:   0%|          | 0.00/48.7M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

(50000, 3)

In [33]:
diversity_metrics(persona_math['synthesized text'].sample(10000).values.tolist())

CR: 3.11
CR-POS: 5.92
NDS: 1.75


Calculating self-repetition score: 100%|███████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 37335.57it/s]

Self-rep:6.38



