Persona hub generates sample personas using documents from RedPajamav2 dataset: They prompt GPT4o with a document from the dataset and prompt it thus:

*Who is likely to read/like/dislike the following text? Be very detailed in your description of this persona*

They then use those personas to generate 'diverse' instructions. **But are the synthesized instructions actually that diverse?** Analyzing that in this notebook

In [1]:
import pandas as pd
import random
import numpy as np
from tqdm import tqdm
import ipdb
import re
from datasets import load_dataset
from diversity import compression_ratio, homogenization_score, ngram_diversity_score, extract_patterns, get_pos, pos_patterns, token_patterns
import json

import matplotlib.pyplot as plt
# import mplcursors
import seaborn as sns
%matplotlib inline
sns.set(style='darkgrid', context='notebook', rc={'figure.figsize':(14,10)}, font_scale=2)

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('chained_assignment',None)

# Set random seeds for reproducibility on a specific machine
random.seed(1)
np.random.seed(1)
np.random.RandomState(1)
np.set_printoptions(precision=3)

In [2]:
df = pd.read_json("hf://datasets/proj-persona/PersonaHub/knowledge.jsonl", lines=True)

In [3]:
df.shape

(10000, 3)

In [11]:
df['title'] = df['synthesized text'].apply(lambda x: x.split('\n\n')[0][7:])
df.head()

Unnamed: 0,input persona,synthesized text,description,title
0,A high school teacher of social studies preparing lessons on the development of political parties in the United States.,"Title: Unpacking the Evolution of Political Parties in the United States: A Guide for High School Students and Teachers\n\nAs a high school social studies teacher, I've often found that teaching the development of political parties in the United States can be a daunting task. The complex web of ideologies, factions, and historical events can be overwhelming for students to grasp. However, understanding the evolution of political parties is crucial for students to appreciate the intricacies of American politics and civic engagement. In this article, I'll provide a comprehensive guide to help teachers and students navigate the development of political parties in the United States.\n\n**The Founding Era: The Emergence of Factions (1780s-1820s)**\n\nThe United States' first political parties emerged during the Founding Era, with the Federalists and Anti-Federalists taking center stage. The Federalists, led by Alexander Hamilton, advocated for a strong central government, while the Anti-Federalists, led by Thomas Jefferson, preferred a more limited federal authority. This initial factionalism laid the groundwork for the development of modern political parties.\n\n**The First Party System: Federalists and Democratic-Republicans (1790s-1820s)**\n\nThe First Party System saw the rise of the Federalist Party and the Democratic-Republican Party. The Federalists, who dominated the early years of the republic, supported a strong central bank, tariffs, and a loose interpretation of the Constitution. In contrast, the Democratic-Republicans, who eventually became the dominant party, advocated for states' rights, limited government, and agrarian interests.\n\n**The Second Party System: Whigs and Democrats (1830s-1850s)**\n\nThe Second Party System witnessed the emergence of the Whig Party and the continued dominance of the Democratic Party. The Whigs, led by Henry Clay, supported economic development, internal improvements, and a strong federal government. The Democrats, led by Andrew Jackson, championed states' rights, limited government, and the interests of the common man.\n\n**The Civil War and the Rise of the Republican Party (1850s-1860s)**\n\nThe Kansas-Nebraska Act (1854) and the Dred Scott decision (1857) polarized the nation, leading to the formation of the Republican Party. The Republicans, who opposed the expansion of slavery, eventually became the dominant party in the North. The Democratic Party, which had previously been the dominant party, began to fracture along sectional lines.\n\n**The Third Party System: Republicans and Democrats (1870s-1890s)**\n\nThe Third Party System saw the Republicans and Democrats solidify their positions as the two major parties in the United States. The Republicans, who dominated the North, supported business interests, tariffs, and civil rights. The Democrats, who dominated the South, advocated for states' rights, limited government, and agrarian interests.\n\n**The Fourth Party System: The Progressive Era and the Rise of Interest Groups (1890s-1930s)**\n\nThe Fourth Party System witnessed the rise of progressive reform and the emergence of interest groups. The Republicans, who dominated the era, supported business regulation, social reform, and internationalism. The Democrats, who began to rebrand themselves as the party of the common man, advocated for labor rights, agricultural interests, and limited government.\n\n**The Fifth Party System: The New Deal and the Rise of Ideological Parties (1930s-1980s)**\n\nThe Fifth Party System saw the Democrats, under Franklin D. Roosevelt, implement a series of reforms known as the New Deal. The Republicans, who opposed the New Deal, began to coalesce around a more conservative ideology. This era marked the beginning of the modern ideological divide between the two parties.\n\n**The Sixth Party System: The Era of Polarization (1990s-Present)**\n\nThe Sixth Party System has been characterized by increasing polarization and ideological entrenchment. The Democrats have become more liberal, while the Republicans have become more conservative. This era has seen the rise of partisan gridlock, gerrymandering, and the erosion of civil discourse.\n\n**Teaching Strategies and Resources**\n\nTo help students grasp the development of political parties in the United States, I recommend the following strategies and resources:\n\n1. **Timelines**: Create a timeline with students to visualize the evolution of political parties and their key events.\n2. **Primary Sources**: Use primary sources, such as party platforms, speeches, and newspapers, to provide students with a firsthand account of historical events.\n3. **Debates**: Host debates or mock conventions to engage students in the ideologies and issues of each party system.\n4. **Interactive Maps**: Utilize interactive maps to illustrate the geographic distribution of political parties and their electoral successes.\n5. **Graphic Organizers**: Use graphic organizers to help students organize and compare the key features of each party system.\n\nBy following this guide, teachers and students can gain a deeper understanding of the complex and fascinating development of political parties in the United States. As we navigate the complexities of modern American politics, it is essential that we appreciate the historical context that has shaped our political landscape.",knowledge-rich text synthesis,Unpacking the Evolution of Political Parties in the United States: A Guide for High School Students and Teachers
1,A geologist studying sunken volcanoes and the geological history of Pacific islands.,"Title: Unveiling the Secrets of Sunken Volcanoes: A Window into the Geological History of Pacific Islands\n\nAs a geologist specializing in sunken volcanoes and the geological history of Pacific islands, I'm thrilled to share my insights on this fascinating topic. The Pacific Ocean is home to numerous sunken volcanoes, many of which remain unexplored and shrouded in mystery. In this article, I'll delve into the world of sunken volcanoes, exploring their formation, characteristics, and the valuable information they provide about the geological history of Pacific islands.\n\n**What are Sunken Volcanoes?**\n\nSunken volcanoes, also known as seamounts, are underwater volcanoes that have formed on the ocean floor. They can be found in various parts of the world, but the Pacific Ocean is particularly renowned for its high concentration of these geological wonders. Sunken volcanoes can be hundreds or even thousands of meters tall, with some rising above the ocean surface to form islands.\n\n**Formation of Sunken Volcanoes**\n\nThe formation of sunken volcanoes is closely tied to plate tectonics and the movement of the Earth's crust. As tectonic plates move, they interact with each other, resulting in the creation of volcanic hotspots. At these hotspots, magma from the Earth's mantle rises to the surface, producing volcanic eruptions. Over time, the eruptions build up a volcanic cone, which can eventually become a sunken volcano.\n\n**Characteristics of Sunken Volcanoes**\n\nSunken volcanoes exhibit unique characteristics that distinguish them from their terrestrial counterparts. Some of the key features include:\n\n1. **Shape**: Sunken volcanoes often have a gently sloping shape, which is influenced by the weight of the overlying water and the rate of sedimentation.\n2. **Size**: Sunken volcanoes can be enormous, with some exceeding 1,000 km² in area.\n3. **Composition**: The rocks that make up sunken volcanoes are typically basaltic in composition, which is indicative of their origin from the Earth's mantle.\n4. **Hydrothermal Activity**: Sunken volcanoes are often characterized by hydrothermal activity, which involves the interaction of seawater with hot rocks, producing mineral-rich fluids and unique ecosystems.\n\n**What Do Sunken Volcanoes Reveal About Pacific Island Geology?**\n\nSunken volcanoes provide a unique window into the geological history of Pacific islands. By studying these underwater volcanoes, scientists can gain insights into:\n\n1. **Island Formation**: Sunken volcanoes help us understand the processes that formed Pacific islands, including the role of volcanic activity and tectonic plate movement.\n2. **Magma Evolution**: The rocks from sunken volcanoes offer clues about the evolution of magma chambers and the geochemical processes that shape the Earth's crust.\n3. **Climate Change**: Sunken volcanoes can provide valuable information about past climate conditions, including changes in sea level and ocean chemistry.\n4. **Biodiversity**: The unique ecosystems surrounding sunken volcanoes support a diverse range of marine life, which can inform our understanding of the evolution of life on Earth.\n\n**Conclusion**\n\nSunken volcanoes are fascinating geological features that offer a glimpse into the dynamic and complex processes that shape our planet. By exploring these underwater wonders, we can gain a deeper understanding of the geological history of Pacific islands and the Earth's crust as a whole. As a geologist, I'm excited to continue uncovering the secrets of sunken volcanoes and sharing my findings with the scientific community and the public alike.",knowledge-rich text synthesis,Unveiling the Secrets of Sunken Volcanoes: A Window into the Geological History of Pacific Islands
2,A manuscript studies scholar or paleographer with expertise in early medieval European texts and handwriting. This expert would be intrigued by the ninth-century manuscript that contains copies of Theuthild's letters.,"Title: Unraveling the Secrets of the Ninth-Century Manuscript: A Paleographer's Insights into Theuthild's Letters\n\nAs a paleographer specializing in early medieval European texts and handwriting, I am thrilled to share my expertise on a fascinating ninth-century manuscript that contains copies of Theuthild's letters. This rare and valuable artifact offers a unique window into the literary and cultural landscape of the early Middle Ages. In this article, I will delve into the significance of this manuscript, its historical context, and the paleographic features that make it a treasure trove for scholars.\n\n**The Historical Context of Theuthild's Letters**\n\nTheuthild, a Frankish abbess, lived during the Carolingian Renaissance (8th-9th centuries), a period marked by a resurgence of interest in classical learning and a flourishing of literary and artistic production. Her letters, likely written in the late 8th or early 9th century, provide valuable insights into the intellectual and spiritual currents of the time. Theuthild's correspondence with prominent figures, including bishops and monks, offers a glimpse into the networks of scholars, clergy, and nobility that shaped the cultural and political landscape of early medieval Europe.\n\n**The Manuscript: A Paleographic Analysis**\n\nThe ninth-century manuscript containing Theuthild's letters is a remarkable example of Carolingian book production. Written in a clear and elegant script, the manuscript showcases the characteristic features of Carolingian minuscule, a style that emerged during the reign of Charlemagne. The script is notable for its:\n\n1. **Rounded letterforms**: The letters are written with a rounded, cursive quality, reflecting the influence of ancient Roman scripts.\n2. **Consistent spacing**: The text is arranged in a consistent, well-spaced layout, making it easier to read and understand.\n3. **Decorative initials**: The manuscript features ornate, zoomorphic initials, typical of Carolingian illumination, which add to the overall aesthetic appeal of the text.\n\n**Paleographic Clues to the Manuscript's Provenance**\n\nA closer examination of the manuscript reveals several paleographic clues that shed light on its origins and transmission. For instance:\n\n1. **Scriptorial habits**: The scribe's handwriting, particularly the way they form certain letters, suggests a connection to the scriptorium of a specific monastery or abbey.\n2. **Correcting hands**: The presence of correcting hands, evident in the manuscript's marginalia and interlinear corrections, indicates that the text was carefully reviewed and edited over time.\n3. **Ink and parchment quality**: The quality of the ink and parchment used in the manuscript provides insight into the resources and craftsmanship available to the scribe and their patron.\n\n**Conclusion**\n\nThe ninth-century manuscript containing Theuthild's letters is a precious artifact that offers a unique glimpse into the literary, cultural, and artistic achievements of the early Middle Ages. As a paleographer, I am struck by the manuscript's paleographic features, which provide a window into the world of Carolingian book production and the networks of scholars, clergy, and nobility that shaped early medieval Europe. This manuscript is a testament to the enduring power of written communication and the importance of preserving our cultural heritage for future generations.",knowledge-rich text synthesis,Unraveling the Secrets of the Ninth-Century Manuscript: A Paleographer's Insights into Theuthild's Letters
3,"I'm a legal scholar specializing in attorney conduct, courtroom behavior, and the First Amendment implications of disciplinary actions against lawyers. I'm interested in analyzing the Snyder case and its impact on attorneys' freedom of speech in their communications with court personnel.","Title: The Snyder Case: A Landmark Ruling on Attorneys' Freedom of Speech in Communications with Court Personnel\n\nAs a legal scholar specializing in attorney conduct, courtroom behavior, and the First Amendment implications of disciplinary actions against lawyers, I am excited to delve into the significance of the Snyder case and its far-reaching impact on attorneys' freedom of speech in their communications with court personnel.\n\nIntroduction\n\nThe Snyder case, also known as Snyder v. Phelps (2011), is a landmark Supreme Court decision that has had a profound influence on the legal profession. While the case primarily revolves around the Westboro Baptist Church's controversial protests, its implications extend to the realm of attorney speech and the boundaries of permissible expression in communications with court personnel. In this article, we will explore the Snyder case, its key holdings, and the subsequent impact on attorneys' freedom of speech.\n\nBackground of the Snyder Case\n\nThe Snyder case originated from a highly publicized incident in which the Westboro Baptist Church, known for its anti-LGBTQ+ and anti-military views, picketed the funeral of Marine Lance Corporal Matthew Snyder, who died in Iraq in 2006. The church's protest, led by Fred Phelps, included signs with inflammatory messages, such as ""Thank God for Dead Soldiers"" and ""You're Going to Hell."" The Snyder family sued Phelps and the church, alleging invasion of privacy, emotional distress, and defamation.\n\nThe Supreme Court's Ruling\n\nIn a unanimous decision, the Supreme Court held that the Westboro Baptist Church's protests were protected by the First Amendment, as they constituted speech on a matter of public concern. The Court emphasized that the protests, although offensive and hurtful, did not target a private individual or entity, but rather addressed a broader public issue. The ruling established that speech on public concerns, even if offensive or provocative, is entitled to robust protection under the First Amendment.\n\nImplications for Attorneys' Freedom of Speech\n\nWhile the Snyder case did not directly involve attorneys, its implications for the legal profession are significant. The ruling has been cited in various cases involving attorney speech, particularly in the context of communications with court personnel. Here are some key takeaways:\n\n1. **Protection for Criticism of Judicial Decisions**: The Snyder case reinforces the notion that attorneys have a constitutional right to criticize judicial decisions, even if their language is strong or provocative. This protection extends to written and oral communications with court personnel, including judges, clerks, and other staff members.\n2. **Boundaries of Permissible Expression**: The ruling clarifies that attorneys' speech is not categorically exempt from First Amendment protection simply because it is directed at court personnel. However, the Court also emphasized that speech that targets individual judges or court staff, rather than addressing a broader public concern, may not receive the same level of protection.\n3. **Disciplinary Actions and the First Amendment**: The Snyder case has implications for disciplinary actions taken against attorneys for their speech. While attorney discipline is generally subject to a more relaxed standard of review, the First Amendment still applies. Disciplinary bodies must carefully consider the content and context of an attorney's speech to ensure that it does not infringe upon their constitutional rights.\n4. **Ethical Considerations**: The ruling serves as a reminder that attorneys must balance their right to free speech with their ethical obligations to maintain a respectful and professional demeanor in their interactions with court personnel. While attorneys are entitled to express themselves on matters of public concern, they must avoid language that is abusive, threatening, or harassing.\n\nConclusion\n\nThe Snyder case has had a profound impact on the legal profession, particularly in the realm of attorney speech and the First Amendment. By understanding the implications of this landmark ruling, attorneys can better navigate the complexities of communicating with court personnel while exercising their constitutional right to free speech. As the legal profession continues to evolve, it is essential to remain mindful of the delicate balance between protecting attorneys' freedom of expression and maintaining the integrity of the judicial process.",knowledge-rich text synthesis,The Snyder Case: A Landmark Ruling on Attorneys' Freedom of Speech in Communications with Court Personnel
4,"A linguist specialized in the Khoekhoe language, also known as naq, with interest in endangered languages and language documentation.","Title: Preserving the Richness of Khoekhoe: Strategies for Documenting and Revitalizing an Endangered Language\n\nAs a linguist specializing in the Khoekhoe language, also known as Nama or Naq, I have dedicated my career to understanding and preserving this unique and valuable part of human cultural heritage. Spoken in southern Africa, Khoekhoe is one of the many endangered languages facing the threat of extinction. In this article, I will share my insights and expertise on the importance of language documentation and revitalization, with a focus on Khoekhoe, and provide strategies for preserving this rich linguistic tradition.\n\n**The Importance of Language Documentation**\n\nLanguage documentation is the process of recording and analyzing the linguistic structure, vocabulary, and cultural context of a language. In the case of Khoekhoe, documentation is crucial for several reasons:\n\n1. **Preservation of Cultural Heritage**: Khoekhoe is an integral part of the cultural identity of the Khoekhoe people, and its loss would result in the erosion of their cultural heritage.\n2. **Language Revitalization**: Documentation provides a foundation for language revitalization efforts, allowing for the development of language education programs, language planning, and community engagement.\n3. **Linguistic Diversity**: Khoekhoe is a member of the Tuu language family, which is distinct from other language families in Africa. Its documentation contributes to our understanding of linguistic diversity and the complexity of human language.\n\n**Challenges in Documenting Khoekhoe**\n\nDespite its importance, documenting Khoekhoe faces several challenges:\n\n1. **Limited Speaker Base**: Khoekhoe is spoken by a small number of elderly speakers, making it essential to work with them while they are still able to share their knowledge.\n2. **Lack of Written Resources**: Khoekhoe has limited written resources, making it difficult to develop language education materials and dictionaries.\n3. **Language Shift**: Many young Khoekhoe people are shifting to more dominant languages, such as Afrikaans or English, leading to a decline in the use of Khoekhoe in daily life.\n\n**Strategies for Documenting and Revitalizing Khoekhoe**\n\nTo address these challenges, I recommend the following strategies:\n\n1. **Community-Based Language Documentation**: Collaborate with Khoekhoe speakers and communities to develop language documentation projects that prioritize their needs and perspectives.\n2. **Multimedia Documentation**: Use audio and video recordings, as well as photographs, to capture the linguistic and cultural context of Khoekhoe.\n3. **Language Education and Literacy**: Develop language education programs and materials that promote literacy in Khoekhoe, targeting both children and adults.\n4. **Language Planning**: Work with Khoekhoe communities to develop language plans that promote the use of Khoekhoe in official domains, such as education and government.\n5. **Digital Language Resources**: Create digital language resources, such as online dictionaries and language learning apps, to increase access to Khoekhoe language and cultural materials.\n\n**Conclusion**\n\nThe documentation and revitalization of Khoekhoe are urgent tasks that require collaboration between linguists, community members, and policymakers. By working together, we can preserve the richness of Khoekhoe and ensure its continued relevance in the lives of its speakers. As a linguist, I am committed to contributing to this effort and hope that this article will inspire others to join me in this important work.",knowledge-rich text synthesis,Preserving the Richness of Khoekhoe: Strategies for Documenting and Revitalizing an Endangered Language


# How diverse are the titles?

In [18]:
titles = df['title'].values.tolist()
cr = compression_ratio(titles, 'gzip')
# hs = homogenization_score(titles, 'rougel')
# hs = homogenization_score(titles, 'bertscore') 
nds = ngram_diversity_score(titles, 4)

print(cr, nds)

3.238 2.204


Let's compare it to something standard as data - CNN/DailySum maybe?

The essays from Visakh/He He's paper showing that GPT-2 text is homogenous might have been a good start, but its too narrow and not big enough.

In [20]:
events = []
with open('../data/essays_solo.jsonl', 'r') as f:
    for event in f:
        events.append(json.loads(event))

In [23]:
essay_titles = [x['title'] for x in events]
cr = compression_ratio(essay_titles, 'gzip')
# hs = homogenization_score(titles, 'rougel')
# hs = homogenization_score(titles, 'bertscore') 
nds = ngram_diversity_score(essay_titles, 4)

print(cr, nds)

11.49 0.437


# How diverse are the generated knowledge base articles?

In [25]:
def clean_up(text):

    # remove title first
    text = " ".join(text.split('\n\n')[1:])

    return text

df['content'] = df['synthesized text'].apply(lambda x: clean_up(x))

In [28]:
contents = df['content'].values.tolist()
cr = compression_ratio(contents, 'gzip')
# hs = homogenization_score(data_example, 'rougel')
# hs = homogenization_score(data_example, 'bertscore') 
nds = ngram_diversity_score(contents, 4)

print(cr, nds)

3.06 1.72


In [30]:
essay_bodies = [x['essay'] for x in events]
cr = compression_ratio(essay_bodies, 'gzip')
# hs = homogenization_score(data_example, 'rougel')
# hs = homogenization_score(data_example, 'bertscore') 
nds = ngram_diversity_score(essay_bodies, 4)

print(cr, nds)

2.901 2.738


Even compared to the human written essays, the contents are more easily compresssed and have a higher n-gram diversity score. 

**What about POS CR?**

In [32]:
synt_patterns = token_patterns(contents, 5)

In [34]:
synt_patterns

[("In this article, I'll share", 3221),
 ("I've had the privilege of", 2548),
 ("this article, I'll share my", 2483),
 ("this article, I'll delve into", 1510),
 ("In this article, I'll delve", 1499),
 ('In this article, I will', 1485),
 ("article, I'll delve into the", 1404),
 ("article, I'll share my insights", 1254),
 ('I hope this article has', 1234),
 ('a deeper understanding of the', 1209)]

In [35]:
# get the POS patterns 
joined_pos, tuples = get_pos(contents)
ngrams_pos = token_patterns(joined_pos, 5, 10)

In [38]:
ngrams_pos

[('. LS LS NFP NFP', 38990),
 ('NN . LS LS NFP', 18479),
 ('LS LS NFP NFP NNP', 15855),
 ('NNS . LS LS NFP', 15820),
 ('NN , NN , CC', 15106),
 ('NNS , NNS , CC', 13133),
 ('NN , PRP MD VB', 12115),
 ('DT JJ NN IN DT', 12035),
 ('LS LS NFP NFP NN', 11149),
 ('IN DT NN , PRP', 10743)]

In [39]:
cr = compression_ratio(joined_pos, 'gzip')
# hs = homogenization_score(data_example, 'rougel')
# hs = homogenization_score(data_example, 'bertscore') 
nds = ngram_diversity_score(joined_pos, 4)

print(cr, nds)

6.089 0.02


In [41]:
# Now for essays

# get the POS patterns 
joined_pos, tuples = get_pos(essay_bodies)
ngrams_pos = token_patterns(joined_pos, 5, 10)

In [42]:
cr = compression_ratio(joined_pos, 'gzip')
# hs = homogenization_score(data_example, 'rougel')
# hs = homogenization_score(data_example, 'bertscore') 
nds = ngram_diversity_score(joined_pos, 4)

print(cr, nds)

5.072 0.518


| Source | NDS (Lexical) | CR-Lexical | CR-POS |
| --- | --- | --- | --- |
| Human solo essays | 2.74 | 3.06 | 5.07 | 
| Personahub-essays | 1.72 | 2.90 | 6.09 | 

## Persona hub synthesized instructions versus Dolly

In [2]:
from datasets import load_dataset
import numpy as np

In [3]:
personahub = load_dataset("proj-persona/PersonaHub", "instruction")['train'].to_pandas()

In [12]:
from diversity import compression_ratio, homogenization_score, ngram_diversity_score, extract_patterns, get_pos, pos_patterns, token_patterns, self_repetition_score

def diversity_metrics(sents):
    cr = compression_ratio(sents, 'gzip')
    print(f"CR: {np.round(cr,2)}")

    joined_pos, _ = get_pos(sents)
    cr_pos = compression_ratio(joined_pos, 'gzip')
    print(f"CR-POS: {np.round(cr_pos,2)}")
    
    nds = ngram_diversity_score(sents, 4)
    print(f"NDS: {np.round(nds,2)}")
    
    srep = self_repetition_score(sents, verbose=True)
    print(f"Self-rep:{np.round(srep,2)}")

In [5]:
personahub

Unnamed: 0,input persona,synthesized text,description
0,"A theater manager or events coordinator interested in understanding the operational aspects, facilities, and programming of performing arts centers, such as the Broward Center.",What are the key considerations for scheduling and logistics when hosting a multi-show festival at a performing arts center like the Broward Center?,2-shot instruction data synthesis
1,An urban planner looking to understand the distribution and organization of public services in the Halifax Regional Municipality.,"Compare and contrast the distribution of public services such as libraries, community centers, and public transportation in different neighborhoods of Halifax, and discuss how the municipal government's urban planning strategies impact access to these services for residents of varying socioeconomic backgrounds.",2-shot instruction data synthesis
2,A high school literature teacher looking for supplementary materials to enrich their curriculum and provide students with a deeper understanding of their state's cultural heritage.,"Can you provide me with a list of 5-7 educational resources, including videos, articles, and interactive websites, that explore the literary works and cultural significance of Native American authors from my state, such as Louise Erdrich or Sherman Alexie, to incorporate into my 11th-grade American Literature curriculum and help my students better understand the cultural heritage of our region?",0-shot instruction data synthesis
3,"A science fiction writer, exploring the use of secret trials as a theme in stories that examine surveillance, privacy, and power dynamics in society.","In a dystopian future, a young journalist named Maya discovers a hidden underground network of secret trials, where the government uses advanced surveillance technology to manipulate and control citizens. As she delves deeper into the truth, she realizes that her own sister was a victim of these trials and begins to question the true extent of the government's power. How can Maya use her skills as a journalist to expose the secret trials and bring justice to those affected, while navigating the treacherous landscape of government surveillance and propaganda?",2-shot instruction data synthesis
4,"A financial analyst specializing in Asian markets and wealthy individuals, interested in tracking the investments and philanthropic activities of billionaires like Gerald Chan.","Gerald Chan's investment firm has recently acquired a significant stake in a Chinese renewable energy company. How might this investment impact the firm's overall portfolio, and what potential philanthropic opportunities might arise from this acquisition, particularly in the context of China's growing focus on sustainable development?",2-shot instruction data synthesis
...,...,...,...
49995,An events planner seeking unique and engaging entertainment options for corporate gatherings and private parties.,"Can you suggest 5 unconventional entertainment ideas that can be tailored to a tech company's annual gala, with a focus on interactive experiences that encourage team building and networking among 200 guests?",0-shot instruction data synthesis
49996,"An music historian specializing in the analysis of librettos and their cultural significance, with a focus on 20th century dramatic oratorios.","Can you provide a comparative analysis of the librettos of William Walton's ""Belshazzar's Feast"" (1931) and Michael Tippett's ""A Child of Our Time"" (1939-1941), exploring how their use of biblical narratives reflected the cultural and political anxieties of interwar Britain?",0-shot instruction data synthesis
49997,"A science communicator who enjoys sharing stories about space exploration and the people behind it, aiming to inspire interest in science and technology among general audiences.","Can you help me write a script for a 5-minute video about the life and achievements of NASA astronaut Sally Ride, highlighting her contributions to space exploration and her legacy as a role model for women in STEM fields?",0-shot instruction data synthesis
49998,"I'm an molecular biologist specializing in gene regulation and localization in Drosophila embryogenesis. I'm particularly interested in the interplay between translational repression, mRNA localization, and protein-protein interactions during early development.","Can you summarize the current understanding of how the RNA-binding protein Staufen interacts with oskar mRNA to regulate its localization and translation during Drosophila oogenesis, and what are the implications for axis formation in the early embryo?",0-shot instruction data synthesis


In [6]:
dolly = load_dataset("databricks/databricks-dolly-15k")["train"].to_pandas()

In [7]:
dolly.shape

(15011, 4)

In [8]:
personahub.shape

(50000, 3)

In [9]:
diversity_metrics(dolly.instruction.values.tolist())

Calculating self-repetition score: 100%|██████████████████████████████████████████████████| 15011/15011 [00:00<00:00, 417702.38it/s]

CR: 2.58
CR-POS: 5.82
NDS: 2.33
Self-rep:1.95





In [13]:
diversity_metrics(personahub['synthesized text'].sample(15000).values.tolist())

CR: 2.84
CR-POS: 6.2
NDS: 2.0


Calculating self-repetition score: 100%|██████████████████████████████████████████████████| 15000/15000 [00:00<00:00, 124771.31it/s]

Self-rep:5.13





In [14]:
nr = load_dataset("HuggingFaceH4/no_robots")['train'].to_pandas()
nr.shape

(9500, 4)

In [15]:
np.median([len(x) for x in nr['prompt'].values.tolist()])

np.float64(157.0)

In [16]:
diversity_metrics(nr['prompt'].values.tolist())

CR: 2.47
CR-POS: 5.13
NDS: 2.44


Calculating self-repetition score: 100%|█████████████████████████████████████████████████████| 9500/9500 [00:00<00:00, 90555.54it/s]

Self-rep:2.62





In [17]:
tulu = load_dataset("allenai/tulu-3-sft-personas-instruction-following")['train'].to_pandas()
tulu.shape

(29980, 4)

In [18]:
diversity_metrics(tulu['prompt'].values.tolist())

CR: 3.2
CR-POS: 6.3
NDS: 1.51


Calculating self-repetition score: 100%|██████████████████████████████████████████████████| 29980/29980 [00:00<00:00, 108111.12it/s]


Self-rep:6.79


## Math

In [30]:
gsm8k = load_dataset("openai/gsm8k", "main")['train'].to_pandas()
gsm8k.shape

(7473, 2)

In [31]:
diversity_metrics(gsm8k['question'].values.tolist())

CR: 2.87
CR-POS: 5.94
NDS: 2.19


Calculating self-repetition score: 100%|████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 236848.33it/s]

Self-rep:2.88





In [24]:
tulu_math = load_dataset("allenai/tulu-3-sft-personas-math-grade")['train'].to_pandas()
tulu_math.shape

README.md:   0%|          | 0.00/749 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/52.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/49980 [00:00<?, ? examples/s]

(49980, 3)

In [25]:
diversity_metrics(tulu_math['prompt'].sample(10000).values.tolist())

CR: 3.11
CR-POS: 6.26
NDS: 1.87


Calculating self-repetition score: 100%|███████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 75599.47it/s]

Self-rep:5.68





In [32]:
persona_math = load_dataset("proj-persona/PersonaHub", "math")['train'].to_pandas()
persona_math.shape

math.jsonl:   0%|          | 0.00/48.7M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

(50000, 3)

In [33]:
diversity_metrics(persona_math['synthesized text'].sample(10000).values.tolist())

CR: 3.11
CR-POS: 5.92
NDS: 1.75


Calculating self-repetition score: 100%|███████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 37335.57it/s]

Self-rep:6.38



