<a href="https://colab.research.google.com/github/vg-55/btc_dataset/blob/main/tabllm_reference_improving_dolly_w_synthetic_examples_from_textbooks_are_all_you_need.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating synthetic examples for training an LLM, inspired from from "Textbooks Are All You Need" 🚀👨‍💻

In this notebook, we will leverage [Gretel's Tabular LLM](https://gretel.ai/tabular-llm) to generate new diverse, high-quality training examples building on these techniques to create better LLMs. Our goal with this notebook is to demonstrate how to get started creating high quality synthetic data for LLM training, and facilitate further research into safeguards for completion models.

## Background
Recent research has shown that training small, efficient language models (LLMs) on high-quality, diverse data can achieve state-of-the-art results, as demonstrated by models like Microsoft's "phi-1.5" (from their paper "[Textbooks Are All You Need](https://arxiv.org/pdf/2306.11644.pdf)"), [Orca2](https://huggingface.co/microsoft/Orca-2-7b), and [IBM's Granite](https://www.ibm.com/blog/watsonx-tailored-generative-ai/). Using similar techniques, we'll demonstate ways to inject randomness into the prompt in a way that gives rise to the generation of diverse datasets.

Creating diverse training data is challenging, but vital to reduce overfitting and improve generalization. Techniques like including random word subsets in prompts, as done in [TinyStories](https://arxiv.org/abs/2305.07759), will be used.

Compared to models trained on web data, “[Textbooks Are All You Need II](https://arxiv.org/abs/2309.05463)” highlights additional advantages from using textbook-like data: "the model seems to store and access the knowledge more efficiently" and it has an "attenuating effect on toxic content generation." However, as the authors note, "although phi-1.5 has a lower propensity for generating toxic content...it is not immune." They posit phi-1.5's reliance on synthetic data "provide[s] a useful platform for exploring these challenges further."

## Prerequisites

Before diving into the notebook, there are a couple of prerequisites:

1. **Gretel API Key**: You'll need an API key from Gretel. If you don't have one already, you can obtain it from [Gretel's console](https://console.gretel.ai). This key will enable us to use Gretel's services for generating our synthetic datasets.

2. **Access to Gretel's Tabular LLM**: To utilize the specific features of the Tabular LLM, you need to have access to the early preview. If you're not already signed up, you can request early access at [Gretel's Tabular LLM page](https://gretel.ai/tabular-llm).

3. **Domain-specific training data**: To try this approach with your own data, you'll need a LLM training dataset in a standard input / output format, like you might load from HuggingFace or use to train your model. Or, get started quickly with the example below using the [databricks/dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset.


Let's get started!


In [None]:
!pip install -qq gretel-client datasets nltk #sklearn

In [None]:
import json
import random
import yaml
import nltk
import pandas as pd
import pathlib
from datasets import load_dataset
from IPython.display import HTML
from sklearn.feature_extraction.text import TfidfVectorizer
from gretel_client import configure_session, create_or_get_unique_project
from gretel_client.helpers import poll

# Constants and Configurations
PROJECT_NAME = "synthetic-dolly" # @param {type:"string"}
DATASET_NAME = "databricks/databricks-dolly-15k" # @param {type:"string"}
MAX_ROWS = 10 # @param {type:"integer"}
UPSAMPLE = 1 # @param {type:"integer"}

INPUT_COLUMNS = ["instruction", "context", "response"]
MODEL_CONFIG = """
schema_version: 1.0
models:
  - tabllm:
      model_id: "gretelai/tabular-v0b"
      output_format: "jsonl"
"""

PROMPT = """
For each example in the Dataset, please act as a tutor and create high quality,
detailed synthetic question and answers of higher quality than the provided example.
Use every word from the provided 'selected_words' column in your response.
Ensure the data teaches concepts step-by-step and focuses on improving reasoning skills.
Focus on generating questions and answers about under-represented topics and knowledge gaps.

Add two new columns to the Dataset:
1. 'synthetic_instruction':
  * Introduce the topic from the example briefly in 1-2 sentences
  * Ask a clear question related to the topic that requires logical thinking or common sense reasoning
  * Provide any necessary context to set up the reasoning problem
  * Do not repeat the instruction from the Dataset example
2. 'synthetic_response':
  * Respond to the synthetically generated instruction thoroughly in a step-by-step manner
  * Provide the complete reasoning needed to arrive at the answer
  * Ensure the explanation is textbook quality with all details needed to learn the concept
  * Answer in 3-5 sentences.
"""

In [None]:
# Helper functions

def setup_nltk():
    """ Downloads necessary NLTK resources. """
    nltk.download('punkt')
    nltk.download('stopwords')

def display_all_data(df):
    # Style DataFrame for better visibility and word-wrap
    styled = df.style.set_properties(**{
        'text-align': 'left',
        'white-space': 'normal',
        'height': 'auto'
    })

    # Display the styled DataFrame
    display(styled)

setup_nltk()

In [None]:
def configure_gretel():
    """ Configures the Gretel session and creates a project. """
    configure_session(api_key="prompt", cache="yes")
    project = create_or_get_unique_project(name=PROJECT_NAME)
    print(f"Project URL: {project.get_console_url()}")
    return project

def initialize_gretel_model(project):
    """ Initialize the Gretel Tabular LLM model with the provided configuration. """
    model_config = yaml.safe_load(MODEL_CONFIG)
    model = project.create_model_obj(model_config)
    model.data_source = None
    model.submit_cloud()
    poll(model, verbose=False)
    return model

project = configure_gretel()
model = initialize_gretel_model(project)

In [None]:
def load_and_clean_dataset(dataset_name, n_rows):
    """ Load and clean a dataset from HuggingFace. """
    dataset = load_dataset(dataset_name, split='train').select(range(n_rows))
    df = pd.DataFrame(dataset)
    df = df.applymap(lambda x: x.replace('\n', ' ').replace('\r', ' ').encode('ascii', 'ignore').decode('ascii'))
    return df

df = load_and_clean_dataset(DATASET_NAME, MAX_ROWS)

In [None]:
def diversify_and_upsample(df, num_words, columns=None, new_column='selected_words', tfidf_threshold=0.2, upsample_multiplier=1):
    """
    Add a new column to a DataFrame with randomly selected interesting words based on TF-IDF scores.
    Optionally upsample the provided dataframe to create additional synthetic examples.

    Returns:
        pd.DataFrame: The input DataFrame with the new column added and upsampled rows with different selected words.
    """
    if not isinstance(upsample_multiplier, int) or upsample_multiplier <= 0:
        raise ValueError("Upsample multiplier must be a positive integer")

    # If columns not specified, use all columns
    if columns is None:
        columns = df.columns

    # Combine all text data into a single string per row
    combined_text = df[columns].apply(lambda row: ' '.join(row.astype(str)), axis=1)

    # Calculate TF-IDF scores with stop words filtering
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(combined_text)
    feature_names = vectorizer.get_feature_names_out()

    # Function to process each row and select different words for each upsampled row
    def process_row(row, row_index):
        # Get TF-IDF scores for the current row and filter words based on the threshold
        row_tfidf = tfidf_matrix[row_index].toarray().flatten()
        interesting_words = [feature_names[i] for i in range(len(row_tfidf)) if row_tfidf[i] > tfidf_threshold]

        # Randomly choose up to num_words without duplicates
        sampled_words = random.sample(interesting_words, min(num_words, len(interesting_words)))

        return ", ".join(sampled_words)

    upsampled_rows = []

    # Apply the function to each row and upsample rows with different selected words
    for idx, row in df.iterrows():
        sampled_words = [process_row(row, idx) for _ in range(upsample_multiplier)]
        upsampled_rows.extend(sampled_words)

    # Reset the index to match the length of upsampled rows
    df = df.loc[df.index.repeat(upsample_multiplier)].reset_index(drop=True)

    df[new_column] = upsampled_rows

    return df

processed_df = diversify_and_upsample(df, num_words=3, upsample_multiplier=UPSAMPLE)
display_all_data(processed_df)

In [None]:
def create_synthetic_data(model, df, prompt):
    """ Generates synthetic data using the Gretel model. """
    prompt_file = pathlib.Path("prompt.jsonl")
    prompt_file.write_text(json.dumps({"prompt": prompt}) + "\n")

    data_path = 'data.csv'
    df.to_csv(data_path, index=False)
    generator = model.create_record_handler_obj(data_source=str(prompt_file),
                                                ref_data={"data": data_path},
                                                params={"num_records": len(df), "temperature": 0.8})
    generator.submit_cloud()
    poll(generator, verbose=True)
    return pd.read_json(generator.get_artifact_link("data"), lines=True, compression="gzip")

# Create synthetic records
synthetic = create_synthetic_data(model, processed_df, PROMPT)

# Compare the example vs synthetic text-book style instructions and responses
synthetic = synthetic[['instruction', 'response', 'synthetic_instruction', 'synthetic_response']]
synthetic.to_csv('synthetic_data.csv', index=False)
display_all_data(synthetic)

## Citations

@misc{li2023textbooks,
      title={Textbooks Are All You Need II: phi-1.5 technical report},
      author={Yuanzhi Li and Sébastien Bubeck and Ronen Eldan and Allie Del Giorno and Suriya Gunasekar and Yin Tat Lee},
      year={2023},
      eprint={2309.05463},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{eldan2023tinystories,
      title={TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
      author={Ronen Eldan and Yuanzhi Li},
      year={2023},
      eprint={2305.07759},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{gunasekar2023textbooks,
      title={Textbooks Are All You Need},
      author={Suriya Gunasekar and Yi Zhang and Jyoti Aneja and Caio César Teodoro Mendes and Allie Del Giorno and Sivakanth Gopi and Mojan Javaheripi and Piero Kauffmann and Gustavo de Rosa and Olli Saarikivi and Adil Salim and Shital Shah and Harkirat Singh Behl and Xin Wang and Sébastien Bubeck and Ronen Eldan and Adam Tauman Kalai and Yin Tat Lee and Yuanzhi Li},
      year={2023},
      eprint={2306.11644},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}