<a href="https://colab.research.google.com/github/schumbar/CMPE297/blob/main/assignment_05/ShawnChumbar_Assignment05_PartF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 05 Part F - Synthetic Data Generation

## Assignment Description
Create a Colab that generates synthetic data for a real dataset using Tabula. Include explanations for the data generation process and how it compares to the original data.

### References
1. [Tabula_on_insurance_dataset.ipynb](https://github.com/zhao-zilong/Tabula/blob/main/Tabula_on_insurance_dataset.ipynb)


### Synthetic Data Generation

In [None]:
!git clone https://github.com/zhao-zilong/Tabula.git


fatal: destination path 'Tabula' already exists and is not an empty directory.


In [None]:
!cd Tabula
!pwd

/content


In [None]:
!pip install datasets>=2.5.2
!pip install numpy>=1.24.2
!pip install tqdm>=4.64.1
!pip install transformers>=4.22.1
!pip install pandas>=1.4.4
!pip install scikit_learn>=1.1.1
!pip install torch>=1.10.2

In [None]:
# change tabula to tabula_middle_padding to test middle padding method
import pandas as pd

In [None]:
data_url = "https://raw.githubusercontent.com/zhao-zilong/Tabula/refs/heads/main/Real_Datasets/Insurance_compressed.csv"

In [None]:
data = pd.read_csv(data_url)

In [None]:
# Clone the Tabula repository (optional, if you need additional resources)
!git clone https://github.com/zhao-zilong/Tabula.git

Cloning into 'Tabula'...
remote: Enumerating objects: 61, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 61 (delta 30), reused 40 (delta 16), pack-reused 0 (from 0)[K
Receiving objects: 100% (61/61), 54.05 KiB | 18.02 MiB/s, done.
Resolving deltas: 100% (30/30), done.


In [None]:
# Install required packages
!pip install datasets>=2.5.2
!pip install numpy>=1.24.2
!pip install tqdm>=4.64.1
!pip install transformers>=4.22.1
!pip install pandas>=1.4.4
!pip install scikit_learn>=1.1.1
!pip install torch>=1.10.2

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[0m

In [None]:
# Add the Tabula directory to the Python path
import sys
sys.path.append('/content/Tabula')  # Adjust the path if necessary

### Import Necessary Libraries

In [None]:
# Import necessary libraries
import os
import warnings
import json
import typing as tp
import logging
import random

import numpy as np
import pandas as pd
from sklearn import preprocessing
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    AutoConfig,
    DataCollatorWithPadding,
    Trainer
)
from datasets import Dataset
from dataclasses import dataclass

### Load Dataset

In [None]:
# Load the dataset
data_url = "https://raw.githubusercontent.com/zhao-zilong/Tabula/main/Real_Datasets/Insurance_compressed.csv"
data = pd.read_csv(data_url)


### Define Utility Functions

In [None]:
# Define utility functions
def _array_to_dataframe(data: tp.Union[pd.DataFrame, np.ndarray], columns=None) -> pd.DataFrame:
    if isinstance(data, pd.DataFrame):
        return data
    assert isinstance(data, np.ndarray), "Input needs to be a Pandas DataFrame or a Numpy NDArray"
    assert columns, "To convert the data into a Pandas DataFrame, a list of column names has to be given!"
    assert len(columns) == len(data[0]), \
        "%d column names are given, but array has %d columns!" % (len(columns), len(data[0]))
    return pd.DataFrame(data=data, columns=columns)

def _get_column_distribution(df: pd.DataFrame, col: str) -> tp.Union[list, dict]:
    if df[col].dtype == "float" or df[col].dtype == "int":
        col_dist = df[col].to_list()
    else:
        col_dist = df[col].value_counts(normalize=True).to_dict()
    return col_dist

def _convert_tokens_to_text(tokens: tp.List[torch.Tensor], tokenizer: AutoTokenizer) -> tp.List[str]:
    text_data = [tokenizer.decode(t, skip_special_tokens=True) for t in tokens]
    text_data = [d.replace("\n", " ").replace("\r", "").strip() for d in text_data]
    return text_data

def _convert_text_to_tabular_data(text: tp.List[str], df_gen: pd.DataFrame) -> pd.DataFrame:
    columns = df_gen.columns.to_list()
    result_list = []
    for t in text:
        features = t.split(",")
        td = dict.fromkeys(columns)
        for f in features:
            values = f.strip().split(" ")
            if len(values) >= 2 and values[0] in columns and not td[values[0]]:
                td[values[0]] = [values[1]]
        result_list.append(pd.DataFrame(td))
    generated_df = pd.concat(result_list, ignore_index=True, axis=0)
    df_gen = pd.concat([df_gen, generated_df], ignore_index=True, axis=0)
    return df_gen

def _pad(x, length: int, pad_value=50256):
    return [pad_value] * (length - len(x)) + x

def _pad_tokens(tokens):
    max_length = len(max(tokens, key=len))
    tokens = [_pad(t, max_length) for t in tokens]
    return tokens

def _seed_worker(_):
    worker_seed = torch.initial_seed() % 2**32
    random.seed(worker_seed)
    np.random.seed(worker_seed)
    torch.manual_seed(worker_seed)
    torch.cuda.manual_seed_all(worker_seed)
    # Shawn Chumbar Created this assignment


### Dfine TabulaStart Classes

In [None]:

# Define TabulaStart classes
class TabulaStart:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def get_start_tokens(self, n_samples: int) -> tp.List[tp.List[int]]:
        raise NotImplementedError("This has to be overwritten by the subclasses")

class CategoricalStart(TabulaStart):
    def __init__(self, tokenizer, start_col: str, start_col_dist: dict):
        super().__init__(tokenizer)
        assert isinstance(start_col, str), "Start column name must be a string"
        assert isinstance(start_col_dist, dict), "Start column distribution must be a dict"
        self.start_col = start_col
        self.population = list(start_col_dist.keys())
        self.weights = list(start_col_dist.values())

    def get_start_tokens(self, n_samples):
        start_words = random.choices(self.population, self.weights, k=n_samples)
        start_text = [self.start_col + " " + str(s) + "," for s in start_words]
        start_tokens = _pad_tokens(self.tokenizer(start_text)["input_ids"])
        return start_tokens

class ContinuousStart(TabulaStart):
    def __init__(self, tokenizer, start_col: str, start_col_dist: tp.List[float],
                 noise: float = .01, decimal_places: int = 5):
        super().__init__(tokenizer)
        assert isinstance(start_col, str), "Start column name must be a string"
        assert isinstance(start_col_dist, list), "Start column distribution must be a list"
        self.start_col = start_col
        self.start_col_dist = start_col_dist
        self.noise = noise
        self.decimal_places = decimal_places

    def get_start_tokens(self, n_samples):
        start_words = random.choices(self.start_col_dist, k=n_samples)
        start_words = [s + random.uniform(-self.noise, self.noise) for s in start_words]
        start_text = [self.start_col + " " + format(s, f".{self.decimal_places}f") + "," for s in start_words]
        start_tokens = _pad_tokens(self.tokenizer(start_text)["input_ids"])
        return start_tokens

class RandomStart(TabulaStart):
    def __init__(self, tokenizer, all_columns: tp.List[str]):
        super().__init__(tokenizer)
        self.all_columns = all_columns

    def get_start_tokens(self, n_samples):
        start_words = random.choices(self.all_columns, k=n_samples)
        start_text = [s + " " for s in start_words]
        start_tokens = _pad_tokens(self.tokenizer(start_text)["input_ids"])
        return start_tokens

### Define Tabula Dataset

In [None]:

# Define TabulaDataset
class TabulaDataset(Dataset):
    def set_tokenizer(self, tokenizer):
        self.tokenizer = tokenizer

    def _getitem(self, key: tp.Union[int, slice, str], decoded: bool = True, **kwargs) -> tp.Union[tp.Dict, tp.List]:
        row = self._data.fast_slice(key, 1)
        shuffle_idx = list(range(row.num_columns))
        random.shuffle(shuffle_idx)
        shuffled_text = ", ".join(
            ["%s %s" % (row.column_names[i], str(row.columns[i].to_pylist()[0]).strip()) for i in shuffle_idx]
        )
        tokenized_text = self.tokenizer(shuffled_text)
        return tokenized_text

    def __getitems__(self, keys: tp.Union[int, slice, str, list]):
        if isinstance(keys, list):
            return [self._getitem(key) for key in keys]
        else:
            return self._getitem(keys)

# Define Data Collator
@dataclass
class TabulaDataCollator(DataCollatorWithPadding):
    tokenizer: AutoTokenizer

    def __call__(self, features: tp.List[tp.Dict[str, tp.Any]]):
        batch = self.tokenizer.pad(
            features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=self.return_tensors,
        )
        batch["labels"] = batch["input_ids"].clone()
        return batch


### Define TabulaTrainer

In [None]:
# Define TabulaTrainer
class TabulaTrainer(Trainer):
    def get_train_dataloader(self) -> DataLoader:
        if self.train_dataset is None:
            raise ValueError("Trainer: training requires a train_dataset.")

        data_collator = self.data_collator
        train_dataset = self.train_dataset  # Do not remove unused columns

        train_sampler = self._get_train_sampler()

        return DataLoader(
            train_dataset,
            batch_size=self._train_batch_size,
            sampler=train_sampler,
            collate_fn=data_collator,
            drop_last=self.args.dataloader_drop_last,
            num_workers=self.args.dataloader_num_workers,
            pin_memory=self.args.dataloader_pin_memory,
            worker_init_fn=_seed_worker,
        )


### Define Main Tabula Class

In [None]:

# Define the main Tabula class
class Tabula:
    def __init__(self, llm: str, experiment_dir: str = "trainer_tabula", epochs: int = 100,
                 batch_size: int = 8, categorical_columns: list = [], **train_kwargs):
        self.llm = llm
        self.tokenizer = AutoTokenizer.from_pretrained(self.llm)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.config = AutoConfig.from_pretrained(self.llm)
        self.model = AutoModelForCausalLM.from_pretrained(self.llm)

        self.experiment_dir = experiment_dir
        self.epochs = epochs
        self.batch_size = batch_size
        self.categorical_columns = categorical_columns
        self.train_hyperparameters = train_kwargs

        self.columns = None
        self.num_cols = None
        self.conditional_col = None
        self.conditional_col_dist = None
        self.label_encoder_list = []

    def encode_categorical_column(self, data: pd.DataFrame):
        self.label_encoder_list = []
        for column in data.columns:
            if column in self.categorical_columns:
                label_encoder = preprocessing.LabelEncoder()
                data[column] = data[column].astype(str)
                label_encoder.fit(data[column])
                data[column] = label_encoder.transform(data[column])
                self.label_encoder_list.append({
                    'column': column,
                    'label_encoder': label_encoder
                })
        return data

    def decode_categorical_column(self, data: pd.DataFrame):
        for le_dict in self.label_encoder_list:
            column = le_dict['column']
            le = le_dict['label_encoder']
            allowed_values = list(range(len(le.classes_)))
            data[column] = pd.to_numeric(data[column], errors='coerce')
            data = data.dropna(subset=[column])
            data[column] = data[column].astype(int)
            data = data[data[column].isin(allowed_values)]
            data[column] = le.inverse_transform(data[column])
        return data

    def fit(self, data: tp.Union[pd.DataFrame, np.ndarray], column_names: tp.Optional[tp.List[str]] = None,
            conditional_col: tp.Optional[str] = None, resume_from_checkpoint: tp.Union[bool, str] = False) \
            -> TabulaTrainer:
        df = _array_to_dataframe(data, columns=column_names)
        self._update_column_information(df)
        self._update_conditional_information(df, conditional_col)

        if self.categorical_columns:
            df = self.encode_categorical_column(df)

        logging.info("Convert data into HuggingFace dataset object...")
        tabula_ds = TabulaDataset.from_pandas(df)
        tabula_ds.set_tokenizer(self.tokenizer)

        logging.info("Create Tabula Trainer...")
        training_args = TrainingArguments(
            self.experiment_dir,
            num_train_epochs=self.epochs,
            per_device_train_batch_size=self.batch_size,
            save_strategy="no",
            **self.train_hyperparameters
        )
        tabula_trainer = TabulaTrainer(
            self.model, training_args, train_dataset=tabula_ds, tokenizer=self.tokenizer,
            data_collator=TabulaDataCollator(tokenizer=self.tokenizer)
        )

        logging.info("Start training...")
        tabula_trainer.train(resume_from_checkpoint=resume_from_checkpoint)
        return tabula_trainer

    def sample(self, n_samples: int,
               start_col: tp.Optional[str] = "", start_col_dist: tp.Optional[tp.Union[dict, list]] = None,
               temperature: float = 0.7, k: int = 100, max_length: int = 100, device: str = "cuda") -> pd.DataFrame:
        tabula_start = self._get_start_sampler(start_col, start_col_dist)

        self.model.to(device)

        df_gen = pd.DataFrame(columns=self.columns)

        with tqdm(total=n_samples) as pbar:
            already_generated = 0
            while n_samples > df_gen.shape[0]:
                start_tokens = tabula_start.get_start_tokens(k)
                start_tokens = torch.tensor(start_tokens).to(device)

                tokens = self.model.generate(
                    input_ids=start_tokens,
                    max_length=max_length,
                    do_sample=True,
                    temperature=temperature,
                    pad_token_id=self.tokenizer.eos_token_id
                )

                text_data = _convert_tokens_to_text(tokens, self.tokenizer)
                df_gen = _convert_text_to_tabular_data(text_data, df_gen)

                for i_num_cols in self.num_cols:
                    df_gen = df_gen[pd.to_numeric(df_gen[i_num_cols], errors='coerce').notnull()]

                df_gen[self.num_cols] = df_gen[self.num_cols].astype(float)

                df_gen = df_gen.dropna(subset=self.columns)

                pbar.update(df_gen.shape[0] - already_generated)
                already_generated = df_gen.shape[0]

        df_gen = df_gen.reset_index(drop=True)

        if self.categorical_columns:
            df_inversed = self.decode_categorical_column(df_gen.head(n_samples))
            return df_inversed
        else:
            return df_gen.head(n_samples)

    def _update_column_information(self, df: pd.DataFrame):
        self.columns = df.columns.to_list()
        self.num_cols = df.select_dtypes(include=np.number).columns.to_list()

    def _update_conditional_information(self, df: pd.DataFrame, conditional_col: tp.Optional[str] = None):
        assert conditional_col is None or conditional_col in df.columns, \
            f"The column name {conditional_col} is not in the feature names of the given dataset"

        self.conditional_col = conditional_col if conditional_col else df.columns[-1]
        self.conditional_col_dist = _get_column_distribution(df, self.conditional_col)

    def _get_start_sampler(self, start_col: tp.Optional[str],
                           start_col_dist: tp.Optional[tp.Union[tp.Dict, tp.List]]) -> TabulaStart:
        if start_col and start_col_dist is None:
            raise ValueError(f"Start column {start_col} was given, but no corresponding distribution.")
        if start_col_dist is not None and not start_col:
            raise ValueError(f"Start column distribution {start_col_dist} was given, the column name is missing.")

        start_col = start_col if start_col else self.conditional_col
        start_col_dist = start_col_dist if start_col_dist else self.conditional_col_dist

        if isinstance(start_col_dist, dict):
            return CategoricalStart(self.tokenizer, start_col, start_col_dist)
        elif isinstance(start_col_dist, list):
            return ContinuousStart(self.tokenizer, start_col, start_col_dist)
        else:
            return RandomStart(self.tokenizer, self.columns)


### Adjust Categorical Columns to Match Dataset

In [None]:
# Adjust the categorical columns to match your dataset
categorical_columns = ["sex", "children", "smoker", "region"]

### Instantiate Tabula Model with Correct Class Name and Parameters

In [None]:
# Instantiate the Tabula model with the correct class name and parameters
model = Tabula(
    llm='distilgpt2',
    experiment_dir="insurance_training",
    batch_size=32,
    epochs=400,
    categorical_columns=categorical_columns
)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
model.fit(data)

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,1.8201
1000,1.5808
1500,1.4395
2000,1.31
2500,1.1891
3000,1.0894
3500,1.0252
4000,0.9803
4500,0.9461
5000,0.9134


<__main__.TabulaTrainer at 0x7ed91154f340>

### Save Trained Model State

In [None]:
# Save the trained model state
torch.save(model.model.state_dict(), "insurance_training/model_400epoch.pt")

### Generate Synthetic Data

In [None]:
# Generate synthetic data
synthetic_data = model.sample(n_samples=1338)
synthetic_data.to_csv("insurance_400epoch.csv", index=False)

1344it [00:02, 473.11it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = pd.to_numeric(data[column], errors='coerce')
