# This notebook goes over data prep for SQL model fine-tuning

## Datasets:

- [Anyscale](https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehensive-case-study-for-tailoring-models-to-unique-applications): Used [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset from Hugging Face, which is a combination of the [WikiSQL](https://huggingface.co/datasets/wikisql) and [Spider](https://huggingface.co/datasets/spider) datasets.
- [NumserStation](https://www.numbersstation.ai/post/nsql-llama-2-7b): 
  - [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): pretraining data
  - [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL): fine-tuning dataset

### Following the NumberStation training methodology

- To train NSQL, we created two training datasets: 
  - a pre-training dataset composed of general SQL queries, and 
  - a fine tuning dataset composed of text-to-SQL pairs. 

## NSTest2SQL Dataset

## Load tokenizer

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-llama-2-7B")
# model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-llama-2-7B", load_in_8bit=True, torch_dtype=torch.bfloat1

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from datasets import load_dataset
from torch.utils.data import Dataset
import copy

class NSText2SQLDataset(Dataset):
    def __init__(self, size=None, max_seq_length=2048):
        self.dataset = load_dataset("NumbersStation/NSText2SQL",split="train")
        if size:
            self.dataset = self.dataset.select(range(size))
        self.max_seq_length = max_seq_length

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        instruction = torch.tensor(tokenizer.encode(self.dataset[index]['instruction']), dtype=torch.int64)
        example = self.dataset[index]['instruction'] + self.dataset[index]["output"]
        example = tokenizer.encode(example)
        example.append(tokenizer.eos_token_id)
        padding = self.max_seq_length - len(example)
        example = torch.tensor(example, dtype=torch.int64)

        if padding < 0:
            example = example[:self.max_seq_length]
        else:
            example = torch.cat((example, torch.zeros(padding, dtype=torch.int64)))
            
        labels = copy.deepcopy(example)
        labels[: len(instruction)] = -100
        
        return {"input_ids": example, "labels": labels}

In [3]:
dataset = NSText2SQLDataset(size=1000, max_seq_length=1024)

In [12]:
dataset[10]['input_ids'].shape

torch.Size([1024])

In [4]:
# specific language: SQL
pretrain_sql_stack = load_dataset("bigcode/the-stack-dedup", data_dir="data/sql", split="train")

Downloading readme: 100%|██████████| 19.3k/19.3k [00:00<00:00, 2.90MB/s]
Resolving data files: 100%|██████████| 27/27 [00:01<00:00, 14.00it/s]
Downloading data: 100%|██████████| 165M/165M [00:33<00:00, 4.91MB/s] 
Downloading data: 100%|██████████| 156M/156M [00:37<00:00, 4.16MB/s] 
Downloading data: 100%|██████████| 166M/166M [00:35<00:00, 4.65MB/s] 
Downloading data: 100%|██████████| 155M/155M [00:33<00:00, 4.63MB/s] 
Downloading data: 100%|██████████| 161M/161M [00:34<00:00, 4.69MB/s] 
Downloading data: 100%|██████████| 162M/162M [00:41<00:00, 3.87MB/s] 
Downloading data: 100%|██████████| 159M/159M [00:32<00:00, 4.85MB/s] 
Downloading data: 100%|██████████| 155M/155M [00:31<00:00, 4.86MB/s] 
Downloading data: 100%|██████████| 160M/160M [00:34<00:00, 4.65MB/s] 
Downloading data: 100%|██████████| 164M/164M [00:33<00:00, 4.88MB/s] 
Downloading data: 100%|██████████| 152M/152M [00:30<00:00, 5.05MB/s] 
Downloading data: 100%|██████████| 159M/159M [00:38<00:00, 4.18MB/s] 
Downloading data:

In [None]:
load_dataset("NumbersStation/NSText2SQL",split="train")