# Part 1: Download data
This notebook goes over data download for SQL model [Pretraining](#pretraining-dataset) and [Fine-tuning](#fine-tunning-dataset)

We recommend you going directly to the [Fine-tuning](#fine-tunning-dataset) if you want to use as base model an instruct model like Llama-7b-Instruct-hf

## Pretraining dataset
Here you will find two datasets for pretraining. The stack-smol dataset that can be used for testing with small subset, but for a full pretraining job you should use the Full squad Stack-dedup dataset

### Stack smol dataset for SQL test Pretraining
- [bigcode/the-stack-smol](https://huggingface.co/datasets/bigcode/the-stack-smol)
- Total samples: 155813472
- Total dataset size 902MB
- SQL dataset Size 158MB
- The jsonl files contains ```prompt``` and ```completions``` pairs per sample. For fine-tuning, both ```prompt``` and ```completions``` fields are populated.

For downloading the dataset you will need to request access in [huggingface bigcode/the-stack-smol](https://huggingface.co/datasets/bigcode/the-stack-smol)

In [None]:
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

In [None]:
from datasets import load_dataset

# specific language: SQL
pretrain_dataset = load_dataset("bigcode/the-stack-smol", data_dir="data/sql", split="train")

prompt = [""] * len(pretrain_dataset)
pretrain_dataset = pretrain_dataset.rename_columns({'content': 'completion'}).select_columns('completion').add_column("prompt", prompt)
pretrain_dataset.to_json(os.path.join(kit_dir,"data/pre-training/pretrain-squad-smol-sql.jsonl"))

### Full squad Stack-dedub dataset for SQL Pretraining

- [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)
- Total samples: 13669625293
- Total dataset size 996GB
- SQL dataset size 4.3GB 
- The jsonl files contains ```prompt``` and ```completions``` pairs per sample. For pretraining, ```prompt``` field is empty.

For downloading the dataset you will need to request access in [huggingface bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)

In [None]:
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

In [None]:
from datasets import load_dataset

# specific language: SQL
pretrain_dataset = load_dataset("bigcode/the-stack-dedup", data_dir="data/sql", split="train")

prompt = [""] * len(pretrain_dataset)
pretrain_dataset = pretrain_dataset.rename_columns({'content': 'completion'}).select_columns('completion').add_column("prompt", prompt)
pretrain_dataset.to_json(os.path.join(kit_dir,"data/pre-training/pretrain-the-stack-dedup.jsonl"))

## Fine-tuning dataset


### NSText2SQL dataset for SQL Fine-tuning
- [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL)
- Total samples: 289288
- Dataset size 433MB
- The jsonl files contains ```prompt``` and ```completions``` pairs per sample. For fine-tuning, both ```prompt``` and ```completions``` fields are populated.

In [None]:
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

In [None]:
from datasets import load_dataset

fine_tune_dataset = load_dataset("NumbersStation/NSText2SQL", split="train")
fine_tune_dataset = fine_tune_dataset.rename_columns({'instruction': 'prompt', 'output': 'completion'}).select_columns(['prompt', 'completion'])
fine_tune_dataset.to_json(os.path.join(kit_dir,"data/fine-tuning/fine-tune-nstext2sql.jsonl"))

## Summary

After this stage we should have 2 jsonl files in the [data folder](../data/), one for pretraining and other for fine-tuning.

## Other recipes

- [Anyscale](https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehensive-case-study-for-tailoring-models-to-unique-applications): Used [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset from Hugging Face, which is a combination of the [WikiSQL](https://huggingface.co/datasets/wikisql) and [Spider](https://huggingface.co/datasets/spider) datasets.