In [3]:
!huggingface-cli login --token <token>

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [8]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011

Downloading readme: 100%|██████████| 8.20k/8.20k [00:00<00:00, 8.58MB/s]


Downloading and preparing dataset json/databricks--databricks-dolly-15k to /root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 13.1M/13.1M [00:00<00:00, 123MB/s][A
Downloading data files: 100%|██████████| 1/1 [00:00<00:00,  3.73it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 1484.71it/s]
                                                        

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.
dataset size: 15011
{'instruction': 'What types of bikes are there?', 'context': '', 'response': 'There are gravel bikes, road bikes, mountain bikes, BMX bikes, recumbent bikes, unicycles, hybrid bikes, electric bikes, cruiser bikes, trail bikes, CX bikes, enduro bikes, touring bikes, fixed gear bikes, kids bikes, fat bikes, tandem bikes, folding bikes, trikes and low rider bikes.', 'category': 'classification'}




In [9]:
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

In [10]:
from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

### Instruction
Make me a list of the oldest board games that I might not know about, and where they were invented.

### Answer
Here is a list of some of the oldest board games. Senet (Egypt), The Royal Game of Ur (Iraq), The Lewis Chessmen (Scotland), Mahjong (China), Game of Goose (Scotland) and Pachisi (India).


In [11]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-13b-hf" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id,use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token



In [12]:
from random import randint
from itertools import chain
from functools import partial


# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample


# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

                                                                    

### Instruction
What are the general rules of the Baseball?

### Context
Baseball is played between two teams with nine players in the field from the team that is not batting at that point (the batting team would have one batter in play at "home plate" on the field). On a baseball field, the game is under the authority of several umpires. There are usually four umpires in major league games; up to six (and as few as one) may officiate depending on the league and the importance of the game. There are three bases. Numbered counterclockwise, first, second, and third bases are cushions (sometimes informally referred to as bags) shaped as 15 in (38 cm) squares which are raised a short distance above the ground; together with home plate, the fourth "base", they form a square with sides of 90 ft (27.4 m) called the diamond. Home plate is a pentagonal rubber slab 17 in (43.2 cm) wide. The playing field is divided into three main sections:

The infield, containing the four bases, is for general

                                                                    

Total number of samples: 1581




In [13]:
lm_dataset.save_to_disk(f"/project/data")

                                                                                              

In [None]:
!python im-a-llama.py 

Logging into the Hugging Face Hub with token hf_oaWwQlP...
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Downloading (…)lve/main/config.json: 100%|█████| 610/610 [00:00<00:00, 8.00MB/s]
Downloading (…)fetensors.index.json: 100%|██| 33.4k/33.4k [00:00<00:00, 249MB/s]
Downloading shards:   0%|                                 | 0/3 [00:00<?, ?it/s]
Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s][A
Downloading (…)of-00003.safetensors:   0%|  | 31.5M/9.95G [00:00<00:36, 275MB/s][A
Downloading (…)of-00003.safetensors:   1%|  | 73.4M/9.95G [00:00<00:29, 335MB/s][A
Downloading (…)of-00003.safetensors:   1%|   | 115M/9.95G [00:00<00:27, 354MB/s][A
Downloading (…)of-00003.safetensors:   2%|   | 157M/9.95G [00:00<00:27, 353MB/s][A
Downloading (…)of-00003.safetenso