In [2]:
!pip install "transformers==4.31.0" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.3.2 which is incompatible.
spyder 5.3.3 requires ipython<8.0.0,>=7.31.1, but you have ipython 8.14.0 which is incompatible.
spyder 5.3.3 requires pylint<3.0,>=2.5.0, but you have pylint 3.0.0a6 which is incompatible.[0m[31m
[0m

In [3]:
!huggingface-cli login --token <token>

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [4]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")





sagemaker role arn: arn:aws:iam::236478807796:role/service-role/AmazonSageMaker-ExecutionRole-20230710T121858
sagemaker bucket: sagemaker-us-east-2-236478807796
sagemaker session region: us-east-2


In [5]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011



Found cached dataset json (/root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)


dataset size: 15011
{'instruction': 'What are the key points that can be extracted regarding antisemitism from the below text?', 'context': 'In 1998, Ignatz Bubis, a leader of the German Jewish community, pointed to a "spreading intellectual nationalism" that made him fear a revival of German antisemitism. Others point to Germany\'s growing Muslim population, both the Turkish "guest workers" who began to arrive in the 1950s, and the large wave of migrants from the Muslim countries who arrive during the European migrant crisis that began in 2015. In 2002, the historian Julius Schoeps said that "resolutions by the German parliament to reject antisemitism are drivel of the worst kind" and "all those ineffective actions are presented to the world as a strong defense against the charge of antisemitism. The truth is: no one is really interested in these matters. No one really cares."', 'response': '1. A prominent member of the German Jewish community, Ignatz Bubis, expressed his concern abou

In [6]:
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt



In [7]:
from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))


### Instruction
How do you get a six pack?

### Answer
You can get a six pack through regular exercise to keep your weight in control first.  If you want a six pack a daily regime of sit ups will be necessary to keep your six pack strong and consistency trained.  Diet is the most important thing if you want a six pack because it will keep your mid-section lean to show the six pack definition.  It has also been shown that without diet your six pack will never be as cut as it could be had you dieted.


In [8]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-13b-hf" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id,use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [9]:
from random import randint
from itertools import chain
from functools import partial


# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample


# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")


Map:   0%|          | 0/15011 [00:00<?, ? examples/s]

### Instruction
How are noble gases obtained?

### Context
The noble gases (historically also the inert gases; sometimes referred to as aerogens) make up a class of chemical elements with similar properties; under standard conditions, they are all odorless, colorless, monatomic gases with very low chemical reactivity. The six naturally occurring noble gases are helium (He), neon (Ne), argon (Ar), krypton (Kr), xenon (Xe), and the radioactive radon (Rn).
Oganesson (Og) is a synthetically produced highly radioactive element. Although IUPAC has used the term "noble gas" interchangeably with "group 18" and thus included oganesson, it may not be significantly chemically noble and is predicted to break the trend and be reactive due to relativistic effects. Because of the extremely short 0.7 ms half-life of its only known isotope, its chemistry has not yet been investigated.
For the first six periods of the periodic table, the noble gases are exactly the members of group 18. Noble gases are t

Map:   0%|          | 0/15011 [00:00<?, ? examples/s]

Map:   0%|          | 0/15011 [00:00<?, ? examples/s]

Total number of samples: 1581


In [10]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/llama/dolly/train'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")


Saving the dataset (0/1 shards):   0%|          | 0/1581 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://sagemaker-us-east-2-236478807796/processed/llama/dolly/train
