# Train a model

We need to save our W&B credentials on the folder that will be copied to the SM instance

In [2]:
!pip install -Uqqq sagemaker wandb huggingface_hub

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.3.3 which is incompatible.[0m[31m
[0m

save your wandb API token to the script folder

In [1]:
import wandb
wandb.sagemaker_auth(path="scripts")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


## SageMaker auth

In [2]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

DATASET_S3 = f's3://{sess.default_bucket()}/processed/wandbot/train'

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::372108735839:role/SageMakerExecutionRole
sagemaker bucket: sagemaker-us-east-1-372108735839
sagemaker session region: us-east-1


## Loading the data from W&B

In [4]:
# the dataset we pre-tokenized and created before
AT_ADDRESS = "capecape/aws_llm_workshop/wandbot_dataset_tokenized:v0"

We put inside the `run_clm.py` file the loading from a W&B Artifact:

```python
import wandb
from datasets import load_from_disk

def load_from_artifact(at_address, at_type="dataset"):
    "Load a HF dataset from a W&B artifact"
    artifact = wandb.use_artifact(at_address, type=at_type)
    artifact_dir = artifact.download()
    return load_from_disk(artifact_dir)

...
    ds = load_from_artifact(AT_ADDRESS)

```

## Train time! 🚂

In [5]:
WANDB_PROJECT = "aws_llm_workshop"

In [6]:
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name 
MODEL_NAME = "codellama/CodeLlama-7b-Instruct-hf"
job_name = 'wandb-qlora-codellama7'

lr = 2e-4

# hyperparameters, which are passed into the training job
hyperparameters = {
    'model_id': MODEL_NAME,                           # pre-trained model
    # 'dataset_artifact': AT_ADDRESS,                   # Artifact containing the dataset at W&B
    # 'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
    'dataset_path': AT_ADDRESS,
    'epochs': 1,                                      # number of training epochs
    'per_device_train_batch_size': 2,                 # batch size for training
    'lr': lr,                                         # learning rate used during training
    'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
    'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
    'report_to': "wandb",                              # report to wandb
    'wandb_project': WANDB_PROJECT,
    "run_name":  f"{MODEL_NAME}__qlora",
}
    
# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache"}, # set env variable to cache models in /tmp
    disable_output_compression = True         # not compress output to save training time and cost
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': DATASET_S3}

# data = {}
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: wandb-qlora-codellama7-2023-10-26-00-01-56-374


Using provided s3_resource
2023-10-26 00:01:56 Starting - Starting the training job...
2023-10-26 00:02:11 Starting - Preparing the instances for training......
2023-10-26 00:03:21 Downloading - Downloading input data...
2023-10-26 00:03:46 Training - Downloading the training image.......................................
2023-10-26 00:10:08 Training - Training image download completed. Training in progress......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-10-26 00:11:09,037 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-10-26 00:11:09,051 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-10-26 00:11:09,060 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-10-26 00:11:09,061 sagemaker_pytorch_container.training INFO     Invoking user

## Link Model Hack

I have to pull the image URI manually here, there is probably a better way...

In [10]:
run_id = "wandb-qlora-codellama7-2023-10-25-13-33-31-404-7cd3bc-algo-1"

In [11]:
S3_URI = "s3://sagemaker-us-east-1-372108735839/wandb-qlora-codellama7-2023-10-25-13-33-31-404/output/model/"

I am passing the `run_id` so we get the artifact logged on the same experiment

In [12]:
with wandb.init(project=WANDB_PROJECT, id=run_id):
    at = wandb.Artifact(name="codellama7_wandb", 
                        type="model",
                        description="A CodeLlama7B expert on W&B")
    at.add_reference(S3_URI)
    wandb.log_artifact(at)

[34m[1mwandb[0m: Generating checksum for up to 10000 objects in "sagemaker-us-east-1-372108735839/wandb-qlora-codellama7-2023-10-25-13-33-31-404/output/model/"... Done. 0.4s


VBox(children=(Label(value='0.963 MB of 0.963 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

