In [1]:
!pip install sagemaker --quiet --upgrade --force-reinstall
!pip install ipywidgets==7.0.0 --quiet
!pip install langchain --quiet --upgrade

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dash 2.15.0 requires dash-core-components==2.0.0, which is not installed.
dash 2.15.0 requires dash-html-components==2.0.0, which is not installed.
dash 2.15.0 requires dash-table==5.0.0, which is not installed.
jupyter-ai 2.5.0 requires faiss-cpu, which is not installed.
aiobotocore 2.7.0 requires botocore<1.31.65,>=1.31.16, but you have botocore 1.34.57 which is incompatible.
amazon-sagemaker-jupyter-scheduler 3.0.6 requires jupyter-scheduler==2.4, but you have jupyter-scheduler 2.3.0 which is incompatible.
autogluon-common 0.8.2 requires pandas<2.2.0,>=2.0.0, but you have pandas 2.2.1 which is incompatible.
autogluon-core 0.8.2 requires pandas<2.2.0,>=2.0.0, but you have pandas 2.2.1 which is incompatible.
autogluon-features 0.8.2 requires pandas<2.2.0,>=2.0.0, but you have pandas 2.2.1 which is incompatib

## 01. Set-up

In [2]:
import sagemaker
from sagemaker.predictor_async import AsyncPredictor
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [3]:
session = sagemaker.session.Session()
bucket = sagemaker.Session().default_bucket()
prefix = "async-sagemaker-tests/inputs" 

## 02. Deploy Llama-2 model from JumpStart using Asynchronous Inference

To deploy the model you need to accept the EULA

In [9]:
%%time
from sagemaker.jumpstart.model import JumpStartModel, AsyncInferenceConfig
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer


model_id, model_version = "meta-textgeneration-llama-2-70b", "*"
my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy(
    initial_instance_count=0,
    instance_type="ml.g5.48xlarge",
    async_inference_config=AsyncInferenceConfig(),
    accept_eula=False
)

-----------!CPU times: user 220 ms, sys: 18.8 ms, total: 239 ms
Wall time: 6min 4s


## 03. Start making predictions

In [10]:
endpoint_name="<ENTER_YOUR_ENDPOINT_NAME>"

In [11]:
predictor = AsyncPredictor(Predictor(endpoint_name=endpoint_name, 
                              sagemaker_session=session,
                              serializer=JSONSerializer(),
                              deserializer=JSONDeserializer()))

In [12]:
import json
payload = {
    "inputs": "Write a program to compute factorial in python:", 
    "parameters": {
        "max_new_tokens": 400
    }
}

In [13]:
import uuid

response = predictor.predict(
        data=payload,
        input_path="s3://{}/{}/payload-{}".format(bucket, prefix,uuid.uuid4())
)

In [14]:
print(f"\033[1m Output:\033[0m {response[0]['generated_text']}")


[1m Output:[0m 
Factorial is a product of all positive integers less than or equal to a given number.
For example, the factorial of 5 is 5*4*3*2*1 = 120.
The factorial of 0 is 1.
The factorial of a negative number is undefined.
The factorial of a non-integer is also undefined.
The factorial of a number can be computed using the following formula:
n! = n * (n-1) * (n-2) * ... * 1
where n is the number whose factorial is to be computed.
For example, the factorial of 5 can be computed as follows:
5! = 5 * 4 * 3 * 2 * 1 = 120
The factorial of a number can also be computed using the following recursive formula:
n! = n * (n-1)!
where n is the number whose factorial is to be computed and (n-1)! is the factorial of (n-1).
For example, the factorial of 5 can be computed recursively as follows:
5! = 5 * 4!
4! = 4 * 3!
3! = 3 * 2! = 3 * 2 * 1!
2! = 2 * 1! = 2 * 1 * 0!
0! = 1
Therefore, 5! = 5 * 4 * 3 * 2 * 1 = 120.
The factorial of a number can also be computed using the following iterative for