In [8]:
from sagemaker.jumpstart.model import JumpStartModel


model_id = "huggingface-llm-falcon-40b-instruct-bf16"
model = JumpStartModel(model_id=model_id, env={"SM_NUM_GPUS": "8", "MAX_INPUT_LENGTH": "2048", "MAX_TOTAL_TOKENS": "4096"})
print(model.env)
predictor = model.deploy(instance_type="ml.g5.48xlarge")

{'SM_NUM_GPUS': '8', 'MAX_INPUT_LENGTH': '2048', 'MAX_TOTAL_TOKENS': '4096', 'SAGEMAKER_PROGRAM': 'inference.py', 'ENDPOINT_SERVER_TIMEOUT': '3600', 'MODEL_CACHE_ROOT': '/opt/ml/model', 'SAGEMAKER_ENV': '1', 'HF_MODEL_ID': '/opt/ml/model', 'SAGEMAKER_MODEL_SERVER_WORKERS': '1'}
--------------------------!

In [9]:
inputs = (
    "United Arab Emirate’s (UAE) Technology Innovation Institute (TII), the applied" \
    "research pillar of Abu Dhabi’s " \
    "Advanced Technology Research Council, has launched Falcon LLM, a foundational large language model" \
    " (LLM) with40 billion parameters. TII is a leading global research center dedicated to pushing the" \
    " frontiers of knowledge. TII’s team of scientists, researchers, and engineers work to deliver" \
    " discovery science and transformative technologies. TII’s work focuses on breakthroughs that " \
    "will future-proof our society. Trained on 1 trillion tokens, TII Falcon LLM boasts top-notch" \
    " performance while remaining incredibly cost-effective.Falcon-40B matches the performance of other " \
    "high-performing LLMs, and is the top-ranked open-source model in the public Hugging Face Open " \
    "LLM leaderboard. It’s available as open-source in two different sizes – Falcon-40B and Falcon-7B and" \
    " was built from scratch using data preprocessing and model training jobs built on Amazon SageMaker. " \
    "Open-sourcing Falcon 40B enables users to construct and customize AI tools that cater to unique" \
    " users needs, facilitating seamless integration and ensuring the long-term preservation of data " \
    "assets. The model weights are available to download, inspect and deploy anywhere. Starting June" \
    " 7th, both Falcon LLMs will also be available in Amazon SageMaker JumpStart, SageMaker’s machine" \
    " learning (ML) hub that offers pre-trained models, built-in algorithms, and pre-built solution " \
    "templates to help you quickly get started with ML. You can deploy and use the Falcon LLMs with a " \
    "few clicks in SageMaker Studio or programmatically through the SageMaker Python SDK. To deploy and " \
    "run inference against Falcon LLMs, refer to the Introduction to SageMaker JumpStart – Text" \
    " Generation with Falcon LLMs example notebook. Dr. Ebtesam Almazrouei, Executive Director–Acting" \
    " Chief AI Researcher of the AI-Cross Center Unit and Project Lead for LLM Projects at TII," \
    " shares: “We proudly announce the official open-source release of Falcon-40B, the world’s" \
    " top-ranking open-source language model. Falcon-40B is an exceptional open-source model with 40B" \
    " parameters, specifically designed as a causal decoder-only model. It was trained on a vast dataset" \
    " of 1,000B tokens, including RefinedWeb enhanced with curated corpora. The model is made available" \
    " under the Apache 2.0 license, ensuring its accessibility and usability. Falcon-40B has surpassed" \
    " renowned models like LLaMA-65B, StableLM and MPT on the public leaderboard maintained by Hugging" \
    " Face. The architecture of Falcon-40B is optimized for inference, incorporating FlashAttention and" \
    " multiquery techniques.” “This step reflects our dedication to pushing the boundaries of AI " \
    "innovation and technology readiness level for community engagement, education, real-world" \
    " applications, and collaboration. Continues Dr Ebtesam. “By releasing Falcon-40B as an open-source" \
    " model, we provide researchers, entrepreneurs, and organizations with the opportunity to harness" \
    " its exceptional capabilities and drive advancements in AI-driven solutions from healthcare to space," \
    " finance, manufacturing to biotech; the possibilities for AI-driven solutions are boundless." \
    " To access Falcon-40B and explore its remarkable potential, please visit FalconLLM.tii.ae. Join us in" \
    " leveraging the power of Falcon-40B to shape the future of AI and revolutionize industries” In this" \
    " post, we dive deep with Dr. Almazrouei about Falcon LLM training on SageMaker, data curation, " \
    "optimization, performance, and next steps. A new generation of LLMs LLMs are software algorithms" \
    " trained to complete natural text sequences. Due to their size and the volume of training data" \
    " they interact with, LLMs have impressive text processing abilities, including summarization," \
    " question answering, in-context learning, and more. In early 2020, research organizations across" \
    " the world set the emphasis on model size, observing that accuracy correlated with number of" \
    " parameters. For example, GPT-3 (2020) and BLOOM (2022) feature around 175 billion parameters," \
    " Gopher (2021) has 230 billion parameters, and MT-NLG (2021) 530 billion parameters. In 2022," \
    " Hoffman et al. observed that the current balance of compute between model parameters and dataset" \
    " size was suboptimal, and published empirical scaling laws suggesting that balancing the compute" \
    " budget towards smaller models trained on more data could lead to better performing models. They" \
    " implemented their guidance in the 70B parameter Chinchilla (2022) model, that outperformed much" \
    " bigger models. Summarize the article above."
)
payload = {
    "inputs": inputs,
    "parameters": {
        "max_new_tokens": 50,
    }
}
payloads = {
    "stress_test": payload,
}

In [10]:
%load_ext autoreload
%autoreload complete

In [26]:
from benchmarking.runner import Benchmarker


benchmarker = Benchmarker(payloads=payloads, max_concurrent_benchmarks=10, num_invocations=15, run_latency_load_test=False)
metrics = benchmarker.run_single_predictor(model_id=model_id, predictor=predictor, clean_up=False)
print(metrics)

(Model 'huggingface-llm-falcon-7b-instruct-bf16', Payload 'stress_test'): Begin throughput load test ...
(Model 'huggingface-llm-falcon-7b-instruct-bf16', Payload 'stress_test'): Finished benchmarking load tests ...
(Model 'huggingface-llm-falcon-7b-instruct-bf16'): Skipping cleaning up resources ...
[{'Throughput': 0.31310795780782597, 'OutputSequenceWords': {'Average': 744.0, 'Minimum': 744, 'Maximum': 744, 'p50': 744.0, 'p90': 744.0, 'p95': 744.0}, 'WordThroughput': 232.9523206090225, 'SampleOutput': ' In 2022, Hoffman et al. observed that the current balance of compute between model parameters and dataset size was suboptimal, and published empirical scaling laws suggesting that balancing the compute budget towards smaller models trained on more data could lead to better performing models. They implemented their guidance in the 70B parameter Chinchilla (2022) model, that outperformed much bigger models. (Source: Hoffman et al. 2022) In 2022, Hoffman et al. observed that the current 

In [12]:
from benchmarking.runner import Benchmarker

results = []
max_new_tokens = 100
num_invocations = 10
# for max_new_tokens in range(800, 1800, 100):
for num_invocations in range(20,50):
    payload = {
        "inputs": inputs,
        "parameters": {
            "max_new_tokens": max_new_tokens,
        }
    }
    payloads = {
        "stress_test": payload,
    }

    print(f"Running for max_new_tokens={max_new_tokens}")
    benchmarker = Benchmarker(payloads=payloads, max_concurrent_benchmarks=10, num_invocations=num_invocations, run_latency_load_test=False)
    metrics = benchmarker.run_single_predictor(model_id=model_id, predictor=predictor, clean_up=False)
    metrics[0]['MaxNewTokens'] = max_new_tokens
    metrics[0]['NumInvocations'] = num_invocations
    results.extend(metrics)
    print(metrics)

Running for max_new_tokens=100
(Model 'huggingface-llm-falcon-40b-instruct-bf16', Payload 'stress_test'): Begin throughput load test ...
(Model 'huggingface-llm-falcon-40b-instruct-bf16', Payload 'stress_test'): Finished benchmarking load tests ...
(Model 'huggingface-llm-falcon-40b-instruct-bf16'): Skipping cleaning up resources ...
[{'Throughput': 0.6729229887080495, 'OutputSequenceWords': {'Average': 64.2, 'Minimum': 63, 'Maximum': 65, 'p50': 65.0, 'p90': 65.0, 'p95': 65.0}, 'WordThroughput': 43.20165587505678, 'SampleOutput': ' TII’s Falcon LLM is a 40B parameter model that follows the same philosophy. Falcon LLM is trained on 1 trillion tokens, and is the top-ranked open-source model in the public Hugging Face Open LLM leaderboard. Falcon LLM is available as open-source in two different sizes – Falcon-40B and Falcon-7B. Falcon-40B was built from scratch using data preprocessing and model training jobs built on Amazon SageMaker.', 'ModelID': 'huggingface-llm-falcon-40b-instruct-bf1

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (424) from primary with message "{"error":"Request failed during generation: Server error: CUDA out of memory. Tried to allocate 2.97 GiB (GPU 0; 22.20 GiB total capacity; 13.83 GiB already allocated; 2.17 GiB free; 18.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF","error_type":"generation"}". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/hf-llm-falcon-40b-instruct-bf16-2023-06-20-13-51-42-825 in account 802376408542 for more information.

In [38]:

import plotly.express as px
import pandas as pd

df = pd.json_normalize(results)
# display(df)
fig = px.line(df, x="NumInvocations", y="WordThroughput", title=f"throughput vs. invocations for {model_id}")
fig.show()