## LLM Inference on G6.4xlarge EC2 Instance
This EC2 instance has one NVIDIA L4 GPU. We will run inference on the Intel/neural-chat-7b-v3-3 model using GPU

## Install necessary packages

In [1]:
!pip install -U -q accelerate
!pip install transformers
!pip install huggingface_hub

Collecting transformers
  Downloading transformers-4.40.2-py3-none-any.whl (9.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tokenizers<0.20,>=0.19
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m109.5 MB/s[0m eta [36m0:00:00[0m
Collecting regex!=2019.12.17
  Downloading regex-2024.4.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (774 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m774.1/774.1 KB[0m [31m80.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: regex, tokenizers, transformers
Successfully installed regex-2024.4.28 tokenizers-0.19.1 transformers-4.40.2


## Log into Huggingface

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load the model and Tokenizer from Huggingface

In [3]:
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Intel/neural-chat-7b-v3-3"

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)



config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]



pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/145 [00:00<?, ?B/s]

## Run the inference on GPU

In [4]:
import time

text = "Write be a 150 word essay Why is health important to everyone?"
device = "cuda"

inputs = tokenizer(text, return_tensors="pt").to(device)

# Get start time
t1 = time.time()

outputs = model.generate(**inputs, max_new_tokens=300)

# Get end time
t2 = time.time()

# Get total time taken
t3 = t2 - t1

response = (tokenizer.decode(outputs[0], skip_special_tokens=True))
print(response)

# Calculate the number of output tokens
tokens = tokenizer.tokenize(response)
num_tokens = (len(tokens))
print("Number of tokens generated: ", num_tokens)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Write be a 150 word essay Why is health important to everyone?

Health is the most valuable asset that every individual possesses. It is the foundation upon which our lives are built, allowing us to pursue our dreams, goals, and aspirations. Without good health, our ability to function effectively in various aspects of life, such as work, relationships, and personal growth, is severely compromised.

Health encompasses not only physical well-being but also mental, emotional, and social wellness. It is a holistic concept that requires a balanced approach to maintain. A healthy lifestyle involves making conscious choices about diet, exercise, sleep, stress management, and overall self-care.

The importance of health cannot be overstated. It is the key to our happiness, productivity, and overall quality of life. Good health enables us to be more resilient in the face of challenges, both physical and emotional. It also helps us to build stronger relationships with others, as we are better e

## Calculate Throughput and total time

In [5]:
# Print total time taken
print(t3,": seconds")

# Calculate tokens per secon
tokens_per_second = num_tokens/t3

print("Number of Tokens per second: ", tokens_per_second)

17.871081352233887 : seconds
Number of Tokens per second:  16.73094056855295
