## Inferencing on P3.2xlarge EC2 instance with NVIDIA Tesla V100 GPU
We will be using the model mistralai/Mistral-7B-v0.1 from Huggingface Transformers. This is a 7.24B parameter model with native precision of BF16. This model with native precision is too big to fit on the GPU memory of this machine. Hence, we will need to quantize the model to lower precision before loading the model.

## Quantize the model from BF16 to INT8 using Quanto
Quanto library is a versatile pytorch quantization toolkit. The quantization method used is the linear quantization. Quanto provides several unique features. We will use the weights quantization (float8,int8,int4,int2). Specifically INT8 quantization of the model's weights.

Reference - https://huggingface.co/docs/transformers/main/en/quantization#quanto

## Install Huggingface Transformers, Quanto and other neccessary packages 

In [1]:
!pip install -U -q git+https://github.com/huggingface/transformers.git
!pip install -U -q quanto
!pip install -U -q accelerate


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Log into Huggingface

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Perform the quantization

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
import torch

model_id = "mistralai/Mistral-7B-v0.1"
quantization_config = QuantoConfig(weights="int8")

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, quantization_config=quantization_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Perform the response generation

In [3]:
import time

text = "Write be a 150 word essay Why is health important to everyone?"
device = "cuda"

inputs = tokenizer(text, return_tensors="pt").to(device)

# Get start time
t1 = time.time()

outputs = model.generate(**inputs, max_new_tokens=300)

# Get end time
t2 = time.time()

# Get total time taken
t3 = t2 - t1

response = (tokenizer.decode(outputs[0], skip_special_tokens=True))
print(response)

# Calculate the number of output tokens
tokens = tokenizer.tokenize(response)
num_tokens = (len(tokens))
print("Number of tokens generated: ", num_tokens)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
[31mFAILED: [0mpybind_module.o 
c++ -MMD -MF pybind_module.o.d -DTORCH_EXTENSION_NAME=quanto_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/ubuntu/cuda_tutorial/lib/python3.10/site-packages/torch/include -isystem /home/ubuntu/cuda_tutorial/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntu/cuda_tutorial/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntu/cuda_tutorial/lib/python3.10/site-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -c /home/ubuntu/cuda_tutorial/lib/python3.10/site-packages/quanto/library/ext/cpp/pybind_module.cpp -o pybind_module.o 
In file included from /home/ubuntu/cuda_tutorial/lib/python3.10/site-packages/torch/include/torch/csrc/Device.h:4,
                 from /home/ubuntu/

Write be a 150 word essay Why is health important to everyone?

Health is important to everyone because it is the foundation of a happy and productive life. Without good health, it is difficult to enjoy life and achieve one’s goals. Good health allows us to be active, productive, and engaged in our communities. It also helps us to avoid illness and injury, which can be costly and disruptive to our lives.

Good health is also important for our mental and emotional well-being. When we are healthy, we are more likely to feel confident, happy, and fulfilled. We are also better able to cope with stress and challenges in our lives.

Finally, good health is important for our relationships with others. When we are healthy, we are more likely to be able to participate in activities with our friends and family, and to be a positive influence in their lives.

In conclusion, health is important to everyone because it is the foundation of a happy and productive life. It allows us to be active, prod

## Calculate total time and throughput

In [4]:
# Print total time taken
print(t3,": seconds")

# Calculate tokens per secon
tokens_per_second = num_tokens/t3

print("Number of Tokens per second: ", tokens_per_second)

43.7796630859375 : seconds
Number of Tokens per second:  5.801780600764548
