<a href="https://colab.research.google.com/github/twhool02/ptm-quantization/blob/main/Quantize_Llama_2_7b_chat_hf_AWQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run Llama2-7b-chat-HF - AWQ Quantization


This notebook quantizes the Hugging Face version of the Llama2 7B model.

The notebook:

1. Loads the Pre-trained Model
2. Applies quantization to the pre-trained model
3. Uploads the model to Hugging Face
4. Runs inference on the model



This is based on the following blogs/videos/documentation:

* [Hugging Face AWQ Documentation](https://huggingface.co/docs/transformers/main/en/quantization?fuse=supported+architectures#awq)
* [How to Quantize an LLM with GGUF or AWQ](https://www.youtube.com/watch?v=XM8pllpBVA0)
* [Squeeze Every Drop of Performance from Your LLM with AWQ (Activation-Aware Quantization)](https://adithyask.medium.com/squeeze-every-drop-of-performance-from-your-llm-with-awq-activation-aware-quantization-53973365eaaa)
* [Hugging Face Documentation](https://huggingface.co/docs)

The AutoAWQ library used to implement AWQ is available here:

* [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)


## Setup

### Log into HuggingFace Hub

This code assumes that the user has a hugging face token setup as a notebook secret called HF_TOKEN

In [None]:
# Required when interacting with HuggingFace Hub
!pip install -q --upgrade huggingface_hub

import huggingface_hub

print(f"Hugging Face Version is: {huggingface_hub.__version__}")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/388.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m194.6/388.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hHugging Face Version is: 0.22.2


In [None]:
from google.colab import userdata

# using the HF_TOKEN secret, this has write permissions to Hugging Face
hftoken = userdata.get('HF_TOKEN')

In [None]:
from huggingface_hub import login

# Log into hugging face using the HF_TOKEN secrect
login(hftoken, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Install Required Libraries

In [None]:
# The Transformers library provides APIs and tools to easily download and train pretrained model.
!pip install -q -U transformers

# install the stable version of AutoAWQ and it kernelts
!pip install autoawq -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m60.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.8/80.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m58.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m103.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.4/33.4 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

### Check library versions

In [None]:
#print the version of transformers
import transformers
print(f"version of transformers: {transformers.__version__}")

#print the version of pytorch
import torch
print(f"version of pytorch: {torch.__version__}")

version of transformers: 4.38.2
version of pytorch: 2.2.1+cu121


### Import Libraries

In [None]:
# for interacting with the operating system.
import os

# torch is the main package of PyTorch.
import torch

from transformers import (
    AutoModelForCausalLM, # Generic model class with a causal language modeling head
    AutoTokenizer, # Automatically selects correct tokenizer for a model.
    pipeline, # Creates a pipeline that applies a model to some input data.
    logging, # Logs events during training and evaluation.
)

# Import required AWQ libraries
from awq import AutoAWQForCausalLM

### Define the processor to use

Ensure the model will use a GPU if available

In [None]:
# Load the model directly onto GPU (if available)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

### Define function to calculate model size

In [None]:
import os

def get_directory_size(directory: str) -> float:
    """
    This function calculates the size of a directory in gigabytes.
    """
    total = 0
    # Walk through the directory tree
    for path, dirs, files in os.walk(directory):
        for f in files:
            # Create the full path to the file
            fp = os.path.join(path, f)
            # Add the file size to the total
            total += os.path.getsize(fp)
    # Convert the size to gigabytes round to 2 decimal places
    total_gb = round(total / (1024 * 1024 * 1024), 2)
    return total_gb

## Load Model and Tokenizer

### Define model and quantized model names

In [None]:
# Define the pre-trained model to be used
model_name = 'meta-llama/Llama-2-7b-chat-hf'

# Quantized model name
quant_model = model_name.split("/")[-1] + "-AWQ"

# Define the path to save the quantized model
quant_path = 'twhoool02/' + quant_model

### Load Base Model

In [None]:
# Load base model
model = AutoAWQForCausalLM.from_pretrained(
    model_name,
    device_map=device
)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/22.2k [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

LICENSE.txt:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### View Model Details

Viewing details of the pre-trained model

In [None]:
print(f"Model Config: \n{model.config}")
print(f"View model structure: \n{model}")

Model Config: 
LlamaConfig {
  "_name_or_path": "/root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/92011f62d7604e261f748ec0cfe6329f31193e33",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "max_seq_len": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.38.2",
  "use_cache": true,
  "vocab_size": 32000
}

View model structure: 
LlamaAWQForCausalLM(
  (model): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(32000, 4096)
      (layers): ModuleList(


### Load Tokenizer

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

## Quantize the Model

### Quantization Config

In [None]:
# Define quantization configuration
awq_config = {
    "zero_point": True, # True means zero point quantization is enabled
    "q_group_size": 128, # 128 means the size of the quantization group is 128
    "w_bit": 4 # 4 means the bit-width of the quantization is 4
    }

### Quantize the Model

In [None]:
# Quantize the model using the 'quantize' method
model.quantize(tokenizer, quant_config=awq_config)

Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating validation split: 0 examples [00:00, ? examples/s]

AWQ: 100%|██████████| 32/32 [15:50<00:00, 29.70s/it]


### Save the quantized model

In [None]:
# Save quantized model
model.save_quantized(quant_model, safetensors=True, shard_size="10GB")
tokenizer.save_pretrained(quant_model)

('Llama-2-7b-chat-hf-AWQ/tokenizer_config.json',
 'Llama-2-7b-chat-hf-AWQ/special_tokens_map.json',
 'Llama-2-7b-chat-hf-AWQ/tokenizer.model',
 'Llama-2-7b-chat-hf-AWQ/added_tokens.json',
 'Llama-2-7b-chat-hf-AWQ/tokenizer.json')

### Calculate quantized model size

There is no model size method for awq models so instead the size of the directory the model is downloaded to is used

In [None]:
model_size = get_directory_size(quant_model)
print(f"Model size: {model_size}GB")

Model size: 3.62GB


## Push the model to Hugging Face

LlamaAWQForCausalLM does not have a push_to_hub attribute so will push the model by uploading files individually

#### Create Repo

In [None]:
# Import the HfApi class from the huggingface_hub module
from huggingface_hub import HfApi

# Initialize the HfApi class
api = HfApi()

# Define the ID of the repository where the file will be uploaded
# The repository is located under the user 'twhoool02' and named after the quant_name variable
repo_id = f"twhoool02/{quant_model}"
print(f"Repository ID is {repo_id}")

Repository ID is twhoool02/Llama-2-7b-chat-hf-AWQ


In [None]:
# check if the repo exists
repo_exists = api.repo_exists(repo_id)

# if not create the repo
if repo_exists == True:
    print(f"{repo_id} already exists")
    pass
else:
    api.create_repo(repo_id=repo_id, private=True)

twhoool02/Llama-2-7b-chat-hf-AWQ already exists


#### Create a list of files to upload

In [None]:
import os

# specify the directory you want to use
base_path = f"/content/{quant_model}"

# create an empty list to store the file paths
local_file_paths = []

# loop through the files in the directory
for filename in os.listdir(base_path):
    file_path = os.path.join(base_path, filename)
    local_file_paths.append(file_path)

# print the list of file paths
for path in local_file_paths:
    print(path)

/content/Llama-2-7b-chat-hf-AWQ/generation_config.json
/content/Llama-2-7b-chat-hf-AWQ/tokenizer_config.json
/content/Llama-2-7b-chat-hf-AWQ/special_tokens_map.json
/content/Llama-2-7b-chat-hf-AWQ/config.json
/content/Llama-2-7b-chat-hf-AWQ/tokenizer.model
/content/Llama-2-7b-chat-hf-AWQ/tokenizer.json
/content/Llama-2-7b-chat-hf-AWQ/model.safetensors


#### Upload the files

In [None]:
#Loop through each file and upload it
for file in local_file_paths:
    file_name = file.split("/")[-1]

    # Specify the path where you want the file to be uploaded in the repository
    path_in_repo = file_name

    # Use the upload_file method of the HfApi class to upload the file
    api.upload_file(
        path_or_fileobj=file,
        path_in_repo=path_in_repo,
        repo_id=repo_id,
        repo_type="model"
    )

    print(f"Uploaded {file_name} to {repo_id}")

Uploaded generation_config.json to twhoool02/Llama-2-7b-chat-hf-AWQ
Uploaded tokenizer_config.json to twhoool02/Llama-2-7b-chat-hf-AWQ
Uploaded special_tokens_map.json to twhoool02/Llama-2-7b-chat-hf-AWQ
Uploaded config.json to twhoool02/Llama-2-7b-chat-hf-AWQ
Uploaded tokenizer.model to twhoool02/Llama-2-7b-chat-hf-AWQ
Uploaded tokenizer.json to twhoool02/Llama-2-7b-chat-hf-AWQ
Uploaded model.safetensors to twhoool02/Llama-2-7b-chat-hf-AWQ


## Upload Model Card

### Create Model Card

In [None]:
# Add custom model card
from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(
    base_model=model_name,
    language='en',  # The language the model was trained on
    license='other',  # The license for the model
    library=['Transformers','AWQ'],  # The library used to train the model
    model_name=quant_model,  # The name of the model
    model_type='llama',
    pipeline_tag='text-generation',
    qunatized_by="twhoool02",
    tags=["meta","AWQ","facebook", "llama", "llama2"],  # Tags for the model
)

content = f"""
---
{ card_data.to_yaml() }
---

# Model Card for {quant_model}

## Model Details

This model is a AWQ quantized version of the {model_name} model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, AWQ
- **Model type:** llama
- **Model name:** {quant_model}
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other
"""

card = ModelCard(content)
print(card)

---
language: en
license: other
tags:
- meta
- AWQ
- facebook
- llama
- llama2
base_model: meta-llama/Llama-2-7b-chat-hf
model_name: Llama-2-7b-chat-hf-AWQ
library:
- Transformers
- AWQ
model_type: llama
pipeline_tag: text-generation
qunatized_by: twhoool02
---

# Model Card for Llama-2-7b-chat-hf-AWQ

## Model Details

This model is a AWQ quantized version of the meta-llama/Llama-2-7b-chat-hf model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, AWQ
- **Model type:** llama
- **Model name:** Llama-2-7b-chat-hf-AWQ
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other



### Upload the model card

In [None]:
try:
    card.validate()
    card.save('README.md')
    card.push_to_hub(repo_id=repo_id)
except:
    raise ValueError("model card info is invalid. please check.")

### Download model from Hugging Face

The newly uploaded model must be downloaded from hugging face for inference

In [None]:
# clear the existing model and tokenizer
del model
del tokenizer

In [None]:
model_name_or_path = "twhoool02/Llama-2-7b-chat-hf-AWQ"

# Load model and tokenizer
model = AutoAWQForCausalLM.from_quantized(
    model_name_or_path,
    fuse_layers=True,
    trust_remote_code=False,
    safetensors=True,
    device=device,
    )

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    trust_remote_code=False
    )

tokenizer.pad_token = tokenizer.eos_token # sets the pad token to the eos token

config.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.89G [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/650 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/183 [00:00<?, ?B/s]

Replacing layers...: 100%|██████████| 32/32 [00:09<00:00,  3.53it/s]
Fusing layers...: 100%|██████████| 32/32 [00:00<00:00, 35.43it/s]


### Get model details

In [None]:
print(f"Model Config: \n{model.config}")
print(f"View model structure: \n{model}")

Model Config: 
LlamaConfig {
  "_name_or_path": "/root/.cache/huggingface/hub/models--twhoool02--Llama-2-7b-chat-hf-AWQ/snapshots/31a81ac0395450d11c7b7db780d3535d898f8d69",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "max_seq_len": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.38.2",
  "use_cache": true,
  "vocab_size"

### Get model size

There is no model size method for awq models so instead the size of the directory the model is downloaded to is used

In [None]:
model_size = get_directory_size(f'/root/.cache/huggingface/hub/models--twhoool02--Llama-2-7b-chat-hf-AWQ')
print(f"Model size: {model_size}GB")

Model size: 7.25GB


## Run Inference on the Quantized Model

At this time LlamaAWQForCausalLM models are not supported for text-generation by the transformer pipeline.

In [None]:
prompt = "Write a poem about Ireland"
prompt_template=f'''{prompt}

'''

tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda() # move the tokens to GPU

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.1,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

In [None]:
print("Output: ", tokenizer.decode(generation_output[0]))

Output:  <s> Write a poem about Ireland

Ireland, land of the green,
Where the shamrocks grow and the beer is seen.
From the rolling hills to the rugged coast,
Ireland's beauty is sure to astound and impress.

The Emerald Isle, a land of charm,
Where the people are warm and the whiskey's in form.
From Dublin to Galway, the culture's rich and deep,
Ireland's spirit is one that can't be beat.

The Cliffs of Moher, a sight to behold,
Where the ocean crashes below and the winds unfold.
The Blarney Stone, a gift of eloquence,
Where the lucky kiss and their words are inventive.

Ireland, a land of myth and legend,
Where the leprechauns dance and the fairies are seen.
From the Ring of Kerry to the Burren's wild,
Ireland's magic is a spell that can't be denied.

So come and visit, stay a while,
And let Ireland's beauty make your heart smile.
For in this land of green, you'll find your heart's desire,
In Ireland, where the magic never retires.</s>


In [None]:
prompt = "What is a large language model?"
prompt_template=f'''{prompt}

'''

tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda() # move the tokens to GPU

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.1,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

In [None]:
print("Output: ", tokenizer.decode(generation_output[0]))

Output:  <s> What is a large language model?

A large language model is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. The model is designed to learn the patterns and structures of language by analyzing the training data, and then use this knowledge to generate text that is similar in style and structure to the training data.

There are several types of large language models, including:

1. Neural network-based models: These models use a neural network architecture to learn the patterns of language from the training data. The network consists of multiple layers of interconnected nodes (neurons) that process the input text data and generate the output text.
2. Transformer-based models: These models use a transformer architecture to learn the patterns of language from the training data. The transformer is a type of neural network that is particularly well-suited for natural l

In [None]:
prompt = "Tell me what you know about Co. Donegal in Ireland."
prompt_template=f'''{prompt}

'''

tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda() # move the tokens to GPU

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.1,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

print("Output: ", tokenizer.decode(generation_output[0]))

Output:  <s> Tell me what you know about Co. Donegal in Ireland.

Co. Donegal is a county located in the northwest of Ireland. It is known for its rugged coastline, picturesque villages, and rich cultural heritage. Here are some of the top attractions and experiences to explore in Co. Donegal:

1. The Wild Atlantic Way: This scenic route runs along the western coast of Ireland and passes through Co. Donegal. It offers breathtaking views of the ocean, cliffs, and beaches.
2. Slieve League Cliffs: These cliffs are higher than the Cliffs of Moher and offer stunning views of the Atlantic Ocean.
3. Glenveagh National Park: This park is home to some of Ireland's most beautiful scenery, including lakes, waterfalls, and mountains.
4. Donegal Castle: This 15th-century castle is a must-visit for history buffs and offers a glimpse into the county's rich past.
5. The Gap of Dunloe: This scenic gap is a popular destination for hikers and offers breathtaking views of the surrounding landscape.
6. Th

In [None]:
prompt = "Can you name the all previous presidents of Ireland?"
prompt_template=f'''{prompt}

'''

tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda() # move the tokens to GPU

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.1,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

print("Output: ", tokenizer.decode(generation_output[0]))

Output:  <s> Can you name the all previous presidents of Ireland?

Ireland has had six presidents since it gained independence from the United Kingdom in 1922. Here are the names of the presidents of Ireland:

1. Douglas Hyde (1938-1945)
2. Éamon de Valera (1945-1959)
3. Séamus Éamon de Valera (1959-1973)
4. Patrick Hillery (1974-1990)
5. Mary Robinson (1990-1997)
6. Mary McAleese (1997-2011)

Note: The office of President of Ireland was established in 1937, and the first president, Douglas Hyde, took office in 1938.</s>


In [None]:
messages = [
    {"role": "user", "content": "What is your favourite TV show?"},
    {"role": "assistant", "content": "I liked Breaking Bad and Band of Brothers"},
    {"role": "user", "content": "Do you have any recommendations of other shows I might like?"}
]

model_inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] What is your favourite TV show? [/INST] I liked Breaking Bad and Band of Brothers </s><s> [INST] Do you have any recommendations of other shows I might like? [/INST]  Certainly! Based on your interest in "Breaking Bad" and "Band of Brothers," here are some other TV shows that you might enjoy:

1. "The Sopranos" - This HBO series follows the life of a New Jersey mob boss, Tony Soprano, as he navigates the criminal underworld and deals with personal and family issues.
2. "Narcos" - This Netflix series tells the true story of the rise and fall of drug kingpin Pablo Escobar and the DEA agents tasked with bringing him down.
3. "The Wire" - This HBO series explores the drug trade in Baltimore from multiple perspectives, including law enforcement, drug dealers, and politicians.
4. "Sons of Anarchy" - This FX series follows the lives of a close-knit outlaw motorcycle club as they navigate the criminal underworld and deal with internal conflicts.
5. "Peaky Blinders" - This BBC series