### **bitsandbytes**
bitsandbytes enables accessible large language models via k-bit quantization for PyTorch. bitsandbytes provides three main features for dramatically reducing memory consumption for inference and training.
([official Documentation](https://huggingface.co/docs/bitsandbytes/main/en/index))

In [None]:
!pip install -U bitsandbytes


In [None]:
import torch
from transformers import AutoTokenizer,pipeline
model_id = "NousResearch/Llama-2-7b-chat-hf"


tokenizer = AutoTokenizer.from_pretrained(model_id)
pipeline = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
    model_kwargs={"load_in_4bit": True} # This is the magic line
)

In [None]:
prompt_template = """[INST] <>
You are a helpful, respectful and honest assistant. Answer exactly in few words.
<>

Answer the question below:

Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
[/INST]
""".strip()


sequences = pipeline(
    prompt_template,
    do_sample=True,
    top_k=10,
    temperature=0.7,
    top_p=0.95,
    num_return_sequences=4, # Number of responses to generate
    eos_token_id=tokenizer.eos_token_id,
    max_length=256,
)

for seq in sequences:
    print("*****"*50)
    print(f"Result: {seq['generated_text']}")