# Text Generation

Notebooks for text generation using LLMs on Huggingface.

In [27]:
!nvidia-smi

Thu Mar  7 01:02:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


|   0  Tesla T4                       Off | 00000001:00:00.0 Off |                    0 |
| N/A   46C    P0              28W /  70W |   2612MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|    0   N/A  N/A     78830      C   /anaconda/envs/azureml_py38/bin/python     2608MiB |
+---------------------------------------------------------------------------------------+


In [21]:
import torch

def clear_cache():
    with torch.no_grad():
        torch.cuda.empty_cache()


## 1. Huggingface Pipelines

`Pipeline` is a high-level API for text generation. It automatically handles model loading, tokenization, batching, and device placement.

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

True

In [3]:

from huggingface_hub import login

HF_TOKEN = os.getenv("HF_TOKEN")
login(HF_TOKEN)


Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/azureuser/.cache/huggingface/token
Login successful


In [2]:
from transformers import pipeline

2024-03-06 23:49:28.565153: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-06 23:49:28.681935: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /anaconda/envs/azureml_py38/lib/:/anaconda/envs/azureml_py38/lib/:/anaconda/envs/azureml_py38/lib/:/anaconda/envs/azureml_py38/lib/
2024-03-06 23:49:28.681955: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2024-03-06 23:49:29.315787: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loa

In [8]:
# pipe = pipeline("text-generation", model="microsoft/phi-1", trust_remote_code=True)

pipe = pipeline(
    "text-generation",
    model="meta-llama/Llama-2-7b-chat-hf", 
    device_map="auto", 
    model_kwargs={"load_in_4bit": True}
)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [15]:
output = pipe("Tell me a haiku about AI.", do_sample=True, top_p=0.95)
print(output[0]["generated_text"])

Tell me a haiku about AI.

The AI sings
A melody of ones
And zeroes blend

How was that?


In [41]:
pipe = None

## 2. Huggingface Automodel
* AutoModalForCausalLM - model loading
* BitsAndBytes - quantization support 4bit / 8bit models
* Accelerate - when using very large models that needs to be loaded on multiple GPUs

In [4]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
def generate_text(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    generated_ids = model.generate(**inputs, do_sample=True, top_p=0.95, max_new_tokens=150)
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return outputs[0]

In [7]:
# text completion format
res = ''
res = generate_text(res + "\n\nQuestion:" + "How does the brain work?")
print(res)



Question:How does the brain work?

The brain is a complex organ, but it can be boiled down to a few basic functions. One of the most important roles of the brain is to control and coordinate the activities of the body. This includes everything from breathing and digestion to movement and thinking.

Here's a simplified overview of the brain's functions:

* **Processing sensory information:** The brain receives sensory information from the body, such as touch, temperature, sound, and smell.
* **Generating and interpreting thoughts:** The brain combines and interprets sensory information to form thoughts and ideas.
* **Making decisions:** The brain helps individuals make decisions based on their thoughts and experiences.
* **Controlling voluntary movements:** The brain allows individuals to move their body in a


In [24]:
def generate_chat(chat):
    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to("cuda")
    generated_ids = model.generate(inputs, do_sample=True, top_p=0.95, max_new_tokens=200)
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return outputs[0]

In [26]:
# using chat template
query = "If you had a name, what would you like it to be?"

chat = [
    { "role": "user", "content": query },
]

res = generate_chat(chat)
print(res)

user
If you had a name, what would you like it to be?
model
I do not have a name as I am an artificial intelligence, not a person. I am designed to assist and provide information based on the knowledge I have been programmed with.
