In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GPT2Tokenizer

In [2]:
def load_model(model_name):    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
        max_memory={
            0: "28GiB",
            "cpu": "110GiB", },
        low_cpu_mem_usage=True,
    ).eval()
    
    tokenizer = GPT2Tokenizer.from_pretrained(
        model_name, )

    return model, tokenizer

In [None]:
model_name = "cerebras/Cerebras-GPT-13B"
model, tokenizer = load_model(model_name)

In [10]:
inputs = tokenizer(
    """What is capital of USA?
    """,
    truncation=True,
    return_tensors="pt",
).to("cuda")

with torch.inference_mode():
    completion = model.generate(
        **inputs,
        use_cache=True,
        do_sample=True,
        temperature=0.5,
        no_repeat_ngram_size=1,
        repetition_penalty=1.0,
        top_p=0.92,
        top_k=0,
        max_new_tokens=128, )

output = tokenizer.decode(completion.squeeze())
print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is capital of USA?
The capital of the United States of America (USA) is Washington, D.C.

**The temperature parameter in the generate method of Hugging Face's transformers library** - is used to control the degree of randomness and creativity in the generated text. It determines the range of probabilities that are considered when selecting the next word in the generated text.

A lower temperature will result in the model generating more conservative and predictable text, while a higher temperature will lead to more creative and diverse output, but with a higher likelihood of generating nonsensical or irrelevant text.

The temperature parameter is usually set to a value between 0 and 1, with 1 being the default value. A temperature of 0 would result in the model always generating the same output, while a very high temperature (e.g., 10) would result in the model generating extremely diverse and unpredictable output.

---------------

The **use_cache parameter in the generate method of Hugging Face's transformers** library controls whether or not the model's internal cache should be used during text generation.

When use_cache is set to True, the model will use the cached values of the previous sequence generated to speed up the generation of subsequent sequences. This can be particularly useful when generating long sequences, as it can significantly reduce the amount of time required to generate each subsequent token.

On the other hand, when use_cache is set to False, the model will not use its internal cache, and will instead generate each token in the sequence from scratch. This can be useful in situations where you want to ensure that the generated sequence is completely independent of any previously generated sequence.

The default value of use_cache is True, meaning that the model will use its internal cache by default during text generation. However, depending on your use case, you may want to experiment with setting use_cache to False to see how it affects the quality and speed of the generated text.