# Урок 9. Часть 2

1. Смотрим на работу **квантизованной** Gemma 2B.
1. Решаем задачи с помощью **квантизованной** Gemma 2B.

## Gemma 2B 4bit

1. Ссылка на kaggle.com https://www.kaggle.com/models/google/gemma
1. Ссылка на huggingface.co https://huggingface.co/google/gemma-1.1-2b-it
1. Ссылка на блогпост https://huggingface.co/blog/4bit-transformers-bitsandbytes
1. Ссылка на еще один блогпост https://kipp.ly/transformer-inference-arithmetic/

In [None]:
! pip install accelerate
! pip install -i https://pypi.org/simple/ bitsandbytes

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch


# Загружаем модель
model_path = "/kaggle/input/gemma/transformers/1.1-2b-it/1/"

tokenizer = AutoTokenizer.from_pretrained(model_path)

# Добавляем конфигурацию квантизации (будем грузить в 4 битах)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=quantization_config, revision="float16", device_map="auto").eval()

for param in model.parameters():
    param.requires_grad = False

In [None]:
quantization_config

Не все параметры модели загружаются в 4 битах, например, слои эмбеддингов и нормализации остаются в своих старых типах.

Так как мы будем работать на гпу Nvidia Tesla P100, то скорость не вырастет - на этой ГПУ нет в железе встроенных оптимизаций для вычислений с низкой точностью, как например, на A100 или RTX4090.

In [None]:
model

In [None]:
model.lm_head.weight

In [None]:
# Токенизируем текст
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

input_ids

In [None]:
# Первая генерация текста
outputs = model.generate(**input_ids, max_new_tokens=20)

print(tokenizer.decode(outputs[0]))

In [None]:
# Токенизируем как чат, модель училась общаться в формате диалога
conversation = [{"role": "user", "content": "Write me a poem about Machine Learning."}]
input_ids = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, return_tensors="pt").to("cuda")

print(tokenizer.batch_decode(input_ids)[0])

In [None]:
# Вторая генерация текста
generate_ids = model.generate(input_ids, max_new_tokens=20)

print(tokenizer.batch_decode(generate_ids)[0])

In [None]:
new_tokens = generate_ids[0, input_ids.shape[-1]:]

print(tokenizer.decode(new_tokens, skip_special_tokens=True))

In [None]:
import time
from functools import wraps

def measure_time(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()  # Record the start time
        result = func(*args, **kwargs)  # Call the actual function
        end_time = time.time()  # Record the end time
        elapsed_time = end_time - start_time  # Calculate the elapsed time
        print(f"Function '{func.__name__}' took {elapsed_time:.4f} seconds to complete.\n")
        return result  # Return the result of the function
    return wrapper

# Напишем функцию для генерации текста - ответа на сообщение пользователя
@measure_time
@torch.inference_mode()
def generate_text(prompt, **kwargs):
    conversation = [{"role": "user", "content": prompt}]
    input_ids = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, return_tensors="pt").to("cuda")
    generate_ids = model.generate(input_ids, **kwargs)
    new_tokens = generate_ids[0, input_ids.shape[-1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

In [None]:
print(generate_text("Hello!", max_new_tokens=100))

## Температура

Параметр температуры в LLM контролирует случайность предсказаний. Низкие значения температуры делают модель более детерминированной, в то время как высокие значения температуры делают ее более креативной и разнообразной. Вот общие рекомендации по настройке температуры:

1. **Низкая температура (0.1 - 0.3)**:
   - Результаты становятся более детерминированными и сфокусированными
   - Хорошо подходит для задач, требующих точности, таких как ответы на фактические вопросы или генерация кода
   - Меньше вероятность получения неожиданных или креативных ответов

2. **Средняя температура (0.4 - 0.7)**:
   - Балансирует между случайностью и детерминизмом
   - Подходит для большинства задач общего назначения
   - Обеспечивает связные, но несколько разнообразные результаты

3. **Высокая температура (0.8 - 1.0)**:
   - Увеличивает креативность и разнообразие в ответах
   - Полезна для творческого письма, мозговых штурмов или создания поэзии
   - Ответы могут быть менее предсказуемыми и более разнообразными

4. **Очень высокая температура (выше 1.0)**:
   - Может приводить к очень разнообразным и иногда бессмысленным результатам
   - Обычно не рекомендуется для большинства задач, но можно экспериментировать для высоко креативных задач

### Практические рекомендации:
- **0.7**: Часто является хорошим выбором по умолчанию для сбалансированных результатов
- **0.5**: Хорошо подходит для смешения креативности и связности
- **0.2**: Идеально для задач, требующих высокой точности и последовательности

Настройка температуры позволяет точно настроить поведение языковой модели для выполнения конкретных задач, поэтому эксперименты в этих диапазонах помогут достичь желаемого стиля вывода.

In [None]:
# Генерируем с разной температурой
temperature_values = [0.1, 0.3, 0.5, 0.75, 1.0, 1.5]

for temperature in temperature_values:
    print(f"{temperature=}", end="\n\n")
    print(generate_text("Write me a haiku about deep learning", max_new_tokens=100, temperature=temperature, do_sample=True))
    print("-" * 100)

## Саммаризация

In [None]:
text = """
The temperature parameter in language models (LLMs) controls the randomness of the predictions. Lower temperatures make the model more deterministic, while higher temperatures make it more creative and diverse. Here are some general guidelines for setting the temperature:

1. **Low Temperature (0.1 - 0.3)**:
   - Results in more deterministic and focused outputs.
   - Good for tasks requiring precision, such as factual answers or code generation.
   - The model is less likely to produce unexpected or creative responses.

2. **Medium Temperature (0.4 - 0.7)**:
   - Balances between randomness and determinism.
   - Suitable for most general-purpose tasks.
   - Produces coherent yet somewhat varied outputs.

3. **High Temperature (0.8 - 1.0)**:
   - Increases creativity and diversity in responses.
   - Useful for creative writing, brainstorming, or generating poetry.
   - Outputs may be less predictable and more varied.

4. **Very High Temperature (above 1.0)**:
   - Can produce highly diverse and sometimes nonsensical outputs.
   - Generally not recommended for most tasks but can be experimented with for highly creative tasks.

### Practical Recommendations:
- **0.7**: Often a good default choice for balanced outputs.
- **0.5**: Good for a mix of creativity and coherence.
- **0.2**: Ideal for tasks requiring high accuracy and consistency.

Adjusting the temperature allows you to fine-tune the behavior of the language model to suit specific needs, so experimenting within these ranges can help you achieve the desired output style.
"""

prompt = f"""Summarize the following text in 2-3 sentences, capturing the main points and key details while maintaining coherence and accuracy. Ensure the summary is concise and informative.

'''
{text}
'''
"""

In [None]:
print(generate_text(prompt, temperature=0.2, do_sample=True, max_new_tokens=100))

## Определение тональности текста

In [None]:
text = "This new GPT4o is a complete disaster. It's slow, inaccurate, and difficult to use. I hate it very much, the Google's Gemma is sooo better."

prompt = f"""Determine the sentiment of this text. Return only sentiment.

'''
{text}
'''
"""

In [None]:
print(generate_text(prompt, temperature=0.2, do_sample=True, max_new_tokens=100))

## Классификация

In [None]:
text = "The latest smartphone from Apple has received super positive reviews for its sleek design and powerful performance. But it is very expensive, so think for yourself!"
prompt = f"""Classify the following text into one of the categories: technology, sports, politics. Return only a category.

'''
{text}
'''
"""

In [None]:
print(generate_text(prompt, temperature=0.2, do_sample=True, max_new_tokens=100))

## Перевод

In [None]:
text = "Биршерт Алексей Дмитриевич записывает лекцию и семинар для 9 занятия по курсу Глубинное обучение."
source_language = "russian"
target_language = "spanish"

prompt = f"""Translate this text from {source_language} to {target_language}. Return only translation.

'''
{text}
'''
"""

In [None]:
translation = generate_text(prompt, temperature=0.7, do_sample=True, max_new_tokens=100)

print(translation)

In [None]:
prompt = f"""Translate this text from {target_language} to {source_language}. Return only translation.

'''
{translation}
'''
"""

In [None]:
back_translation = generate_text(prompt, temperature=0.7, do_sample=True, max_new_tokens=100)

print(back_translation)

## Ответы на вопросы по тексту


In [None]:
text = """
Jason Statham is an English actor and former competitive diver, best known for his roles in action-thriller films. Born on July 26, 1967, in Shirebrook, Derbyshire, England, Statham's journey to stardom is as remarkable as the characters he portrays. Before entering the film industry, he was a member of Britain's National Diving Squad for over a decade, competing at world championships and the Commonwealth Games. His rugged good looks and athletic build, combined with his martial arts skills, made him a natural fit for the action genre.

Statham's film career began in 1998 when he was cast in Guy Ritchie's crime comedy "Lock, Stock and Two Smoking Barrels." His performance caught the attention of audiences and critics alike, leading to a follow-up role in Ritchie's "Snatch" (2000), where he starred alongside Brad Pitt and Benicio del Toro. These early roles established him as a reliable actor capable of delivering tough, street-smart characters with a touch of humor.

He gained international fame with his role as Frank Martin in "The Transporter" series (2002-2008), where he performed many of his own stunts, showcasing his skills in martial arts, driving, and combat. This franchise solidified his reputation as a top-tier action star. Statham continued to build on this success with roles in high-profile action films such as "Crank" (2006), "War" (2007), and "Death Race" (2008).

In 2010, Statham joined the ensemble cast of "The Expendables," alongside other action legends like Sylvester Stallone and Arnold Schwarzenegger. The film's success led to two sequels, further cementing his status in Hollywood. He also became part of the "Fast & Furious" franchise, debuting as the villain Deckard Shaw in "Fast & Furious 6" (2013) and reprising the role in subsequent films, including "Furious 7" (2015), "The Fate of the Furious" (2017), and the spin-off "Hobbs & Shaw" (2019).

Statham's appeal lies in his ability to bring authenticity to his roles, performing stunts and fight scenes with a level of realism that resonates with audiences. Off-screen, he is known for his private and low-key lifestyle, a stark contrast to the high-octane characters he portrays. He has been in a long-term relationship with model Rosie Huntington-Whiteley, with whom he shares a son.

Jason Statham's career is a testament to his versatility and dedication to his craft. From his beginnings as a competitive diver to becoming one of Hollywood's most bankable action stars, he has consistently delivered performances that are both compelling and entertaining. His contributions to the action genre have earned him a loyal fan base and a lasting legacy in the film industry.
"""

In [None]:
questions = [
    "What were Jason Statham's professions before he became an actor?",
    "Which film marked the beginning of Jason Statham's film career in 1998?",
    "How did Jason Statham's role in 'The Transporter' series impact his career?",
    "In which film franchise did Jason Statham play the character Deckard Shaw?",
    "Who is Jason Statham's long-term partner, and do they have any children?"
]

In [None]:
for question in questions:
    print(question, end="\n\n")

    prompt = f"""Answer the question based on the context provided:

QUESTION
{question}

'''CONTEXT
{text}
'''
    """
    print(generate_text(prompt, temperature=0.7, do_sample=True, max_new_tokens=100))

    print("-" * 100)

## Резюме

1. Загрузили **квантизованную** модель Gemma с помощью библиотек `Transformers` и `bitsandbytes`
2. Порешали все те же задачи с помощью **квантизованной** Gemma и сравнили с обычной