**SUMMARIZER**

In this notebook we will be trying GEMMA to summarize our long prompts into something that would fit FLUX (so maximum of 77 tokens or circa 55 words)

# 0. Imports and functions

In [2]:
import sys
import os 
sys.path.append(os.path.join(os.getcwd(), '../'))
from credentials import Credentials
credentials = Credentials()
os.environ["http_proxy"] = credentials.http_proxy
os.environ["https_proxy"] = credentials.https_proxy
import numpy as np
import pickle
import torch

In [3]:
def load_images_dict(images_dict_path:str)->dict:
    """Loads images dictionary from the path

    Args:
        images_dict_path (str): Path to images dictionary

    Returns:
        dict: Images dictionary
    """
    with open(images_dict_path, 'rb') as fp:
        images_dict = pickle.load(fp)
    return images_dict

In [4]:
images_dict = load_images_dict('images_dict.pkl')

In [5]:
def generate_random_prompt(images_dict: dict)->str:
    """Generates a prompt from the images dictionary by combining three random elements 

    Args:
        images_dict (dict): Images dictionary

    Returns:
        str: Generated prompt
    """
    #choosing three random elements
    keys = list(images_dict.keys()) 
    chosen_keys = [str(string) for string in np.random.choice(keys, 3, replace=False)] 
    print(chosen_keys)

    #building the prompt
    prompt = images_dict[chosen_keys[0]]['person']
    prompt += images_dict[chosen_keys[1]]['clothes']
    prompt += images_dict[chosen_keys[2]]['scenario']
    print(f"Final prompt (length: {len(prompt.split(" "))}): {prompt}")
    return prompt

In [6]:
random_prompt = generate_random_prompt(images_dict)

['polo2.jpg', 'orange.jpg', 'street.jpg']
Final prompt (length: 125): The image features a person who appears to be a young adult male. He has short, light-colored hair and is wearing glasses.The person is wearing a bright orange hoodie and matching orange socks. Her fashion style can be described as casual and sporty, with a focus on bold, vibrant colors.The scene is set on a city street, with a building featuring a window and a street sign in the background. The person is standing on the sidewalk, posing with one hand on her hip and the other hand resting on her thigh. She is looking directly at the camera, which gives the impression that she is the focal point of the image. The pose and the setting suggest a sense of urban style and confidence.


## 1. GEMMA

In [7]:
from huggingface_hub import login
login(token=credentials.hf_token)

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,             # or False if you want full precision
    bnb_4bit_quant_type="nf4",     # 'nf4' or 'fp4'
    bnb_4bit_compute_dtype=torch.float16,  # can also try torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    quantization_config=quantization_config,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
device

'cuda'

In [7]:
prompt = 'Summarize this text into about 55 words: "The image features a person who appears to be a woman. She has dark skin and is wearing her hair in an updo.The person is wearing a long, oversized coat that reaches down to her ankles. The coat has a high collar and appears to be made of a heavy fabric, suggesting a fashion style that is both functional and fashionable.The scene is set against a white brick wall with a decorative element that resembles a heart shape. The model is posing on a stool, sitting with one leg crossed over the other, and holding the bag in his lap. He is looking directly at the camera with a neutral expression. The overall fashion style of the image suggests a casual yet stylish look, possibly for a brand that focuses on leather accessories and casual outerwear."'

In [23]:
prompt = generate_random_prompt(images_dict)
prompt = 'Summarize this text into about 55 words: "' + prompt + '"'
print(prompt)

['polo.jpg', 'outdoor.jpg', 'dior.png']
Final prompt (length: 129): The image features a person who appears to be a young woman with long blonde hair. She has a fair complexion and is wearing makeup that includes lipstick and eye makeup.The person is wearing a bright orange dress. The fashion style of the person can be described as summery and vibrant, with a focus on bold colors and light fabrics.The scene is set against a dark gray background. The person is posing in a dynamic manner, with one leg lifted and the other bent at the knee, giving the impression of movement. The person is holding a small, colorful clutch in their right hand, which is raised slightly above their hip. The pose and the way the person is holding the clutch suggest a sense of elegance and style.
Summarize this text into about 55 words: "The image features a person who appears to be a young woman with long blonde hair. She has a fair complexion and is wearing makeup that includes lipstick and eye makeup.The per

In [24]:
messages = [
    {"role": "user", "content": prompt},
]

In [27]:
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=100)
summary = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(summary)

A young woman with blonde hair and fair skin wears a vibrant orange dress against a dark gray background.  She poses dynamically, one leg lifted and the other bent, holding a colorful clutch.  The fashion is summery and bold, showcasing a sense of elegance and style. 



In [28]:
len(summary.split(" "))

49