**SUMMARIZER**

In this notebook we will develop the summarizing functions for a production setting.
<br>
In dev we tried the following models:
1. BART-LARGE-CNN
2. DeepSeek-Qwen-R1
3. PEGASUS
4. flan-t5-large
<br>

We concluded that the best one to use would be flan-t5-large.

# 0. Imports

In [1]:
import sys
import os 
sys.path.append(os.path.join(os.getcwd(), '../'))
from credentials import Credentials
credentials = Credentials()
os.environ["http_proxy"] = credentials.http_proxy
os.environ["https_proxy"] = credentials.https_proxy
import numpy as np
import pickle
import torch

In [2]:
from functions.summarizer import *

KeyboardInterrupt: 

# 1. Google flan-t5-large

Loading the models...

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [4]:
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

In [5]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
device

'cuda'

Loading the dictionary of images info

In [13]:
def load_images_dict(images_dict_path:str)->dict:
    """Loads images dictionary from the path

    Args:
        images_dict_path (str): Path to images dictionary

    Returns:
        dict: Images dictionary
    """
    with open(images_dict_path, 'rb') as fp:
        images_dict = pickle.load(fp)
    return images_dict

In [14]:
images_dict = load_images_dict('images_dict.pkl')

Generating a random prompt

In [15]:
def generate_random_prompt(images_dict: dict)->str:
    """Generates a prompt from the images dictionary by combining three random elements 

    Args:
        images_dict (dict): Images dictionary

    Returns:
        str: Generated prompt
    """
    #choosing three random elements
    keys = list(images_dict.keys()) 
    chosen_keys = [str(string) for string in np.random.choice(keys, 3, replace=False)] 
    print(chosen_keys)

    #building the prompt
    prompt = images_dict[chosen_keys[0]]['person']
    prompt += images_dict[chosen_keys[1]]['clothes']
    prompt += images_dict[chosen_keys[2]]['scenario']
    print(f"Final prompt (length: {len(prompt.split(" "))}): {prompt}")
    return prompt

In [16]:
prompt = generate_random_prompt(images_dict)

['polo2.jpg', 'trench.png', 'beach.jpg']
Final prompt (length: 99): The image features a person who appears to be a young adult male. He has short, light-colored hair and is wearing glasses.The person is wearing a long, oversized coat that reaches down to her ankles. The coat has a high collar and appears to be made of a heavy fabric, suggesting a fashion style that is both functional and fashionable.The scene is set outdoors, with a basketball hoop in the background. The person is posing in front of the hoop, with her hands on her hips and her head turned to the side, giving a confident and stylish appearance.


In [10]:
def flan_t5_summarize(prompt, context):
    """Summarizes a text into about half his size given a context

    Args:
        prompt (str): Prompt to summarize.
        context (str): Context of the prompt (person, scenario, image...)

    Returns:
        str: Summarized prompt.
    """
    prompt_length = len(prompt.split(" "))
    final_prompt_1 = f'Summarize this {context} comprehensively into {prompt_length//2} tokens "' + prompt + '"'
    print(final_prompt_1)
    target_tokens = prompt_length//2
    min_tokens = target_tokens - 15
    max_tokens = target_tokens + 15
    print(f"Prompt_length: {prompt_length}, target_length: {target_tokens}, max_output: {max_tokens}, min_output: {min_tokens}")

    # First summarization 
    summparams = SummParams(
        text = final_prompt_1,
        tokenizer = tokenizer,
        model = model,
        num_beams = 6,
        max_input_length = 1024,
        min_output_length = max(16, min_tokens),
        max_output_length= max_tokens,
        device = device
    )
    summarized_text_1 = summarize_text(summparams)
    return summarized_text_1


Trying it with three random images

In [57]:
#choosing three random elements
keys = list(images_dict.keys()) 
chosen_keys = [str(string) for string in np.random.choice(keys, 3, replace=False)] 
print(chosen_keys)

#building the prompt
elements = ['person', 'clothes', 'scenario']
for element in elements:
    prompt = images_dict[chosen_keys[0]][element]
    prompt_length = len(prompt.split(" "))
    #print(prompt_length)
    if prompt_length>20:
        summary = flan_t5_summarize(prompt, element)
        print(f"Final prompt (length: {len(summary.split(" "))}): {summary}\n")
    else:
        print(f"Final prompt (length: {prompt_length}): {prompt}\n")

['trench.png', 'polo.jpg', 'dior.png']
Summarize this person comprehensively into 15 tokens "The image features a person who appears to be a woman. She has a fair complexion and her hair is styled in a way that it falls over her shoulders."
Prompt_length: 30, target_length: 15, max_output: 30, min_output: 0
Final prompt (length: 23): The image features a person who appears to be a woman. She has a fair complexion and her hair is styled in a

Summarize this clothes comprehensively into 19 tokens "The person is wearing a long, oversized coat that reaches down to her ankles. The coat has a high collar and appears to be made of a heavy fabric, suggesting a fashion style that is both functional and fashionable."
Prompt_length: 39, target_length: 19, max_output: 34, min_output: 4
Final prompt (length: 14): The person is wearing a long, oversized coat that reaches down to her ankles.

Summarize this scenario comprehensively into 38 tokens "The scene is set against a plain, light-colored backg

______

# Extra

In [10]:
final_prompt = 'Summarize into: person description, clothing, pose, background". Text: "' + prompt + '"'

In [18]:
final_prompt =  'Summarize this text. Text: "The scene takes place in a room with orange walls, which complements the persons outfit. The person is sitting on a large, round, orange cushion or ottoman, which is placed against the wall. She is posing in a relaxed manner, with one leg crossed over the other, and her hands resting on her knees. The persons pose and the choice of the orange background create a cohesive and visually appealing image that highlights the clothing and the color scheme."'

In [19]:
final_prompt

'Summarize this text. Text: "The scene takes place in a room with orange walls, which complements the persons outfit. The person is sitting on a large, round, orange cushion or ottoman, which is placed against the wall. She is posing in a relaxed manner, with one leg crossed over the other, and her hands resting on her knees. The persons pose and the choice of the orange background create a cohesive and visually appealing image that highlights the clothing and the color scheme."'

In [12]:
summparams = SummParams(
    text = final_prompt,
    tokenizer = tokenizer,
    model = model,
    num_beams = 6,
    max_input_length = 1024,
    min_output_length = 80,
    max_output_length= 110,
    device = device
)

In [13]:
summarized_text = summarize_text(summparams)
summarized_text

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


'The image features a person who appears to be a woman. She has a fair complexion and her hair is styled in a way that it falls over her shoulders. The person is wearing a bright orange hoodie and matching orange socks. Her fashion style can be described as casual and sporty, with a focus on bold, vibrant colors. The scene takes place in a room with a brick wall in the background.'

In [128]:
len(summarized_text.split(" "))

73

Resummarizing...

In [129]:
final_prompt = 'Summarize into about 77 tokens, including: person description, clothing, pose, background. Text: "' + summarized_text + '"'

In [130]:
final_prompt

'Summarize into about 77 tokens, including: person description, clothing, pose, background. Text: "The image features a male model with short hair and appears to be of African descent. The person is dressed in a suit and tie, with a pink shirt and a blue blazer. The scene is set on a city street, with a window and a street sign in the background. The person is standing on the sidewalk, posing with one hand on her hip and the other hand resting on her thigh."'

In [131]:
summparams = SummParams(
    text = final_prompt,
    tokenizer = tokenizer,
    model = model,
    num_beams = 6, 
    max_input_length = 1024,
    min_output_length = 60,
    max_output_length= 77,
    device = device
)

In [132]:
summarized_text_2 = summarize_text(summparams)
summarized_text_2

'The image features a male model with short hair and appears to be of African descent. The person is dressed in a suit and tie, with a pink shirt and a blue blazer. The scene is set on a city street, with a window and a street sign in the background'

In [134]:
len(summarized_text_2.split(" "))

51

In [148]:
def summarize_prompt(prompt: str, tokenizer, model, device) ->str:
    """Summarizes the generated prompt using BART-LARGE-CNN three times (sequentially), so that it fits into 77 tokens (approx 55 words).

    Args:
        prompt (str): Generated prompt.
        tokenizer: Tokenizer.
        model: Model.
        device: Device.

    Returns:
        str: Prompt summarized into 77 tokens (55 words)
    """
    final_prompt_1 = 'Summarize into: person description, clothing, pose, background". Text: "' + prompt + '"'
    # First summarization 
    summparams = SummParams(
        text = final_prompt_1,
        tokenizer = tokenizer,
        model = model,
        num_beams = 6,
        max_input_length = 1024,
        min_output_length = 80,
        max_output_length= 110,
        device = device
    )
    summarized_text_1 = summarize_text(summparams)
    print(f"Summary 1 (length: {len(summarized_text_1.split(" "))}): {summarized_text_1}")

    # Second summarization
    final_prompt_2 = 'Summarize into 77 tokens, including: person description, clothing, pose, background. Text: "' + summarized_text_1 + '"'
    summparams.text = final_prompt_2
    summparams.num_beams = 6
    summparams.max_input_length = 1024
    summparams.min_output_length = 60
    summparams.max_output_length = 77
    summarized_text_2 = summarize_text(summparams)
    print(f"Summary 2 (length: {len(summarized_text_2.split(" "))}): {summarized_text_2}")

    return summarized_text_2

In [151]:
prompt = generate_random_prompt(images_dict)
s = summarize_prompt(prompt, tokenizer, model, device)

['gucci.jpg', 'trench.png', 'loewe.jpg']
Final prompt (length: 136): The image features a person who appears to be a woman. She has dark skin and is wearing her hair in an updo.The person is wearing a long, oversized coat that reaches down to her ankles. The coat has a high collar and appears to be made of a heavy fabric, suggesting a fashion style that is both functional and fashionable.The scene is set against a white brick wall with a decorative element that resembles a heart shape. The model is posing on a stool, sitting with one leg crossed over the other, and holding the bag in his lap. He is looking directly at the camera with a neutral expression. The overall fashion style of the image suggests a casual yet stylish look, possibly for a brand that focuses on leather accessories and casual outerwear.
Summary 1 (length: 85): The image features a person who appears to be a woman. She has dark skin and is wearing her hair in an updo. The person is wearing a long, oversized coat that

## pruebas (resumir por separado y luego juntar)

In [58]:
def prueba_summarize(prompt):
    prompt_length = len(prompt.split(" "))
    final_prompt_1 = f'Summarize into {prompt_length//2} tokens "' + prompt + '"'
    final_prompt_2 = f'Summarize this description comprehensively into {prompt_length//2} tokens "' + prompt + '"'
    print(final_prompt_1)
    target_tokens = prompt_length//2
    min_tokens = target_tokens - 15
    max_tokens = target_tokens + 15
    print(f"Prompt_length: {prompt_length}, target_length: {target_tokens}, max_output: {max_tokens}, min_output: {min_tokens}")

    # First summarization 
    summparams = SummParams(
        text = final_prompt_1,
        tokenizer = tokenizer,
        model = model,
        num_beams = 6,
        max_input_length = 1024,
        min_output_length = max(16, min_tokens),
        max_output_length= max_tokens,
        device = device
    )

    # summarize with prompt_1
    summarized_text_1 = summarize_text(summparams)
    if summarized_text_1[-1]=='.':
        return summarized_text_1

    else:
        # summarize with prompt_2
        summparams.text = final_prompt_2
        summarized_text_2 = summarize_text(summparams)
        return summarized_text_2


In [62]:
def prueba_summarize(prompt, context):
    """Summarizes a text into about half his size given a context

    Args:
        prompt (str): Prompt to summarize.
        context (str): Context of the prompt (person, scenario, image...)

    Returns:
        str: Summarized prompt.
    """
    prompt_length = len(prompt.split(" "))
    final_prompt_1 = f'Summarize this {context} comprehensively into {prompt_length//2} tokens "' + prompt + '"'
    print(final_prompt_1)
    target_tokens = prompt_length//2
    min_tokens = target_tokens - 15
    max_tokens = target_tokens + 15
    print(f"Prompt_length: {prompt_length}, target_length: {target_tokens}, max_output: {max_tokens}, min_output: {min_tokens}")

    # First summarization 
    summparams = SummParams(
        text = final_prompt_1,
        tokenizer = tokenizer,
        model = model,
        num_beams = 6,
        max_input_length = 1024,
        min_output_length = max(16, min_tokens),
        max_output_length= max_tokens,
        device = device
    )
    summarized_text_1 = summarize_text(summparams)
    return summarized_text_1


In [96]:
#choosing three random elements
keys = list(images_dict.keys()) 
chosen_keys = [str(string) for string in np.random.choice(keys, 3, replace=False)] 
print(chosen_keys)

#building the prompt
elements = ['person', 'clothes', 'scenario']
for element in elements:
    prompt = images_dict[chosen_keys[0]][element]
    prompt_length = len(prompt.split(" "))
    #print(prompt_length)
    if prompt_length>20:
        summary = prueba_summarize(prompt, element)
        print(f"Final prompt (length: {len(summary.split(" "))}): {summary}\n")
    else:
        print(f"Final prompt (length: {prompt_length}): {prompt}\n")

['beach.jpg', 'trench.png', 'loewe.jpg']
Summarize this person comprehensively into 15 tokens "The image features a person who appears to be a woman. She has dark hair, and her skin tone is light. She is wearing a yellow hoodie and yellow sweatpants."
Prompt_length: 30, target_length: 15, max_output: 30, min_output: 0
Final prompt (length: 17): The image features a person who appears to be a woman with dark hair and light skin.

Final prompt (length: 17): The person is wearing a casual, sporty fashion style, characterized by the matching yellow hoodie and sweatpants.

Summarize this scenario comprehensively into 20 tokens "The scene is set outdoors, with a basketball hoop in the background. The person is posing in front of the hoop, with her hands on her hips and her head turned to the side, giving a confident and stylish appearance."
Prompt_length: 40, target_length: 20, max_output: 35, min_output: 5
Final prompt (length: 10): A person is posing in front of a basketball hoop.



In [68]:
def summarize_text(summparams: SummParams) -> str:
    """Summarizes text

    Args:
        summparams (SummParams): Parameters for text summarization (dataclass)

    Returns:
        str: Summarized text
    """
    inputs = summparams.tokenizer([summparams.text], max_length=summparams.max_input_length, return_tensors='pt').to(summparams.device)
    summary_ids = summparams.model.generate(inputs['input_ids'], num_beams=summparams.num_beams, min_length=summparams.min_output_length, max_length=summparams.max_output_length, early_stopping=True)
    return (summparams.tokenizer.decode(summary_ids[0], skip_special_tokens=True, truncation=True))

In [118]:
#prompt = "The scene takes place in a room with orange walls, which complements the person's outfit. The person is sitting on a large, round, orange cushion or ottoman, which is placed against the wall. She is posing in a relaxed manner, with one leg crossed over the other, and her hands resting on her knees. The person's pose and the choice of the orange background create a cohesive and visually appealing image that highlights the clothing and the color scheme."
#prompt = "The image features a person who appears to be a woman. She has light skin and is wearing makeup that includes dark lipstick and eye makeup. Her hairstyle is a short, blonde bob."
#prompt = "The person is dressed in a suit and tie, with a pink shirt and a blue blazer. The fashion style can be described as formal or business casual."
#prompt = "The image features a person who appears to be a woman. She has blonde hair and is wearing makeup that includes dark lipstick."
prompt = "The person is wearing a bright orange dress. The fashion style of the person can be described as summery and vibrant, with a focus on bold colors and light fabrics."
#prompt = "The scene is set in a room with a vintage or classic aesthetic. In the background, there are framed pictures on the wall, including one that appears to be a painting of a nude woman. The model is posing in front of the wall, standing between two chairs with a relaxed yet poised posture. He is looking directly at the camera, which suggests a confident and polished demeanor."
#prompt = "The scene is set against a white brick wall with a decorative element that resembles a heart shape. The model is posing on a stool, sitting with one leg crossed over the other, and holding the bag in his lap. He is looking directly at the camera with a neutral expression. The overall fashion style of the image suggests a casual yet stylish look, possibly for a brand that focuses on leather accessories and casual outerwear."
#prompt = "The scene is set against a white brick wall with a decorative element that resembles a heart shape. The model is posing on a stool, sitting with one leg crossed over the other, and holding the bag in his lap."
#prompt = "The scene takes place in a room with orange walls, which complements the person's outfit. The person is sitting on a large, round, orange cushion or ottoman, which is placed against the wall. She is posing in a relaxed manner, with one leg crossed over the other, and her hands resting on her knees. The person's pose and the choice of the orange background create a cohesive and visually appealing image that highlights the clothing and the color scheme."
#prompt = "The image features a person who appears to be a young adult female. She has dark skin and her hair is styled in a way that it falls over her shoulders."
#prompt = "Maya is a spirited graphic designer in her late twenties, with a penchant for bold colors and unconventional ideas. She’s always sketching concepts in her notebook, humming softly as she works. Curious and empathetic, Maya thrives on collaboration, but she values her quiet moments with coffee and a good book, finding inspiration in the smallest details of life."
#prompt = "David is a middle-aged history professor with a calm demeanor and sharp intellect. His salt-and-pepper hair frames a face lined from years of laughter and thoughtful contemplation. Passionate about ancient civilizations, he captivates students with stories of the past. Outside the classroom, he enjoys gardening, jazz music, and long evening walks through the city streets."
prompt_length = len(prompt.split(" "))
target_tokens = prompt_length//2
final_prompt_1 = f'Summarize this person comprehensively into {target_tokens} tokens "' + prompt + '"'
min_tokens = target_tokens - 15
max_tokens = target_tokens + 15
print(final_prompt_1)

# First summarization 
summparams = SummParams(
    text = final_prompt_1,
    tokenizer = tokenizer,
    model = model,
    num_beams = 2,
    max_input_length = 1024,
    min_output_length = max(16, min_tokens),
    max_output_length= min(max_tokens, int(0.8*prompt_length)),
    device = device
)

summarized_text_1 = summarize_text(summparams)
print(summarized_text_1)
print(len(summarized_text_1.split(" ")))



Summarize this person comprehensively into 15 tokens "The person is wearing a bright orange dress. The fashion style of the person can be described as summery and vibrant, with a focus on bold colors and light fabrics."
The person is wearing a bright orange dress. The fashion style of the person can be described as summery and
20


In [75]:
inputs = summparams.tokenizer([summparams.text], max_length=summparams.max_input_length, return_tensors='pt').to(summparams.device)
summary_ids = summparams.model.generate(inputs['input_ids'], num_beams=summparams.num_beams, min_length=summparams.min_output_length, max_length=summparams.max_output_length, early_stopping=True, repetition_penalty=2.0)
summarized_text =  (summparams.tokenizer.decode(summary_ids[0], skip_special_tokens=True, truncation=True))
print(summarized_text)

The scene takes place in a room with orange walls, which complements the person's outfit. The person is sitting on a large, round, orange cushion or ottoman, which is placed against the wall.


In [None]:
# //2 +-10
relationships = [ #(yes/no/failed), prompt_length, prompt_length//2, min_output, max_output
    ["yes", 78, 39, 29, 49],
    ["yes", 30, 15, 5, 25],
    ["yes", 31, 15, 5, 25],
    ["yes", 76, 38, 48, 28],
    ["yes", 40, 29, 10, 30],
    ["yes", 42, 21, 11, 31],
    ["yes", 54, 27, 17, 37],
    ["yes", 28, 14, 4, 24]
]

# //2 +-10 with max(16, min) and min(36, max) --> YEEFUCKINGHAWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
relationships_2 = [
    ["no", 78, 39, 29, 49],
    ["no", 68, 34, 44, 24]
]

In [36]:
18*1.4

25.2

In [255]:
prompt_length//2

39

In [226]:
print(f"Summary 1 (length: {len(summarized_text_1.split(" "))}): {summarized_text_1}")

Summary 1 (length: 11): The model is wearing a brown leather jacket and black pants.


## GEMMA

In [None]:
from huggingface_hub import login
login(token=credentials.hf_token)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
# # Load model directly
# tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it")

In [None]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,             # or False if you want full precision
    bnb_4bit_quant_type="nf4",     # 'nf4' or 'fp4'
    bnb_4bit_compute_dtype=torch.float16,  # can also try torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    quantization_config=quantization_config,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
device

'cuda'

In [7]:
prompt = 'Summarize this text into about 55 words: "The image features a person who appears to be a woman. She has dark skin and is wearing her hair in an updo.The person is wearing a long, oversized coat that reaches down to her ankles. The coat has a high collar and appears to be made of a heavy fabric, suggesting a fashion style that is both functional and fashionable.The scene is set against a white brick wall with a decorative element that resembles a heart shape. The model is posing on a stool, sitting with one leg crossed over the other, and holding the bag in his lap. He is looking directly at the camera with a neutral expression. The overall fashion style of the image suggests a casual yet stylish look, possibly for a brand that focuses on leather accessories and casual outerwear."'

In [23]:
prompt = generate_random_prompt(images_dict)
prompt = 'Summarize this text into about 55 words: "' + prompt + '"'
print(prompt)

['polo.jpg', 'outdoor.jpg', 'dior.png']
Final prompt (length: 129): The image features a person who appears to be a young woman with long blonde hair. She has a fair complexion and is wearing makeup that includes lipstick and eye makeup.The person is wearing a bright orange dress. The fashion style of the person can be described as summery and vibrant, with a focus on bold colors and light fabrics.The scene is set against a dark gray background. The person is posing in a dynamic manner, with one leg lifted and the other bent at the knee, giving the impression of movement. The person is holding a small, colorful clutch in their right hand, which is raised slightly above their hip. The pose and the way the person is holding the clutch suggest a sense of elegance and style.
Summarize this text into about 55 words: "The image features a person who appears to be a young woman with long blonde hair. She has a fair complexion and is wearing makeup that includes lipstick and eye makeup.The per

In [24]:
messages = [
    {"role": "user", "content": prompt},
]

In [25]:
# model = model.eval()

In [27]:
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=100)
summary = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(summary)

A young woman with blonde hair and fair skin wears a vibrant orange dress against a dark gray background.  She poses dynamically, one leg lifted and the other bent, holding a colorful clutch.  The fashion is summery and bold, showcasing a sense of elegance and style. 



In [28]:
len(summary.split(" "))

49