**IMG 2 TEXT**

This notebook will try the LLAVA model with several prompts, then will use our chosen prompts to generate a dictionnary with the descriptions of "person", "clothes" and "scenario" for our list of fotos in the "/data" folder.

# 0. Imports

In [49]:
import sys
import os 
sys.path.append(os.path.join(os.getcwd(), '../'))
from credentials import Credentials
credentials = Credentials()
os.environ["http_proxy"] = credentials.http_proxy
os.environ["https_proxy"] = credentials.https_proxy
import pickle


# 1. LLaVA

## Loading the model

In [2]:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration, BitsAndBytesConfig
import torch
from PIL import Image

In [3]:
#processor
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [4]:
# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

# Load model with quantization
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    quantization_config=quantization_config,  # <-- important
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    attn_implementation="flash_attention_2"
)

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [6]:
model = model.to(device)

## Prompt 1: What is shown in this image?

In [None]:
conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "What is shown in this image?"},
          {"type": "image"},
        ],
    },
]

Duration:
- 100 tokens --> 1.5 mins (better to do 1.5 mins multiple times lol)

In [14]:
# prepare image and text prompt, using the appropriate prompt template
image = Image.open("../data/polo.jpg").convert("RGB")

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "What is shown in this image?"},
          {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]  
What is shown in this image? [/INST] The image shows a woman standing in front of a colorful, patterned backdrop. She is wearing a white blazer over a floral top and has long blonde hair. The backdrop has a patriotic theme, featuring what appears to be the American flag. The text on the image reads "POLO Ralph Lauren" and "SUMMER STARTS HERE SHOP NOW," suggesting that this is an advertisement for the summer collection of the Ralph Lauren brand


## Prompt 2: Describe this image from a marketing perspective

TIMES:
- 100 tokens -> 1.5mins
- 200 tokens -> 2mins!
- 300 tokens -> 2mins 47s!
- 500 tokens -> 3mins 47s! --> this!

In [18]:
# prepare image and text prompt, using the appropriate prompt template
image = Image.open("../data/polo.jpg").convert("RGB")

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "Explain this image from a marketing perspective:"},
          {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=500)

print(processor.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]  
Explain this image from a marketing perspective: [/INST] This image is a promotional advertisement for Polo Ralph Lauren's Summer collection. From a marketing perspective, the image is designed to showcase the brand's style and appeal to potential customers. Here are some key elements that contribute to the effectiveness of this advertisement:

1. **Brand Identity**: The Polo logo is prominently displayed at the top, reinforcing brand recognition. The use of the brand name in a large, bold font also helps to establish a strong brand presence.

2. **Seasonal Appeal**: The text "SUMMER STARTS HERE" suggests that the image is part of a seasonal campaign, aiming to capture the attention of consumers who are looking for summer fashion.

3. **Product Promotion**: The model is wearing a white blazer, which is likely a key piece from the summer collection. By showcasing this garment, the advertisement is promoting the product to potential buyers.

4. **Lifestyle Imagery**: The backgro

In [19]:
# prepare image and text prompt, using the appropriate prompt template
image = Image.open("../data/lacoste.png").convert("RGB")

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "Explain this image from a marketing perspective:"},
          {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=500)

print(processor.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]  
Explain this image from a marketing perspective: [/INST] This image appears to be a promotional advertisement for Lacoste, a French sportswear brand. The ad features two individuals standing on a train platform, dressed in Lacoste's signature white polo shirts and white pants, with the Lacoste logo prominently displayed. The setting suggests a moment of leisure or travel, which aligns with the brand's image of sporty elegance and lifestyle.

From a marketing perspective, the ad is designed to showcase the brand's clothing in a real-life context, emphasizing the versatility and style of the Lacoste polo shirts and pants. The use of two models, one male and one female, suggests that the brand's clothing is unisex and appeals to a wide range of consumers.

The choice of location, a train platform, could be symbolic of the brand's association with sport and travel, as well as the idea of being on the move or ready for any occasion. The text "Les Ardentes Virtuoses" and "Harlem 125

In [20]:
# prepare image and text prompt, using the appropriate prompt template
image = Image.open("../data/trench.png").convert("RGB")

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "Explain this image from a marketing perspective:"},
          {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=500)

print(processor.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]  
Explain this image from a marketing perspective: [/INST] This image appears to be an advertisement for a clothing brand. From a marketing perspective, the following elements are noteworthy:

1. **Visual Appeal**: The model is dressed in a chic, oversized trench coat, which is a fashionable choice. The coat's length and the model's pose suggest a sense of style and sophistication.

2. **Message**: The text "THIS TRENCH IS 30 YEARS OLD" is a play on words, implying that the trench coat is a classic, timeless piece. The phrase "buy clothes that will last longer than the trends" aligns with the brand's message, suggesting that investing in quality, long-lasting clothing is a better choice than following fleeting fashion trends.

3. **Branding**: The text "BE OLDER." is a call to action that encourages the viewer to consider the value of age and experience in their clothing choices. This phrase is likely to resonate with consumers who value quality and durability over fast fashion.

## Prompt 3: Analyze elements

### image 1 (innovative)

In [8]:
# prepare image and text prompt, using the appropriate prompt template
image = Image.open("../data/innovative.png").convert("RGB")

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": (
                    "This is a fashion banner, it has an image and probably some background and text. Focus only on the image and answer very briefly:\n"
                    "1. short answer: where is the scene and what is in the background?\n"
                    "2. short answer: how many people are in the image? where are they placed and how are they interacting with the scenario? which shot (medium, long...)?\n"
                    "3. short answer: what is their fashion style\n"
                    "4. short answer: what are the colors and emotions of this image?"
                )
            },
            {"type": "image"}
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=500)

print(processor.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]  
This is a fashion banner, it has an image and probably some background and text. Focus only on the image and answer very briefly:
1. short answer: where is the scene and what is in the background?
2. short answer: how many people are in the image? where are they placed and how are they interacting with the scenario? which shot (medium, long...)?
3. short answer: what is their fashion style
4. short answer: what are the colors and emotions of this image? [/INST] 1. The scene is set against a black metal fence.
2. There are two people in the image. They are standing in front of the fence, leaning against it. The shot appears to be a medium shot, capturing them from the waist up.
3. The fashion style of the two people is casual and colorful, with one person wearing a pink jacket and the other in a yellow jacket.
4. The colors in the image are bright and vibrant, with the pink and yellow jackets standing out against the black fence. The emotions conveyed by the image are energetic

### image 2: 

In [9]:
# prepare image and text prompt, using the appropriate prompt template
image = Image.open("../data/trench.png").convert("RGB")

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": (
                    "This is a fashion banner, it has an image and probably some background and text. Focus only on the image and answer very briefly:\n"
                    "1. short answer: where is the scene and what is in the background?\n"
                    "2. short answer: how many people are in the image? where are they placed and how are they interacting with the scenario? which shot (medium, long...)?\n"
                    "3. short answer: what is their fashion style\n"
                    "4. short answer: what are the colors and emotions of this image?"
                )
            },
            {"type": "image"}
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=500)

print(processor.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]  
This is a fashion banner, it has an image and probably some background and text. Focus only on the image and answer very briefly:
1. short answer: where is the scene and what is in the background?
2. short answer: how many people are in the image? where are they placed and how are they interacting with the scenario? which shot (medium, long...)?
3. short answer: what is their fashion style
4. short answer: what are the colors and emotions of this image? [/INST] 1. The scene appears to be a studio setting, with a plain white background.
2. There is one person in the image, a woman. She is standing in the center of the frame, looking directly at the camera. She is wearing a long coat and has her head covered with the coat's collar.
3. The fashion style of the image is minimalist and modern, with a focus on the coat as the main piece of clothing.
4. The colors in the image are neutral, with the woman's coat being the main focal point. The emotions conveyed by the image are likely

### image 3

In [10]:
# prepare image and text prompt, using the appropriate prompt template
image = Image.open("../data/10magazine.jpg").convert("RGB")

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": (
                    "This is a fashion banner, it has an image and probably some background and text. Focus only on the image and answer very briefly:\n"
                    "1. short answer: where is the scene and what is in the background?\n"
                    "2. short answer: how many people are in the image? where are they placed and how are they interacting with the scenario? which shot (medium, long...)?\n"
                    "3. short answer: what is their fashion style\n"
                    "4. short answer: what are the colors and emotions of this image?"
                )
            },
            {"type": "image"}
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=500)

print(processor.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]  
This is a fashion banner, it has an image and probably some background and text. Focus only on the image and answer very briefly:
1. short answer: where is the scene and what is in the background?
2. short answer: how many people are in the image? where are they placed and how are they interacting with the scenario? which shot (medium, long...)?
3. short answer: what is their fashion style
4. short answer: what are the colors and emotions of this image? [/INST] 1. The scene is set in a natural outdoor environment, possibly a park or garden, with trees and a bench in the background.
2. There are two people in the image. They are placed on opposite sides of the bench, with one person sitting on the bench and the other standing behind it. They are interacting with the scenario by posing for the photo. The shot appears to be a medium shot, focusing on the upper body and the lower body of the individuals.
3. The fashion style of the individuals is formal and elegant, with the perso

## Prompt 4: Reverse prompting

In [8]:
# prepare image and text prompt, using the appropriate prompt template
image = Image.open("../data/innovative.png").convert("RGB")

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": (
                    "Give me a prompt to generate this image in about 75 words"
                )
            },
            {"type": "image"}
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]  
Give me a prompt to generate this image in about 75 words [/INST] "Create an image featuring two individuals standing in front of a black metal fence. The person on the left has a vibrant pink jacket and yellow pants, while the person on the right is wearing a pink hoodie and white pants. Both are looking directly at the camera with a serious expression. Above them, superimpose the text 'Revolutionizing Fashion Marketing' in bold, white font. Below them, add the text 'Innovative campaigns that changed the game' in a smaller


# FINAL: LLAVA x PROMPTS

In [15]:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration, BitsAndBytesConfig
import torch
from PIL import Image
import os
import time

In [6]:
#processor
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [7]:
# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

# Load model with quantization
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    quantization_config=quantization_config,  # <-- important
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    attn_implementation="flash_attention_2"
)

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [9]:
model = model.to(device)

Now the idea will be to use this prompt (see below) for all the images that we have in /data, and to store the output information in a dictionary. The idea will be to merge the different point of images to create new ones using AI.

Let's retrieve first the list of images

In [40]:
images = [file for file in os.listdir('../data') if "." in file] #retrieving only the images
images

['beach.jpg',
 'bixby.jpg',
 'dior.png',
 'gucci.jpg',
 'loewe.jpg',
 'orange.jpg',
 'outdoor.jpg',
 'paris.jpg',
 'polo.jpg',
 'polo2.jpg',
 'street.jpg',
 'trench.png']

This is the function we are going to use

In [56]:
def img2text(image_src:str) -> str:
    """Turns images into text (fixed prompt for now)

    Args:
        image_src (str): Source of the image
    
    Returns:
        (str): Decoded output
    """

    torch.cuda.empty_cache()
    time.sleep(2)
    print("Cache is empty now!")

    # FIXED PROMPT
    conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": (
                    "This is an image for a fashion ad. Ignore the text on the foto. Focus only on the image:\n"
                    "1. Who is in the image? Describe their physical appearance (not their clothes): gender, race (important), hairstyle...\n"
                    '2. What clothes are they wearing and which is their fashion style (use word "person")?\n'
                    '3. Where is the scene and what is in the background? How is the person posing relative to the scene? (use word "person")'                )
            },
            {"type": "image"}
        ],
    },
    ]

    # prepare image and text prompt, using the appropriate prompt template
    try:
        image = Image.open(image_src).convert("RGB")
    
    except:
        print("Unable to open image...")
    
    else:
        print(f"Processing image {image_src}...")
        prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

        inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

        # autoregressively complete prompt
        output = model.generate(**inputs, max_new_tokens=500)

        return (processor.decode(output[0], skip_special_tokens=True))
    

Let's try it out

In [57]:
output = img2text('../data/'+images[3])
output

KeyboardInterrupt: 

In [54]:
final_output = output.split("[/INST]")[1] #getting the output
final_output = [line.strip()[3:] for line in final_output.split("\n") if len(line)>1] #splitting by lines and removing the "1." from the start
final_output

['The image features a person who appears to be a woman. She has dark skin and her hair is styled in a short, natural cut.',
 'The person is wearing a black and white checkered dress with a high collar and a bow tie at the neck. The dress has a full skirt and long sleeves. The fashion style of the person can be described as a blend of classic and modern, with a touch of vintage inspiration.',
 'The scene is set against a plain, light-colored background. The person is posing in a dynamic and confident manner, with one hand on their hip and the other extended outward. The pose suggests movement and energy, which is in line with the fashion style being showcased. The person is the central focus of the image, and their position relative to the background emphasizes their outfit and the overall composition of the advertisement.']

Now let's generate output for all the images and store them in a dictionary

In [58]:
images_dict = dict() #global dictionary

for image in images:
    # getting the output
    base_url = '../data/'
    output = img2text(base_url+image) #using LLAVA
    final_output = output.split("[/INST]")[1] #getting the output
    final_output = [line.strip()[3:] for line in final_output.split("\n") if len(line)>1] #splitting by lines and removing the "1." from the start
    print(final_output)
    # filling the dictionary
    img_description = dict() #specific dictionary for an image
    img_description["person"] = final_output[0]
    img_description["clothes"] = final_output[1]
    img_description["scenario"] = final_output[2]
    images_dict[image] = img_description
    #store the dictionary at every iteration just in case something happens
    with open('images_dict.pkl', 'wb') as fp:
        pickle.dump(images_dict, fp)
    print("................\n\n")


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/beach.jpg...
['The image features a person who appears to be a woman. She has dark hair, and her skin tone is light. She is wearing a yellow hoodie and yellow sweatpants.', 'The person is wearing a casual, sporty fashion style, characterized by the matching yellow hoodie and sweatpants.', 'The scene is set outdoors, with a basketball hoop in the background. The person is posing in front of the hoop, with her hands on her hips and her head turned to the side, giving a confident and stylish appearance.']
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/bixby.jpg...
['The image features a male model. He has short hair and appears to be of African descent.', 'The model is wearing a white suit with a white shirt and a white bow tie. His fashion style can be described as classic and formal.', 'The scene is set in a room with a vintage or classic aesthetic. In the background, there are framed pictures on the wall, including one that appears to be a painting of a nude woman. The model is posing in front of the wall, standing between two chairs with a relaxed yet poised posture. He is looking directly at the camera, which suggests a confident and polished demeanor.']
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/dior.png...
['The image features a person who appears to be a woman. She has blonde hair and is wearing makeup that includes dark lipstick.', 'The person is wearing a black dress with a high neckline and a flared skirt. The dress has a textured pattern and is adorned with what seems to be a floral or paisley design. The person is also wearing black high heels.', 'The scene is set against a dark gray background. The person is posing in a dynamic manner, with one leg lifted and the other bent at the knee, giving the impression of movement. The person is holding a small, colorful clutch in their right hand, which is raised slightly above their hip. The pose and the way the person is holding the clutch suggest a sense of elegance and style.']
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/gucci.jpg...
['The image features a person who appears to be a woman. She has dark skin and is wearing her hair in an updo.', 'The person is wearing a black and white checkered dress with a high collar and a bow tie at the neck. The dress has a full skirt and is paired with black high-heeled shoes. The overall fashion style can be described as a blend of classic and modern elements, with a touch of vintage inspiration.', "The scene is set against a plain, light-colored background. The person is posing in a dynamic and confident manner, with one hand on her hip and the other extended outward. Her pose suggests movement and energy, which is in line with the fashion style being showcased. The person is the central focus of the image, and her pose and attire are the main elements that draw the viewer's attention."]
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/loewe.jpg...
['The image features a male model. He has dark hair and a beard, and appears to be of Caucasian descent.', 'The model is wearing a brown leather jacket and black pants. He is also holding a brown bag.', 'The scene is set against a white brick wall with a decorative element that resembles a heart shape. The model is posing on a stool, sitting with one leg crossed over the other, and holding the bag in his lap. He is looking directly at the camera with a neutral expression. The overall fashion style of the image suggests a casual yet stylish look, possibly for a brand that focuses on leather accessories and casual outerwear.']
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/orange.jpg...
['The image features a person who appears to be a young adult female. She has dark skin and her hair is styled in a way that it falls over her shoulders.', 'The person is wearing a bright orange hoodie and matching orange socks. Her fashion style can be described as casual and sporty, with a focus on bold, vibrant colors.', "The scene takes place in a room with orange walls, which complements the person's outfit. The person is sitting on a large, round, orange cushion or ottoman, which is placed against the wall. She is posing in a relaxed manner, with one leg crossed over the other, and her hands resting on her knees. The person's pose and the choice of the orange background create a cohesive and visually appealing image that highlights the clothing and the color scheme."]
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/outdoor.jpg...
['In the image, there is a person who appears to be a woman. She has dark skin and her hair is styled in an afro.', 'The person is wearing a bright orange dress. The fashion style of the person can be described as summery and vibrant, with a focus on bold colors and light fabrics.', 'The scene takes place outdoors, with a large rock formation in the background. The person is posing on top of the rock formation, leaning against it with one hand. The pose gives the impression that the person is enjoying a sunny day outdoors.']
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/paris.jpg...
['The image features a person who appears to be a woman. She has light skin and is wearing makeup that includes dark lipstick and eye makeup. Her hairstyle is a short, blonde bob.', 'The person is wearing a black dress with a lace or embroidered detail on the bodice. The dress has a flared skirt and is paired with black lace gloves. The overall fashion style of the person can be described as elegant and sophisticated.', "The scene takes place in a room with a brick wall in the background. The person is posing in front of a mirror, which reflects their image. They are standing with their hands clasped together in front of them, and their pose suggests a sense of confidence and poise. The person's reflection in the mirror creates a sense of depth and symmetry in the image."]
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/polo.jpg...
['The image features a person who appears to be a young woman with long blonde hair. She has a fair complexion and is wearing makeup that includes lipstick and eye makeup.', 'The person is wearing a white blazer over a floral top, paired with white pants. The fashion style can be described as chic and casual, with a touch of femininity.', 'The scene is set outdoors, with a cityscape visible in the background. The person is posing in front of a large, colorful quilt that is draped over a structure, which adds a vibrant and artistic element to the image. The person is standing in front of the quilt, with the cityscape behind them, creating a contrast between the urban environment and the quilted fabric. The pose is relaxed yet poised, with the person looking directly at the camera.']
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/polo2.jpg...
['The image features a person who appears to be a young adult male. He has short, light-colored hair and is wearing glasses.', 'The person is dressed in a suit and tie, with a pink shirt and a blue blazer. The fashion style can be described as formal or business casual.', 'The scene is set in a room with a couch and a window. The person is sitting on the couch, holding a book, and posing with one leg crossed over the other. The pose suggests a relaxed yet sophisticated demeanor. The background is blurred, but it appears to be a well-lit, comfortable interior space.']
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/street.jpg...
['In the image, there is a person who appears to be a woman. She has dark hair, which could be black or dark brown, styled in a way that it falls over her shoulders. She has a neutral expression on her face.', 'The person is wearing a brown jumpsuit with wide-leg pants. The jumpsuit has a high neckline and long sleeves, which gives it a formal or elegant appearance. The fashion style of the person can be described as chic and sophisticated.', 'The scene is set on a city street, with a building featuring a window and a street sign in the background. The person is standing on the sidewalk, posing with one hand on her hip and the other hand resting on her thigh. She is looking directly at the camera, which gives the impression that she is the focal point of the image. The pose and the setting suggest a sense of urban style and confidence.']
................




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Cache is empty now!
Processing image ../data/trench.png...
['The image features a person who appears to be a woman. She has a fair complexion and her hair is styled in a way that it falls over her shoulders.', 'The person is wearing a long, oversized coat that reaches down to her ankles. The coat has a high collar and appears to be made of a heavy fabric, suggesting a fashion style that is both functional and fashionable.', "The scene is set against a plain, light-colored background that provides a neutral backdrop for the person. The person is posing in a way that she is looking directly at the camera, with her head slightly tilted to one side. Her pose is elegant and poised, with her hands resting at her sides. The overall composition of the image suggests a focus on the coat and the person's style, with the background serving to highlight the clothing."]
................




In [59]:
images_dict

{'beach.jpg': {'person': 'The image features a person who appears to be a woman. She has dark hair, and her skin tone is light. She is wearing a yellow hoodie and yellow sweatpants.',
  'clothes': 'The person is wearing a casual, sporty fashion style, characterized by the matching yellow hoodie and sweatpants.',
  'scenario': 'The scene is set outdoors, with a basketball hoop in the background. The person is posing in front of the hoop, with her hands on her hips and her head turned to the side, giving a confident and stylish appearance.'},
 'bixby.jpg': {'person': 'The image features a male model. He has short hair and appears to be of African descent.',
  'clothes': 'The model is wearing a white suit with a white shirt and a white bow tie. His fashion style can be described as classic and formal.',
  'scenario': 'The scene is set in a room with a vintage or classic aesthetic. In the background, there are framed pictures on the wall, including one that appears to be a painting of a nu

In [None]:
# with open('images_dict.pkl', 'rb') as fp:
#     images_dict = pickle.load(fp)