<a href="https://colab.research.google.com/github/ritwikraha/Open-Generative-Fill/blob/SLD/notebooks/sld/llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [37]:
import ast
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer
)

In [38]:
system_content = """Excellent Parser

Objective: Analyze scene descriptions to identify objects and their attributes.

Process Steps
1. Read the user prompt (scene description).
2. Identify all objects mentioned with quantities.
3. Extract attributes of each object (color, size, material, etc.).
4. If the description mentions objects that shouldn't be in the image, take note at the negation part.
5. Explain your understanding (reasoning) and then format your result (answer / negation) as shown in the examples.
6. Importance of Extracting Attributes: Attributes provide specific details about the objects. This helps differentiate between similar objects and gives a clearer understanding of the scene.

Follow the steps closely and accurately identify objects based on the given prompt. Ensure adherence to the output format.
"""

use_contents = [
"User prompt: A brown horse is beneath a black dog. Another orange cat is beneath a brown horse.",
"User prompt: There's a white car and a yellow airplane in a garage. They're in front of two dogs and behind a cat. The car is small. Another yellow car is outside the garage.",
"User prompt: A car and a dog are on top of an airplane and below a red chair. There's another dog sitting on the mentioned chair.",
"User prompt: An oil painting at the beach of a blue bicycle to the left of a bench and to the right of a palm tree with five seagulls in the sky.",
"User prompt: An animated-style image of a scene without backpacks.",
"User Prompt: Make the dog a sleeping dog and remove all shadows in an image of a grassland.",
]

assistant_contents = [
"""
Reasoning: The description talks about three objects: a brown horse, a black dog, and an orange cat. We report the color attribute thoroughly. No specified negation terms. No background is mentioned and thus fill in the default one.
Objects: [('horse', ['brown']), ('dog', ['black']), ('cat', ['orange'])]
Background: A realistic image
Negation:
""",
"""
Reasoning: The scene has two cars, one airplane, two dogs, and a cat. The car and airplane have colors. The first car also has a size. No specified negation terms. The background is a garage.
Objects: [('car', ['white and small', 'yellow']), ('airplane', ['yellow']), ('dog', [None, None]), ('cat', [None])]
Background: A realistic image in a garage
Negation:
""",
"""
Reasoning: Four objects are described: one car, airplane, two dog, and a chair. The chair is red color. No specified negation terms. No background is mentioned and thus fill in the default one.
Objects: [('car', [None]), ('airplane', [None]), ('dog', [None, None]), ('chair', ['red'])]
Background: A realistic image
Negation:
""",
"""
Reasoning: Here, there are five seagulls, one blue bicycle, one palm tree, and one bench. No specified negation terms. The background is an oil painting at the beach.
Objects: [('bicycle', ['blue']), ('palm tree', [None]), ('seagull', [None, None, None, None, None]), ('bench', [None])]
Background: An oil painting at the beach
Negation:
""",
"""
Reasoning: The description clearly states no backpacks, so this must be acknowledged. The user provides the negative prompt of backpacks. The background is an animated-style image.
Objects: [('backpacks', [None])]
Background: An animated-style image
Negation: backpacks
""",
"""
Reasoning: The user prompt specifies a sleeping dog on the image and a shadow to be removed. The background is a realistic image of a grassland.
Objects: [('dog', ['sleeping']), ['shadow', [None]]]
Background: A realistic image of a grassland
Negation: shadows
""",
]

In [44]:
model_id="Qwen/Qwen1.5-0.5B-Chat"
device = "cuda" if torch.cuda.is_available() else "cpu"

prompt="""An image of a rustic village with a cat to the left and a dog\
to the right."""

In [45]:
messages = [
    {"role":"system", "content": system_content},
]

for user_content, assistant_content in zip(use_contents, assistant_contents):
    messages.append({"role":"user", "content": user_content.strip()})
    messages.append({"role":"assistant", "content": assistant_content.strip()})

messages.append({"role":"user", "content":f"User prompt: {prompt}"})

In [46]:
language_model = AutoModelForCausalLM.from_pretrained(
    model_id,
).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [47]:
model_inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)

print(tokenizer.decode(model_inputs[0]))

<|im_start|>system
Excellent Parser

Objective: Analyze scene descriptions to identify objects and their attributes.

Process Steps
1. Read the user prompt (scene description).
2. Identify all objects mentioned with quantities.
3. Extract attributes of each object (color, size, material, etc.).
4. If the description mentions objects that shouldn't be in the image, take note at the negation part.
5. Explain your understanding (reasoning) and then format your result (answer / negation) as shown in the examples.
6. Importance of Extracting Attributes: Attributes provide specific details about the objects. This helps differentiate between similar objects and gives a clearer understanding of the scene.

Follow the steps closely and accurately identify objects based on the given prompt. Ensure adherence to the output format.
<|im_end|>
<|im_start|>user
User prompt: A brown horse is beneath a black dog. Another orange cat is beneath a brown horse.<|im_end|>
<|im_start|>assistant
Reasoning: Th

In [48]:
with torch.no_grad():
    generated_ids = language_model.generate(
        model_inputs.to(device),
        max_new_tokens=512,
        temperature=0.0,
        do_sample=False,
    )



In [59]:
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs, generated_ids)
]

output_generation = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [61]:
print(output_generation)

Reasoning: The user provided a description of a rural village with a cat to the left and a dog to the right. The image does not contain any shadows. The background is a realistic image of a rural village.
Objects: [('cat', ['left']), ('dog', ['right'])]                                                                                                      
Background: A realistic image of a rural village
Negation: shadows


In [62]:
# # Extracting key objects
key_objects_part = output_generation.split("Objects:")[1]
start_index = key_objects_part.index("[")
end_index = key_objects_part.rindex("]") + 1
objects_str = key_objects_part[start_index:end_index]

print(objects_str)

[('cat', ['left']), ('dog', ['right'])]


In [63]:
# Converting string to list
parsed_objects = ast.literal_eval(objects_str)

parsed_objects

[('cat', ['left']), ('dog', ['right'])]

In [65]:
# Extracting additional negative prompt
bg_prompt = output_generation.split("Background:")[1].split("\n")[0].strip()
negative_prompt = output_generation.split("Negation:")[1].strip()

bg_prompt, negative_prompt

('A realistic image of a rural village', 'shadows')

In [66]:
parsed_result = {
    "objects": parsed_objects,
    "bg_prompt": bg_prompt,
    "neg_prompt": negative_prompt,
}

In [67]:
parsed_result

{'objects': [('cat', ['left']), ('dog', ['right'])],
 'bg_prompt': 'A realistic image of a rural village',
 'neg_prompt': 'shadows'}