# Image Stereoscopic Reconstruction (Pipeline)

In this notebook, we explore the process of image translation, in order to obtain a frontal view of an architectural object from the corresponding lateral view, with possible image enhancements (inclusion of new details, inpainting, etc.).
To achieve this, we are going to use an attention based, Chain of Thoughts (CoT) driven generative process, which includes an LLM coupled with a Conditional Latent Diffusion Model (in our example, we are using Qwen 2.5 Image Edit).

## Setup

In [None]:
%pip install -r requirements.txt

In [None]:
import base64
import io
import json
import ollama
import chromadb
from matplotlib import image as mpimg
from matplotlib import pyplot as plt

### Utility Functions

In [None]:
def encode_image(image_path) -> str:
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')
    
def decode_and_show_image(base64_img, img_format: str):
    decoded_bytes = io.BytesIO(base64.b64decode(base64_img))
    decoded_image = mpimg.imread(decoded_bytes, format=img_format)
    
    plt.imshow(decoded_image, interpolation='nearest')
    plt.show()

### Ollama

We make use of [Ollama](), a local LLM orchestrator.
Feel free to experiment with other vision models of your taste ([list of available ones](https://ollama.com/search?c=vision)).

In [None]:
OLLAMA_URL = "http://localhost:11434"   # Feel free to change if your Ollama port is different
MODEL = "qwen2.5vl:32b"                 # Our approach is tested with and works best with Qwen2.5-VL 32B.

%ollama pull $MODEL
%ollama serve

### Vector Store

We make use of [ChromaDB](https://www.trychroma.com/), a lightweight and easy to set up in-memory vector store.
Documentation can be found [here](https://docs.trychroma.com/docs/overview/getting-started).

In [None]:
chroma_client = chromadb.EphemeralClient()  #By default, we use an in-memory approach which does not persist anything for this demo.
collection = chroma_client.create_collection(name="eustachian_collection")

Adding two images to the vector store, for final image details enrichment.

In [None]:
collection.add(
    ids=["id1", "id2"],
    documents=[
        f"""
        {{ 
            "caption": "A statue of St. Eustace, patron saint of Matera, suited in armor with a golden plume, holding a spear upright in its right hand.",
            "base64": "{encode_image('assets/stEustace.jpg')}"
        }}
        """,
        f"""
        {{ 
            "caption": "A statue of St. Vitus, suited in light armor and a red cape, bringing a silver cross in is left hand, followed by two dogs of the same breed, of brown and black.",
            "base64": "{encode_image('assets/stVitus.jpg')}"
        }}
        """
    ]
)

#### Example of querying

In [None]:
result = collection.query(
    query_texts=["This is a query document about a saint followed by dogs"],
    n_results=1
)
print(json.loads(result))

## Phase 1: Prospective Change

To solve this purpose, we make use of Qwen 2.5 Image Edit, an **instructive** T2I model capable of image generation, image editing and in-context image editing, over a CoT-LLM infrastructure.

### Setup

Cloning from public [Qwen Huggingface repo](https://huggingface.co/Qwen/Qwen-Image-Edit)

In [None]:
!git clone https://github.com/QwenLM/Qwen-Image.git ./models/

In [None]:
%pip install git+https://github.com/huggingface/diffusers

In [None]:
import os
from PIL import Image
import torch
from diffusers import QwenImageEditPlusPipeline
from typing import List, Dict
from io import BytesIO
from ollama import chat, ChatResponse

In [None]:
pipeline = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-2509")
print("Pipeline loaded.")

In [None]:
pipeline.to(torch.bfloat16) # Inference in mixed precision (raccomended for faster inference)
pipeline.to("cuda")         # Change to "cpu" if CUDA is not supported by your machine. The inference will be slower.
pipeline.set_progress_bar_config(disable=False)

In [None]:
def generate_image(inputs: Dict[Image.Image, str, torch.Generator, float, str, int, int], output_filename = "output_edit"):
    with torch.inference_mode():
        output = pipeline(**inputs)
        i = 0
        for output_image in output.images:
            output_image.save(f"{output_filename}_{i}.jpeg")
            i += 1
        print("Image generated (saved!)")
    return output_image

def make_inputs(images: List[Image.Image], prompt: str, guidance = 4.0, neg_prompt = " ", inf_steps = 50, gen_images = 1):
    return {
    "image": images,
    "prompt": prompt,
    "generator": torch.manual_seed(0),
    "true_cfg_scale": guidance,
    "negative_prompt": neg_prompt,
    "num_inference_steps": inf_steps,
    "num_images_per_prompt": gen_images
    }


In [None]:
def polish_prompt_en(original_prompt: str, images: List[Image.Image]):
    base64_images = []
    for image in images:
        with BytesIO() as buffered:
            image.save(buffered, format="JPEG")
            base64_images.append(base64.b64encode(buffered.getvalue()).decode("utf8"))
    response: ChatResponse = chat(model=MODEL, messages=[
    {
        'role': 'user',
        'content':  f'''
        # Edit Instruction Rewriter
        You are a professional edit instruction rewriter. Your task is to generate a precise, concise, and visually achievable professional-level edit instruction based on the user-provided instruction and the image to be edited.  

        Please strictly follow the rewriting rules below:

        ## 1. General Principles
        - Keep the rewritten prompt **concise**. Avoid overly long sentences and reduce unnecessary descriptive language.  
        - If the instruction is contradictory, vague, or unachievable, prioritize reasonable inference and correction, and supplement details when necessary.  
        - Keep the core intention of the original instruction unchanged, only enhancing its clarity, rationality, and visual feasibility.  
        - All added objects or modifications must align with the logic and style of the edited input image’s overall scene.  

        ## 2. Task Type Handling Rules
        ### 1. Add, Delete, Replace Tasks
        - If the instruction is clear (already includes task type, target entity, position, quantity, attributes), preserve the original intent and only refine the grammar.  
        - If the description is vague, supplement with minimal but sufficient details (category, color, size, orientation, position, etc.). For example:  
            > Original: "Add an animal"  
            > Rewritten: "Add a light-gray cat in the bottom-right corner, sitting and facing the camera"  
        - Remove meaningless instructions: e.g., "Add 0 objects" should be ignored or flagged as invalid.  
        - For replacement tasks, specify "Replace Y with X" and briefly describe the key visual features of X.  

        ### 2. Text Editing Tasks
        - All text content must be enclosed in English double quotes `" "`. Do not translate or alter the original language of the text, and do not change the capitalization.  
        - **For text replacement tasks, always use the fixed template:**
            - `Replace "xx" to "yy"`.  
            - `Replace the xx bounding box to "yy"`.  
        - If the user does not specify text content, infer and add concise text based on the instruction and the input image’s context. For example:  
            > Original: "Add a line of text" (poster)  
            > Rewritten: "Add text \"LIMITED EDITION\" at the top center with slight shadow"  
        - Specify text position, color, and layout in a concise way.  

        ### 3. Human Editing Tasks
        - Maintain the person’s core visual consistency (ethnicity, gender, age, hairstyle, expression, outfit, etc.).  
        - If modifying appearance (e.g., clothes, hairstyle), ensure the new element is consistent with the original style.  
        - **For expression changes, they must be natural and subtle, never exaggerated.**  
        - If deletion is not specifically emphasized, the most important subject in the original image (e.g., a person, an animal) should be preserved.
            - For background change tasks, emphasize maintaining subject consistency at first.  
        - Example:  
            > Original: "Change the person’s hat"  
            > Rewritten: "Replace the man’s hat with a dark brown beret; keep smile, short hair, and gray jacket unchanged"  

        ### 4. Style Transformation or Enhancement Tasks
        - If a style is specified, describe it concisely with key visual traits. For example:  
            > Original: "Disco style"  
            > Rewritten: "1970s disco: flashing lights, disco ball, mirrored walls, colorful tones"  
        - If the instruction says "use reference style" or "keep current style," analyze the input image, extract main features (color, composition, texture, lighting, art style), and integrate them into the prompt.  
        - **For coloring tasks, including restoring old photos, always use the fixed template:** "Restore old photograph, remove scratches, reduce noise, enhance details, high resolution, realistic, natural skin tones, clear facial features, no distortion, vintage photo restoration"  
        - If there are other changes, place the style description at the end.

        ## 3. Rationality and Logic Checks
        - Resolve contradictory instructions: e.g., "Remove all trees but keep all trees" should be logically corrected.  
        - Add missing key information: if position is unspecified, choose a reasonable area based on composition (near subject, empty space, center/edges).  

        # Output Format Example
        "211 floors high skyscrape, majesticly dominating the crowded street below..."

        Prompt to be rewritten: {original_prompt}
        ''',
        "images": base64_images
    }
    ])
    return response['message']['content']

### Inference

#### Solid Rotation

We take a ROI of the image beforehand, including just the monument for better results for the rotation task.

In [None]:
image_full = Image.open("assets/eustachian_monument.jpeg").convert("RGB")
image_roi = image_full.crop((200, 500, 800, 1100))
image_roi

In [None]:
images = [image_full]
prompt = polish_prompt_en("Rotate the main subject, so it looks in front view.", images)
inputs = make_inputs(images, prompt)
output_image = generate_image(inputs)

In [None]:
output_image

## Phase 2: Inpainting

## Final Result

### Further improvements

These operations aim to edit the previous image in order to have better quality images, but are not strictly necessary.
You can skip these steps if you wish.

#### Colorization

In [None]:
image = Image.open("./output_edit.jpeg").convert("RGB")
prompt = "Colorize the image."
inputs = make_inputs(image, prompt)
output_image = generate_image(inputs)

In [None]:
output_image