Janus
===

**Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation**

 * Paper: https://arxiv.org/abs/2410.13848

![Janus Architecture](../assets/janus_arch.jpg)

![Janus Training Stages](../assets/janus_training_stages.jpg)

 * Installation
```bash
pip install torch torchvision
pip install transformers
```

In [4]:
from PIL import Image
import torch
import textwrap
  
from transformers import (
    JanusForConditionalGeneration,
    JanusProcessor
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [2]:
model_id = "deepseek-community/Janus-Pro-1B"

# Set generation mode to `text` to perform text generation.
processor = JanusProcessor.from_pretrained(model_id)
model = JanusForConditionalGeneration.from_pretrained(
    model_id,     
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

model.eval().to(device);

In [5]:
image_path = "../samples/plants.jpg"
# Prepare Input for generation.
messages = [
    {
        "role": "user",
        "content": [
            #{'type':'image', 'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
            {'type':'image', 'image': image_path},
            {'type':"text", "text": "What do you see in this image?"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    generation_mode="text",
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

output = model.generate(**inputs, max_new_tokens=40,generation_mode='text',do_sample=True)
text = processor.decode(output[0], skip_special_tokens=True)
print(textwrap.fill(text, 80))

You are a helpful language and vision assistant. You are able to understand the
visual content that the user provides, and assist the user with a variety of
tasks using natural language.  :  What do you see in this image?  :In this
image, I see two decorative flower pots. One pot is taller and has a large plant
in it, while the other pot is shorter and has a smaller plant. Both pots are
placed on
