# Introduction
This notebook demonstrates how to quickly set up and test the LLaVA model with 4-bit quantization in Google Colab. This setup is optimized for rapid testing and inference using both language and vision models, leveraging Colab's GPUs.


In [1]:
# Check the Python version to ensure compatibility with LLaVA requirements

!python --version

Python 3.10.12


In [None]:
# List all installed packages in the current environment
!pip list

Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
accelerate                       0.32.1
aiohttp                          3.9.5
aiosignal                        1.3.1
alabaster                        0.7.16
albumentations                   1.3.1
altair                           4.2.2
annotated-types                  0.7.0
anyio                            3.7.1
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array_record                     0.5.1
arviz                            0.15.1
astropy                          5.3.4
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.1.0
attrs                            23.2.0
audioread                        3.0.1
autograd                         1.6.2
Babel                            2.15.0
backcall                         0.2.0
beautifulsoup4                   4.12.3

# Setup and Environment Preparation
Initial setup involves mounting Google Drive to access files and installing specific libraries required for the LLaVA model. This step ensures all necessary dependencies are in place for successful model execution.


In [None]:
# Mount Google Drive to access files stored there


from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Setup and Environment Preparation
Initial setup involves mounting Google Drive to access files and installing specific libraries required for the LLaVA model. This step ensures all necessary dependencies are in place for successful model execution.


In [None]:
# Navigate to the drive directory
%cd drive

/content/drive


In [None]:
# Create a new directory for the project in Google Drive

mkdir /content/drive/MyDrive/llmva

mkdir: cannot create directory ‘/content/drive/MyDrive/llmva’: File exists


In [None]:
# Change the current working directory to the project folder

%cd /content/drive/MyDrive/llmva

/content/drive/MyDrive/llmva


In [None]:
# Clone the specified GitHub repository for the LLaVA project

!git clone -b v1.0 https://github.com/camenduru/LLaVA

fatal: destination path 'LLaVA' already exists and is not an empty directory.


In [None]:
# Change to the cloned repository's directory

%cd /content/drive/MyDrive/llmva/LLaVA

/content/drive/MyDrive/llmva/LLaVA


# Installing Dependencies
Correct versions of dependencies are crucial for the model's functionality. Here, we ensure that all required packages are installed to avoid any compatibility issues during model execution.


In [None]:
# Install required libraries for the project
!pip install -q transformers==4.36.2
!pip install -q gradio .

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llava 1.1.1 requires transformers==4.31.0, but you have transformers 4.36.2 which is incompatible.[0m[31m
[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for llava (pyproject.toml) ... [?25l[?25hdone


# Importing Libraries and Model Setup
Import all necessary libraries and configure the LLaVA model for 4-bit quantization. This step optimizes the model for efficient computation and prepares it for high-performance inference tasks.


In [None]:
# Import necessary libraries and configure model parameters

import os
import requests
from PIL import Image
from io import BytesIO
from llava.conversation import conv_templates, SeparatorStyle
from llava.utils import disable_torch_init
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from transformers import TextStreamer
from transformers import AutoTokenizer, BitsAndBytesConfig
from llava.model import LlavaLlamaForCausalLM
import torch

[2024-08-24 12:44:22,762] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [None]:
# Set up the model with specific quantization parameters for memory efficiency

model_path = "4bit/llava-v1.5-13b-3GB"
kwargs = {"device_map": "auto"}
kwargs['load_in_4bit'] = True
kwargs['quantization_config'] = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
)
model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

vision_tower = model.get_vision_tower()
if not vision_tower.is_loaded:
    vision_tower.load_model()
vision_tower.to(device='cuda')
image_processor = vision_tower.image_processor

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

# Function Definition for Interaction
Define the `interact_image` function to handle image and text inputs. This function processes the images, feeds them into the model along with text prompts, and generates responses, showcasing the model's capabilities in real-time.


In [None]:

def interact_image(image_file, prompt):


     """
    Function to load an image, preprocess it, and perform inference using the LLaVA model.
    
    Args:
    image_path (str): The path to the image file.
    prompt (str): The prompt to guide the model's response generation, including queries about the image.

    Returns:
    tuple: The original image and the model's textual output.
    """

    # Load and preprocess the image for inference
    if image_file.startswith('http') or image_file.startswith('https'):
        response = requests.get(image_file)
        image = Image.open(BytesIO(response.content)).convert('RGB')
    else:
        image = Image.open(image_file).convert('RGB')
    disable_torch_init()
    conv_mode = "llava_v0"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()

    # Prepare the input prompt with role and token markers
    inp = f"{roles[0]}: {prompt}"
    inp = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + inp
    # Initialize a conversation object and append initial messages
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    # Construct the full prompt and convert it to tensor for model input
    raw_prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(raw_prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
    # Perform model inference in no_grad context to optimize memory usage
    with torch.inference_mode():
      output_ids = model.generate(input_ids, images=image_tensor, do_sample=True, temperature=0.2,
                                  max_new_tokens=1024, use_cache=True, stopping_criteria=[stopping_criteria])
    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    conv.messages[-1][-1] = outputs
    output = outputs.rsplit('</s>', 1)[0]
    return image, output

# Executing Inference
Perform inference by running the `interact_image` function with an example image and prompt. This section demonstrates the practical application of the model in interpreting and responding to complex queries about the content of images.


In [None]:
# Execute the interaction with a sample image and print the outputi

mage, output = interact_image(f'Screenshot from 2024-08-24 13-52-43.png',
'Describe the image and color details. as well what are this drawings? which dimensions are provided here? and what are the measurements available?'
)
print(output)

The image is a black and white drawing of a model, likely a 3D model, featuring a long object with a curve. The drawing includes various measurements and dimensions, such as 1.5" and 1.25". The measurements are provided in inches, and the drawing appears to be a blueprint or a technical drawing. The image also has a few notes, which might provide additional information or instructions for the model.
