## Prerequisites

1. In your terminal, cd in to `tutorials_deeplearninghero/llms`
2. Clone the Mini-GPT4 repo with `git clone https://github.com/Vision-CAIR/MiniGPT-4.git`
3. `cd` into `MiniGPT-4` and create the conda environment with `conda env create -f environment.yml`
4. Activate the environment `conda activate minigpt4`
5. Install the conda environment as a Python kernel with `conda install ipykernel`
6. Make sure thie `minigpt4` kernel is selected for your notebook

## Install few more libraries

In [6]:
!/opt/conda/envs/minigpt4/bin/pip install --quiet fschat==0.1.10 gdown ipywidgets

In [8]:
import shutil
import pathlib
import os
import gdown
import transformers
import gc
import huggingface_hub
from ipywidgets import Layout, Box, Image, VBox, GridspecLayout, HTML


## Setting up Mini-GPT4

In [4]:
# It looks like using ~/.cache as opposed to absolute /home/jupyter it points to somehwere else
# Figure out where it points to
default_cache_dir = pathlib.Path("/home/jupyter/.cache/huggingface/hub")
llama_space = "decapoda-research"
llama_id = "llama-7b-hf"
vicuna_space = "lmsys"
vicuna_id = "vicuna-7b-delta-v0"

## Download base models

In [None]:
def download_models():
    llama_repo_id = f"{llama_space}/{llama_id}"
    vicuna_repo_id = f"{vicuna_space}/{vicuna_id}"
    huggingface_hub.snapshot_download(repo_id=llama_repo_id)
    huggingface_hub.snapshot_download(repo_id=vicuna_repo_id)
      
download_models()

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 16.0MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 411/411 [00:00<00:00, 114kB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 333kB/s]
Loading checkpoint shards: 100%|██████████| 33/33 [02:11<00:00,  3.99s/it]
Downloading (…)lve/main/config.json: 100%|██████████| 619/619 [00:00<00:00, 144kB/s]
Downloading (…)model.bin.index.json: 100%|██████████| 26.8k/26.8k [00:00<00:00, 8.49MB/s]
Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s][A
Downloading (…)l-00001-of-00002.bin:   0%|          | 31.5M/9.98G [00

In [5]:
import json

def patch_tokenizer_config(default_cache_dir):
    # Magic fix introduced in https://github.com/huggingface/transformers/issues/22222#issuecomment-1477171703
    for space, repo in [(vicuna_space, vicuna_id), (llama_space, llama_id)]:
        for path in pathlib.Path(default_cache_dir / f"models--{space}--{repo}/snapshots/").rglob("*/tokenizer_config.json"):
            print(f"Loading {path}")
            config = json.loads(open(path, "r").read())
            if config["tokenizer_class"] == "LlamaTokenizer":
                print("No fix needed")
            else:
                config["tokenizer_class"] = "LlamaTokenizer"
            with open(path, "w") as f:
                json.dump(config, f)

patch_tokenizer_config(default_cache_dir)

In [10]:
# Vicuna weights are deltas which needs to be applied on top of llama
!/opt/conda/envs/minigpt4/bin/python -m fastchat.model.apply_delta \
    --base-model-path $default_cache_dir/models--$llama_space--$llama_id/snapshots/*/ \
    --target-model-path ./vicuna-7b-v0 \
    --delta-path $default_cache_dir/models--$vicuna_space--$vicuna_id/snapshots/*/ 

Loading the base model from /.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348/
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.99it/s]
Loading the delta from /.cache/huggingface/hub/models--lmsys--vicuna-7b-delta-v0/snapshots/f902a2f7e2ca5dfeedf40a0220320e50d2d4fa2a/
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:04<00:00,  2.25s/it]
Applying the delta
Applying delta: 100%|█████████████████████████| 323/323 [00:03<00:00, 85.23it/s]
Saving the target model to ./vicuna-7b-v0


In [7]:
!mv ../../vicuna-7b-v0 ./

## Download BLIP-2 checkpoint

In [15]:
output_path = 'pretrained_minigpt4.pth'
gdown.download(
    "https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing", output_path, fuzzy=True
)

Downloading...
From (uriginal): https://drive.google.com/uc?id=1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R
From (redirected): https://drive.google.com/uc?id=1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R&confirm=t&uuid=50653c7d-659f-4f9e-9a7f-31b39f79dd97
To: /home/jupyter/tutorials_deeplearninghero/llms/pretrained_minigpt4.pth
100%|██████████| 37.9M/37.9M [00:00<00:00, 214MB/s]


'pretrained_minigpt4.pth'

In [9]:
#!curl -LO https://github.com/Vision-CAIR/MiniGPT-4/archive/refs/heads/main.zip 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 34.4M    0 34.4M    0     0  17.6M      0 --:--:--  0:00:01 --:--:-- 24.7M


In [36]:
#import zipfile
#with zipfile.ZipFile("main.zip", 'r') as zip_ref:
#    zip_ref.extractall("./")

## Setting paths to configs

In [13]:
import yaml

eval_config_path = pathlib.Path("MiniGPT-4/eval_configs/minigpt4_eval.yaml")
with open(eval_config_path, "r") as f:
    eval_config_dict = yaml.safe_load(f)
    eval_config_dict["model"]["ckpt"] = "./pretrained_minigpt4.pth"
    eval_config_dict["model"]["prompt_path"] = "./MiniGPT-4/prompts/alignment.txt"
    
with open(eval_config_path, "w") as f:
    yaml.dump(eval_config_dict, f)

minigpt4_config_path = pathlib.Path("MiniGPT-4/minigpt4/configs/models/minigpt4.yaml")
with open(minigpt4_config_path, "r") as f:
    minigpt4_config_dict = yaml.safe_load(f)
    minigpt4_config_dict["model"]["llama_model"] = "./vicuna-7b-v0"
    
with open(minigpt4_config_path, "w") as f:
    yaml.dump(minigpt4_config_dict, f)

## Running Mini-GPT4

In [1]:
import sys
minigpt4_path = './MiniGPT-4'
if sys.path[-1] != minigpt4_path:
    sys.path.append(minigpt4_path)

In [2]:
import argparse 
from minigpt4.common.config import Config
from minigpt4.common.registry import registry

from minigpt4.datasets.builders import *
from minigpt4.models import *
from minigpt4.processors import *
from minigpt4.runners import *
from minigpt4.tasks import *

parser = argparse.ArgumentParser(description="")
parser.add_argument('--cfg-path', help='')
parser.add_argument('--options', nargs="+",help='')
parser.add_argument('--gpu-id', default=0, help='')
args = parser.parse_args(" --cfg-path ./MiniGPT-4/eval_configs/minigpt4_eval.yaml".split())

cfg = Config(args)

model_config = cfg.model_cfg
model_config.device_8bit = args.gpu_id
model_cls = registry.get_model_class(model_config.arch)
model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))

vis_processor_cfg = cfg.datasets_cfg.cc_sbu_align.vis_processor.train
vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)

  warn(f"Failed to load image Python extension: {e}")
  from .autonotebook import tqdm as notebook_tqdm


Loading VIT
Loading VIT Done
Loading Q-Former
Loading Q-Former Done
Loading LLAMA

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues


Loading checkpoint shards: 100%|██████████| 2/2 [00:16<00:00,  8.49s/it]


Loading LLAMA Done
Load 4 training prompts
Prompt Example 
###Human: <Img><ImageHere></Img> Please provide a detailed description of the picture. ###Assistant: 
Load BLIP2-LLM Checkpoint: ./pretrained_minigpt4.pth


In [3]:
import argparse
import time
from PIL import Image

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer
from transformers import StoppingCriteria, StoppingCriteriaList
from minigpt4.conversation.conversation import *


class MiniGPT4Chat:
    
    def __init__(self, model, vis_processor, device='cuda:0'):
        self.device = device
        self.model = model
        self.vis_processor = vis_processor
        stop_words_ids = [torch.tensor([835]).to(self.device),
                          torch.tensor([2277, 29937]).to(self.device)]  # '###' can be encoded in two different ways.
        self.stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
        self.conv, self.img_list = None, None
        self.reset_history()
        
    def ask(self, text):
        if len(self.conv.messages) > 0 and self.conv.messages[-1][0] == self.conv.roles[0] \
                and self.conv.messages[-1][1][-6:] == '</Img>':  # last message is image.
            self.conv.messages[-1][1] = ' '.join([self.conv.messages[-1][1], text])
        else:
            self.conv.append_message(self.conv.roles[0], text)

    def answer(self, max_new_tokens=300, num_beams=1, min_length=1, top_p=0.9,
               repetition_penalty=1.0, length_penalty=1, temperature=1.0, max_length=2000):
        self.conv.append_message(self.conv.roles[1], None)
        embs = self.get_context_emb(self.img_list)

        current_max_len = embs.shape[1] + max_new_tokens
        if current_max_len - max_length > 0:
            print('Warning: The number of tokens in current conversation exceeds the max length. '
                  'The model will not see the contexts outside the range.')
        begin_idx = max(0, current_max_len - max_length)

        embs = embs[:, begin_idx:]

        outputs = self.model.llama_model.generate(
            inputs_embeds=embs,
            max_new_tokens=max_new_tokens,
            stopping_criteria=self.stopping_criteria,
            num_beams=num_beams,
            do_sample=True if num_beams==1 else False,
            min_length=min_length,
            top_p=top_p,
            repetition_penalty=repetition_penalty,
            length_penalty=length_penalty,
            temperature=temperature,
        )
        output_token = outputs[0]
        if output_token[0] == 0:  # the model might output a unknow token <unk> at the beginning. remove it
            output_token = output_token[1:]
        if output_token[0] == 1:  # some users find that there is a start token <s> at the beginning. remove it
            output_token = output_token[1:]
        output_text = self.model.llama_tokenizer.decode(output_token, add_special_tokens=False)
        output_text = output_text.split('###')[0]  # remove the stop sign '###'
        output_text = output_text.split('Assistant:')[-1].strip()
        self.conv.messages[-1][1] = output_text
        return output_text, output_token.cpu().numpy()

    def upload_img(self, image):
        if isinstance(image, str):  # is a image path
            raw_image = Image.open(image).convert('RGB')
            image = self.vis_processor(raw_image).unsqueeze(0).to(self.device)
        elif isinstance(image, Image.Image):
            raw_image = image
            image = self.vis_processor(raw_image).unsqueeze(0).to(self.device)
        elif isinstance(image, torch.Tensor):
            if len(image.shape) == 3:
                image = image.unsqueeze(0)
            image = image.to(self.device)

        image_emb, _ = self.model.encode_img(image)
        self.img_list.append(image_emb)
        self.conv.append_message(self.conv.roles[0], "<Img><ImageHere></Img>")
        msg = "Received."
        return msg

    def get_context_emb(self, img_list):
        prompt = self.conv.get_prompt()
        prompt_segs = prompt.split('<ImageHere>')
        assert len(prompt_segs) == len(img_list) + 1, "Unmatched numbers of image placeholders and images."
        seg_tokens = [
            self.model.llama_tokenizer(
                seg, return_tensors="pt", add_special_tokens=i == 0).to(self.device).input_ids
            # only add bos to the first seg
            for i, seg in enumerate(prompt_segs)
        ]
        seg_embs = [self.model.llama_model.model.embed_tokens(seg_t) for seg_t in seg_tokens]
        mixed_embs = [emb for pair in zip(seg_embs[:-1], img_list) for emb in pair] + [seg_embs[-1]]
        mixed_embs = torch.cat(mixed_embs, dim=1)
        return mixed_embs
    
    def reset_history(self):
        self.conv = Conversation(
            system="Give the following image: <Img>ImageContent</Img>. "
                   "You will be able to see the image once I provide it to you. Please answer my questions.",
            roles=("Human", "Assistant"),
            messages=[],
            offset=2,
            sep_style=SeparatorStyle.SINGLE,
            sep="###",
        )
        self.img_list = []

## Running MiniGPT4

In [4]:
thumbnail_paths = [
    "./images/cake.jpg", 
    "./images/ad.png", 
    "./images/logo.jpg", 
]

In [10]:
images = []
filenames = []
for path in thumbnail_paths:
    images.append(Image(value=open(path, "rb").read(), format="jpg", width=256, height=256))
    filenames.append(pathlib.Path(path).name)


# can use height_ratios to control height of rows?
grid = GridspecLayout(4, len(filenames), height='300px')

for i, (img, tid) in enumerate(zip(images, filenames)):
    grid[0, i] = HTML(value=f"Template ID: {tid}")
    grid[1:4, i] = img
        
display(grid)

GridspecLayout(children=(HTML(value='Template ID: cake.jpg', layout=Layout(grid_area='widget001')), Image(valu…

In [13]:
# Be careful if you import ipython Image object can conflict with this
from PIL import Image

prompts = {
    "./images/cake.jpg": "What are the ingredients? How do I make this?",
    "./images/ad.png": "Explain to me why this is a clever and funny advertisement",
    "./images/logo.jpg": "What are the main colors of this design? Is this a visually appealing design? Why?"
}

minigpt4 = MiniGPT4Chat(model, vis_processor)
num_beams = 1
temperature = 0.9
max_new_tokens = 200

for path, prompt in prompts.items():
    minigpt4.reset_history()
    
    minigpt4.upload_img(path)
    minigpt4.ask(prompt)
    out, _ = minigpt4.answer(
        num_beams=num_beams,
        temperature=temperature,
        max_new_tokens=max_new_tokens,
    )    
    
    print(path,":")
    print(out)
    print('-'*20)
    
    

./images/cake.jpg :
This image shows a chocolate cake with chocolate frosting and drizzle on top. It is on a cake stand with a white plate underneath it. The cake appears to be a layered cake with a dark brown crumb and a light brown frosting on top. It has a small amount of chocolate drizzle on top of it.

There are no ingredients listed in the image, but it appears to be a homemade chocolate cake with chocolate frosting and drizzle. To make this cake, you will need the following ingredients:

* 2 cups all purpose flour
* 1 teaspoon baking powder
* 1 teaspoon salt
* 1 cup unsalted butter, at room temperature
* 1 cup granulated sugar
* 1 cup milk
*
--------------------
./images/ad.png :
The billboard in the image shows a person wearing a mask and holding up a sign that reads " best ever braces. " This advertisement is intended to promote braces for people with crooked teeth. The person in the image is smiling, which is likely intended to convey a sense of confidence and happiness about

In [25]:
from ipywidgets import Layout, Box, Image, VBox, GridspecLayout, HTML

images = []
template_ids = []
for path in thumbnail_paths:
    images.append(Image(value=open(path, "rb").read(), format="jpg", width=256, height=256))
    template_ids.append(pathlib.Path(path).stem)


# can use height_ratios to control height of rows?
grid = GridspecLayout(len(template_ids), 4)

for i, (img, out) in enumerate(zip(images, gpt4_outputs)):
    grid[i, 0] = img    
    full_res = ""
    for q, r in zip([prompt_start] + prompts_followup, out):
        full_res += f"<b>{q}</b><br/>{r}<br/>"
    grid[i, 1:] = HTML(value=full_res)
    
        
display(grid)

GridspecLayout(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xf…

In [1]:
#!python MiniGPT-4/demo.py --cfg-path MiniGPT-4/eval_configs/minigpt4_eval.yaml  --gpu-id 0

Initializing Chat
Loading VIT
Loading VIT Done
Loading Q-Former
Loading Q-Former Done
Loading LLAMA

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:18<00:00,  9.21s/it]
Loading LLAMA Done
Load 4 training prompts
Prompt Example 
###Human: <Img><ImageHere></Img> Could you describe the contents of this image for me? ###Assistant: 
Load BLIP2-LLM Checkpoint: ./pretrained_minigpt4.pth
Initialization Finished
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://04f234d5480077b379.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
^C
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://04f234d5480077b379.gradio.live
