# Demo

This is the demo for mm-cot model. Here is the original repository: https://github.com/amazon-science/mm-cot
Please make sure you have sufficient hardware capabilities before running this model, for reference it takes approximately 30 min when running base model on a 4090 GPU, and 1 hour when running large model.

## Implementation Steps:
1. Install all dependency packages from requirements.txt, please make sure to install cuda on top of the list.
2. Download ScienceQA dataset and put it inside data folder
3. Download pre-trained models (4 models) from Hugging Face https://huggingface.co/cooelf/mm-cot/tree/main to model folder.
4. Run the following commands

### Checking Hardware Availability
Making sure GPU is detected and ready to run, PLEASE DON'T RUN THIS MODEL ON CPU AS IT WILL TAKE FOREVER.

In [19]:
!set CUDA_VISIBLE_DEVICES=0

In [1]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.cuda.is_available())

True


## Rationale Generation (Step 1)
As mentioned in the paper, we have 2 steps here, the first step is given text (question) and visual, generate rationale to the question without necessarily generating the answer yet. User can switch between base model or large model.

In [24]:
!python main.py \
    --data_root data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg rationale --img_type vit\
    --bs 2 --eval_bs 4  --epoch 50 --lr 5e-5 --output_len 512 \
    --use_caption --use_generate --prompt_format QCM-E \
    --output_dir experiments \
    --evaluate_dir model/mm-cot-base-rationale

args Namespace(data_root='data', output_dir='experiments', model='declare-lab/flan-alpaca-large', options=['A', 'B', 'C', 'D', 'E'], epoch=50, lr=5e-05, bs=2, input_len=512, output_len=512, eval_bs=4, eval_acc=None, train_split='train', val_split='val', test_split='test', use_generate=True, final_eval=False, user_msg='rationale', img_type='vit', eval_le=None, test_le=None, evaluate_dir='model/mm-cot-base-rationale', caption_file='data/instruct_captions.json', use_caption=True, prompt_format='QCM-E', seed=42)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

  0%|          | 0/1061 [00:00<?, ?it/s]
  0%|          | 2/1061 [00:01<17:01,  1.04it/s]
  0%|          | 3/1061 [00:07<48:56,  2.78s/it]
  0%|          | 4/1061 [00:08<38:19,  2.18s/it]
  0%|          | 5/1061 [00:10<40:38,  2.31s/it]
  1%|          | 6/1061 [00:11<32:23,  1.84s/it]
  1%|          | 7/1061 [00:14<37:26,  2.13s/it]
  1%|          | 8/1061 [00:15<29:43,  1.69s/it]
  1%|          | 9/1061 [00:17<33:32,  1.91s/it]
  1%|          | 10/1061 [00:19<31:01,  1.77s/it]
  1%|1         | 11/1061 [00:19<24:37,  1.41s/it]
  1%|1         | 12/1061 [00:20<21:36,  1.24s/it]
  1%|1         | 13/1061 [00:25<38:52,  2.23s/it]
  1%|1         | 14/1061 [00:25<31:31,  1.81s/it]
  1%|1         | 15/1061 [00:28<33:33,  1.92s/it]
  2%|1         | 16/1061 [00:28<26:53,  1.


====Input Arguments====
{
  "data_root": "data",
  "output_dir": "experiments",
  "model": "declare-lab/flan-alpaca-large",
  "options": [
    "A",
    "B",
    "C",
    "D",
    "E"
  ],
  "epoch": 50,
  "lr": 5e-05,
  "bs": 2,
  "input_len": 512,
  "output_len": 512,
  "eval_bs": 4,
  "eval_acc": null,
  "train_split": "train",
  "val_split": "val",
  "test_split": "test",
  "use_generate": true,
  "final_eval": false,
  "user_msg": "rationale",
  "img_type": "vit",
  "eval_le": null,
  "test_le": null,
  "evaluate_dir": "model/mm-cot-base-rationale",
  "caption_file": "data/instruct_captions.json",
  "use_caption": true,
  "prompt_format": "QCM-E",
  "seed": 42
}
img_features size:  torch.Size([11208, 145, 1024])
number of train problems: 12726

number of val problems: 4241

number of test problems: 4241

[08:36:22] [Model]: Loading model/mm-cot-base-rationale...           main.py:71
                                                                               
           [Data]: 

In [21]:
!python main.py \
    --data_root data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg rationale --img_type vit\
    --bs 2 --eval_bs 4  --epoch 50 --lr 5e-5 --output_len 512 \
    --use_caption --use_generate --prompt_format QCM-E \
    --output_dir experiments \
    --evaluate_dir model/mm-cot-large-rationale

args Namespace(data_root='data', output_dir='experiments', model='declare-lab/flan-alpaca-large', options=['A', 'B', 'C', 'D', 'E'], epoch=50, lr=5e-05, bs=2, input_len=512, output_len=512, eval_bs=4, eval_acc=None, train_split='train', val_split='val', test_split='test', use_generate=True, final_eval=False, user_msg='rationale', img_type='vit', eval_le=None, test_le=None, evaluate_dir='model/mm-cot-large-rationale', caption_file='data/instruct_captions.json', use_caption=True, prompt_format='QCM-E', seed=42)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

  0%|          | 0/1061 [00:00<?, ?it/s]
  0%|          | 2/1061 [00:03<30:08,  1.71s/it]
  0%|          | 3/1061 [00:16<1:53:35,  6.44s/it]
  0%|          | 4/1061 [00:18<1:26:14,  4.90s/it]
  0%|          | 5/1061 [00:24<1:33:40,  5.32s/it]
  1%|          | 6/1061 [00:26<1:13:27,  4.18s/it]
  1%|          | 7/1061 [00:33<1:27:19,  4.97s/it]
  1%|          | 8/1061 [00:34<1:08:24,  3.90s/it]
  1%|          | 9/1061 [00:40<1:18:23,  4.47s/it]
  1%|          | 10/1061 [00:43<1:09:42,  3.98s/it]
  1%|1         | 11/1061 [00:44<54:59,  3.14s/it]  
  1%|1         | 12/1061 [00:46<47:10,  2.70s/it]
  1%|1         | 13/1061 [00:57<1:32:49,  5.31s/it]
  1%|1         | 14/1061 [00:59<1:13:26,  4.21s/it]
  1%|1         | 15/1061 [01:04<1:18:26,  4.50s/it]
  2%|1         | 1


====Input Arguments====
{
  "data_root": "data",
  "output_dir": "experiments",
  "model": "declare-lab/flan-alpaca-large",
  "options": [
    "A",
    "B",
    "C",
    "D",
    "E"
  ],
  "epoch": 50,
  "lr": 5e-05,
  "bs": 2,
  "input_len": 512,
  "output_len": 512,
  "eval_bs": 4,
  "eval_acc": null,
  "train_split": "train",
  "val_split": "val",
  "test_split": "test",
  "use_generate": true,
  "final_eval": false,
  "user_msg": "rationale",
  "img_type": "vit",
  "eval_le": null,
  "test_le": null,
  "evaluate_dir": "model/mm-cot-large-rationale",
  "caption_file": "data/instruct_captions.json",
  "use_caption": true,
  "prompt_format": "QCM-E",
  "seed": 42
}
img_features size:  torch.Size([11208, 145, 1024])
number of train problems: 12726

number of val problems: 4241

number of test problems: 4241

[00:00:57] [Model]: Loading model/mm-cot-large-rationale...          main.py:71
                                                                               
           [Data]:

## Answer Generation (Step 2)

Now rationale and question are fused together, passed into answer generation model, now the model should provide answer to each original question.

In [25]:
!python main.py \
    --data_root data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg answer --img_type vit \
    --bs 4 --eval_bs 8 --epoch 50 --lr 5e-5 --output_len 64  \
    --use_caption --use_generate --prompt_format QCMG-A \
    --output_dir experiments \
    --eval_le experiments/predictions_ans_eval.json \
    --test_le experiments/predictions_ans_test.json \
    --evaluate_dir model/mm-cot-large-answer

args Namespace(data_root='data', output_dir='experiments', model='declare-lab/flan-alpaca-large', options=['A', 'B', 'C', 'D', 'E'], epoch=50, lr=5e-05, bs=4, input_len=512, output_len=64, eval_bs=8, eval_acc=None, train_split='train', val_split='val', test_split='test', use_generate=True, final_eval=False, user_msg='answer', img_type='vit', eval_le='experiments/predictions_ans_eval.json', test_le='experiments/predictions_ans_test.json', evaluate_dir='model/mm-cot-large-answer', caption_file='data/instruct_captions.json', use_caption=True, prompt_format='QCMG-A', seed=42)
====Input Arguments====
{
  "data_root": "data",
  "output_dir": "experiments",
  "model": "declare-lab/flan-alpaca-large",
  "options": [
    "A",
    "B",
    "C",
    "D",
    "E"
  ],
  "epoch": 50,
  "lr": 5e-05,
  "bs": 4,
  "input_len": 512,
  "output_len": 64,
  "eval_bs": 8,
  "eval_acc": null,
  "train_split": "train",
  "val_split": "val",
  "test_split": "test",
  "use_generate": true,
  "final_eval": fals

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

  0%|          | 0/531 [00:00<?, ?it/s]
  0%|          | 2/531 [00:00<02:56,  3.00it/s]
  1%|          | 3/531 [00:01<04:11,  2.10it/s]
  1%|          | 4/531 [00:02<05:00,  1.76it/s]
  1%|          | 5/531 [00:02<05:20,  1.64it/s]
  1%|1         | 6/531 [00:03<06:05,  1.44it/s]
  1%|1         | 7/531 [00:04<06:09,  1.42it/s]
  2%|1         | 8/531 [00:05<06:23,  1.36it/s]
  2%|1         | 9/531 [00:05<06:32,  1.33it/s]
  2%|1         | 10/531 [00:06<06:35,  1.32it/s]
  2%|2         | 11/531 [00:07<06:33,  1.32it/s]
  2%|2         | 12/531 [00:08<06:05,  1.42it/s]
  2%|2         | 13/531 [00:08<05:53,  1.47it/s]
  3%|2         | 14/531 [00:09<05:37,  1.53it/s]
  3%|2         | 15/531 [00:09<05:23,  1.59it/s]
  3%|3         | 16/531 [00:10<05:11,  1.65it/s]
  3%|3  

## Rant

The code implementation for this paper is
1. Poorly Documented
2. Bad architecture design
3. Bug in the code