# Data Preprocess 

## 1. Data of Alignment Stage

Download the 558K subset of the [LAION-CC-SBU dataset with BLIP captions](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) through below commands.

In [None]:
! mkdir -p data/LLaVA-Pretrain
! wget -P data/LLaVA-Pretrain https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/resolve/main/blip_laion_cc_sbu_558k.json
! wget -P data/LLaVA-Pretrain https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/resolve/main/images.zip
! unzip data/LLaVA-Pretrain/images.zip -d data/LLaVA-Pretrain



## 2. Data of Extension and Fine-tuning Stage

Please download the annotation of the final mixture [LLaVA](https://github.com/haotian-liu/LLaVA) instruction tuning data [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json)

In [None]:
! mkdir -p data/LLaVA-Instruct-150K
! wget -P data/LLaVA-Instruct-150K https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/resolve/main/llava_v1_5_mix665k.json



Download the images from constituting datasets:

- COCO: [train2017](http://images.cocodataset.org/zips/train2017.zip)
- GQA: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
- OCR-VQA: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), **we save all files as `.jpg`**
- TextVQA: [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
- VisualGenome: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)

After downloading all of them, organize the data as follows in ./data/LLaVA-Instruct-150K,

```
├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2
```

## 3. Format Data

In [None]:
import json
import random
import os 
from tqdm import tqdm

mpath = "./data/LLaVA-Pretrain"
fname = "blip_laion_cc_sbu_558k.json"

with open(os.path.join(mpath, fname), "r") as f:
    data = json.loads(f.read())

print(len(data))
print(data[-10000])

new_data = []
for i in tqdm(data):
    image = f"<image>{mpath}/{i['image']}</image>"
    user = i['conversations'][0]['value'].replace('<image>', image)
    assistant = i['conversations'][1]['value']
    info = {
        "conversations": [
            {
                "from": "user",
                "value": user
            },
            {
                "from": "assistant",
                "value": assistant
            }
        ]
    }
    new_data.append(info)

length = len(new_data)
print(length)

with open("data/558k_pretrain.json", 'w') as f:
    f.write(json.dumps(new_data, ensure_ascii=False))

print(new_data[random.randint(0, length)])

In [None]:
import json
import random
import os
from tqdm import tqdm

mpath = "./data/LLaVA-Instruct-150K"
fname = "llava_v1_5_mix665k.json"
with open(os.path.join(mpath, fname), "r") as f:
    data = json.loads(f.read())

print(len(data))
print(data[-10000])

new_data = []
for i in tqdm(data):
    if i.get("image"):
        image = f"<image>{mpath}/{i.get('image')}</image>"
        user = i['conversations'][0]['value'].replace('<image>', image)
    else:
        user = i['conversations'][0]['value']
    assistant = i['conversations'][1]['value']
    info = {
        "conversations": [
            {
                "from": "user",
                "value": user
            },
            {
                "from": "assistant",
                "value": assistant
            }
        ]
    }
    new_data.append(info)

length = len(new_data)
print(length)

with open("data/665k_finetune.json", 'w') as f:
    f.write(json.dumps(new_data, ensure_ascii=False))

print(new_data[random.randint(0, length)])

new_data = random.sample(new_data, 10000)
with open("data/10k_finetune.json", 'w') as f:
    f.write(json.dumps(new_data, ensure_ascii=False))
