LingCloud

The LingCloud project seeks to enhance the large language model's capability by incorporating human-like eyes.

I would like to express my sincere gratitude to all co-authors: my advisors, Prof. Baotian Hu, Lin Ma, and Min Zhang, and team members, Xinyu Chen, Wanqi Zhong, and Yiran Cui, for their tremendous supports.

Currently, GPT-4 has achieved unparalleled proficiency in image comprehension. Given our limited computational resources and financial supports, we also need to develop a model that can perform various tasks akin to GPT-4. The aim of this project is to connect visual information to the large language model (brain), thus increasing its ability to comprehend the external world's infinite-granularity visual content. As a result, we present the first version of LingCloud, LMEye(IPN), which will be continuously improved to achieve the robust and efficient interaction between LLMs and the external world.

If you have any question, please feel free to contact me by e-mail: liyunxin987@163.com, Twitter: @LyxTg, or submit your issue in the repository.

🔥 News

[08.04] We have achieved the first place on SEED-Bench, 9 dimmension of image understanding, Here.

[07.20] We have achieved the first place on the leaderboard of multimodal LLMs with less parameters, MMBench.

[07.17] Please see a new LMEye version, The dynamically updated test address is https://c9740b4915267dc264.gradio.live It supports: Single-round Q&A without input images; Single-round Q&A for images; Chinese input; English input.

[07.02] We release a new verision LMEye v0.1. Please follow here to RUN it. Its performances on perceptual and cognitive evaluation surpass mostly MLLMs.

[07.02] The online demo is closed for fully upgrading. We will continually provide the newest local demo with powerful LMEye variant.

[06.24] An online demo of LMEye (IPN-Bloomz-7b1): http://model.hitwds.cn:7080/.

[06.12] We release more diverse and high-quality Multimodal Instruction-following Data (V2), termed LMEyeMID, Please see here https://huggingface.co/datasets/YunxinLi/Multimodal_Insturction_Data_V2.

[05.25] We provide a file to deploy a simple demo.

[05.22] We release the codes of LMEye and the tuned checkpoints of LLaMA-7b/13b and Bloomz-7b1.

[05.05] We present the paper LMEye: An Interactive Perception Network for Large Language Models

[05.04] We release the evaluation dataset(/data/multimodal_data_all_generation.txt) construted by GPT-3.5-turbo based on about 3.5k images from MiniGPT-4. Here, you can also download these images and put them into the path /data/image/.

🚀 Architecture

Here, you can see the detailed architecture and some experimental analyses of LingCloud 1.0, LMEye.

✨ Presentation

You can deploy a simple LMEye demo using the following command:

python app_demo.py

Here, we present some cases in the experimental part and Appendix.

🚀 How to run

All codes are shown in the file directory LMEye.

Environment

You can follow the basic conda environment file LMEye_environment.yml to install the environment.
You only need to run the train.py, achieving a LMEye variant based on BLIP-2.

Train

If you want to train a similar model from scratch, you could use the train.py to perform the first-stage multimodal pretraining.

Prepare the pretraining image-text pairs from the released corpus such as Laion, CC3M, etc, and use the frozen visual encoder (e.g., CLIP-ViT-L/14) to extract the image feature.

Download the checkpoints of corresponding LLMs and modify the path.

At this stage, more powerful visual encoders are more important than language models.
The second-stage instruction-tuning: train.py.

Here, You can download the first or second version of Multimodal Instruction Data. The image source contains the COCO Caption, Flick30k, and the released multimodal instruction data from LLaVA.

Test

The following test process about previous LMEye variant could be ignored.

We release the checkpoints of instruction version for LLaMA-7b/13b and Bloomz-7b1. You can download them from the repository in Huggingface Hub.

🚨 Discussion

Finetune the LLMs with multimodal insturction data may decrease their performances on NLP. In this paper, we find that text instruction-following tuning LLMs have better generalization on performing multimodal interaction. For future, could we jointly finetune LLMs with multimodal instruction data and text-only instruction-tuning data? How could we alleviate this bias?
Hallucination.
Text-only Insturction tuned LLMs perform better than pure LLMs for image understanding in downstream tasks.
Self-instructed multimodal instruction-following data is diverse. Yet the quality of data has a big room to improve.
How to perform image-text semantic alignment under this paradigm.

Acknowledge

Thanks everyone for your contributions.

If you're using LMEye in your research or applications, please cite our work.

@article{li2023lmeye,
    title={LMEye: An Interactive Perception Network for Large Language Models},
    author={Li, Yunxin and Hu, Baotian and Chen, Xinyu and Ma, Lin and Zhang, Min},
    journal={arXiv preprint arXiv:2305.03701},
    year={2023}
}

License

This repository respects to Apache license 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
Lmeye		Lmeye
images		images
tools		tools
utils		utils
.gitignore		.gitignore
LMEye_environment.yml		LMEye_environment.yml
README.md		README.md
app_demo.py		app_demo.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lmeye

Lmeye

images

images

tools

tools

utils

utils

.gitignore

.gitignore

LMEye_environment.yml

LMEye_environment.yml

README.md

README.md

app_demo.py

app_demo.py

train.py

train.py

Repository files navigation

LingCloud

🔥 News

🚀 Architecture

✨ Presentation

🚀 How to run

Environment

Train

Test

🚨 Discussion

Acknowledge

License

About

Releases

Packages

Contributors 2

Languages

YunxinLi/LingCloud

Folders and files

Latest commit

History

Repository files navigation

LingCloud

🔥 News

🚀 Architecture

✨ Presentation

🚀 How to run

Environment

Train

Test

🚨 Discussion

Acknowledge

License

About

Resources

Stars

Watchers

Forks

Languages