UniFashion

The official code for paper "UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation"

Abstract

The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. \modelname{} unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain.

Data preparation

Download data from FashionGen(code: pz7a), Fashion200K and FashionIQ
Processing the FashionIQ dataset to Fashion-IQ-Cap:
```
cd src/llava/LLaVA
python inference.py
```

We have uploaded the processed Fashion-IQ-Cap by LLaVA v1.6 on dataset directory, you can just download it!

Training

Phase 1 - Cross-modal Pre-training

run this command:
```
 bash pretrain.sh
```
Phase 2 - Composed Multimodal Fine-tuning run this command:
```
 bash cir_ft.sh
```
MGD finetuning: run this command:
```
 cd src/UNIStableVITON
 bash train.sh
```

Vaildation

You can download all the checkpoints from huggingface

FashionGen dataset for cross-modal retrieval tasks: During the training process, we vaild evey epoch and save the result in a csv file. Or you can just vaild the saved checkpoint by run this command:
```
 bash vail.sh
```
```
CUDA_VISIBLE_DEVICES=0 python src/blip_validate.py \
--dataset 'fashiongen' \
--blip-model-name 'blip2_cir_cls' \
--model-path ...
```
Image captioning task performance:

run this command to generate captions for images in FashionGen:
```
 cd src/llava/LLaVA
 python cot.py
```
run this command to test BLEU, METEOR, ROUGE-L:
```
 cd src
 python metrics.py
```

on the Fashion-IQ dataset for composed image retrieval task: run this command:

 bash vail.sh

CUDA_VISIBLE_DEVICES=0 python src/blip_validate.py \
--dataset 'fashioniq' \
--blip-model-name 'blip2_cir_rerank' \
--model-path ...

VITON-HD and MGD datasets for try-on task: run this command:
```
 cd src/UNIStableVITON
 bash inference.sh
```

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dataset		dataset
src		src
README.md		README.md
cir_ft.sh		cir_ft.sh
pretrain.sh		pretrain.sh
process_raw_data.py		process_raw_data.py
requirements.txt		requirements.txt
run.sh		run.sh
vaild.sh		vaild.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniFashion

Abstract

Data preparation

Training

Vaildation

About

Releases

Packages

Languages

xiangyu-mm/UniFashion

Folders and files

Latest commit

History

Repository files navigation

UniFashion

Abstract

Data preparation

Training

Vaildation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages