Skip to content

运行image_text_similarity_filter以及image_text_matching_filter算子时遇到疑似没有执行compute_stats_single的问题 #682

Open
@butterbutterflies

Description

@butterbutterflies

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

当前设备: 8*Nvidia-H800 80G

当前背景: 尝试复现llava-1.5-pretrain-dataset-refine-recipe,执行命令为:dj-process --config configs/data_juicer_recipes/llava-pretrain-refine.yaml

具体实施描述:
1. llava-1.5-pretrain-dataset-refine-recipe依赖blip以及clip模型,已经在本地提前下载了相关模型文件,在image_text_similarity_filter.py以及image_text_matching_filter.py文件中的如下代码中将pretrained_model_name_or_path固定写成了相关模型的路径:
self.model_key = prepare_model(model_type='huggingface', pretrained_model_name_or_path="/home/for_llava_preprocess/clip-vit-base-patch32/", trust_remote_code=trust_remote_code)

2. 从完整的blip_laion_cc_sbu_558k_dj_fmt_only_caption.jsonl文件中抽出了15条数据用以测试。内容示例如下:
{"id": "004539375", "text": "<image>\nselect luxury furniture 3 - inch gel memory foam mattress topper <|__dj__eoc|>", "images": ["00453/004539375.jpg"]} {"id": "002239345", "text": "<image>\na grey watch with an army style strap <|__dj__eoc|>", "images": ["00223/002239345.jpg"]} {"id": "005947502", "text": "<image>\na dragon kite flying in the blue sky stock images <|__dj__eoc|>", "images": ["00594/005947502.jpg"]} {"id": "005116462", "text": "<image>\n$ 10 - cute cheap printed mini dress - khaki multicolor striped floral print peasant short sleeve tunic <|__dj__eoc|>", "images": ["00511/005116462.jpg"]} ...

3. yaml文件配置如下,主要用于测试image_text_similarity_filter以及image_text_matching_filter:
`project_name: 'llava-1.5-pretrain-dataset-refine-recipe'
dataset_path: '/sfs/llava_pretrain_558k/test_llava.jsonl'
export_path: '/sfs/llava_pretrain_558k/test_llava_refine.jsonl'
np: 4
text_keys: 'text'
image_key: 'images'
image_special_token: ''
eoc_special_token: '<|__dj__eoc|>'

open_tracer: true

process:

  • image_text_similarity_filter: #544202
    hf_clip: openai/clip-vit-base-patch32
    min_score: 0.20315419
  • image_text_matching_filter:
    hf_blip: Salesforce/blip-itm-base-coco
    min_score: 0.44930778 `

4. 在image_text_similarity_filter.py的compute_stats_single方法中有加入如下log打印:
` # ----------------------------------------------------------
import logging

            # 配置日志
            logging.basicConfig(
                level=logging.INFO,
                format='%(processName)s - %(message)s'
            )
            logger = logging.getLogger()

            # 在进程中打印
            logger.info("--------------------------------")
            logger.info(**chunk_logits**)
            logger.info("--------------------------------")
            # ----------------------------------------------------------

`

问题描述:
1. 在np=4,仅测试15条数据的情况下,image_text_similarity_filter未见任何chunk_logits打印输出。甚至有在compute_stats_single函数内第一行加入打印当前处理sample的日志,也未见任何输出。但是相关测试在运行单元测试脚本test_image_text_similarity_filter.py时候一切日志打印均正常(单元测试np=2)。

2. 执行命令dj-process --config configs/data_juicer_recipes/llava-pretrain-refine.yaml,输出stats.jsonl内容如下,image_text_similarity为空:
{"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}}

3. 遇到上述问提,请问是模型加载方式不对吗?另外如何根据设备资源情况、数据处理量以及加载模型显存使用情况来计算np的数目?官方提供的llava-1.5-pretrain-dataset-refine-recipe中np=42,想了解改设置是在怎样的设备上运行的?

Additional 额外信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions