Description
Before Asking 在提问之前
-
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
Question
当前设备: 8*Nvidia-H800 80G
当前背景: 尝试复现llava-1.5-pretrain-dataset-refine-recipe,执行命令为:dj-process --config configs/data_juicer_recipes/llava-pretrain-refine.yaml
具体实施描述:
1. llava-1.5-pretrain-dataset-refine-recipe依赖blip以及clip模型,已经在本地提前下载了相关模型文件,在image_text_similarity_filter.py以及image_text_matching_filter.py文件中的如下代码中将pretrained_model_name_or_path固定写成了相关模型的路径:
self.model_key = prepare_model(model_type='huggingface', pretrained_model_name_or_path="/home/for_llava_preprocess/clip-vit-base-patch32/", trust_remote_code=trust_remote_code)
2. 从完整的blip_laion_cc_sbu_558k_dj_fmt_only_caption.jsonl文件中抽出了15条数据用以测试。内容示例如下:
{"id": "004539375", "text": "<image>\nselect luxury furniture 3 - inch gel memory foam mattress topper <|__dj__eoc|>", "images": ["00453/004539375.jpg"]} {"id": "002239345", "text": "<image>\na grey watch with an army style strap <|__dj__eoc|>", "images": ["00223/002239345.jpg"]} {"id": "005947502", "text": "<image>\na dragon kite flying in the blue sky stock images <|__dj__eoc|>", "images": ["00594/005947502.jpg"]} {"id": "005116462", "text": "<image>\n$ 10 - cute cheap printed mini dress - khaki multicolor striped floral print peasant short sleeve tunic <|__dj__eoc|>", "images": ["00511/005116462.jpg"]} ...
3. yaml文件配置如下,主要用于测试image_text_similarity_filter以及image_text_matching_filter:
`project_name: 'llava-1.5-pretrain-dataset-refine-recipe'
dataset_path: '/sfs/llava_pretrain_558k/test_llava.jsonl'
export_path: '/sfs/llava_pretrain_558k/test_llava_refine.jsonl'
np: 4
text_keys: 'text'
image_key: 'images'
image_special_token: ''
eoc_special_token: '<|__dj__eoc|>'
open_tracer: true
process:
- image_text_similarity_filter: #544202
hf_clip: openai/clip-vit-base-patch32
min_score: 0.20315419 - image_text_matching_filter:
hf_blip: Salesforce/blip-itm-base-coco
min_score: 0.44930778 `
4. 在image_text_similarity_filter.py的compute_stats_single方法中有加入如下log打印:
` # ----------------------------------------------------------
import logging
# 配置日志
logging.basicConfig(
level=logging.INFO,
format='%(processName)s - %(message)s'
)
logger = logging.getLogger()
# 在进程中打印
logger.info("--------------------------------")
logger.info(**chunk_logits**)
logger.info("--------------------------------")
# ----------------------------------------------------------
`
问题描述:
1. 在np=4,仅测试15条数据的情况下,image_text_similarity_filter未见任何chunk_logits打印输出。甚至有在compute_stats_single函数内第一行加入打印当前处理sample的日志,也未见任何输出。但是相关测试在运行单元测试脚本test_image_text_similarity_filter.py时候一切日志打印均正常(单元测试np=2)。
2. 执行命令dj-process --config configs/data_juicer_recipes/llava-pretrain-refine.yaml,输出stats.jsonl内容如下,image_text_similarity为空:
{"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}} {"__dj__stats__":{"image_text_matching_score":[],"image_text_similarity":[]}}
3. 遇到上述问提,请问是模型加载方式不对吗?另外如何根据设备资源情况、数据处理量以及加载模型显存使用情况来计算np的数目?官方提供的llava-1.5-pretrain-dataset-refine-recipe中np=42,想了解改设置是在怎样的设备上运行的?
Additional 额外信息
No response