开启onediff优化后，A100上unet执行结果为nan，但是V100上unet计算结果正常 #958

lss15151161 · 2024-06-17T07:42:23Z

Describe the bug

StableVideoDiffusionPipeline 开启onediff优化后，A100上unet执行结果为nan，V100上unet计算结果正常

Your environment

debian

OneDiff git commit id

onediff 0.13.0.dev202403280124
onediffx 0.13.0.dev202403280124

OneFlow version info

oneflow 0.9.1.dev20240615+cu118

How To Reproduce

Steps to reproduce the behavior(code or script):

The complete error message

Additional context

Add any other context about the problem here.

lijunliangTG · 2024-06-19T01:27:04Z

请您提供一下复现该行为的代码。您可以尝试先更新OneDiff版本到最新版本。

lss15151161 · 2024-06-19T04:00:59Z

请您提供一下复现该行为的代码。您可以尝试先更新OneDiff版本到最新版本。
您好，感谢回复，升级到最新版测试仍然有同样问题，复现代码如下。但是模型权重太大不好上传，您用开源权重应该也可以复现

import os
import torch
import torch.nn as nn
import json
import numpy as np
from diffusers.models import UNetSpatioTemporalConditionModel
import json
from onediff.infer_compiler import oneflow_compile

if __name__ == "__main__":
    unet_model_path = './xxx/unet'
    
    device = "cuda"
    m_torch = UNetSpatioTemporalConditionModel.from_pretrained(unet_model_path, torch_dtype=torch.float16).cuda().eval()

    m_oneflow = oneflow_compile(m_torch)

    # 创建输入
    with open("./mock_unet_input_a100.txt", 'r') as f:
        mock_input = json.load(f)
        latent_model_input = torch.tensor(mock_input['latent_model_input']).half().cuda()
        t = torch.tensor(mock_input['t']).cuda()
        image_embeddings = torch.tensor(mock_input["image_embeddings"]).half().cuda()
        added_time_ids = torch.tensor(mock_input["added_time_ids"]).half().cuda()

    with torch.no_grad():
        out = m_torch(latent_model_input, t, image_embeddings, added_time_ids, return_dict=False)
    print(out[0][0,0,0,0,:5])

    with torch.no_grad():
        out = m_oneflow(latent_model_input, t, image_embeddings, added_time_ids, return_dict=False)
    print(out[0][0,0,0,0,:5])

mock_unet_input_a100.txt

lijunliangTG · 2024-06-20T02:16:51Z

您可以提供huggingface的模型名来帮助我复现您的错误

lss15151161 · 2024-06-20T06:42:39Z

您可以提供huggingface的模型名来帮助我复现您的错误

这个模型，用的这里的模型，试了fp16和fp32两种权重，加载权重的时候都使用fp16加载。结果都是nan，https://huggingface.co/stabilityai/stable-video-diffusion-img2vid/tree/main/unet

![Uploading image.png…]()

lss15151161 · 2024-06-20T06:43:43Z

您可以提供huggingface的模型名来帮助我复现您的错误

这个模型，用的这里的模型，试了fp16和fp32两种权重，加载权重的时候都使用fp16加载。结果都是nan，https://huggingface.co/stabilityai/stable-video-diffusion-img2vid/tree/main/unet

A10，A30，A100都有同样的问题

strint · 2024-07-12T03:48:11Z

@lss15151161

Try this:

export ONEFLOW_ATTENTION_ALLOW_HALF_PRECISION_ACCUMULATION=False

strint · 2024-07-19T07:35:50Z

@lss15151161 Please have a try.

lss15151161 · 2024-07-19T08:21:58Z

@lss15151161 Please have a try.

thx，this solved my problem~

lss15151161 assigned strint Jun 17, 2024

lss15151161 mentioned this issue Jun 20, 2024

StableVideoDiffusionPipeline 加速 V100 能执行成功，A30 执行报错 #955

Closed

strint added the sig-hfdiffusers label Jul 5, 2024

lss15151161 closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

开启onediff优化后，A100上unet执行结果为nan，但是V100上unet计算结果正常 #958

开启onediff优化后，A100上unet执行结果为nan，但是V100上unet计算结果正常 #958

lss15151161 commented Jun 17, 2024 •

edited

Loading

lijunliangTG commented Jun 19, 2024

lss15151161 commented Jun 19, 2024 •

edited

Loading

lijunliangTG commented Jun 20, 2024

lss15151161 commented Jun 20, 2024 •

edited

Loading

lss15151161 commented Jun 20, 2024

strint commented Jul 12, 2024

strint commented Jul 19, 2024

lss15151161 commented Jul 19, 2024

开启onediff优化后，A100上unet执行结果为nan，但是V100上unet计算结果正常 #958

开启onediff优化后，A100上unet执行结果为nan，但是V100上unet计算结果正常 #958

Comments

lss15151161 commented Jun 17, 2024 • edited Loading

Describe the bug

Your environment

OneDiff git commit id

OneFlow version info

How To Reproduce

The complete error message

Additional context

lijunliangTG commented Jun 19, 2024

lss15151161 commented Jun 19, 2024 • edited Loading

lijunliangTG commented Jun 20, 2024

lss15151161 commented Jun 20, 2024 • edited Loading

lss15151161 commented Jun 20, 2024

strint commented Jul 12, 2024

strint commented Jul 19, 2024

lss15151161 commented Jul 19, 2024

lss15151161 commented Jun 17, 2024 •

edited

Loading

lss15151161 commented Jun 19, 2024 •

edited

Loading

lss15151161 commented Jun 20, 2024 •

edited

Loading