Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yuv444p10le -> y4m: failed to parse y4m header. #127

Closed
Selur opened this issue Jun 1, 2019 · 12 comments

Comments

@Selur
Copy link

commented Jun 1, 2019

using:
ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p10le -strict -2 -vsync 0 -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 24.000 --codec h265 --profile main444 --level auto --tier high --sar 1:1 --lookahead 32 --output-depth 10 --cbrhq 44000 --max-bitrate 240000 --aq --gop-len 600 --strict-gop --bframes 4 --ref 7 --weightp --mv-precision Q-pel --preset quality --colormatrix bt709 --cuda-schedule sync --output "E:\Temp\paloma8kprores_13_36_54_1210_02.265"
using x264:
ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p10le -strict -2 -vsync 0 -f yuv4mpegpipe - | x264 --preset fast --crf 18.00 --profile high444 --level 5.2 --direct auto --b-adapt 0 --qcomp 0.5 --rc-lookahead 40 --qpmax 81 --partitions i4x4,p8x8,b8x8 --no-fast-pskip --subme 5 --trellis 0 --aq-mode 0 --vbv-maxrate 720000 --vbv-bufsize 720000 --sar 1:1 --non-deterministic --range tv --colormatrix bt709 --demuxer raw --input-res 7920x6024 --input-csp i444 --input-range tv --input-depth 10 --fps 24/1 --output-csp i444 --output-depth 10 --output "E:\Temp\paloma8kprores.264" -
encoding works fine.

-> seems like nvencc has some problem with the y4m headers when using yuv444p10le

@Selur

This comment has been minimized.

Copy link
Author

commented Jun 1, 2019

strange thing is using:

ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p10le -strict -1 -vsync 0  -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 24.000 --codec h265 --sar 1:1 --output-depth 10 --cbrhq 44000 --max-bitrate 240000 --colormatrix bt709  --output "E:\Temp\paloma8kprores_13_36_54_1210_02.265"

encoding seems to start:

Max B frames are 0 frames.
NVEncC (x64) 4.39 (r1094) by rigaya, May 18 2019 15:00:59 (VC 1916/Win/avx2)
OS Version     Windows 10 x64 (17763)
CPU            AMD Ryzen 7 1800X Eight-Core Processor (8C/16T)
GPU            #0: GeForce GTX 1070 Ti (2432 cores, 1683 MHz)[PCIe3x16][430.86]
NVENC / CUDA   NVENC API 9.0, CUDA 10.2, schedule mode: auto
Input Buffers  CUDA, 36 frames
Input Info     y4m(yuv444(10bit))->p010 [-], 7920x6024, 24/1 fps
Vpp Filters    copyHtoD
Output Info    H.265/HEVC main10 @ Level auto
               7920x6024p 1:1 24.000fps (24/1fps)
Encoder Preset default
Rate Control   CBRHQ
Bitrate        44000 kbps (Max: 240000 kbps)
Target Quality auto
Initial QP     I:20  P:23  B:25
VBV buf size   auto
Lookahead      off
GOP length     240 frames
B frames       0 frames [ref mode: disabled]
Ref frames     3 frames, LTR: off
AQ             off
CU max / min   auto / auto
Others         mv:auto
Failed to add frame into the encoder.


encoded 4 frames, 0.17 fps, 36070.46 kbps, 0.72 MB
encode time 0:00:23, CPULoad: 6.3%
frame type IDR 1
frame type I   1,  total size  0.42 MB
frame type P   3,  total size  0.30 MB

but stops,.. source has 1132 frames,...

@Selur

This comment has been minimized.

Copy link
Author

commented Jun 1, 2019

started the last call again, now encoding doesn't stop after 4 frames,...
Taskmgr shows NVEncC at 0,1% CPU usage and 6.3GB memory usage, GPU-Z shows 7860MB memory usage and and Video Encing Load 0%.
Is this a bug or a limit of my card? (GeForce GTX 1070 Ti, 8GB RAM)

@Selur

This comment has been minimized.

Copy link
Author

commented Jun 1, 2019

will test in ~1hr (running some system intensive test atm.)

@rigaya

This comment has been minimized.

Copy link
Owner

commented Jun 1, 2019

How about removing "--output-depth 10" & "--lookaehad 32"? Lookahead and high bit depth consumes CPU & GPU memory very much.

@Selur

This comment has been minimized.

Copy link
Author

commented Jun 1, 2019

ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p10le -strict -1 -vsync 0 -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 24.000 --codec h265 --profile main444 --level auto --tier high --sar 1:1 --lookahead 32 --output-depth 10 --cbrhq 44000 --max-bitrate 240000 --aq --gop-len 600 --strict-gop --bframes 4 --ref 7 --weightp --mv-precision Q-pel --cu-max 32 --cu-min 8 --preset quality --colormatrix bt709 --cuda-schedule sync --output "E:\Temp\paloma8kprores_13_36_54_1210_02.265"

calling:

ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p10le -strict -1 -vsync 0  -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 24.000 --codec h265 --profile main444 --level auto --tier high --sar 1:1 --lookahead 2 --output-depth 10 --cbrhq 44000 --max-bitrate 240000 --aq --gop-len 600 --strict-gop --bframes 4 --ref 7 --weightp --mv-precision Q-pel --preset quality --colormatrix bt709 --cuda-schedule sync --output "E:\output\test.265"

gave me:

Max B frames are 0 frames.
HEVC encode with weightp is known to be unstable on some environments.
Consider not using weightp with HEVC encode if unstable.
Failed to cuMemAllocPitch, 2 (cudaErrorMemoryAllocation)

and another time:

4m: failed to parse y4m header.
Failed to open input file.

calling: (8bit lookahead2 4:2:0)

ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv420p -vsync 0  -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 24.000 --codec h265 --profile main --level auto --tier high --sar 1:1 --lookahead 2  --cbrhq 44000 --max-bitrate 240000 --aq --gop-len 600 --strict-gop  --ref 7 --mv-precision Q-pel --preset quality --colormatrix bt709 --cuda-schedule sync --output "E:\output\test.265"

gave me:

Max B frames are 0 frames.
NVEncC (x64) 4.39 (r1094) by rigaya, May 18 2019 15:00:59 (VC 1916/Win/avx2)
OS Version     Windows 10 x64 (17763)
CPU            AMD Ryzen 7 1800X Eight-Core Processor (8C/16T)
GPU            #0: GeForce GTX 1070 Ti (2432 cores, 1683 MHz)[PCIe3x16][430.86]
NVENC / CUDA   NVENC API 9.0, CUDA 10.2, schedule mode: sync
Input Buffers  CUDA, 36 frames
Input Info     y4m(yv12)->nv12 [AVX2], 7920x6024, 24/1 fps
Vpp Filters    copyHtoD
Output Info    H.265/HEVC main @ Level unknown
               7920x6024p 1:1 24.000fps (24/1fps)
Encoder Preset quality
Rate Control   CBRHQ
Bitrate        44000 kbps (Max: 240000 kbps)
Target Quality auto
Initial QP     I:20  P:23  B:25
VBV buf size   auto
Lookahead      on, 2 frames, Adaptive I, B Insert
GOP length     600 frames
B frames       0 frames [ref mode: disabled]
Ref frames     7 frames, LTR: off
AQ             on
CU max / min   auto / auto
Others         mv:Q-pel

the encoding started, GPU ~0-16%, VE 0-25%, speed ~2fps, so the gpu&co seem mostly idelling. (cpu usage 5.5%, 4.6GB RAM)
The speed of the ffmpeg decoding itself:

ffmpeg -y -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv420p -vsync 0  -f yuv4mpegpipe - 

is 14fps (uses 1.6GB RAM, 39% cpu). (speed when using yuv444p10le, is 8.5fps with 2.2GB RAM usage)

calling: (8bit, no lookahead)

ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv420p -vsync 0  -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 24.000 --codec h265 --profile main --level auto --tier high --sar 1:1 --lookahead 0  --cbrhq 44000 --max-bitrate 240000 --aq --gop-len 600 --strict-gop  --ref 7 --mv-precision Q-pel --preset quality --colormatrix bt709 --cuda-schedule sync --output "E:\output\test.265"

gave me no further speed improvement.

calling: (8bit lookahead2 4:4:4)

ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p -vsync 0 -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 24.000 --codec h265 --profile main444 --level auto --tier high --sar 1:1 --lookahead 2  --cbrhq 44000 --max-bitrate 240000 --aq --gop-len 600 --strict-gop  --ref 7 --mv-precision Q-pel --preset quality --colormatrix bt709 --cuda-schedule sync --output "E:\output\test.265"

gave me:

Max B frames are 0 frames.
Failed to cuMemAllocPitch, 2 (cudaErrorMemoryAllocation)

and

4m: failed to parse y4m header.
Failed to open input file.

same when using:

I:\Hybrid\64bit>ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "C:\Users\Selur\Desktop\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p -vsync 0 -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 24.000 --codec h265 --profile main444 --sar 1:1 --lookahead 0  --cbrhq 44000 --max-bitrate 240000 --colormatrix bt709 --output "E:\output\test.265"

Testing with 4k content:
(10bit, no lookahead, 4:4:4)

ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "F:\TestClips&Co\files\MPEG-4 H.264\4k\4k_sample_4096x2160.mp4" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p10le -strict -1 -vsync 0  -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 25.000 --codec h265 --profile main444 --level auto --tier high --sar 1:1 --lookahead 0 --output-depth 10 --cbrhq 1500 --max-bitrate 240000 --preset default --colormatrix bt709 --cuda-schedule sync --output "E:\Temp\4k_sample_4096x2160_15_42_40_7310_01.265"

gave me:

4m: failed to parse y4m header.
Failed to open input file.

(8bit, no lookahead, 4:4:4)

ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "F:\TestClips&Co\files\MPEG-4 H.264\4k\4k_sample_4096x2160.mp4" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p -vsync 0  -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 25.000 --codec h265 --profile main444 --level auto --tier high --sar 1:1 --lookahead 0 --cbrhq 1500 --max-bitrate 240000 --preset default --colormatrix bt709 --cuda-schedule sync --output "E:\Temp\4k_sample_4096x2160_15_42_40_7310_01.265"

gave me:

Max B frames are 0 frames.
NVEncC (x64) 4.39 (r1094) by rigaya, May 18 2019 15:00:59 (VC 1916/Win/avx2)
OS Version     Windows 10 x64 (17763)
CPU            AMD Ryzen 7 1800X Eight-Core Processor (8C/16T)
GPU            #0: GeForce GTX 1070 Ti (2432 cores, 1683 MHz)[PCIe3x16][430.86]
NVENC / CUDA   NVENC API 9.0, CUDA 10.2, schedule mode: sync
Input Buffers  CUDA, 36 frames
Input Info     y4m(yuv444)->yuv444 [AVX2], 4096x2160, 25/1 fps
Vpp Filters    copyHtoD
Output Info    H.265/HEVC main444 @ Level unknown
               4096x2160p 1:1 25.000fps (25/1fps)
Encoder Preset default
Rate Control   CBRHQ
Bitrate        1500 kbps (Max: 240000 kbps)
Target Quality auto
Initial QP     I:20  P:23  B:25
VBV buf size   auto
Lookahead      off
GOP length     250 frames
B frames       0 frames [ref mode: disabled]
Ref frames     3 frames, LTR: off
AQ             off
CU max / min   auto / auto
Others         mv:auto

~12fps, GPU ~4-12%, VE 10-15%.

so far my conclusions are:

a. 4:4:4 y4m handling seems to be broken for 10bit and 8bit when using 8k
b. 4:4:4 y4m handling seems to be broken for 10bit and 8bit when using 4k
c. you are correct lookahead and 10bit encoding eat a lot of memory

@rigaya

This comment has been minimized.

Copy link
Owner

commented Jun 1, 2019

I've testing myself, but 4:4:4 y4m handling seems to work fine.

Below is the result for 4K 4:4:4 10bit encode, the output was fine. 4:4:4 8bit was also fine. (The system has 16GB RAM.)

x64\ffmpeg.exe -y -loglevel fatal -threads 8 -i "Y:\QSVTest\The_World_in_HDR_in_4K_HDR10.mkv" -an -pix_fmt yuv444p10le -strict -1 -f yuv4mpegpipe - | x64\NVEncC64.exe --y4m -i - -o F:\temp\test.mp4 -c hevc --output-depth 10 --profile main444 --vbrhq 0 --vbr-quality 25 --lookahead 32

NVEncC (x64) 4.41 (r1108) by rigaya, May 25 2019 09:49:12 (VC 1921/Win/avx2)
OS Version     Windows 10 x64 (17134)
CPU            Intel Core i9-7980XE @ 2.60GHz [TB: 4.11GHz] (18C/36T)
GPU            #0: GeForce RTX 2070 (2304 cores, 1710 MHz)[PCIe3x16][419.67]
NVENC / CUDA   NVENC API 9.0, CUDA 10.1, schedule mode: auto
Input Buffers  CUDA, 44 frames
Input Info     y4m(yuv444(10bit))->yuv444(16bit) [AVX2], 3840x2160, 19001/317 fps
Vpp Filters    copyHtoD
Output Info    H.265/HEVC main444 10bit @ Level auto
               3840x2160p 1:1 59.940fps (19001/317fps)
               avwriter: hevc => mp4
Encoder Preset default
Rate Control   VBRHQ
Bitrate        0 kbps (Max: 38400 kbps)
Target Quality 25.00
Initial QP     I:20  P:23  B:25
VBV buf size   auto
Lookahead      on, 32 frames, Adaptive I, B Insert
GOP length     600 frames
B frames       3 frames [ref mode: disabled]
Ref frames     3 frames, LTR: off
AQ             off
CU max / min   auto / auto
Others         mv:auto


encoded 1274 frames, 11.21 fps, 16144.31 kbps, 40.91 MB
encode time 0:01:53, CPU: 2.7%, GPU: 5.5%, VE: 22.8%, GPUClock: 1410MHz, VEClock: 1305MHz
frame type IDR   4
frame type I     4,  avgQP  19.75,  total size   1.28 MB
frame type P   336,  avgQP  19.49,  total size  26.34 MB
frame type B   934,  avgQP  22.50,  total size  13.29 MB

8K 10bit 4:2:0
x64\ffmpeg.exe -y -loglevel fatal -threads 16 -i "Y:\QSVTest\Japan in 8K.mkv" -an -pix_fmt yuv444p10le -strict -1 -f yuv4mpegpipe - | x64\NVEncC64.exe --y4m -i - -o F:\temp\test.mp4 -c hevc

NVEncC (x64) 4.41 (r1108) by rigaya, May 25 2019 09:49:12 (VC 1921/Win/avx2)
OS Version     Windows 10 x64 (17134)
CPU            Intel Core i9-7980XE @ 2.60GHz [TB: 4.11GHz] (18C/36T)
GPU            #0: GeForce RTX 2070 (2304 cores, 1710 MHz)[PCIe3x16][419.67]
NVENC / CUDA   NVENC API 9.0, CUDA 10.1, schedule mode: auto
Input Buffers  CUDA, 36 frames
Input Info     y4m(yuv444(10bit))->nv12 [-], 7680x4320, 24000/1001 fps
Vpp Filters    copyHtoD
Output Info    H.265/HEVC main @ Level auto
               7680x4320p 1:1 23.976fps (24000/1001fps)
               avwriter: hevc => mp4
Encoder Preset default
Rate Control   CQP  I:20  P:23  B:25
Lookahead      off
GOP length     240 frames
B frames       3 frames [ref mode: disabled]
Ref frames     3 frames, LTR: off
AQ             off
CU max / min   auto / auto
Others         mv:auto


encoded 76 frames, 0.51 fps, 19689.90 kbps, 7.44 MB
encode time 0:02:30, CPU: 2.8%, GPU: 0.8%, VE: 4.6%, GPUClock: 1511MHz, VEClock: 1399MHz
frame type IDR  1
frame type I    1,  avgQP  20.00,  total size  0.01 MB
frame type P   19,  avgQP  23.00,  total size  3.89 MB
frame type B   56,  avgQP  25.00,  total size  3.54 MB

8K 10bit 4:4:4
-> out of memory

The error you get when you are out of memory is not a single error, I got "cudaErrorMemoryAllocation" as well as "failed to parse y4m header".

I don't think y4m 4:4:4 handling is broken as long as you are not out of memory.

@Selur

This comment has been minimized.

Copy link
Author

commented Jun 1, 2019

System has 32GB RAM and the card has 8GB Ram, will do some more testing tomorrow and report back in case I find something new.

@Selur

This comment has been minimized.

Copy link
Author

commented Jun 2, 2019

I can reproduce the results you got with your command line.
So seems like 8K 10bit 4:4:4 is just not meant to be with 8GB cards.

rigaya added a commit that referenced this issue Jun 5, 2019
@rigaya

This comment has been minimized.

Copy link
Owner

commented Jun 5, 2019

Tried reducing GPU memory usage in NVEnc 4.42 when lookahead is off.

Now should be able to encode 8K 10bit 4:4:4 in 8GB GPU memory (but very slow).

NVEncC (x64) 4.42 (r1121) by rigaya, Jun  4 2019 20:08:17 (VC 1921/Win/avx2)
OS Version     Windows 10 x64 (17134)
CPU            Intel Core i9-7980XE @ 2.60GHz [TB: 4.11GHz] (18C/36T)
GPU            #0: GeForce RTX 2070 (2304 cores, 1710 MHz)[PCIe3x16][419.67]
NVENC / CUDA   NVENC API 9.0, CUDA 10.1, schedule mode: auto
Input Buffers  CUDA, 12 frames
Input Info     y4m(yuv444(10bit))->yuv444(16bit) [AVX2], 7680x4320, 24000/1001 fps
Vpp Filters    copyHtoD
Output Info    H.265/HEVC main444 10bit @ Level auto
               7680x4320p 1:1 23.976fps (24000/1001fps)
               avwriter: hevc => mp4
Encoder Preset default
Rate Control   VBRHQ
Bitrate        0 kbps (Max: 57600 kbps)
Target Quality 25.00
Initial QP     I:20  P:23  B:25
VBV buf size   auto
Lookahead      off
GOP length     240 frames
B frames       3 frames [ref mode: disabled]
Ref frames     3 frames, LTR: off
AQ             off
CU max / min   auto / auto
Others         mv:auto


encoded 995 frames, 0.50 fps, 46363.95 kbps, 229.37 MB
encode time 0:32:52, CPU: 2.8%, GPU: 1.3%, VE: 5.0%, GPUClock: 1528MHz, VEClock: 1416MHz
frame type IDR   5
frame type I     5,  avgQP  19.80,  total size    6.72 MB
frame type P   249,  avgQP  20.18,  total size  133.10 MB
frame type B   741,  avgQP  23.98,  total size   89.54 MB
@Selur

This comment has been minimized.

Copy link
Author

commented Jun 5, 2019

could you also add a windows binary to the release of 4.42?

@rigaya

This comment has been minimized.

Copy link
Owner

commented Jun 5, 2019

Please wait a while, auto build in now working appveyor...

Or please download from the link.

@Selur

This comment has been minimized.

Copy link
Author

commented Jun 5, 2019

Hurray, I can confirm 8k 10bit 4:4:4 encoding works when lookahead is set to 0. 👍 👍
I used:
ffmpeg -y -loglevel fatal -noautorotate -threads 8 -i "F:\paloma8kprores.mov" -map 0:0 -an -sn -vf zscale=rangein=tv:range=tv -pix_fmt yuv444p10le -strict -1 -vsync 0 -f yuv4mpegpipe - | NVEncC --y4m -i - --fps 24.000 --codec h265 --sar 1:1 --lookahead 0 --output-depth 10 --cbrhq 44000 --max-bitrate 240000 --colormatrix bt709 --output "E:\Temp\paloma8kprores_13_36_54_1210_02.265"

@rigaya rigaya closed this Sep 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.