v1.2.0

@Venti- Venti- released this Nov 17, 2017 · 112 commits to master since this release

We are happy to release Kvazaar version 1.2. Since the last version, Kvazaar has obtained significant speedups at all presets and the compression efficiency has improved for the fastest presets. Please find the complete list of changes below.

Features

  • Intra prediction mode encryption with --crypto=intra_pred_modes (2b8ce5e)
  • Adaptive QP for 360° video with --erp-aqp (26adef4)
  • New selection algorithm for --owf=auto and --threads=auto (8c4a347)
  • Added an option to set the encryption key using --key (2e13091)
  • Added an option to limit SAO to band offset or edge offset only with --sao=band and --sao=edge (8674c0f)

Optimization

  • Reduced number of intra modes checked when using --rd=2 (2cad317)
  • Reduced inter-frame CTU dependencies caused by SAO (050e90d)
  • Changed to a faster calculation for coefficient costs when using --rd=0 (1ead9c0)

Fixes

  • Fixed long motion vectors not getting clipped (#158, 85e2a40)
  • Fixed order of pictures in reconstruction debug output when --gop=8 is used (#101, aae141f)
  • Fixed a use-after-free when encoding very few frames with --gop=8 (#161, 2991962)
  • Fixed a crash when video size is not a multiple of the smallest CU size (2f2405d)
  • Fixed invalid bitstream when QP is too large (382636d)
  • Fixed a race condition causing a deadlock (5f8e17d)
  • Fixed a memory leak in encryption (8654b48)
  • Fixed I-frames not being IRAP frames when using GOP (00c9f52, 841597e)
  • Fixed computing inter and intra costs with different metrics (afc13f1)
  • Fixed reliance on undefined behavior (b41f0fa, 924cf85)
  • Fixed --mv-constraint=frametilemargin constraining motion vectors too much (409d211)
  • Fixed using --bipred with --tmvp (#160, 9974380)

User Interface

  • Changed type of kvz_config.roi.dqps from uint8_t* to int8_t. Delta QP values for --roi may now be negative. (79cb3a2)
  • Changed PSNR display format (20d6444)

Building

  • Default to no -Werror. Run configure with --enable-werror to enable it. (033bc6b)
  • make check now runs valgrind tests that used to only run on Travis. Programs ffmpeg, valgrind and TAppDecoderStatic should be found from $PATH (6bbe5e1)

Refactoring

  • Removed duplicate code in inter MVP and merge candidate selection (4fb0783)
  • Removed duplicate code in intra reconstruction for luma and chroma (e944416)
  • Changed functions for writing the CU tree bitstream to use luma pixel coordinates (610c91b, f5eef7f)
  • Removed duplicate code in functions for writing intra CU bitstream with and without encryption (525a518)
  • Removed duplicate code in helper functions in search.c (2c73476)
  • Gathered function parameters for inter search functions into a single struct (2fa3d82)

BD-Bitrate

Average BD-Bitrate compared with v1.1:

Class 0-uf 1-sf 2-vf 3-fr 4-f 5-m 6-s 7-sr 8-vs
hevc-A -15.71 % -6.68 % -4.66 % -0.89 % -1.11 % -0.54 % +0.04 % -0.02 % +0.32 %
hevc-B -19.04 % -8.15 % -6.92 % -1.26 % -1.48 % -0.65 % -0.33 % -0.33 % -0.07 %
hevc-C -20.39 % -8.54 % -5.01 % -0.55 % -0.72 % -0.44 % +0.03 % -0.00 % +0.23 %
hevc-D -13.24 % -5.15 % -2.54 % -0.33 % -0.51 % -0.32 % -0.10 % -0.04 % +0.13 %
hevc-E -4.37 % -3.31 % -1.90 % -0.52 % -1.10 % -0.68 % -0.74 % -0.86 % -0.78 %
hevc-F -12.42 % -6.15 % -5.25 % +0.04 % +0.24 % +0.25 % +0.32 % +0.70 % +0.90 %
Total -14.80 % -6.59 % -4.68 % -0.60 % -0.78 % -0.39 % -0.13 % -0.08 % +0.14 %

Speedup

Average speedup compared with v1.1 on an Intel Core i7-4770 machine:

Class 0-uf 1-sf 2-vf 3-fr 4-f 5-m 6-s 7-sr 8-vs
hevc-A x1.07 x1.06 x1.10 x1.10 x1.09 x1.11 x1.10 x1.11 x1.10
hevc-B x1.06 x1.07 x1.09 x1.11 x1.09 x1.13 x1.12 x1.13 x1.14
hevc-C x1.22 x1.27 x1.32 x1.35 x1.33 x1.37 x1.39 x1.42 x1.41
hevc-D x1.34 x1.58 x1.64 x1.60 x1.58 x1.54 x1.55 x1.57 x1.54
hevc-E x1.20 x1.17 x1.16 x1.18 x1.16 x1.17 x1.16 x1.20 x1.19
hevc-F x1.25 x1.20 x1.20 x1.23 x1.20 x1.24 x1.24 x1.27 x1.27
Total x1.19 x1.22 x1.24 x1.26 x1.24 x1.26 x1.26 x1.28 x1.28

v1.1.0

@Venti- Venti- released this Feb 16, 2017 · 273 commits to master since this release

I think there are enough new features to call this v1.1.0.

Average BD bitrate (QP 17, 22, 27, 32) v1.1.0 vs v1.0.0

Class 0-uf 1-sf 2-vf 3-fr 4-f 5-m 6-s 7-sr 8-vs
A -1.9% -1.5% -1.2% -1.0% -0.7% -0.7% -1.1% -1.1% -1.3%
B -2.2% -1.3% -1.4% -0.8% -0.5% -0.6% -0.9% -0.8% -1.0%
C -1.3% -1.1% -1.0% -0.9% -0.6% -0.6% -0.6% -0.6% -0.8%
D -1.1% -0.9% -0.7% -0.7% -0.6% -0.5% -0.1% -0.2% -0.4%
E -2.5% -1.5% -0.9% -0.7% -0.3% -0.3% -0.5% -0.4% -0.4%
F -1.6% -0.7% -0.9% -0.8% -0.5% -0.4% -0.7% -0.7% -0.7%
All -1.7% -1.2% -1.0% -0.8% -0.5% -0.5% -0.6% -0.6% -0.8%

Average speedup (QP 17, 22, 27, 32) v1.1.0 vs v1.0.0

Class 0-uf 1-sf 2-vf 3-fr 4-f 5-m 6-s 7-sr 8-vs
A 1.03x 1.02x 1.01x 1.01x 1.01x 1.02x 1.07x 1.09x 1.16x
B 1.03x 1.01x 1.02x 1.01x 1.01x 1.01x 1.06x 1.06x 1.13x
C 1.03x 1.02x 1.01x 1.01x 1.01x 1.01x 1.07x 1.07x 1.16x
D 1.07x 1.05x 1.03x 1.03x 1.02x 1.02x 1.07x 1.09x 1.17x
E 0.99x 0.97x 0.98x 0.99x 0.99x 0.99x 1.00x 1.02x 1.05x
F 1.02x 1.01x 1.01x 1.01x 1.01x 1.01x 1.08x 1.09x 1.17x
All 1.03x 1.02x 1.01x 1.01x 1.01x 1.01x 1.06x 1.07x 1.14x

Paramaters: --threads=4 --owf=1 -p64

Features

  • Bitrate control now works at LCU level, giving more consistent results. (2318bd7)
  • Added --roi parameter for LCU level delta-QP control. (4a0121a)
  • Added --slices parameter for encapsulating tiles and WPP-rows into slice NAL's instead of using bitstream offsets. (1e6463c)
  • Temporal motion vector prediction now works with B-frames. (d892be5)

Optimization

  • Added AVX2 version of SSD. (778e46d)
  • Optimized intra reference building. (c31207e)
  • Optimized bitstream writes. (a9e45ef)
  • Optimized CU-split decision. (2c069a3)
  • Fix main-thread busy-looping on Linux. (a5a925f)
  • Avoid initializing memory needlessly during RDOQ. (acd12cb, b021d22)

Fixes

  • Pass DTS and PTS timestamps correctly through the API. (d18de19)
  • Fixed bug with subpixel motion estimation within tiles. (2c005cd)
  • Improved 10-bit RD-performance. (70a52f0)
  • Fixed for stupendously large bitstreams when --mv-constraint was used with --subme. (937a764)
  • Fixed bug with --smp and --amp. (46c9a48)
  • Fix problem with --bipred. (1e6463c)
  • Fixed hang with threading on OSX. (d893474)
  • Fix crash when frame is less than 65 pixels high and WPP is used. (b8e3513)

User Interface

  • Disabled WPP with tiles enabled. (cb6672b)
  • Improved --help. (5bf7454, 78a28e0)
  • Made it possible to disable the gop-structure that was enabled by default in v.1.0.0. (deb63f7)
  • Have --threads=auto enable threading instead of disabling it. (db5e750)
  • Give errors on failures and handle them better. (97863cd, 6a178de)
  • Use reference picture number of medium preset by default. (7ff33e1)

Building

  • Include optimizations on 32-bit. (1dcc993)
  • Added appveyor CI tests for MSYS2. (e269b86)
  • Add pkg-config macros, so pkg-config doesn't need to be installed anymore. (2d7daa1)
  • Travis CI OSX tests work again. (c32f5fa)

Refactoring

  • Refactored deblocking and sign hiding. (7ec5f78)
  • Removed Exp-Golomb lookup table. (ed3bd89)
  • Copy kvz_config to encoder_control_t and remove duplicate fields. (e78a8df)

v1.0.0

@Venti- Venti- released this Oct 4, 2016 · 383 commits to master since this release

It's been 9 months since last release. Now that the encoder just got 10x faster (on veryslow), and quite a bit faster and better on every other preset as well, I think it's time for a major verson bump.

Average BD bitrate (QP 17, 22, 27, 32) v1.0.0 vs v0.8.3

Class 0-uf 1-sf 2-vf 3-fr 4-f 5-m 6-s 7-sr 8-vs
A -16.4% -26.9% -27.5% -31.0% -11.2% -11.9% -11.3% -6.7% -4.8%
B -16.2% -33.7% -31.7% -37.6% -11.6% -14.8% -15.7% -9.1% -6.3%
C -7.0% -17.6% -28.0% -31.2% -8.3% -9.0% -11.3% -7.1% -8.1%
D -3.7% -12.3% -29.2% -30.3% -5.4% -5.9% -11.5% -8.3% -9.9%
E -28.4% -42.6% -33.5% -39.4% -22.6% -28.5% -20.3% -7.0% -0.7%
F -6.1% -11.3% -12.8% -16.5% -10.1% -2.1% 2.3% 10.8% 6.4%

|
|All|-13.0%|-24.1%|-27.1%|-31.0%|-11.5%|-12.0%|-11.3%| -4.6%| -3.9%|

Average speedup (QP 17, 22, 27, 32) v1.0.0 vs v0.8.3

Class 0-uf 1-sf 2-vf 3-fr 4-f 5-m 6-s 7-sr 8-vs
A 1.61x 1.91x 1.89x 1.37x 2.69x 3.33x 4.79x 7.32x 11.06x
B 1.65x 1.98x 1.96x 1.46x 2.67x 3.36x 4.79x 8.15x 13.89x
C 1.76x 1.97x 1.98x 1.45x 2.52x 2.97x 4.87x 9.32x 15.77x
D 2.09x 1.87x 1.81x 1.32x 1.97x 2.36x 5.13x 8.78x 12.65x
E 1.91x 1.96x 1.75x 1.40x 3.00x 3.70x 4.87x 6.06x 7.56x
F 1.84x 1.83x 1.74x 1.41x 2.86x 2.98x 4.60x 8.18x 13.58x

|
|All|1.81x|1.92x|1.86x|1.40x|2.62x|3.12x|4.84x|7.97x|12.42x|

Paramaeters: --threads=4 --owf=1 --wpp -p64

New Features

  • --version
  • --help
  • --loop-input
  • --mv-constraint to constrain motion vectors
  • --tiles=2x2 as an alternative syntax for uniform tiles
  • --hash=md5
  • Print information about what SIMD optimizations are in use
  • --mv=full8 --mv=full16 --mv=full32 --mv=full64
  • --cu-split-termination=zero/off
  • --crypto for selective encryption of bitstream (for OpenHEVC)
  • --me-early-termination=sensitive/on/off for early termination of motion vector search
  • Added 4x8 SMP and 4x12 AMP motion partitions
  • --subme=0/1/2/3/4 for control over complexity of fractional pixel motion prediction
  • --lossless for lossless coding
  • Monochrome coding
  • --input-format=420/400
  • --input-bitdepth=8/10
  • --tmpv for temporal motion vector predictor
  • --rdoq-skip for not using rdoq for situations where it's unlikely to improve BDRate
  • Modified --gop=lp-g4d3r1t1 syntax to not take the reference frames as a parameter, it's now --gop=lp-g4d3t1.
  • Enable WPP and multithreading by default, with detection for number of cores
  • Update all presets to ratedistortion-complexity optimized versions. These are based on a search of all (~ish) possible encoding parameters and bring a huge boost to both speed and BDRate when encoding with the presets (10x speed for veryslow, ~1.1x-4x for others, up to 30% improved BDRate for some presets).
  • Set default options to match medium with intra period of 64, QP 22 and --gop=lp-g4d3t1
  • --implicit-rdpcm RExt feature

Optimizations

  • AVX2 version for Sample Adaptive Offset (SAO)
  • Optimized memory copying
  • AVX2 versions of filters for fractional pixel motion estimation
  • AVX2 version for half pixel chroma sampling for SMP/AMP
  • AVX2 versions for calculating two or four SATD values at once for small blocks
  • Rewrote AVX2 version of fractional pixel motion compensation
  • Rewrote motion vector cost calculation. It only got slightly faster, but BDRate improved a bunch due to the new implementation being more correct.
  • Made AVX2 SAD use SSE4.1 for cases where there isn't an AVX2 implementation, speeding up SMP/AMP.

Bugfixes

  • Fixed a bug in rate control where an int overflowed after coding 2^31 bits (2Gb)
  • Fixed non-determinism intiles
  • Fixed chroma reconstruction bug in tiles
  • Fixed a bug with calculating the number of bits used for intra mode on 4x4 CUs
  • Stopped checking zero motion vector multiple times in motion compensation
  • Fixed possible segfault in motion compensation
  • Fixed a race condition with OWF and SMP/AMP
  • Gave pthread_cond_timedwait time in correctly, such that main thread now sleeps instead of busylooping when it has nothing to do
  • Fixed rate control with lp-gop
  • Fixed full search not taking temporal motion vector into account
  • Allow non-gop-length intra period for lp-gop

Code / Building / Testing

  • Moved SAO to it's own file
  • Removed a ton of unnecessary includes
  • Updated autotools ax_pthread
  • Added build test for OS-X for Travis
  • Made tests check for bitstream correctness
  • Refactored some of the copypasta in motion vector search starting point selection
  • Refactored the cu_info_t datastructures to hold information at a 4x4 resolution needed for AMP and SMP
  • Changed cu_info_t to use bitfields to negate the effect of increasing the cu_info_t array by a factor of 4
  • Moved bitstream generation from encoderstate.c to encode_coding_tree.c
  • Renamed encoder_state_t.global to frame, which makes sense since it hold frame level data, not global data
  • Rewrote integer vector inter prediction, because it was so bad
  • Refactored init_lcu_t
  • Added more tests for inter SAD
  • Added speed tests for dual intra SAD functions
  • Added more realistic speed tests for inter SAD

Other

  • Added a manpage
  • Added scripts for updating manpage and README based on --usage.
  • Added a Dockerfile. Just because.
  • Added commit date to --version

v0.8.3

@Venti- Venti- released this Jul 13, 2016 · 675 commits to master since this release

Change version to 0.8.3

v0.8.2

@Venti- Venti- released this Jan 15, 2016 · 709 commits to master since this release

This release is to fix wrong library version numbering.

v0.7.2

@Arizer Arizer released this Oct 30, 2015 · 858 commits to master since this release

Added more AVX2 optimizations for:

  • Angular prediction
  • (De)quantization

v0.7.1

@aryla aryla released this Oct 23, 2015 · 872 commits to master since this release

API changes

  • New function encoder_headers for obtaining the VPS, SPS and PPS separately from the rest of the bitstream

Speedups

  • AVX2 optimizations for angular prediction
  • AVX2 optimizations for quantization
  • AVX2 optimizations for SATD