-
Notifications
You must be signed in to change notification settings - Fork 67
/
spec.tex
8195 lines (7479 loc) · 321 KB
/
spec.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[9pt,letterpaper]{book}
\usepackage{latexsym}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{bm}
\usepackage{textcomp}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage{longtable}
\usepackage{ltablex}
\usepackage{wrapfig}
\usepackage{float}
\usepackage[pdfpagemode=None,pdfstartview=FitH,pdfview=FitH,colorlinks=true]%
{hyperref}
\newtheorem{theorem}{Theorem}[section]
\newcommand{\idx}[1]{{\ensuremath{\mathit{#1}}}}
\newcommand{\qti}{\idx{qti}}
\newcommand{\qtj}{\idx{qtj}}
\newcommand{\pli}{\idx{pli}}
\newcommand{\plj}{\idx{plj}}
\newcommand{\qi}{\idx{qi}}
\newcommand{\ci}{\idx{ci}}
\newcommand{\bmi}{\idx{bmi}}
\newcommand{\bmj}{\idx{bmj}}
\newcommand{\qri}{\idx{qri}}
\newcommand{\qrj}{\idx{qrj}}
\newcommand{\hti}{\idx{hti}}
\newcommand{\sbi}{\idx{sbi}}
\newcommand{\bi}{\idx{bi}}
\newcommand{\bj}{\idx{bj}}
\newcommand{\mbi}{\idx{mbi}}
\newcommand{\mbj}{\idx{mbj}}
\newcommand{\mi}{\idx{mi}}
\newcommand{\cbi}{\idx{cbi}}
\newcommand{\qii}{\idx{qii}}
\newcommand{\ti}{\idx{ti}}
\newcommand{\tj}{\idx{tj}}
\newcommand{\rfi}{\idx{rfi}}
\newcommand{\zzi}{\idx{zzi}}
\newcommand{\ri}{\idx{ri}}
%This somewhat odd construct ensures that \bitvar{\qi}, etc., will set the
% qi in bold face, even though it is in a \mathit font, yet \bitvar{VAR} will
% set VAR in a bold, roman font.
\newcommand{\bitvar}[1]{\ensuremath{\mathbf{\bm{#1}}}}
\newcommand{\locvar}[1]{\ensuremath{\mathrm{#1}}}
\newcommand{\term}[1]{{\em #1}}
\newcommand{\bin}[1]{\ensuremath{\mathtt{b#1}}}
\newcommand{\hex}[1]{\ensuremath{\mathtt{0x#1}}}
\newcommand{\ilog}{\ensuremath{\mathop{\mathrm{ilog}}\nolimits}}
\newcommand{\round}{\ensuremath{\mathop{\mathrm{round}}\nolimits}}
\newcommand{\sign}{\ensuremath{\mathop{\mathrm{sign}}\nolimits}}
\newcommand{\lflim}{\ensuremath{\mathop{\mathrm{lflim}}\nolimits}}
%Section-based table, figure, and equation numbering.
\numberwithin{equation}{chapter}
\numberwithin{figure}{chapter}
\numberwithin{table}{chapter}
%Provide section numbering for \paragraph.
\makeatletter
\renewcommand{\paragraph}{\@startsection{paragraph}{4}{0ex}%
{-3.25ex plus -1ex minus -0.2ex}%
{1.5ex plus 0.2ex}%
{\normalfont\normalsize\bfseries}}
\makeatother
\stepcounter{secnumdepth}
\stepcounter{tocdepth}
\keepXColumns
\pagestyle{headings}
\bibliographystyle{alpha}
\title{Theora Specification}
\author{Xiph.Org Foundation}
\date{\today}
\begin{document}
\frontmatter
\begin{titlepage}
\maketitle
\end{titlepage}
\thispagestyle{empty}
\cleardoublepage
\pagenumbering{roman}
\thispagestyle{plain}
\tableofcontents
\cleardoublepage
\thispagestyle{plain}
\listoffigures
\cleardoublepage
\thispagestyle{plain}
\listoftables
\cleardoublepage
\thispagestyle{plain}
\markboth{{\sc Notation and Conventions}}{{\sc Notation and Conventions}}
\chapter*{Notation and Conventions}
All parameters either passed in or out of a decoding procedure are given in
\bitvar{bold\ face}.
The prefix \bin{} indicates that the following value is to be interpreted as a
binary number (base 2).
\begin{verse}
{\bf Example:} The value \bin{1110100} is equal to the decimal value 116.
\end{verse}
The prefix \hex{} indicates the following value is to be interpreted as a
hexadecimal number (base 16).
\begin{verse}
{\bf Example:} The value \hex{74} is equal to the decimal value 116.
\end{verse}
All arithmetic defined by this specification is exact.
However, any real numbers that do arise will always be converted back to
integers again in short order.
The entire specification can be implemented using only normal integer
operations.
All operations are to be implemented with sufficiently large integers so that
overflow cannot occur.
Where the result of a computation is to be truncated to a fixed-sized binary
representation, this will be explicitly noted.
The size given for all variables is the maximum number of bits needed to store
any value in that variable.
Intermediate computations involving that variable may require more bits.
The following operators are defined:
\begin{description}
\item[$|a|$]
The absolute value of a number $a$.
\begin{align*}
|a| & = \left\{\begin{array}{ll}
-a, & a < 0 \\
a, & a \ge 0
\end{array}\right.
\end{align*}
\item[$a*b$]
Multiplication of a number $a$ by a number $b$.
\item[$\frac{a}{b}$]
Exact division of a number $a$ by a number $b$, producing a potentially
non-integer result.
\item[$\left\lfloor a\right\rfloor$]
The largest integer less than or equal to a real number $a$.
\item[$\left\lceil a\right\rceil$]
The smallest integer greater than or equal to a real number $a$.
\item[$a//b$]
Integer division of $a$ by $b$.
\begin{align*}
a//b & = \left\{\begin{array}{ll}
\left\lceil\frac{a}{b}\right\rceil, & a < 0 \\
\left\lfloor\frac{a}{b}\right\rfloor, & a \ge 0
\end{array}\right.
\end{align*}
\item[$a\%b$]
The remainder from the integer division of $a$ by $b$.
\begin{align*}
a\%b & = a-|b|*\left\lfloor\frac{a}{|b|}\right\rfloor
\end{align*}
Note that with this definition, the result is always non-negative and less than
$|b|$.
\item[$a<<b$]
The value obtained by left-shifting the two's complement integer $a$ by $b$
bits.
For purposes of this specification, overflow is ignored, and so this is
equivalent to integer multiplication of $a$ by $2^b$.
\item[$a>>b$]
The value obtained by right-shifting the two's complement integer $a$ by $b$
bits, filling in the leftmost bits of the new value with $0$ if $a$ is
non-negative and $1$ if $a$ is negative.
This is {\em not} equivalent to integer division of $a$ by $2^b$.
Instead,
\begin{align*}
a>>b & = \left\lfloor\frac{a}{2^b}\right\rfloor.
\end{align*}
\item[$\round(a)$]
Rounds a number $a$ to the nearest integer, with ties rounded away from $0$.
\begin{align*}
\round(a) = \left\{\begin{array}{ll}
\lceil a-\frac{1}{2}\rceil & a \le 0 \\
\lfloor a+\frac{1}{2}\rfloor & a > 0
\end{array}\right.
\end{align*}
\item[$\sign(a)$]
Returns the sign of a given number.
\begin{align*}
\sign(a) = \left\{\begin{array}{ll}
-1 & a < 0 \\
0 & a = 0 \\
1 & a > 0
\end{array}\right.
\end{align*}
\item[$\ilog(a)$]
The minimum number of bits required to store a positive integer $a$ in
two's complement notation, or $0$ for a non-positive integer $a$.
\begin{align*}
\ilog(a) = \left\{\begin{array}{ll}
0, & a \le 0 \\
\left\lfloor\log_2{a}\right\rfloor+1, & a > 0
\end{array}\right.
\end{align*}
\begin{verse}
{\bf Examples:}
\begin{itemize}
\item $\ilog(-1)=0$
\item $\ilog(0)=0$
\item $\ilog(1)=1$
\item $\ilog(2)=2$
\item $\ilog(3)=2$
\item $\ilog(4)=3$
\item $\ilog(7)=3$
\end{itemize}
\end{verse}
\item[$\min(a,b)$]
The minimum of two numbers $a$ and $b$.
\item[$\max(a,b)$]
The maximum of two numbers $a$ and $b$.
\end{description}
\cleardoublepage
\thispagestyle{plain}
\markboth{{\sc Key words}}{{\sc Key words}}
\chapter*{Key words}
%We can't rewrite this, because this is text required by RFC 2119, so we use
% some emergency stretching to get it typeset properly.
\setlength{\emergencystretch}{2em}
The key words ``MUST'', ``MUST NOT'', ``REQUIRED'', ``SHALL'', ``SHALL NOT'',
``SHOULD'', ``SHOULD NOT'', ``RECOMMENDED'', ``MAY'', and ``OPTIONAL'' in this
document are to be intrepreted as described in RFC 2119 \cite{rfc2119}.\par
\setlength{\emergencystretch}{0em}
Where such assertions are placed on the contents of a Theora bitstream itself,
implementations should be prepared to encounter bitstreams that do not follow
these requirements.
An application's behavior in the presecence of such non-conforming bitstreams
is not defined by this specification, but any reasonable method of handling
them MAY be used.
By way of example, applications MAY discard the current frame, retain the
current output thus far, or attempt to continue on by assuming some default
values for the erroneous bits.
When such an error occurs in the bitstream headers, an application MAY refuse
to decode the entire stream.
An application SHOULD NOT allow such non-conformant bitstreams to overflow
buffers and potentially execute arbitrary code, as this represents a serious
security risk.
An application MUST, however, ensure any bits marked as reserved have the value
zero, and refuse to decode the stream if they do not.
These are used as place holders for future bitstream features with which the
current bitstream is forward-compatible.
Such features may not increment the bitstream version number, and can only be
recognized by checking the value of these reserved bits.
\cleardoublepage
\mainmatter
\pagenumbering{arabic}
\setcounter{page}{1}
\chapter{Introduction}
Theora is a general purpose, lossy video codec.
It is based on the VP3 video codec produced by On2 Technologies
(\url{http://www.on2.com/}).
On2 donated the VP3.1 source code to the Xiph.Org Foundation and released it
under a BSD-like license.
On2 also made an irrevocable, royalty-free license grant for any patent claims
it might have over the software and any derivatives.
No formal specification exists for the VP3 format beyond this source code,
however Mike Melanson maintains a detailed description \cite{Mel04}.
Portions of this specification were adopted from that text with permission.
\section{VP3 and Theora}
Theora contains a superset of the features that were available in the original
VP3 codec.
Content encoded with VP3.1 can be losslessly transcoded into the Theora format.
Theora content cannot, in general, be losslessly transcoded into the VP3
format.
If a feature is not available in the original VP3 format, this is mentioned
when that feature is defined.
A complete list of these features appears in Appendix~\ref{app:vp3-compat}.
%TODO: VP3 - theora comparison in appendix
\section{Video Formats}
Theora currently supports progressive video data of arbitrary dimensions at a
constant frame rate in one of several $Y'C_bC_r$ color spaces.
The precise definition the supported color spaces appears in
Section~\ref{sec:colorspaces}.
Three different chroma subsampling formats are supported: 4:2:0, 4:2:2,
and 4:4:4.
The precise details of each of these formats and their sampling locations are
described in Section~\ref{sec:pixfmts}.
The Theora format does not support interlaced material, variable frame rates,
bit-depths larger than 8 bits per component, nor alternate color spaces such
as RGB or arbitrary multi-channel spaces.
Black and white content can be efficiently encoded, however, because the
uniform chroma planes compress well.
Support for interlaced material is planned for a future version.
\begin{verse}
{\bf Note:} Infrequently changing frame rates---as when film and video
sequences are cut together---can be supported in the Ogg container format by
chaining several Theora streams together.
\end{verse}
Support for increased bit depths or additional color spaces is not planned.
\section{Classification}
Theora is a block-based lossy transform codec that utilizes an
$8\times 8$ Type-II Discrete Cosine Transform and block-based motion
compensation.
This places it in the same class of codecs as MPEG-1, -2, -4, and H.263.
The details of how individual blocks are organized and how DCT coefficients are
stored in the bitstream differ substantially from these codecs, however.
Theora supports only intra frames (I frames in MPEG) and inter frames (P frames
in MPEG).
There is no equivalent to the bi-predictive frames (B frames) found in MPEG
codecs.
\section{Assumptions}
The Theora codec design assumes a complex, psychovisually-aware encoder and a
simple, low-complexity decoder.
%TODO: Talk more about implementation complexity.
Theora provides none of its own framing, synchronization, or protection against
transmission errors.
An encoder is solely a method of accepting input video frames and
compressing these frames into raw, unformatted `packets'.
The decoder then accepts these raw packets in sequence, decodes them, and
synthesizes a fascimile of the original video frames.
Theora is a free-form variable bit rate (VBR) codec, and packets have no
minimum size, maximum size, or fixed/expected size.
Theora packets are thus intended to be used with a transport mechanism that
provides free-form framing, synchronization, positioning, and error correction
in accordance with these design assumptions, such as Ogg (for file transport)
or RTP (for network multicast).
For the purposes of a few examples in this document, we will assume that Theora
is embedded in an Ogg stream specifically, although this is by no means a
requirement or fundamental assumption in the Theora design.
The specification for embedding Theora into an Ogg transport stream is given in
Appendix~\ref{app:oggencapsulation}.
\section{Codec Setup and Probability Model}
Theora's heritage is the proprietary commerical codec VP3, and it retains a
fair amount of inflexibility when compared to Vorbis \cite{vorbis}, the first
Xiph.Org codec, which began as a research codec.
However, to provide additional scope for encoder improvement, Theora adopts
some of the configurable aspects of decoder setup that are present in Vorbis.
This configuration data is not available in VP3, which uses hardcoded values
instead.
Theora makes the same controversial design decision that Vorbis made to include
the entire probability model for the DCT coefficients and all the quantization
parameters in the bitstream headers.
This is often several hundred fields.
It is therefore impossible to decode any frame in the stream without
having previously fetched the codec info and codec setup headers.
\begin{verse}
{\bf Note:} Theora {\em can} initiate decode at an arbitrary intra-frame packet
within a bitstream so long as the codec has been initialized with the setup
headers.
\end{verse}
Thus, Theora headers are both required for decode to begin and relatively large
as bitstream headers go.
The header size is unbounded, although as a rule-of-thumb less than 16kB is
recommended, and Xiph.Org's reference encoder follows this suggestion.
%TODO: Is 8kB enough? My setup header is 7.4kB, that doesn't leave much room
% for comments.
%RG: the lesson from vorbis is that as small as possible is really
% important in some applications. Practically, what's acceptable
% depends a great deal on the target bitrate. I'd leave 16 kB in the
% spec for now. fwiw more than 1k of comments is quite unusual.
Our own design work indicates that the primary liability of the required header
is in mindshare; it is an unusual design and thus causes some amount of
complaint among engineers as this runs against current design trends and
points out limitations in some existing software/interface designs.
However, we find that it does not fundamentally limit Theora's suitable
application space.
%silvia: renamed
%\subsection{Format Specification}
\section{Format Conformance}
The Theora format is well-defined by its decode specification; any encoder that
produces packets that are correctly decoded by an implementation following
this specification may be considered a proper Theora encoder.
A decoder must faithfully and completely implement the specification defined
herein %, except where noted,
to be considered a conformant Theora decoder.
A decoder need not be implemented strictly as described, but the
actual decoder process MUST be {\em entirely mathematically equivalent}
to the described process.
Where appropriate, a non-normative description of encoder processes is
included.
These sections will be marked as such, and a proper Theora encoder is not
bound to follow them.
%TODO: \subsection{Hardware Profile}
\chapter{Coded Video Structure}
Theora's encoding and decoding process is based on $8\times 8$ blocks of
pixels.
This sections describes how a video frame is laid out, divided into
blocks, and how those blocks are organized.
\section{Frame Layout}
A video frame in Theora is a two-dimensional array of pixels.
Theora, like VP3, uses a right-handed coordinate system, with the origin in the
lower-left corner of the frame.
This is contrary to many video formats which use a left-handed coordinate
system with the origin in the upper-left corner of the frame.
%INT: This means that for interlaced material, the definition of `even fields'
%INT: and `odd fields' may be reversed between Theora and other video codecs.
%INT: This document will always refer to them as `top fields' and `bottom
%INT: fields'.
Theora divides the pixel array up into three separate \term{color planes}, one
for each of the $Y'$, $C_b$, and $C_r$ components of the pixel.
The $Y'$ plane is also called the \term{luma plane}, and the $C_b$ and $C_r$
planes are also called the \term{chroma planes}.
Each plane is assigned a numerical value, as shown in
Table~\ref{tab:color-planes}.
\begin{table}[htbp]
\begin{center}
\begin{tabular}{cl}\toprule
Index & Color Plane \\\midrule
$0$ & $Y'$ \\
$1$ & $C_b$ \\
$2$ & $C_r$ \\
\bottomrule\end{tabular}
\end{center}
\caption{Color Plane Indices}
\label{tab:color-planes}
\end{table}
In some pixel formats, the chroma planes are subsampled by a factor of two
in one or both directions.
This means that the width or height of the chroma planes may be half that of
the total frame width and height.
The luma plane is never subsampled.
\section{Picture Region}
An encoded video frame in Theora is required to have a width and height that
are multiples of sixteen, making an integral number of blocks even when the
chroma planes are subsampled.
However, inside a frame a smaller \term{picture region} may be defined
to present material whose dimensions are not a multiple of sixteen pixels, as
shown in Figure~\ref{fig:pic-frame}.
The picture region can be offset from the lower-left corner of the frame by up
to 255 pixels in each direction, and may have an arbitrary width and height,
provided that it is contained entirely within the coded frame.
It is this picture region that contains the actual video data.
The portions of the frame which lie outside the picture region may contain
arbitrary image data, so the frame must be cropped to the picture region
before display.
The picture region plays no other role in the decode process, which operates on
the entire video frame.
\begin{figure}[htbp]
\begin{center}
\includegraphics{pic-frame}
\end{center}
\caption{Location of frame and picture regions}
\label{fig:pic-frame}
\end{figure}
\section{Blocks and Super Blocks}
\label{sec:blocks-and-sbs}
Each color plane is subdivided into \term{blocks} of $8\times 8$ pixels.
Blocks are grouped into $4\times 4$ arrays called \term{super blocks} as
shown in Figure~\ref{fig:superblock}.
Each color plane has its own set of blocks and super blocks.
If the chroma planes are subsampled, they are still divided into $8\times 8$
blocks of pixels; there are just fewer blocks than in the luma plane.
The boundaries of blocks and super blocks in the luma plane do not necessarily
coincide with those of the chroma planes, if the chroma planes have been
subsampled.
\begin{figure}[htbp]
\begin{center}
\includegraphics{superblock}
\end{center}
\caption{Subdivision of a frame into blocks and super blocks}
\label{fig:superblock}
\end{figure}
Blocks are accessed in two different orders in the various decoder processes.
The first is \term{raster order}, illustrated in Figure~\ref{fig:raster-block}.
This accesses each block in row-major order, starting in the lower left of the
frame and continuing along the bottom row of the entire frame, followed by the
next row up, starting on the left edge of the frame, etc.
\begin{figure}[htbp]
\begin{center}
\includegraphics{raster-block}
\end{center}
\caption{Raster ordering of $n\times m$ blocks}
\label{fig:raster-block}
\end{figure}
The second is \term{coded order}.
In coded order, blocks are accessed by super block.
Within each frame, super blocks are traversed in raster order,
similar to raster order for blocks.
Within each super block, however, blocks are accessed in a Hilbert curve
pattern, illustrated in Figure~\ref{fig:hilbert-block}.
If a color plane does not contain a complete super block on the top or right
sides, the same ordering is still used, simply with any blocks outside the
frame boundary ommitted.
\begin{figure}[htbp]
\begin{center}
\includegraphics{hilbert-block}
\end{center}
\caption{Hilbert curve ordering of blocks within a super block}
\label{fig:hilbert-block}
\end{figure}
To illustrate this ordering, consider a frame that is 240 pixels wide and
48 pixels high.
Each row of the luma plane has 30 blocks and 8 super blocks, and there are 6
rows of blocks and two rows of super blocks.
%When accessed in raster order, each block in the luma plane is assigned the
% following indices:
%\vspace{\baselineskip}
%\begin{center}
%\begin{tabular}{|ccccccc|}\hline
%150 & 151 & 152 & 153 & $\ldots$ & 178 & 179 \\
%120 & 121 & 122 & 123 & $\ldots$ & 148 & 149 \\\hline
% 90 & 91 & 92 & 93 & $\ldots$ & 118 & 119 \\
% 60 & 61 & 62 & 63 & $\ldots$ & 88 & 89 \\
% 30 & 31 & 32 & 33 & $\ldots$ & 58 & 59 \\
% 0 & 1 & 2 & 3 & $\ldots$ & 28 & 29 \\\hline
%\end{tabular}
%\end{center}
%\vspace{\baselineskip}
When accessed in coded order, each block in the luma plane is assigned the
following indices:
\vspace{\baselineskip}
\begin{center}
\begin{tabular}{|cccc|c|cc|}\hline
123 & 122 & 125 & 124 & $\ldots$ & 179 & 178 \\
120 & 121 & 126 & 127 & $\ldots$ & 176 & 177 \\\hline
5 & 6 & 9 & 10 & $\ldots$ & 117 & 118 \\
4 & 7 & 8 & 11 & $\ldots$ & 116 & 119 \\
3 & 2 & 13 & 12 & $\ldots$ & 115 & 114 \\
0 & 1 & 14 & 15 & $\ldots$ & 112 & 113 \\\hline
\end{tabular}
\end{center}
\vspace{\baselineskip}
Here the index values specify the order in which the blocks would be accessed.
The indices of the blocks are numbered continuously from one color plane to the
next.
They do not reset to zero at the start of each plane.
Instead, the numbering increases continuously from the $Y'$ plane to the $C_b$
plane to the $C_r$ plane.
The implication is that the blocks from all planes are treated as a unit during
the various processing steps.
Although blocks are sometimes accessed in raster order, in this document the
index associated with a block is {\em always} its index in coded order.
\section{Macro Blocks}
\label{sec:mbs}
A macro block contains a $2\times 2$ array of blocks in the luma plane
{\em and} the co-located blocks in the chroma planes, as shown in
Figure~\ref{fig:macroblock}.
Thus macro blocks can represent anywhere from six to twelve blocks, depending
on how the chroma planes are subsampled.
This is in contrast to super blocks, which only contain blocks from a single
color plane.
% the whole super vs. macro blocks thing is a little confusing, and it can be
% hard to remember which is what initially. A figure would/will help here,
% but I tried to add some text emphasizing the difference in terms of
% functionality.
%TBT: At this point we haven't described any functionality yet.
%TBT: As far as the reader knows, the only purpose of the blocks, macro blocks
%TBT: and super blocks is for data organization---and for blocks and super
%TBT: blocks, this is essentially true.
%TBT: So lets restrict the differences we emphasize to those of data
%TBT: organization, which the sentence I just added above does.
Macro blocks contain information about coding mode and motion vectors for the
corresponding blocks in all color planes.
\begin{figure}[htbp]
\begin{center}
\includegraphics{macroblock}
\end{center}
\caption{Subdivision of a frame into macro blocks}
\label{fig:macroblock}
\end{figure}
Macro blocks are also accessed in a \term{coded order}.
This coded order proceeds by examining each super block in the luma plane in
raster order, and traversing the four macro blocks inside using a smaller
Hilbert curve, as shown in Figure~\ref{fig:hilbert-mb}.
%r: I rearranged the wording to make a more formal idiom here
If the luma plane does not contain a complete super block on the top or right
sides, the same ordering is still used, with any macro blocks outside
the frame boundary simply omitted.
Because the frame size is constrained to be a multiple of 16, there are never
any partial macro blocks.
Unlike blocks, macro blocks need never be accessed in a pure raster order.
\begin{figure}[htbp]
\begin{center}
\includegraphics{hilbert-mb}
\end{center}
\caption{Hilbert curve ordering of macro blocks within a super block}
\label{fig:hilbert-mb}
\end{figure}
Using the same frame size as the example above, there are 15 macro blocks in
each row and 3 rows of macro blocks.
The macro blocks are assigned the following indices:
\vspace{\baselineskip}
\begin{center}
\begin{tabular}{|cc|cc|c|cc|c|}\hline
30 & 31 & 32 & 33 & $\cdots$ & 42 & 43 & 44 \\\hline
1 & 2 & 5 & 6 & $\cdots$ & 25 & 26 & 29 \\
0 & 3 & 4 & 7 & $\cdots$ & 24 & 27 & 28 \\\hline
\end{tabular}
\end{center}
\vspace{\baselineskip}
\section{Coding Modes and Prediction}
Each block is coded using one of a small, fixed set of \term{coding modes} that
define how the block is predicted from previous frames.
A block is predicted using one of two \term{reference frames}, selected
according to the coding mode.
A reference frame is the fully decoded version of a previous frame in the
stream.
The first available reference frame is the previous intra frame, called the
\term{golden frame}.
The second available reference frame is the previous frame, whether it was an
intra frame or an inter frame.
If the previous frame was an intra frame, then both reference frames are the
same.
See Figure~\ref{fig:reference-frames} for an illustration of the reference
frames used for an intra frame that does not follow an intra frame.
\begin{figure}[htbp]
\begin{center}
\includegraphics{reference-frames}
\end{center}
\caption{Example of reference frames for an inter frame}
\label{fig:reference-frames}
\end{figure}
Two coding modes in particular are worth mentioning here.
The INTRA mode is used for blocks that are not predicted from either reference
frame.
This is the only coding mode allowed in intra frames.
The INTER\_NOMV coding mode uses the co-located contents of the block in the
previous frame as the predictor.
This is the default coding mode.
\section{DCT Coefficients}
\label{sec:dct-coeffs}
A \term{residual} is added to the predicted contents of a block to form the
final reconstruction.
The residual is stored as a set of quantized coefficients from an integer
approximation of a two-dimensional Type II Discrete Cosine Transform.
The DCT takes an $8\times 8$ array of pixel values as input and returns an
$8\times 8$ array of coefficient values.
The \term{natural ordering} of these coefficients is defined to be row-major
order, from lowest to highest frequency.
They are also often indexed in \term{zig-zag order}, as shown in
Figure~\ref{tab:zig-zag}.
\begin{figure}[htbp]
\begin{center}
\begin{tabular}[c]{rr|c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c}
&\multicolumn{1}{r}{} & && &&&&&$c$&&& && && \\
&\multicolumn{1}{r}{} &0&&1&&2&&3&&4&&5&&6&&7 \\\cline{3-17}
&0 & 0 &$\rightarrow$& 1 && 5 &$\rightarrow$& 6 && 14 &$\rightarrow$& 15 && 27 &$\rightarrow$& 28 \\[-0.5\defaultaddspace]
& & &$\swarrow$&&$\nearrow$& &$\swarrow$&&$\nearrow$& &$\swarrow$&&$\nearrow$& &$\swarrow$& \\
&1 & 2 & & 4 && 7 & & 13 && 16 & & 26 && 29 & & 42 \\[-0.5\defaultaddspace]
& &$\downarrow$&$\nearrow$&&$\swarrow$&&$\nearrow$&&$\swarrow$&&$\nearrow$&&$\swarrow$&&$\nearrow$&$\downarrow$ \\
&2 & 3 & & 8 && 12 & & 17 && 25 & & 30 && 41 & & 43 \\[-0.5\defaultaddspace]
& & &$\swarrow$&&$\nearrow$& &$\swarrow$&&$\nearrow$& &$\swarrow$&&$\nearrow$& &$\swarrow$& \\
&3 & 9 & & 11 && 18 & & 24 && 31 & & 40 && 44 & & 53 \\[-0.5\defaultaddspace]
$r$&&$\downarrow$&$\nearrow$&&$\swarrow$&&$\nearrow$&&$\swarrow$&&$\nearrow$&&$\swarrow$&&$\nearrow$&$\downarrow$ \\
&4 & 10 & & 19 && 23 & & 32 && 39 & & 45 && 52 & & 54 \\[-0.5\defaultaddspace]
& & &$\swarrow$&&$\nearrow$& &$\swarrow$&&$\nearrow$& &$\swarrow$&&$\nearrow$& &$\swarrow$& \\
&5 & 20 & & 22 && 33 & & 38 && 46 & & 51 && 55 & & 60 \\[-0.5\defaultaddspace]
& &$\downarrow$&$\nearrow$&&$\swarrow$&&$\nearrow$&&$\swarrow$&&$\nearrow$&&$\swarrow$&&$\nearrow$&$\downarrow$ \\
&6 & 21 & & 34 && 37 & & 47 && 50 & & 56 && 59 & & 61 \\[-0.5\defaultaddspace]
& & &$\swarrow$&&$\nearrow$& &$\swarrow$&&$\nearrow$& &$\swarrow$&&$\nearrow$& &$\swarrow$& \\
&7 & 35 &$\rightarrow$& 36 && 48 &$\rightarrow$& 49 && 57 &$\rightarrow$& 58 && 62 &$\rightarrow$& 63
\end{tabular}
\end{center}
\caption{Zig-zag order}
\label{tab:zig-zag}
\end{figure}
\begin{verse}
{\bf Note:} the row and column indices refer to {\em frequency number} and not
pixel locations.
The frequency numbers are defined independently of the memory organization of
the pixels.
They have been written from top to bottom here to follow conventional notation,
despite the right-handed coordinate system Theora uses for pixel locations.
%RG: I'd rather we were internally consistent and put dc at the lower left.
Many implementations of the DCT operate `in-place'.
That is, they return DCT coefficients in the same memory buffer that the
initial pixel values were stored in.
Due to the right-handed coordinate system used for pixel locations in Theora,
one must note carefully how both pixel values and DCT coefficients are
organized in memory in such a system.
\end{verse}
DCT coefficient $(0,0)$ is called the \term{DC coefficient}.
All the other coefficients are called \term{AC coefficients}.
\chapter{Decoding Overview}
This section provides a high level description of the Theora codec's
construction.
A bit-by-bit specification appears beginning in Section~\ref{sec:bitpacking}.
The later sections assume a high-level understanding of the Theora decode
process, which is provided below.
\section{Decoder Configuration}
Decoder setup consists of configuration of the quantization matrices and the
Huffman codebooks for the DCT coefficients, and a table of limit values for
the deblocking filter.
The remainder of the decoding pipeline is not configurable.
\subsection{Global Configuration}
The global codec configuration consists of a few video related fields, such as
frame rate, frame size, picture size and offset, aspect ratio, color space,
pixel format, and a version number.
The version number is divided into a major version, a minor version, amd a
minor revision number.
%r: afaik the released vp3 codec called itself 3.1 and is compatible w/ theora
%r: even though we received the in-progress 3.2 codebase
For the format defined in this specification, these are `3', `2', and
`1', respectively, in reference to Theora's origin as a successor to
the VP3.1 format.
\subsection{Quantization Matrices}
Theora allows up to 384 different quantization matrices to be defined, one for
each \term{quantization type}, \term{color plane} ($Y'$, $C_b$, or $C_r$), and
\term{quantization index}, \qi, which ranges from zero to 63, inclusive.
There are currently two quantization types defined, which depend on the coding
mode of the block being dequantized, as shown in Table~\ref{tab:quant-types}.
\begin{table}[htbp]
\begin{center}
\begin{tabular}{cl}\toprule
Quantization Type & Usage \\\midrule
$0$ & INTRA-mode blocks \\
$1$ & Blocks in any other mode. \\
\bottomrule\end{tabular}
\end{center}
\caption{Quantization Type Indices}
\label{tab:quant-types}
\end{table}
%r: I think 'nominally' is more specific than 'generally' here
The quantization index, on the other hand, nominally represents a progressive
range of quality levels, from low quality near zero to high quality near 63.
However, the interpretation is arbitrary, and it is possible, for example, to
partition the scale into two completely separate ranges with 32 levels each
that are meant to represent different classes of source material, or any
other arrangement that suits the encoder's requirements.
Each quantization matrix is an $8\times 8$ matrix of 16-bit values, which is
used to quantize the output of the $8\times 8$ DCT\@.
Quantization matrices are specified using three components: a
\term{base matrix} and two \term{scale values}.
The first scale value is the \term{DC scale}, which is applied to the DC
component of the base matrix.
The second scale value is the \term{AC scale}, which is applied to all the
other components of the base matrix.
There are 64 DC scale values and 64 AC scale values, one for each \qi\ value.
There are 64 elements in each base matrix, one for each DCT coefficient.
They are stored in natural order (cf. Section~\ref{sec:dct-coeffs}).
There is a separate set of base matrices for each quantization type and each
color plane, with up to 64 possible base matrices in each set, one for each
\qi\ value.
%r: we will mention that the given matricies must bound the \qi range
%r: in the detailed section. it's not important at this level.
Typically the bitstream contains matrices for only a sparse subset of the
possible \qi\ values.
The base matrices for the remainder of the \qi\ values are computed using
linear interpolation.
This configuration allows the encoder to adjust the quantization matrices to
approximate the complex, non-linear response of the human visual system to
different quantization errors.
Finally, because the in-loop deblocking filter strength depends on the strength
of the quantization matrices defined in this header, a table of 64 \term{loop
filter limit values} is defined, one for each \qi\ value.
The precise specification of how all of this information is decoded appears in
Section~\ref{sub:loop-filter-limits} and Section~\ref{sub:quant-params}.
\subsection{Huffman Codebooks}
Theora uses 80 configurable binary Huffman codes to represent the 32 tokens
used to encode DCT coefficients.
Each of the 32 token values has a different semantic meaning and is used to
represent single coefficient values, zero runs, combinations of the two, and
\term{End-Of-Block markers}.
The 80 codes are divided up into five groups of 16, with each group
corresponding to a set of DCT coefficient indices.
The first group corresponds to the DC coefficient, while the remaining four
groups correspond to different subsets of the AC coefficients.
Within each frame, two pairs of 4-bit codebook indices are stored.
The first pair selects which codebooks to use from the DC coefficient group for
the $Y'$ coefficients and the $C_b$ and $C_r$ coefficients.
The second pair selects which codebooks to use from {\em all four} of the AC
coefficient groups for the $Y'$ coefficients and the $C_b$ and $C_r$
coefficients.
The precise specification of how the codebooks are decoded appears in
Section~\ref{sub:huffman-tables}.
\section{High-Level Decode Process}
\subsection{Decoder Setup}
Before decoding can begin, a decoder MUST be initialized using the bitstream
headers corresponding to the stream to be decoded.
Theora uses three header packets; all are required, in order, by this
specification.
Once set up, decode may begin at any intra-frame packet---or even inter-frame
packets, provided the appropriate decoded reference frames have already been
decoded and cached---belonging to the Theora stream.
In Theora I, all packets after the three initial headers are intra-frame or
inter-frame packets.
The header packets are, in order, the identification header, the comment
header, and the setup header.
\paragraph{Identification Header}
The identification header identifies the stream as Theora, provides a version
number, and defines the characteristics of the video stream such as frame
size.
A complete description of the identification header appears in
Section~\ref{sec:idheader}.
\paragraph{Comment Header}
The comment header includes user text comments (`tags') and a vendor string
for the application/library that produced the stream.
The format of the comment header is the same as that used in the Vorbis I and
Speex codecs, with slight modifications due to the use of a different bit
packing mechanism.
A complete description of how the comment header is coded appears in
Section~\ref{sec:commentheader}, along with a suggested set of tags.
\paragraph{Setup Header}
The setup header includes extensive codec setup information, including the
complete set of quantization matrices and Huffman codebooks needed to decode
the DCT coefficients.
A complete description of the setup header appears in
Section~\ref{sec:setupheader}.
\subsection{Decode Procedure}
The decoding and synthesis procedure for all video packets is fundamentally the
same, with some steps omitted for intra frames.
\begin{itemize}
\item
Decode packet type flag.
\item
Decode frame header.
\item
Decode coded block information (inter frames only).
\item
Decode macro block mode information (inter frames only).
\item
Decode motion vectors (inter frames only).
\item
Decode block-level \qi\ information.
\item
Decode DC coefficient for each coded block.
\item
Decode 1st AC coefficient for each coded block.
\item
Decode 2nd AC coefficient for each coded block.
\item
$\ldots$
\item
Decode 63rd AC coefficient for each coded block.
\item Perform DC coefficient prediction.
\item Reconstruct coded blocks.
\item Copy uncoded bocks.
\item Perform loop filtering.
\end{itemize}
\begin{verse}
{\bf Note:} clever rearrangement of the steps in this process is possible.
As an example, in a memory-constrained environment, one can make multiple
passes through the DCT coefficients to avoid buffering them all in memory.
On the first pass, the starting location of each coefficient is identified, and
then 64 separate get pointers are used to read in the 64 DCT coefficients
required to reconstruct each coded block in sequence.
This operation produces entirely equivalent output and is naturally perfectly
legal.
It may even be a benefit in non-memory-constrained environments due to a
reduced cache footprint.
\end{verse}
Theora makes equivalence easy to check by defining all decoding operations in
terms of exact integer operations.
No floating-point math is required, and in particular, the implementation of
the iDCT transform MUST be followed precisely.
This prevents the decoder mismatch problem commonly associated with codecs that
provide a less rigorous transform specification.
Such a mismatch problem would be devastating to Theora, since a single rounding
error in one frame could propagate throughout the entire succeeding frame due
to DC prediction.
\paragraph{Packet Type Decode}
Theora uses four packet types.
The first three packet types mark each of the three Theora headers described
above.
The fourth packet type marks a video packet.
All other packet types are reserved; packets marked with a reserved type should
be ignored.
Additionally, zero-length packets are treated as if they were an inter
frame with no blocks coded. That is, as a duplicate frame.
\paragraph{Frame Header Decode}
The frame header contains some global information about the current frame.
The first is the frame type field, which specifies if this is an intra frame or
an inter frame.
Inter frames predict their contents from previously decoded reference frames.
Intra frames can be independently decoded with no established reference frames.
The next piece of information in the frame header is the list of \qi\ values