FGDB Spec

Even Rouault edited this page May 14, 2015 · 37 revisions
Clone this wiki locally

This is a work-in-progress reverse-engineered specification of .gdbtable, .gdbtablx, .gdbindexes, .atx and .freelist files found in FileGDB datasets. It generally applies to FileGDB datasets v10, as well as earlier versions, unless otherwise specified.

Conventions

  • ubyte: unsigned byte
  • int16: little-endian 16-bit integer
  • int32: little-endian 32-bit integer
  • float64: little-endian 64-bit IEEE754 floating point number
  • utf16: string in little-endian UTF-16 encoding
  • string: (UTF-8 ?) string

A row or a feature are synonyms in this document.

Specification of .gdbtable files

.gdbtable files describe fields and contain row data.

They are made of an header, a section describing the fields, and a section describing the rows.

Header (40 bytes)

  • 4 bytes: 0x03 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ?
  • int32: number of (valid) rows
  • 4 bytes: varying values - unknown role (TBC : this value does have something to do with row size. A value larger than the size of the largest row seems to be ok)
  • 4 bytes: 0x05 0x00 0x00 0x00 - unknown role. Constant among the files
  • 4 bytes: varying values - unknown role. Seems to be 0x00 0x00 0x00 0x00 for FGDB 10 files, but not for earlier versions
  • 4 bytes: 0x00 0x00 0x00 0x00 - unknown role. Constant among the files
  • int32: file size in bytes
  • 4 bytes: 0x00 0x00 0x00 0x00 - unknown role. Constant among the files
  • int32: offset in bytes at which the field description section begins (often 40 in FGDB 10)
  • 4 bytes: 0x00 0x00 0x00 0x00 - unknown role. Constant among the files

Field description section

Fixed part

  • int32: size of header in bytes (this field excluded)
  • int32: version of the file ? Seems to be 3 for FGDB 9.X files and 4 for FGDB 10.X files
  • ubyte: layer geometry type. 0 = none, 1 = point, 2 = multipoint, 3= (multi)polyline, 4 = (multi)polygon, 9=multipatch
  • 3 bytes: 0x03 0x00 0x00 - unknown role
  • int16: number of fields (including geometry field and implicit OBJECTID field)

Repeated part (per field)

Following immediately: the description of the fields (repeated as many times as the number of fields)

  • ubyte: number of UTF-16 characters (not bytes) of the name of the field
  • utf16: name of the field
  • ubyte: number of UTF-16 characters (not bytes) of the alias of the field. Might be 0
  • utf16: alias of the field (ommitted if previous field is 0)
  • ubyte: field type ( 0 = int16, 1 = int32, 2 = float32, 3 = float64, 4 = string, 5 = datetime, 6 = objectid, 7 = geometry, 8 = binary, 9=raster, 10/11 = UUID, 12 = XML )

The next bytes for the field description depend on the field type.

For field type = 4 (string),

  • int32: maximum length of string
  • ubyte: flag
  • varuint: ldf = length of default value in byte if (flag&4) != 0 followed by ldf bytes with the default value numeric

For field type = 6 (objectid),

  • ubyte: unknown role = 4
  • ubyte: unknown role = 2

For field type = 7 (geometry),

  • ubyte: unknown role = 0
  • ubyte: flag = 6 or 7. If lsb is 1, the field can be null.
  • int16: length (in bytes) of the WKT string describing the SRS.
  • string: WKT string describing the SRS Or {B286C06B-0879-11D2-AACA-00C04FA33C20} for no SRS .
  • ubyte: flags. Value is generally 1 (has_z = has_m = false, generally for system tablea00000004.gdbtable ), 5 (has_z = true, has_m = false) or 7 (has_z = has_m = true)
  • float64: xorigin
  • float64: yorigin
  • float64: xyscale
  • float64: morigin (present only if has_m = True)
  • float64: mscale (present only if has_m = True)
  • float64: zorigin (present only if has_z = True)
  • float64: zscale (present only if has_z = True)
  • float64: xytolerance
  • float64: mtolerance (present only if has_m = True)
  • float64: ztolerance (present only if has_z = True)
  • float64: xmin of layer extent (might be NaN)
  • float64: ymin of layer extent (might be NaN)
  • float64: xmax of layer extent (might be NaN)
  • float64: ymax of layer extent (might be NaN)

There are extra bytes whose organization seems to comply to the following ad-hoc algorithm (seems to work in practice, but it is obvious we have not yet understood how this works really due to the various hacks in it) :

  1. Read 5 bytes
  2. If those bytes are 0x00 XXX 0x00 0x00 0x00 where XXX=0x01, 0x02 or 0x03, then read XXX * float64 values and go to 6.
  3. Otherwise, rewind those 5 bytes
  4. Read a float64 value
  5. Goto 1
  6. End

Note: it seems that if at least 3 float64 values are read from the above algorithm, the 2 first ones correspond to the zmin of layer extent and zmax of layer extent (might be NaN), when the layer is of 2.5D geometry type.

For field type = 8 (binary),

  • ubyte: unknown role
  • ubyte: flag

For field type = 9 (raster),

  • ubyte: unknown role
  • ubyte: flag. If lsb is 1, the field can be null.
  • ubyte: number of UTF-16 characters (not bytes) of the following string
  • utf16: string whose value seems to be "Raster Column"
  • int16: length (in bytes) of the WKT string describing the SRS.
  • string: WKT string describing the SRS Or {B286C06B-0879-11D2-AACA-00C04FA33C20} for no SRS .
  • ubyte: flags. Value is generally 1 (has_z = has_m = false, generally for system tablea00000004.gdbtable ), 5 (has_z = true, has_m = false) or 7 (has_z = has_m = true). If 0, none of the following float64 values is present : the next one is the ubyte of unknown role.
  • float64: xorigin
  • float64: yorigin
  • float64: xyscale
  • float64: morigin (present only if has_m = True)
  • float64: mscale (present only if has_m = True)
  • float64: zorigin (present only if has_z = True)
  • float64: zscale (present only if has_z = True)
  • float64: xytolerance
  • float64: mtolerance (present only if has_m = True)
  • float64: ztolerance (present only if has_z = True)
  • ubyte: unknown role

For field type = 10, 11

  • ubyte: width : 38
  • ubyte: flag

For field type = 12

  • ubyte: width : 0
  • ubyte: flag

For other field types,

  • ubyte: width in bytes (e.g. 2 for int16, 4 for int32, 4 for float32, 8 for float64, 8 for datetime)
  • ubyte: flag
  • ubyte: ldf = length of default value in byte if (flag&4) != 0 followed by ldf bytes

If the lsb of the flag field (when present) is set to 1, then the field can be null in records

Rows section

The rows section does not necessarily immediately follow the last field description. It starts generally a few bytes after, but not in a predictable way. Note : for FGDB layers created by the ESRI FGDB SDK API, there are 4 bytes between the end of the field description section and the beginning of the rows section : 0xDE 0xAD 0xBE 0xEF (!)

The rows section is a sequence of X rows (where X is the total number of features found in the .gdbtablx, which might be different from the number of valid rows found in the header of the .gdbtable). Each row starts at an offset indicated in the .gdbtablx file

Row description

  • int32: length in bytes of the row blob ( this field excluded)
  • ceil(number_nullable_fields / 8) * ubyte: flags describing if a field is null. See below explanation

Null fields flags

Each bit of the flags field encode for the presence or absence of the field content, for a nullable field, for the row. The flag is set to 1 if the field is missing/null, or 0 if the field is present/non-null (0 is used as well for spare bytes). The flag for the first field, in the order of the fields of the field description section (typically the geometry), is the least significant bit of the first byte of the flags field.

There are no bits reserved for non-nullable fields.

If all fields are non-nullable, the flag field is absent.

Note: there's no explicit data for OBJECTID and no reserved flag bit for it.

For each non-null field, the field content is appended in the order of the fields of the field description section.

Field content

Geometry field (type = 7)

This field is generally called "SHAPE".

Geometry blobs use 2 new encoding schemes :

  • varuint (64 bit): a sequence of bytes [b0, b1, ... bN]. All bytes except last one have their msb (most significant bit) set to 1. The presence of a msb = 0 marks the end of the sequence. The value of the varuint is (b0 & 0x7F) | ((b1 & 0x7F) << 7) | ((b2 & 0x7F) << 14) | ... | ((bN & 0x7F) << (7 * N)). Note that a valid sequence might be just 1 byte.
  • varint (64 bit): same concept as varuint. But the 2nd most significant bit of b0 (i.e. the one obtained by masking with 0x40) indicates the sign of the result, and should be ignored in the computation of the unsigned value : (b0 & 0x3F) | ((b1 & 0x7F) << 6) | ((b2 & 0x7F) << 13) | ... | ((bN & 0x7F) << (7 * N - 1)). If the bit sign is set to 1, the value must be negated.

Common preambule to all geometry types

  • varuint: length of the geometry blob in bytes (this field excluded)
  • varuint: geometry_type. 1 = 2D point, 3 = 2D (multi)linestring, 5 = 2D (multi)polygon. Other values possible. See SHPT_ enumeration of ogrpgeogeometry.h. This is generally a single byte, but for SHPT_GENERALxxxxx geometries this can be multi-byte due to flags added to the base type

The bytes of the geometry blob following this preamble depend of course on the geometry type.

  • For point geometries (geometry type = 1, 9, 21, 11)

    • varuint: x = (varuint - 1) / xyscale + xorigin
    • varuint: y = (varuint - 1) / xyscale + yorigin
    • varuint ( present only if Z component ): z = (varuint - 1) / zscale + zorigin
    • varuint ( present only if M component ): m = (varuint - 1) / mscale + morigin

    Note the (varuint - 1), instead of varint in following geometry types. The reason for that exception is unclear.

  • For multipoint geometries (geometry type = 8, 20, 28, 18)

    • varuint: number of points
    • varuint: xmin = varuint / xyscale + xorigin
    • varuint: ymin = varuint / xyscale + yorigin
    • varuint: xmax = varuint / xyscale + xmin
    • varuint: ymax = varuint / xyscale + ymin

    followed by points coordinates:

    For each point of all parts (dx = dy = 0 initially) :

    • varint: dx = dx + varint; x[i] = dx / xyscale + xorigin
    • varint: dy = dy + varint; y[i] = dy / xyscale + yorigin

    If there is a Z component, an array of Z values follows :

    For each point of all parts (dz = 0 initially) :

    • varint: dz = dz + varint; z[i] = dz / zscale + zorigin
  • For (multi)linestring (geometry type = 3, 10, 23, 13) or (multi)polygon (geometry type = 5, 19, 25, 15)

    • varuint: total number of points of all following parts
    • varuint: number of parts, i.e. number of rings for (multi)polygon - inner and outer rings being at the same level, number of linestrings of a multilinestring, or 1 for a linestring)
    • varuint: xmin = varuint / xyscale + xorigin
    • varuint: ymin = varuint / xyscale + yorigin
    • varuint: xmax = varuint / xyscale + xmin
    • varuint: ymax = varuint / xyscale + ymin
    • varuint: number of points of first part (omitted if there is only one part)
    • ...: ...
    • varuint: number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers

    followed by, for each part, points coordinates:

    For each point of all parts (dx = dy = 0 initially) :

    • varint: dx = dx + varint; x[i] = dx / xyscale + xorigin
    • varint: dy = dy + varint; y[i] = dy / xyscale + yorigin

    If there is a Z component, an array of Z values follows :

    For each point of all parts (dz = 0 initially) :

    • varint: dz = dz + varint; z[i] = dz / zscale + zorigin

    For polygons if the ring is clockwise then it is an outer ring and if is counterclockwise it is an inner ring. While it is not documented anywhere ESRI programs make the assumption that inner rings will always follow the the outer ring that contains them. So

    [clockwise,counterclockwise,clockwise,clockwise,counterclockwise,counterclockwise] 
    

    can be represented in GeoJSON as

    [[clockwise,counterclockwise],[clockwise],[clockwise,counterclockwise,counterclockwise]] 
    

    TODO: M values. Likely like Z component. But in FileGDB_API/samples/data/Shapes.gdb/a00000028.gdbtable, which is a polylinezm, the m values all are NaN, which is represented as 0x42 0x00 0x00 0x00 0x00 at the end of the geometry blob

  • For GeneralPolyline ( (geometry type & 0xff) = 50 )

    • varuint: total number of points of all following parts
    • varuint: number of parts, number of linestrings of a multilinestring, or 1 for a linestring
    • varuint: number of curve descriptions (present if (geom_type & 0x20000000) != 0 )
    • varuint: xmin = varuint / xyscale + xorigin
    • varuint: ymin = varuint / xyscale + yorigin
    • varuint: xmax = varuint / xyscale + xmin
    • varuint: ymax = varuint / xyscale + ymin
    • varuint: number of points of first part (omitted if there is only one part)
    • ...: ...
    • varuint: number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers

    followed by, for each part, points coordinates:

    For each point of all parts (dx = dy = 0 initially) :

    • varint: dx = dx + varint; x[i] = dx / xyscale + xorigin
    • varint: dy = dy + varint; y[i] = dy / xyscale + yorigin

    If there is a Z component ( (geom_type & 0x80000000) != 0 ) , an array of Z values follows :

    For each point of all parts (dz = 0 initially) :

    • varint: dz = dz + varint; z[i] = dz / zscale + zorigin

    If there is a M component ( (geom_type & 0x40000000) != 0 ) , an array of M values follows (unless the next byte is 0x42, in which case the M array is skipped) :

    For each point of all parts (dm = 0 initially) :

    • varint: dm = dm + varint; m[i] = dm / mscale + morigin

    If there are curves ( (geom_type & 0x20000000) != 0 ), an array of segment modifiers follows. There are as many segment modifiers as the above "number of curve description" fields. The serialization of these curve descriptions is directly based on the esriSegmentModifier, WKSPoint, SegmentArc, SegmentBezierCurve and SegmentEllipticArc C structures described in extended_shape_buffer_format.pdf, which the following equivalences :

    • C long --> int32
    • C enum --> int32
    • C double --> float64
  • For GeneralMultiPatch ( (geometry type & 0xff) = 54 )

    • varuint: total number of points of all following parts
    • varuint: unknown role
    • varuint: number of parts, i.e. number of rings for (multi)polygon - inner and outer rings being at the same level, number of linestrings of a multilinestring, or 1 for a linestring)
    • varuint: xmin = varuint / xyscale + xorigin
    • varuint: ymin = varuint / xyscale + yorigin
    • varuint: xmax = varuint / xyscale + xmin
    • varuint: ymax = varuint / xyscale + ymin
    • varuint: number of points of first part (omitted if there is only one part)
    • ...: ...
    • varuint: number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers

    followed by, for each part, part type:

    • varuint: : part type. Only keep 4 lowest significant bit (higher bits are for priority, material index. see extended-shapefile-format.pdf). 0 = triangle strip, 1 = triangle fan, 2 = outer ring, 3 = inner ring, 4 = first ring, 5 = ring, 6 = triangles

    followed by, for each part, points coordinates:

    For each point of all parts (dx = dy = 0 initially) :

    • varint: dx = dx + varint; x[i] = dx / xyscale + xorigin
    • varint: dy = dy + varint; y[i] = dy / xyscale + yorigin

    If there is a Z component ( (geom_type & 0x80000000) != 0 ) , an array of Z values follows :

    For each point of all parts (dz = 0 initially) :

    • varint: dz = dz + varint; z[i] = dz / zscale + zorigin

Binary (type = 8)

Number of bytes of the string as a varuint, followed by binary content

Raster (type = 9)

int32 value ( constant to 1 ? )

String (type=4) or XML (type=12)

Number of bytes of the string as a varuint, followed by string content

UUID (type=10 or 11)

16 bytes.

The string representation is the following (printf like expression) :

"{%02X%02X%02X%02X-%02X%02X-%02X%02X-%02X%02X-%02X%02X%02X%02X%02X%02X}", b[3], b[2], b[1], b[0], b[5], b[4], b[7], b[6], b[8], b[9], b[10], b[11], b[12], b[13], b[14], b[15]

Other types

a int16 value for a int16 field, a int32 for a int32 field, etc..

Note : datetime values are the number of days since 30th dec 1899 00:00:00, encoded as float64

Specification of .gdbtablx file

.gdbtablx files contain the offset of the rows of the associated .gdbtable file.

Header (16 bytes)

  • 4 bytes: 0x03 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ?
  • int32: n1024BlocksPresent = number of blocks of offsets for 1024 features that are effectively present in that file (ie sparse blocks are not counted in that number).
  • int32: number_of_rows : number of rows, included deleted rows
  • int32: size_offset = number of bytes to encode each feature offset. Must be 4 (.gdbtable up to 4GB), 5 (.gdbtable up to 1TB) or 6 (.gdbtable up to 256TB)

Offset section

The section starts immediately after the header (at offset 16) and is made of size_offset x number_rows bytes. For each row,

  • int32, int40 or int48: (depending on size_offset value) offset of the beginning of the row in the .gdbtable file, or 0 if the row is deleted. int40 is made of a int32 with the 32 least significant bits followed by a 4th byte with the 8 most significant bits. Similar for int48

If there is a bit array (bitmap) to represent the presence/absence of blocks of offsets for 1024 features, then the correct row iCorrectedRow in the index for the FID iRow+1 is given by :

        GUInt32 nCountBlocksBefore = 0;
        int iBlock = iRow / 1024;
        // Check if the block is not empty
        if( (pabyTablXBlockMap[iBlock / 8] & (1 << (iBlock % 8))) == 0 )
        {
            nCurRow = -1;
            return FALSE;
        }
        for(int i=0;i<iBlock;i++)
            nCountBlocksBefore += ( pabyTablXBlockMap[i / 8] & (1 << (i % 8)) ) != 0;
        int iCorrectedRow = nCountBlocksBefore * 1024 + (iRow % 1024);

Trailing section (16 bytes + variable number )

Located at offset 16 + size_offset * n1024BlocksPresent * 1024

  • int32: nBitmapInt32Words = number of int32 words for the bitmap (rounded to the next multiple of 32)
  • int32: n1024BlocksTotal = (number_of_rows + 1023) / 1024. In the case where there's a bitmap, this is also nBitsForBlockMap = number of bits in the block map.
  • int32: n1024BlocksPresentBis (must be == n1024BlocksPresent of the header)
  • int32: nUsefulBitmapIn32Words = number of int32 words in the bitmap where there's at least a non-zero bit. Said otherwise, all following words until the end of the bitmap are 0. Doesn't seem to be used by proprietary implementations.

if nBitmapInt32Words == 0 (no bitmap), then n1024BlocksTotal == n1024BlocksPresentBis ( == n1024BlocksPresent) and nUsefulBitmapIn32Words = 0

Otherwise, following those 16 trailer bytes, there is a bit array of at least (n1024BlocksTotal + 7) / 8 bytes (in practice its size is rounded to the next muliple of 32 int32 words). Each bit in the array represents the presence of a block of offsets for 1024 features (bit = 1), or its absence (bit = 0). The total number of bits set to 1 must be equal to n1024Blocks

Specification of .gdbindexes files

.gdbindexes files list the indexes that may exist on certain fields of a .gdbtable. This only apply to FileGDB v10 .gdbindexes : v9 .gdbindexes have a different (and more complicated) structure.

Header (4 bytes)

  • int32: number of indexes describes in the file

Index description

The section starts immediately after the header (at offset 4) and is repeated as many times as they are indexes.

  • uint32: number of UTF-16 characters for the following field
  • utf16: suffix of the index file. If it's value is foo, the filename of the index is aXXXXXXXX.foo.atx (unless the index is FDO_OBJECTID in which case the index is the .gdbtablx file, or FDO_SHAPE in which case the index is the .spx file)
  • int16: unknown role
  • int16: unknown role
  • int32: unknown role
  • int16: unknown role
  • int32: unknown role
  • uint32: number of UTF-16 characters for the following field
  • utf16: field name (or sometimes expression like "LOWER(Name)" as found in a00000001.gdbindexes)
  • int16: unknown role

Specification of .atx files

.atx files contain indexes for a field of a .gdbtable. The general idea is that the values that the field takes in the .gdbtable are listed in ascending order with the associated FID. .atx files are organized in pages of 4096 bytes and have a hierarchical organization whose depth depends on the size of the values of the field and the number of features of the table. The first page is 1, so page N is located at offset (N-1)*4096.

The reading of .atx files must start with its trailing section.

Trailing section (22 bytes)

  • byte: size in bytes of the values indexed (called size_value afterwards). This has a close relationship with the field type of the field being indexed. So for, int16 it is equal to 2. For int32: 4. For float32: 4. For float64: 8. For string: variable number that is a multiple of 2 (string values are encoded as UTF16 characters, so 2 bytes per character) and at maximum 160 bytes (80 characters). For datetime: 8. For UUID: 38 ( the string representation is 38 bytes. See above). Indexing of binary or XML fields has not been studied (if it is possible !)
  • byte: unknown role
  • int32: unknown role. Apparently always/often 1.
  • uint32: index depth >= 1. If it is 1 the first page directly references features. Otherwise the first page reference pages that reference pages referencing features (depth = 2), or pages that reference pages that reference pages that reference features (depth = 3), and so on...
  • uint32: number of features referenced in the file. Otherwise said number of features that have a non-null value for the field being indexed. Must not be greater than the number of valid features of the .gdbtable. It has been observed that (with FileGDB SDK 1.3) this value is not relieable for an index that has been built while features are inserted, if the values inserted are not in increasing order.
  • int32: unknown role. Apparently always/often 0.
  • int32: unknown role. Apparently always/often 1.

The maximum number of features (or sub-pages references) in a page is : nMaxPerPages = (4096 - 12) / (4 + size_value)

The offset at which field values are found in a page is : nOffsetFirstValInPage = 12 + nMaxPerPages * 4

Page referencing features (4096 bytes)

For a given field value, if found in several features, the features are sorted by ascending ID. The structure of such a page is header section (12 bytes), followed by FID numbers (maximum of 4 * nMaxPerPages bytes), a few potential padding bytes, and finally field values (maximum of size_value * nMaxPerPages bytes)

Header section structure (offset 0 in the page) :

  • uint32: ID of the next page at the same depth, or 0 for last page. Not strictly needed to use the index
  • uint32: number of features referenced in the page (nFeatures). Not greater than nMaxPerPages
  • uint32: unknown role. Apparently always/often 0.

FID section structure (offset 12 in the page) :

  • uint32: FID of the first feature referenced in the page
  • ...
  • uint32: FID ot the (nFeatures)th feature referenced in the page.

Padding section of zeroes (size: nOffsetFirstValInPage - 12 - 4 * nFeatures)

Values section structure (offset nOffsetFirstValInPage in the page):

  • type depending on the field (int16/int32/float32/float64/datetime as float64/string as UTF16 characters/UUID): value of field for the first feature referenced in the page
  • ...
  • type: value of field for the (nFeatures)th feature referenced in the page.

Page referencing other pages (4096 bytes)

The structure of such a page is header section (4 bytes), followed by sub-pages numbers (maximum of 4 * (1 + nMaxPerPages) bytes), a few potential padding bytes, and finally field values (maximum of size_value * nMaxPerPages bytes)

Header section structure (offset 0 in the page) :

  • uint32: ID of the next page at the same depth, or 0 for last page. Not strictly needed to use the index
  • uint32: number of sub-pages referenced in the page (nSubPages). Not greater than nMaxPerPages

Sub-pages number section (offset 8 in the page):

  • uint32: ID of the first sub-page referenced in the page
  • ...
  • uint32: ID of the (nSubPages)th sub-page referenced in the page.
  • uint32: ID of the (nSubPages+1)th sub-page referenced in the page (note: there is no maching value for that last sub-page number in the values section)

Padding section of zeroes( size: nOffsetFirstValInPage - 8 - 4 * (nSubPages+1))

Values section structure (offset nOffsetFirstValInPage in the page):

  • type depending on the field (int16/int32/float32/float64/datetime as float64/string as UTF16 characters/UUID): maximum value of field taken in the features referenced by the sub-page (and its potential sub-sub-pages) for the first sub-page referenced in the page
  • ...
  • type: maximum value of field taken in the features referenced by the sub-page (and its potential sub-sub-pages) for the (nSubPages)th sub-page referenced in the page

Partial specification of .spx files

Note: this section collects known facts about .spx files, but the reverse engineering is only partial, so nothing really useful can be made of it for now.

.spx files contain the spatial index for the geometry field of a .gdbtable. With high confidence, on can tell they have exactly the same structure as .atx files: same trailing section of 22 bytes, same principle of pages of 4096 byte, with either pages referencing other pages (depth > 0) or pages referencing features (depth = 0). The payload being indexed is 8 bytes (size_value = 8). The key is to understand how those 8 bytes are built from the feature geometry (and presumably characteristics of the spatial reference system object)

The following facts have been collected (warning: some might actually be wrong, or dependant of other conditions!) :

  • for a table with points at the same location, the 8 bytes are set to 0
  • for 2 table with identical points (for example a grid of 361x181 points in [-180,180]*[-90,90])), it seems that the XORIGIN, YORIGIN and XYSCALE parameters have no impact on the values of the 8 bytes
  • features seem to be sorted by ascending x coordinates, and for equal x, by descending y
  • the structure of the 8 bytes seem to be 4 bytes for y following by 4 bytes for x on some examples, but not exactly on some other tests. So maybe there are varying integers involved.
  • if generating a layer of 361x181 points aligned on the grid [-180,180]*[-90,90]), and a similar layer where the points are replaced by degenerated rectangles (or lines) whose all 4 coordinates are in fact a single point coordinate, the 8 bytes differ for matching features among the 2 layers.
  • a feature can be referenced more than once (presumably depending on its bounding box, seen on a polygon layer)

Specification of .freelist files

.freelist files contain the offset to the holes (rows deleted, or old updates) in the associated .gdbtable file. The file is rewritten after each edit session, with the most recent edit at the start of the file, and order being maintained during repeated edit operations. The file is optional, and will be deleted when the fGDB is compacted.

The file has 344 bytes of buffer at the end, and looks to be created in 4K blocks ( so, smallest is 4096 + 344 = 4440 bytes )

Header (8 bytes)

  • int32: number of rows
  • 4 bytes: 0xFFFFFFFF. No apparent use

Offset section

The section starts immediately after the header and is made of (4 + size_offset) x number_rows bytes. For each row,

  • int32: number of bytes
  • int32, int40 or int48: (depending on size_offset value) offset of the beginning of the row in the .gdbtable file. int40 is made of a int32 with the 32 least significant bits followed by a 4th byte with the 8 most significant bits. Similar for int48

GDB files

Files are named in the format a[number in lowercase hex].[extension] with files with the same base but different extensions being related. Files are numbered incrementally, a00000001 is first a00000002 is second, but numbers may be skipped.

FileGDB v10

For FileGDB v10, the first 8 (a00000001 to a00000008) files seem to be reserved for database information and subsequent files are feature classes (a00000009, a0000000a, ...).

  • a00000001 is called GDB_SystemCatalog and contains a list of tables (including itself, other reserved tables and user tables). Tables may be mentionned but not actually found on the disk : this is often (only ?) the case of table a00000008. The FID of a record in this table determines the name of the file to consider. For example the record of FID 37 (the convention taken here for FID numbering is starting from 1) will be in file a00000025. There might be deleted rows in this catalog table, so gaps in FID numbering.

    The table contains a Name field and a FileFormat field. The value of FileFormat seems to be 0 in most cases, and sometimes 2 for a few reserved system tables.

  • a00000002 contains config parameters for the database and is called GDB_DBTune

  • a00000003 is called GDB_SpatialRefs and contains the SRS as WKT in field SRTEXT (in ESRI WKT dialect) and the following fields : FalseX, FalseY, XYUnits, FalseZ, ZUnits, FalseM, MUnits, XYTolerance, ZTolerance, MTolerance. All rows are unique so if there are 3 features classes, all with the same spatial reference system, but one has a different ZTolerance there will be two rows.

  • a00000004 is called GDB_Items and contains metadata about the items (layers), mostly in XML. The fields are :

    • UUID (UUID) : UUID
    • Type (UUID) : item type
    • Name (string) : item/layer name. Matches the Name field of the GDB_SystemCatalog
    • PhysicalName (string) : item/layer name in upper case characters.
    • Path (string) : "\mylayername" for top-level layers or "\myfeaturedataset\mylayername" for layers attached to a feature dataset "myfeaturedataset"
    • DatasetSubType1 (int32) : 1 for user tables (TBC)
    • DatasetSubType2 (int32) : layer geometry type. 1 for point layer, 2 for multipoint layers, 3 for linestring layers, 4 for polygon layers
    • DatasetInfo1 (string) : "SHAPE" for user tables (TBC)
    • DatasetInfo2 (string) : NULL for user tables (TBC)
    • URL (string) : empty string (TBC)
    • Definition (XML) : DEFeatureClassInfo XML element. Contains an XML version of the information that can be obtained by parsing the header of a table : fields, SRS, ...
    • Documentation (XML) : metadata XML element
    • ItemInfo (XML) : NULL for user tables (TBC)
    • Properties (int32) : 1 for user tables (TBC)
    • Defaults (binary) : absent for user tables (TBC)
    • Shape (geometry) : 5 point polygon listing the corner of the bounding box of the layer reprojected into EPSG:4326 (even if the layer SRS is not EPSG:4326). Or missing if the layer SRS is undefined.

    A few particular records :

    • The first record is reserved for a kind of root item ( Name = "", Path = "\" ).
    • The second record is reserved for a Name = "Workspace" item, Path = "", Definition containing a DEWorkspace XML element
    • When there are feature datatesets, they also appear as records : e.g. Name = "featuredataset", PhysicalName = "FEATUREDATASET", Path = "\FEATUREDATASET", Definition containing a DEFeatureDataset XML element
  • a00000005, a00000006 and a00000007 are one of GDB_ItemRelationships,GDB_ItemRelationshipTypes or GDB_ItemTypes (order may vary depending on datasets)

  • a00000008 is called GDB_ReplicaLog. It is often listed in the GDB_SystemCatalog, but actually missing on disk.

Globally for v10 files, the main interesting reserved table seems to be the GDB_SystemCatalog to establish the link between the layer name and its associated .gdbtable file. Using a00000004 might be needed in case there are user table of other table types listed in the GDB_SystemCatalog that are not vector tables (rasters, relationships, ...), and also may be used to have an overview of all tables by exploiting the XML definition without opening all the corresponding .gdbtable files.

FileGDB v9

For FileGDB v9, the first 36 (a00000001 to a00000024) files seem to be reserved for database information and subsequent files are feature classes (a00000025, a00000026, ...). Very often, the files between a00000009 and a00000024 are missing.

  • a00000001 : GDB_SystemCatalog. Similar to v10. Contains as well a DatasetGUID field. Records 1 to 36 are reserved for GDB_ tables

  • a00000001 : GDB_DBTune

  • a00000003 : GDB_SpatialRefs. Identical to v10

  • a00000004 : GDB_Release. Contains a single record : for v9.2 databases: Major = 2, Minor = 2, Bugfix = 0. For v9.3 databases: Major = 2, Minor = 3, Bugfix = 0

  • a00000005 : GDB_FeatureDataset

  • a00000006 : GDB_ObjectClasses. Contains a Name field, and other technical fields.

  • a00000007 : GDB_FeatureClasses. Simplified version of GDB_Items of v10. Contains the layer geometry type in GeometryType and shape field name in ShapeField. The ObjectClassID field is related to the FID of GDB_ObjectClasses

  • a00000008 : GDB_FieldInfo. Contains information about some (but not all fields) of layers.

Globally for v9 files, the main interesting reserved table seems to be the GDB_SystemCatalog to establish the link between the layer name and its associated .gdbtable file. Using a00000007 in conjunction with a00000006 might be needed in case there are user table of other table types listed in the GDB_SystemCatalog that are not vector tables (rasters, relationships, ...)

License

This specification document is (C) 2013 Even Rouault and licensed under the CC-BY-SA 3.0 terms. CC-BY-SA

Formatting to Markdown done by Calvin Metcalf.

Note: the scope of the copyrighted material does, of course, not extend onto any source or binary code derived from the specification, that may be licensed under the terms that their author may see fit.