Permalink
Find file
10d2e48 Oct 16, 2015
182 lines (160 sloc) 7.71 KB

delz (decompress LZ)

Polish vehicle registration certificate data decompression tool

Overview

Polish vehicle registration certificates since accession to European Union contains 2D code for automatic processing of data written inside the document. However the whole process consists of few steps. First and most obvious of them is the code, to be exact it is Aztec code which is not very popular these days. Output of the reader is base64-coded data chunk plus one character at the end (the meaning of it is still unknown). After decoding base64 we get a few hundred bytes of binary blob which is compressed using custom, LZ77-derived algorithm and this program is decompressing this data. Its output is array of undescribed strings delimited by pipe symbol (|) or 0x7c. Note that this data is encoded using UTF-16LE, which is typical for Windows systems, and there it is known as simply Unicode. Output of this program will be covered in details at the end of this document. Whole process is presented on the schematic below.

+-------+        +--------+                 +------------+                 +-----------+
| Aztec | scan() |        | base64_decode() | LZ         | decompress_lz() | pipe      |
|       |------->| base64 |---------------->| compressed |---------------->| delimited |
| code  |        |        |                 | data       |                 | text      |
+-------+        +--------+                 +------------+                 +-----------+

Installation

This project provides you with:

  • standalone program for decompressing the blob you should get as output of base64 decoding function

It can be compiled by simply typing make into your terminal (provided that you have compiler).

  • library in ar format ready to be included in your application

It can be compiled by typing make liblz.a or as a dependency of above program.

Both parts are licensed as LGPL so you can use it even in commercial products. For exact conditions of redistribution, see LICENSE file you should get with this copy of program source.

Usage

To decompress data properly you should provide base64-decoded byte stream to program's standard input. Decompressed data will be printed to the standard output of the program. Assuming that data.bin contains valid stream and is stored in current working directory you could do: ./delz < data.bin > data.utf8. After that you should get decompressed data in data.utf8 file. In case of failure program returns non-zero value.

Portability

Program was tested on amd64 Linux system, but it should work properly on every UNIX-based system that supports UTF-8 in its console. It might be possible to make it work on Windows system as well but since Windows traditionally uses UTF-16 as standard console encoding it may not be possible to print output properly to console.

Furthermore there should not be any problem to use it on any big-endian system, but it may involve further testing.

Technical details

The compression algorithm turns out to be custom implementation of LZSS algorithm. If you are interested in functioning LZ77-based algorithms you could start with Wikipedia pages for LZ77 and LZSS. tl;dr: both algorithms finds repeating substrings and replacing them with reference to last occurence of the same string. LZ77 proposed to save them together and place after them next byte of string uncompressed, which in practice was not efficient. LZSS tried to solve this problem by prefixing each byte of output with single bit indicating if it is raw data whether length-offset pair. It was better but now we have to store 9-bit words which still is not very efficient. According to informations from this article one of the implementers of LZSS solved this by storing flags in 8-bit packs.

The implementation used here goes one step further and besides the following tries to optimize usage of length-offset pairs by:

  • encoding offsets longer than 127 bytes just after pair indicator mentioned above
  • using same bits as in previous point to indicate that we should use previous offset (so it saves a byte in case two times in a row it need to copy bytes from ie. current offset minus 1)
  • encoding length just after big offset bits and doing this the way that lengths shorter than about 36 bytes needs less than or exactly 8 bits, so it is saving few bits if it is shorter (and it is uncommon to copy more than 36 bytes at once)

If you want to learn more about details of this algorithm you should read the function code (I know it may be difficult for someone not writing low-level code, though).

Output

It seems that some time ago and after implementing the code, government (or PWPW, who is responsible for producing the documents) changed output data format. In its new version the code stores some fields not present in the document itself and they have unknown meaning too. These new version is indicated by XXC1 field at hte beginning.

Because this section may interest mainly Polish-speaking people the desriptions will be only in that language. For others curious what is inside: please use Google translate and sorry for that.

Pozycja (stary) Pozycja (nowy) Miejsce w dowodzie Przykład Opis
- 0 XXC1 n.d. Rozróżnia wersje protokołu
0 1 SERIA DR BAF1026996 Seria i numer dowodu
- 2 ? 1465198 Kod teryt urzędu rejestrującego
1 3 ORGAN WYDAJĄCY PREZYDENT M. ST. WARSZAWY Linia 1
2 4 DZIELNICA ŻOLIBORZ Linia 2
3 5 ul. NIEISTNIEJĄCA 1/2 Linia 3
4 6 01-627 WARSZAWA Linia 4
7 A UA 12345 Numer rejestracyjny pojazdu
8 D.1 PEUGEOT Marka
9 D.2 Typ homologacji
10 D.2 Wariant homologacji
11 D.2 Wersja homologacji
12 D.3 206 XAD Model
13 E ZFA00000123456789 VIN
14 I 2001-12-21 Data wydania dowodu rejestracyjnego (YYYY-MM-DD)
15 H --- Okres ważności dowodu
16 C.1.1 KOWALSKI JAN Pierwsza linia
17 C.1.1 JAN Imię
18 C.1.1 KOWALSKI Nazwisko
19 ? Nieznany
20 C.1.2 PESEL
21 C.1.3 01-627 Kod pocztowy
22 C.1.3 WARSZAWA Gmina
23 C.1.3 WARSZAWA Miejscowość
24 C.1.3 NIEISTNIEJĄCA Ulica
25 C.1.3 6A Nr domu
26 C.1.3 Nr mieszkania
27 C.2.1 KOWALSKI JAN Pierwsza linia
28 C.2.1 JAN Imię
29 C.2.1 KOWALSKI Nazwisko
30 ? Nieznany
31 C.2.2 PESEL
32 C.2.3 01-627 Kod pocztowy
33 C.2.3 WARSZAWA Gmina
34 C.2.3 WARSZAWA Miejscowość
35 C.2.3 NIEISTNIEJĄCA Ulica
36 C.2.3 6A Nr domu
37 C.2.3 Nr mieszkania
38 F.1 1600 Maksymalna masa całkowita [kg]
39 F.2 1600 Dopuszczalna masa całkowita pojazdu [kg]
40 F.3 2600 Dopuszczalna masa całkowita zespołu [kg]
41 G 1040 Masa własna
42 J --- Kategoria pojazdu
43 K --- Numer świadectwa homologacji typu pojazdu
44 L 2 Liczba osi
45 O.1 1000 Maksymalna mas całkowita przyczepy z hamulcem
46 O.2 400 Maksymalna mas całkowita przyczepy bez hamulca
47 Q Stosunek mocy do masy (w KW/kg)
48 P.1 1600,00 Pojemność silnika [cm^3]
49 P.2 80,00 Moc silnika [kW]
50 P.3 D Rodzaj paliwa
51 B 2001-12-31 Data pierwszej rejestracji pojazdu (YYYY-MM-DD)
52 S.1 5 Liczba miejsc siedzących
53 S.2 --- Liczba miejsc stojących
54 RODZAJ POJAZDU SAMOCHÓD OSOBOWY
55 PRZEZNACZENIE ---
56 ROK PRODUKCJI 2000
57 DOPUSZCZALNA ŁADOWNOŚĆ ---
58 NAJWIĘKSZY DOP. NACISK OSI 10,00 [kN]
59 NR KARTY POJAZDU Karty nie wydano
60 ? Kod identyfikacyjny
61 ? 03 Rodzaj - kod
62 ? 06 Podrodzaj - kod
63 ? 000 Przeznaczenie - kod
64 ? 0011NNNNNNNN Nieznany
65 ? 007004000 Nieznany