Enable SAS Transport Format for SAS v8 and 9 #10

selik · 2017-04-22T04:10:14Z

The file format is slightly different.
https://support.sas.com/techsup/technote/ts140_2.pdf

The first header record consists of the following character string, in ASCII:

    HEADER RECORD*******LIBV8 HEADER RECORD!!!!!!!000000000000000000000000000000

The first real header record uses the following layout:

    aaaaaaaabbbbbbbbccccccccddddddddeeeeeeee ffffffffffffffff

where aaaaaaaa and bbbbbbbb are each 'SAS ' and cccccccc is 'SASLIB ', dddddddd is
the version of the SAS system that created the file, and eeeeeeee is the operating system
creating it. ffffffffffffffff is the datetime created, formatted as ddMMMyy:hh:mm:ss.
Note that only a 2-digit year appears. If any program needs to read in this 2-digit year, be
prepared to deal with dates in the 1900s or the 2000s.

Another way to consider this record is as a C structure:

    struct REAL_HEADER {
        char sas_symbol[2][8];
        char saslib[8];
        char sasver[8];
        char sas_os[8];
        char blanks[24];
        char sas_create[16];
    };

Second real header record:

    ddMMMyy:hh:mm:ss

where the string is the datetime modified. Most often, the datetime created and datetime
modified will always be the same. Pad with ASCII blanks to 80 bytes.
Note that only a 2-digit year appears. If any program needs to read in this 2-digit year, be
prepared to deal with dates in the 1900s or the 2000s.

Member header records:
Both of these occur for every member in the transport file.

    HEADER RECORD*******MEMBV8 HEADER RECORD!!!!!!!000000000000000001600000000140
    HEADER RECORD*******DSCPTV8 HEADER RECORD!!!!!!!000000000000000000000000000000

Note the 0140 that appears in the member header record above. That value is the size of the variable descriptor (NAMESTR) record that is described later in this document.

Member header data:

    aaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbccccccccddddddddeeeeeeeeffffffffffffffff

where aaaaaaaa is 'SAS ', bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb is the data set name,
cccccccc is SASDATA (if a SAS data set is being created), dddddddd is the version of
the SAS System under which the file was created, and eeeeeeee is the operating system
name. ffffffffffffffff is the datetime created, formatted as in previous headers. Consider
this C structure:

    struct REAL_HEADER {
        char sas_symbol[8];
        char sas_dsname[32];
        char sasdata[8];
        char sasver[8];
        char sas_osname[8];
        char sas_create[16];
    };

The second header record is

    ddMMMyy:hh:mm:ss aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbb

where the datetime modified appears using DATETIME16. format, followed by blanks
up to column 33, where the a's above correspond to a blank-padded data set label, and
bbbbbbbb is the blank-padded data set type. Note that data set labels can be up to 256
characters as of Version 8 of the SAS System, but only up to the first 40 characters are
stored in the second header record. Note also that only a 2-digit year appears in the
datetime modified value. If any program needs to read in this 2-digit year, be prepared to
deal with dates in the 1900s or the 2000s.

Consider the following C structure:

    struct SECOND_HEADER {
        char dtmod_day[2];
        char dtmod_month[3];
        char dtmod_year[2];
        char dtmod_colon1[1];
        char dtmod_hour[2];
        char dtmod_colon2[1];
        char dtmod_minute[2];
        char dtmod_colon2[1];
        char dtmod_second[2];
        char padding[16];
        char dslabel[40];
        char dstype[8];
    };

Namestr header record:
One for each member.

    HEADER RECORD*******NAMSTV8 HEADER RECORD!!!!!!!000000xxxxxx000000000000000000

Namestr records:
Each namestr field is 140 bytes long, but the fields are streamed together and broken in
80-byte pieces. If the last byte of the last namestr field does not fall in the last byte of the
80-byte record, the record is padded with ASCII blanks ('20'x) to 80 bytes.

Here is the C structure definition for the namestr record:

    struct NAMESTR {
        short ntype; /* VARIABLE TYPE: 1=NUMERIC, 2=CHAR */
        short nhfun; /* HASH OF NNAME (always 0) */
        short nlng; /* LENGTH OF VARIABLE IN OBSERVATION */
        short nvar0; /* VARNUM */
        char8 nname; /* NAME OF VARIABLE */
        char40 nlabel; /* LABEL OF VARIABLE */
        char8 nform; /* NAME OF FORMAT */
        short nfl; /* FORMAT FIELD LENGTH OR 0 */
        short nfd; /* FORMAT NUMBER OF DECIMALS */
        short nfj; /* 0=LEFT JUSTIFICATION, 1=RIGHT JUST */
        char nfill[2]; /* (UNUSED, FOR ALIGNMENT AND FUTURE) */
        char8 niform; /* NAME OF INPUT FORMAT */
        short nifl; /* INFORMAT LENGTH ATTRIBUTE */
        short nifd; /* INFORMAT NUMBER OF DECIMALS */
        long npos; /* POSITION OF VALUE IN OBSERVATION */
        char longname[32]; /* long name for Version 8-style */
        short lablen; /* length of label */
        char rest[18]; /* remaining fields are irrelevant */
    };

The variable name truncated to 8 characters goes into nname, and the complete name
goes into longname. Use blank padding in either case if necessary. The variable label
truncated to 40 characters goes into nlabel, and the total length of the label goes into
lablen. If your label exceeds 40 characters, you will have the opportunity to write the
complete label in the label section described below.

Note that the length given in the last 4 bytes of the member header record indicates the
actual number of bytes for the NAMESTR structure. The size of the structure listed
above is 140 bytes.

If you have any labels that exceed 40 characters, they can be placed in this section. The
label records section starts with this header:

    HEADER RECORD*******LABELV8 HEADER RECORD!!!!!!!nnnnn

where nnnnn is the number of variables for which long labels will be defined.

Each label is defined using the following:

    aabbccd.....e.....

where

    aa = variable number
    bb = length of name
    cc = length of label
    d.... = name in bb bytes
    e.... = label in cc bytes

For example, variable number 1 named x with the 43-byte label 'a very long label for x is
given right here' would be provided as a stream of 6 bytes in hex '00010001002B'X
followed by the ASCII characters.

    xa very long label for x is given right here

These are streamed together. The last label descriptor is followed by ASCII blanks
('20'X) to an 80-byte boundary.

If you have any format or informat names that exceed 8 characters, regardless of the
label length, a different form of label record header is used:

    HEADER RECORD*******LABELV9 HEADER RECORD!!!!!!!nnnnn

where nnnnn is the number of variables for which long format names and any labels will
be defined.

Each label is defined using the following:

aabbccddeef.....g.....h.....i.....

where

    aa=variable number
    bb=length of name in bytes
    cc=length of label in bytes
    dd=length of format description in bytes
    ee=length of informat description in bytes
    f.....=text for variable name
    g.....=text for variable label
    h.....=text for format description
    i.....=text of informat description

Note: The FORMAT and INFORMAT descriptions are in the form used in a FORMAT
or INFORMAT statement. For example, my_long_fmt., my_long_fmt8.,
my_long_fmt8.2. The text values are streamed together and no characters appear for
attributes with a length of 0 bytes.

For example, variable number 1 is named X and has a label of 'ABC,' no attached
format, and an 11-character informat named my_long_fmt with informat length=8 and
informat decimal=0. The data would be

    (hex)      (characters)
    010103000d XABCmy_long_fmt

The last label descriptor is followed by ASCII blanks ('20'X) to an 80-byte boundary.

Observation header:

   HEADER RECORD*******OBSV8 HEADER RECORD!!!!!!!000000000000000000000000000000

Data records:

Data records are streamed in the same way that namestrs are. There is ASCII blank
padding at the end of the last record if necessary. There is no special trailing record.

The text was updated successfully, but these errors were encountered:

selik · 2017-04-22T04:25:29Z

Missing values are written out with the first byte (the exponent) indicating the proper
missing values. All subsequent bytes are 0x00. The first byte is:

    ._      0x5f
    .       0x2e
    .A      0x41
    .B      0x42
    .       ...
    .Z      0x5a

selik · 2017-04-22T04:27:00Z

All numeric data fields in the transport file are stored as floating point numbers.

All floating point numbers in the file are stored using the IBM mainframe representation.
If your application is to read from or write to transport files, it is necessary to convert
native floating point numbers to or from the transport representation.

Most platforms use the IEEE representation for floating point numbers.

selik · 2020-04-21T05:39:45Z

#17 and #19 are requesting v8/v9, so it'd be nice to get around to this.

patirahardik · 2020-05-12T12:41:43Z

Waiting for this Update.

selik · 2020-05-12T17:46:48Z

@patirahardik Are you trying to read, write, or both? If reading, do you happen to have an example file for testing?

rosselliott · 2020-06-24T13:44:50Z

Also waiting for this.

Git commit logs from this overhaul. Good, bad, and ugly: * Revision of the project structure. Using the PyPA recommended src layout, this demonstrates the new load/dump API. One other major change is pasting the test data into the Python code for easier verification and modification. * load/loads returns a library, which has members * Note installation and contribution instructions. * Note Python version requirement. * Add data validation tests. Variable type, name length, and label length. * Ignore local data * Show more logs during test * Embrace Pandas dataframes Assuming users will want to use Pandas dataframes for all library members simplifies the parsing logic dramatically, as we no longer need to worry about rows vs columns and whether someone wants to use generators. Another major change is handling multiple members in a library. This has a big impact on the API. * Provide Pandas DataFrame and Series accessors. Rather than using composition to create ``Member`` classes that contain Pandas dataframes, it seems cleaner to provide a ``sas`` accessor for Pandas dataframes and series. Each dataset now behaves exactly like a regular dataframe, but has the capacity for adding SAS-style metadata. * Add docstring * Allow setting variable length. It may be possible for a SAS dataset to specify a variable length as larger than the longest string that happens to be in the data. I haven't seen this in the wild, but this will allow me to initialize an empty Series, set the length, and then read data from an XPT file. * Allow setting variable number and position. As I work on converting the file parsing code to the new Pandas accessor implementation, it's nice to be able to store all metadata on the accessor objects while reading the file. * Track version in setup.cfg There was a bit of a circular import problem caused by adding the pandas dependency. If pip tries to import the module to read the version info before it has installed all dependencies, installation fails. So, I moved the version info to setup.cfg and in the code look up the info from the installation. * Extend Pandas objects with inheritance. While the Pandas documentation discourages inheritance, it seems to be the only way to (mostly) retain metadata. Using the accessor namespace technique is bogus, since it silently discards the meta instance after a variety of usage. * Include bool columns as SAS numeric. Otherwise `df.info` causes problems, which makes logging difficult. Also, I fixed spurious warnings occurring with empty datasets. * Fix namestr pack/unpack. * Note differences between v5, v8, and v9. * Simplify format parsing and display. I don't have good examples of XPORT files with formats and informats. I must guess at what the names look like. Going through the documentation of informats, it appears that if you drop the "w.d" portion of names, no name is longer than 8 characters, including the "$" for character formats. My assumption is thus that the "$" is included in the name, but the "w.d" portion is not. * Fix format comparison test. This is unsatisfactory, but it'll have to do until I learn more about SAS formats. I'm not sure how left- and right-alignment is specified with format names. * Fix member header encode to XPT. * Fix observation encoding. * Finally! Fix Pandas extensions. What a mess. Pandas does not make it easy to attach metadata to Series and DataFrames. Fragile fragile fragile. Lots of awkward code. * Allow decoding non-ASCII text. * Add debug logs for metadata, info for parsing. * Test and fix dtype coercion/validation. * Test and fix metadata str length validation * Add datetime validation for created, modified. * Test and fix CLI * Test and fix the old interface. * Fix issue converting numeric column names. * Update examples for Pandas subclass design.

selik · 2021-12-25T08:39:33Z

#79 implements reading. Writing is in-progress.

gaineleanor · 2021-12-27T02:37:08Z

At present, it seems that the xpt format itself does not have an encoding setting specification.
It supports session encoding but it is not formal. It is also mentioned here.

https://www.pinnacle21.com/forum/zhongwenbanxmlwenjiandeshengcheng

* Prefer conda-forge * Add docstring and comment `text_encode`. * Better error messages for `struct.pack`. If the `struct.pack` receives the wrong type of data, it complains about a "required argument" and doesn't specify which one. This check for `number` and `position` provides a better error message. * Write SAS Transport v8 I generated a Transport v8 file using Stata. One surprise was that Stata enumerated variables starting from 0, instead of from 1. I've followed Stata's example. I didn't notice anything in SAS's specification that indicated whether 0 or 1 is correct. * Update isort's call to isort v5 standard * Bump version, note changes. Note that this writes only SAS Transport Version 8 and not 9. I haven't found a good example of long format description support that Transport Version 9 provides.

selik · 2022-01-01T01:11:48Z

This is for the most part resolved, except for v9's support of writing long format names. I'll close this and open a new issue for writing long format names. Reading them is implemented, but not writing.

selik added the enhancement label Apr 22, 2017

selik mentioned this issue Oct 16, 2018

ParseError #19

Closed

selik mentioned this issue Apr 21, 2020

xport.ParseError : header #13

Closed

This was referenced Nov 21, 2021

Use SAS Transport v8 #76

Closed

Support for SAS Version 8 Transport Files #60

Closed

Sas Transort V8 character limit compatibility #53

Closed

This was referenced Dec 25, 2021

Read Transport Version 8/9 #79

Merged

Support different text encodings #75

Closed

selik closed this as completed Jan 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable SAS Transport Format for SAS v8 and 9 #10

Enable SAS Transport Format for SAS v8 and 9 #10

selik commented Apr 22, 2017 •

edited

Loading

selik commented Apr 22, 2017

selik commented Apr 22, 2017 •

edited

Loading

selik commented Apr 21, 2020

patirahardik commented May 12, 2020

selik commented May 12, 2020

rosselliott commented Jun 24, 2020

selik commented Dec 25, 2021

gaineleanor commented Dec 27, 2021

selik commented Jan 1, 2022

Enable SAS Transport Format for SAS v8 and 9 #10

Enable SAS Transport Format for SAS v8 and 9 #10

Comments

selik commented Apr 22, 2017 • edited Loading

selik commented Apr 22, 2017

selik commented Apr 22, 2017 • edited Loading

selik commented Apr 21, 2020

patirahardik commented May 12, 2020

selik commented May 12, 2020

rosselliott commented Jun 24, 2020

selik commented Dec 25, 2021

gaineleanor commented Dec 27, 2021

selik commented Jan 1, 2022

selik commented Apr 22, 2017 •

edited

Loading

selik commented Apr 22, 2017 •

edited

Loading