Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SAS Transport Format for SAS v8 and 9 #10

Closed
selik opened this issue Apr 22, 2017 · 9 comments
Closed

Enable SAS Transport Format for SAS v8 and 9 #10

selik opened this issue Apr 22, 2017 · 9 comments

Comments

@selik
Copy link
Owner

selik commented Apr 22, 2017

The file format is slightly different.
https://support.sas.com/techsup/technote/ts140_2.pdf

  1. The first header record consists of the following character string, in ASCII:
    HEADER RECORD*******LIBV8 HEADER RECORD!!!!!!!000000000000000000000000000000
  1. The first real header record uses the following layout:
    aaaaaaaabbbbbbbbccccccccddddddddeeeeeeee ffffffffffffffff

where aaaaaaaa and bbbbbbbb are each 'SAS ' and cccccccc is 'SASLIB ', dddddddd is
the version of the SAS system that created the file, and eeeeeeee is the operating system
creating it. ffffffffffffffff is the datetime created, formatted as ddMMMyy:hh:mm:ss.
Note that only a 2-digit year appears. If any program needs to read in this 2-digit year, be
prepared to deal with dates in the 1900s or the 2000s.

Another way to consider this record is as a C structure:

    struct REAL_HEADER {
        char sas_symbol[2][8];
        char saslib[8];
        char sasver[8];
        char sas_os[8];
        char blanks[24];
        char sas_create[16];
    };
  1. Second real header record:
    ddMMMyy:hh:mm:ss

where the string is the datetime modified. Most often, the datetime created and datetime
modified will always be the same. Pad with ASCII blanks to 80 bytes.
Note that only a 2-digit year appears. If any program needs to read in this 2-digit year, be
prepared to deal with dates in the 1900s or the 2000s.

  1. Member header records:
    Both of these occur for every member in the transport file.
    HEADER RECORD*******MEMBV8 HEADER RECORD!!!!!!!000000000000000001600000000140
    HEADER RECORD*******DSCPTV8 HEADER RECORD!!!!!!!000000000000000000000000000000

Note the 0140 that appears in the member header record above. That value is the size of the variable descriptor (NAMESTR) record that is described later in this document.

  1. Member header data:
    aaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbccccccccddddddddeeeeeeeeffffffffffffffff

where aaaaaaaa is 'SAS ', bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb is the data set name,
cccccccc is SASDATA (if a SAS data set is being created), dddddddd is the version of
the SAS System under which the file was created, and eeeeeeee is the operating system
name. ffffffffffffffff is the datetime created, formatted as in previous headers. Consider
this C structure:

    struct REAL_HEADER {
        char sas_symbol[8];
        char sas_dsname[32];
        char sasdata[8];
        char sasver[8];
        char sas_osname[8];
        char sas_create[16];
    };

The second header record is

    ddMMMyy:hh:mm:ss aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbb

where the datetime modified appears using DATETIME16. format, followed by blanks
up to column 33, where the a's above correspond to a blank-padded data set label, and
bbbbbbbb is the blank-padded data set type. Note that data set labels can be up to 256
characters as of Version 8 of the SAS System, but only up to the first 40 characters are
stored in the second header record. Note also that only a 2-digit year appears in the
datetime modified value. If any program needs to read in this 2-digit year, be prepared to
deal with dates in the 1900s or the 2000s.

Consider the following C structure:

    struct SECOND_HEADER {
        char dtmod_day[2];
        char dtmod_month[3];
        char dtmod_year[2];
        char dtmod_colon1[1];
        char dtmod_hour[2];
        char dtmod_colon2[1];
        char dtmod_minute[2];
        char dtmod_colon2[1];
        char dtmod_second[2];
        char padding[16];
        char dslabel[40];
        char dstype[8];
    };
  1. Namestr header record:
    One for each member.
    HEADER RECORD*******NAMSTV8 HEADER RECORD!!!!!!!000000xxxxxx000000000000000000
  1. Namestr records:
    Each namestr field is 140 bytes long, but the fields are streamed together and broken in
    80-byte pieces. If the last byte of the last namestr field does not fall in the last byte of the
    80-byte record, the record is padded with ASCII blanks ('20'x) to 80 bytes.

Here is the C structure definition for the namestr record:

    struct NAMESTR {
        short ntype; /* VARIABLE TYPE: 1=NUMERIC, 2=CHAR */
        short nhfun; /* HASH OF NNAME (always 0) */
        short nlng; /* LENGTH OF VARIABLE IN OBSERVATION */
        short nvar0; /* VARNUM */
        char8 nname; /* NAME OF VARIABLE */
        char40 nlabel; /* LABEL OF VARIABLE */
        char8 nform; /* NAME OF FORMAT */
        short nfl; /* FORMAT FIELD LENGTH OR 0 */
        short nfd; /* FORMAT NUMBER OF DECIMALS */
        short nfj; /* 0=LEFT JUSTIFICATION, 1=RIGHT JUST */
        char nfill[2]; /* (UNUSED, FOR ALIGNMENT AND FUTURE) */
        char8 niform; /* NAME OF INPUT FORMAT */
        short nifl; /* INFORMAT LENGTH ATTRIBUTE */
        short nifd; /* INFORMAT NUMBER OF DECIMALS */
        long npos; /* POSITION OF VALUE IN OBSERVATION */
        char longname[32]; /* long name for Version 8-style */
        short lablen; /* length of label */
        char rest[18]; /* remaining fields are irrelevant */
    };

The variable name truncated to 8 characters goes into nname, and the complete name
goes into longname. Use blank padding in either case if necessary. The variable label
truncated to 40 characters goes into nlabel, and the total length of the label goes into
lablen. If your label exceeds 40 characters, you will have the opportunity to write the
complete label in the label section described below.

Note that the length given in the last 4 bytes of the member header record indicates the
actual number of bytes for the NAMESTR structure. The size of the structure listed
above is 140 bytes.

If you have any labels that exceed 40 characters, they can be placed in this section. The
label records section starts with this header:

    HEADER RECORD*******LABELV8 HEADER RECORD!!!!!!!nnnnn

where nnnnn is the number of variables for which long labels will be defined.

Each label is defined using the following:

    aabbccd.....e.....

where

    aa = variable number
    bb = length of name
    cc = length of label
    d.... = name in bb bytes
    e.... = label in cc bytes

For example, variable number 1 named x with the 43-byte label 'a very long label for x is
given right here' would be provided as a stream of 6 bytes in hex '00010001002B'X
followed by the ASCII characters.

    xa very long label for x is given right here

These are streamed together. The last label descriptor is followed by ASCII blanks
('20'X) to an 80-byte boundary.

If you have any format or informat names that exceed 8 characters, regardless of the
label length, a different form of label record header is used:

    HEADER RECORD*******LABELV9 HEADER RECORD!!!!!!!nnnnn

where nnnnn is the number of variables for which long format names and any labels will
be defined.

Each label is defined using the following:

aabbccddeef.....g.....h.....i.....

where

    aa=variable number
    bb=length of name in bytes
    cc=length of label in bytes
    dd=length of format description in bytes
    ee=length of informat description in bytes
    f.....=text for variable name
    g.....=text for variable label
    h.....=text for format description
    i.....=text of informat description

Note: The FORMAT and INFORMAT descriptions are in the form used in a FORMAT
or INFORMAT statement. For example, my_long_fmt., my_long_fmt8.,
my_long_fmt8.2. The text values are streamed together and no characters appear for
attributes with a length of 0 bytes.

For example, variable number 1 is named X and has a label of 'ABC,' no attached
format, and an 11-character informat named my_long_fmt with informat length=8 and
informat decimal=0. The data would be

    (hex)      (characters)
    010103000d XABCmy_long_fmt

The last label descriptor is followed by ASCII blanks ('20'X) to an 80-byte boundary.

  1. Observation header:
   HEADER RECORD*******OBSV8 HEADER RECORD!!!!!!!000000000000000000000000000000
  1. Data records:

Data records are streamed in the same way that namestrs are. There is ASCII blank
padding at the end of the last record if necessary. There is no special trailing record.

@selik
Copy link
Owner Author

selik commented Apr 22, 2017

Missing values are written out with the first byte (the exponent) indicating the proper
missing values. All subsequent bytes are 0x00. The first byte is:

    ._      0x5f
    .       0x2e
    .A      0x41
    .B      0x42
    .       ...
    .Z      0x5a

@selik
Copy link
Owner Author

selik commented Apr 22, 2017

All numeric data fields in the transport file are stored as floating point numbers.

All floating point numbers in the file are stored using the IBM mainframe representation.
If your application is to read from or write to transport files, it is necessary to convert
native floating point numbers to or from the transport representation.

Most platforms use the IEEE representation for floating point numbers.

@selik selik mentioned this issue Oct 16, 2018
@selik
Copy link
Owner Author

selik commented Apr 21, 2020

#17 and #19 are requesting v8/v9, so it'd be nice to get around to this.

@patirahardik
Copy link

Waiting for this Update.

@selik
Copy link
Owner Author

selik commented May 12, 2020

@patirahardik Are you trying to read, write, or both? If reading, do you happen to have an example file for testing?

@rosselliott
Copy link

Also waiting for this.

selik referenced this issue Aug 18, 2020
Git commit logs from this overhaul. Good, bad, and ugly:

* Revision of the project structure.

Using the PyPA recommended src layout, this demonstrates the new
load/dump API.  One other major change is pasting the test data into the
Python code for easier verification and modification.

* load/loads returns a library, which has members

* Note installation and contribution instructions.

* Note Python version requirement.

* Add data validation tests.

Variable type, name length, and label length.

* Ignore local data

* Show more logs during test

* Embrace Pandas dataframes

Assuming users will want to use Pandas dataframes for all library
members simplifies the parsing logic dramatically, as we no longer need
to worry about rows vs columns and whether someone wants to use
generators.

Another major change is handling multiple members in a library.  This
has a big impact on the API.

* Provide Pandas DataFrame and Series accessors.

Rather than using composition to create ``Member`` classes that contain
Pandas dataframes, it seems cleaner to provide a ``sas`` accessor for
Pandas dataframes and series. Each dataset now behaves exactly like a
regular dataframe, but has the capacity for adding SAS-style metadata.

* Add docstring

* Allow setting variable length.

It may be possible for a SAS dataset to specify a variable length as
larger than the longest string that happens to be in the data.  I
haven't seen this in the wild, but this will allow me to initialize an
empty Series, set the length, and then read data from an XPT file.

* Allow setting variable number and position.

As I work on converting the file parsing code to the new Pandas accessor
implementation, it's nice to be able to store all metadata on the
accessor objects while reading the file.

* Track version in setup.cfg

There was a bit of a circular import problem caused by adding the pandas
dependency.  If pip tries to import the module to read the version info
before it has installed all dependencies, installation fails.  So, I
moved the version info to setup.cfg and in the code look up the info
from the installation.

* Extend Pandas objects with inheritance.

While the Pandas documentation discourages inheritance, it seems to be
the only way to (mostly) retain metadata.  Using the accessor namespace
technique is bogus, since it silently discards the meta instance after
a variety of usage.

* Include bool columns as SAS numeric.

Otherwise `df.info` causes problems, which makes logging difficult.

Also, I fixed spurious warnings occurring with empty datasets.

* Fix namestr pack/unpack.

* Note differences between v5, v8, and v9.

* Simplify format parsing and display.

I don't have good examples of XPORT files with formats and informats.  I
must guess at what the names look like.  Going through the documentation
of informats, it appears that if you drop the "w.d" portion of names,
no name is longer than 8 characters, including the "$" for character
formats.  My assumption is thus that the "$" is included in the name,
but the "w.d" portion is not.

* Fix format comparison test.

This is unsatisfactory, but it'll have to do until I learn more about
SAS formats.  I'm not sure how left- and right-alignment is specified
with format names.

* Fix member header encode to XPT.

* Fix observation encoding.

* Finally! Fix Pandas extensions.

What a mess.  Pandas does not make it easy to attach metadata to Series
and DataFrames.  Fragile fragile fragile.  Lots of awkward code.

* Allow decoding non-ASCII text.

* Add debug logs for metadata, info for parsing.

* Test and fix dtype coercion/validation.

* Test and fix metadata str length validation

* Add datetime validation for created, modified.

* Test and fix CLI

* Test and fix the old interface.

* Fix issue converting numeric column names.

* Update examples for Pandas subclass design.
@selik
Copy link
Owner Author

selik commented Dec 25, 2021

#79 implements reading. Writing is in-progress.

@gaineleanor
Copy link

At present, it seems that the xpt format itself does not have an encoding setting specification.
It supports session encoding but it is not formal. It is also mentioned here.

image
https://www.pinnacle21.com/forum/zhongwenbanxmlwenjiandeshengcheng

selik added a commit that referenced this issue Jan 1, 2022
* Prefer conda-forge

* Add docstring and comment `text_encode`.

* Better error messages for `struct.pack`.

If the `struct.pack` receives the wrong type of data, it complains about
a "required argument" and doesn't specify which one.  This check for
`number` and `position` provides a better error message.

* Write SAS Transport v8

I generated a Transport v8 file using Stata.  One surprise was that
Stata enumerated variables starting from 0, instead of from 1.  I've
followed Stata's example.  I didn't notice anything in SAS's
specification that indicated whether 0 or 1 is correct.

* Update isort's call to isort v5 standard

* Bump version, note changes.

Note that this writes only SAS Transport Version 8 and not 9.  I haven't
found a good example of long format description support that Transport
Version 9 provides.
@selik
Copy link
Owner Author

selik commented Jan 1, 2022

This is for the most part resolved, except for v9's support of writing long format names. I'll close this and open a new issue for writing long format names. Reading them is implemented, but not writing.

@selik selik closed this as completed Jan 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants