-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable SAS Transport Format for SAS v8 and 9 #10
Comments
Missing values are written out with the first byte (the exponent) indicating the proper
|
All numeric data fields in the transport file are stored as floating point numbers. All floating point numbers in the file are stored using the IBM mainframe representation. Most platforms use the IEEE representation for floating point numbers. |
Waiting for this Update. |
@patirahardik Are you trying to read, write, or both? If reading, do you happen to have an example file for testing? |
Also waiting for this. |
Git commit logs from this overhaul. Good, bad, and ugly: * Revision of the project structure. Using the PyPA recommended src layout, this demonstrates the new load/dump API. One other major change is pasting the test data into the Python code for easier verification and modification. * load/loads returns a library, which has members * Note installation and contribution instructions. * Note Python version requirement. * Add data validation tests. Variable type, name length, and label length. * Ignore local data * Show more logs during test * Embrace Pandas dataframes Assuming users will want to use Pandas dataframes for all library members simplifies the parsing logic dramatically, as we no longer need to worry about rows vs columns and whether someone wants to use generators. Another major change is handling multiple members in a library. This has a big impact on the API. * Provide Pandas DataFrame and Series accessors. Rather than using composition to create ``Member`` classes that contain Pandas dataframes, it seems cleaner to provide a ``sas`` accessor for Pandas dataframes and series. Each dataset now behaves exactly like a regular dataframe, but has the capacity for adding SAS-style metadata. * Add docstring * Allow setting variable length. It may be possible for a SAS dataset to specify a variable length as larger than the longest string that happens to be in the data. I haven't seen this in the wild, but this will allow me to initialize an empty Series, set the length, and then read data from an XPT file. * Allow setting variable number and position. As I work on converting the file parsing code to the new Pandas accessor implementation, it's nice to be able to store all metadata on the accessor objects while reading the file. * Track version in setup.cfg There was a bit of a circular import problem caused by adding the pandas dependency. If pip tries to import the module to read the version info before it has installed all dependencies, installation fails. So, I moved the version info to setup.cfg and in the code look up the info from the installation. * Extend Pandas objects with inheritance. While the Pandas documentation discourages inheritance, it seems to be the only way to (mostly) retain metadata. Using the accessor namespace technique is bogus, since it silently discards the meta instance after a variety of usage. * Include bool columns as SAS numeric. Otherwise `df.info` causes problems, which makes logging difficult. Also, I fixed spurious warnings occurring with empty datasets. * Fix namestr pack/unpack. * Note differences between v5, v8, and v9. * Simplify format parsing and display. I don't have good examples of XPORT files with formats and informats. I must guess at what the names look like. Going through the documentation of informats, it appears that if you drop the "w.d" portion of names, no name is longer than 8 characters, including the "$" for character formats. My assumption is thus that the "$" is included in the name, but the "w.d" portion is not. * Fix format comparison test. This is unsatisfactory, but it'll have to do until I learn more about SAS formats. I'm not sure how left- and right-alignment is specified with format names. * Fix member header encode to XPT. * Fix observation encoding. * Finally! Fix Pandas extensions. What a mess. Pandas does not make it easy to attach metadata to Series and DataFrames. Fragile fragile fragile. Lots of awkward code. * Allow decoding non-ASCII text. * Add debug logs for metadata, info for parsing. * Test and fix dtype coercion/validation. * Test and fix metadata str length validation * Add datetime validation for created, modified. * Test and fix CLI * Test and fix the old interface. * Fix issue converting numeric column names. * Update examples for Pandas subclass design.
#79 implements reading. Writing is in-progress. |
At present, it seems that the xpt format itself does not have an encoding setting specification.
|
* Prefer conda-forge * Add docstring and comment `text_encode`. * Better error messages for `struct.pack`. If the `struct.pack` receives the wrong type of data, it complains about a "required argument" and doesn't specify which one. This check for `number` and `position` provides a better error message. * Write SAS Transport v8 I generated a Transport v8 file using Stata. One surprise was that Stata enumerated variables starting from 0, instead of from 1. I've followed Stata's example. I didn't notice anything in SAS's specification that indicated whether 0 or 1 is correct. * Update isort's call to isort v5 standard * Bump version, note changes. Note that this writes only SAS Transport Version 8 and not 9. I haven't found a good example of long format description support that Transport Version 9 provides.
This is for the most part resolved, except for v9's support of writing long format names. I'll close this and open a new issue for writing long format names. Reading them is implemented, but not writing. |
The file format is slightly different.
https://support.sas.com/techsup/technote/ts140_2.pdf
where aaaaaaaa and bbbbbbbb are each 'SAS ' and cccccccc is 'SASLIB ', dddddddd is
the version of the SAS system that created the file, and eeeeeeee is the operating system
creating it. ffffffffffffffff is the datetime created, formatted as ddMMMyy:hh:mm:ss.
Note that only a 2-digit year appears. If any program needs to read in this 2-digit year, be
prepared to deal with dates in the 1900s or the 2000s.
Another way to consider this record is as a C structure:
where the string is the datetime modified. Most often, the datetime created and datetime
modified will always be the same. Pad with ASCII blanks to 80 bytes.
Note that only a 2-digit year appears. If any program needs to read in this 2-digit year, be
prepared to deal with dates in the 1900s or the 2000s.
Both of these occur for every member in the transport file.
Note the 0140 that appears in the member header record above. That value is the size of the variable descriptor (NAMESTR) record that is described later in this document.
where aaaaaaaa is 'SAS ', bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb is the data set name,
cccccccc is SASDATA (if a SAS data set is being created), dddddddd is the version of
the SAS System under which the file was created, and eeeeeeee is the operating system
name. ffffffffffffffff is the datetime created, formatted as in previous headers. Consider
this C structure:
The second header record is
where the datetime modified appears using DATETIME16. format, followed by blanks
up to column 33, where the a's above correspond to a blank-padded data set label, and
bbbbbbbb is the blank-padded data set type. Note that data set labels can be up to 256
characters as of Version 8 of the SAS System, but only up to the first 40 characters are
stored in the second header record. Note also that only a 2-digit year appears in the
datetime modified value. If any program needs to read in this 2-digit year, be prepared to
deal with dates in the 1900s or the 2000s.
Consider the following C structure:
One for each member.
Each namestr field is 140 bytes long, but the fields are streamed together and broken in
80-byte pieces. If the last byte of the last namestr field does not fall in the last byte of the
80-byte record, the record is padded with ASCII blanks ('20'x) to 80 bytes.
Here is the C structure definition for the namestr record:
The variable name truncated to 8 characters goes into nname, and the complete name
goes into longname. Use blank padding in either case if necessary. The variable label
truncated to 40 characters goes into nlabel, and the total length of the label goes into
lablen. If your label exceeds 40 characters, you will have the opportunity to write the
complete label in the label section described below.
Note that the length given in the last 4 bytes of the member header record indicates the
actual number of bytes for the NAMESTR structure. The size of the structure listed
above is 140 bytes.
If you have any labels that exceed 40 characters, they can be placed in this section. The
label records section starts with this header:
where nnnnn is the number of variables for which long labels will be defined.
Each label is defined using the following:
where
For example, variable number 1 named x with the 43-byte label 'a very long label for x is
given right here' would be provided as a stream of 6 bytes in hex '00010001002B'X
followed by the ASCII characters.
These are streamed together. The last label descriptor is followed by ASCII blanks
('20'X) to an 80-byte boundary.
If you have any format or informat names that exceed 8 characters, regardless of the
label length, a different form of label record header is used:
where nnnnn is the number of variables for which long format names and any labels will
be defined.
Each label is defined using the following:
where
Note: The FORMAT and INFORMAT descriptions are in the form used in a FORMAT
or INFORMAT statement. For example, my_long_fmt., my_long_fmt8.,
my_long_fmt8.2. The text values are streamed together and no characters appear for
attributes with a length of 0 bytes.
For example, variable number 1 is named X and has a label of 'ABC,' no attached
format, and an 11-character informat named my_long_fmt with informat length=8 and
informat decimal=0. The data would be
The last label descriptor is followed by ASCII blanks ('20'X) to an 80-byte boundary.
Data records are streamed in the same way that namestrs are. There is ASCII blank
padding at the end of the last record if necessary. There is no special trailing record.
The text was updated successfully, but these errors were encountered: