Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

custom formatters for to_csv #4668

Open
cpcloud opened this Issue · 11 comments

6 participants

@cpcloud
Collaborator

SO question

something like

df.to_csv(format='%10.4f', sep=' ')
@nehalecky

Most legacy Fortran 77 based simulators (of which many still are actively used in different scientific communities) such as TOUGH2, (which I've had the pleasure of working with extensively), often have data input file subroutines that have hardcoded fields widths. This makes life interesting, and often as much time is spent in pre processing as in post processing. Scientists I worked with had folders of rigid bash scripts and even Fortran routines to attempt to manage these 'input decks' creation—oh the scientist-hours lost. :cry:.

Input decks can contain information such as mesh geometry and physical properties, system initial conditions (i.e., thermodynamic state of each element and transport states between them), and simulator operation parameters. While it's often the case that pandas is used to analyze measured or resultant data, I certainly could envision using it to manipulate input files (indeed, I wrote an entire perl library to do this, before discovering python and pandas). With functionality to manage such input decks, it is fundamental to have fixed width output.

Clearly there are much better ways to interact with fortran libraries (f2py / numpy) but I can tell you that (some) scientists are simply interested in getting a simulation up and running. If pandas already has them hooked for data analysis use, there could be a large benefit from such functionality. If had I access to such a tool, my graduate student life would have been a who lot more social :wink:.

@jreback
Owner

@nehalecky so you want either to_csv to have a fixed width mode

what kind of an API would you see here?

we have been toying with the idea of passing a style parameter to these output routines which could be a class (pandas would provide a base class) that could be overridden for really custom behavior

but easy to see a FIxedWidthWriter

or maybe overkill an just need something straightforward?

@hayd
Collaborator

Perhaps should be float_format to match with options.display.... actually atm that requires a formatter (e.g. '{:10.4f}'.format), maybe should also accept strings like '%10.4f'...

@patricktokeeffe

I think per-column functionality should be added to this list, similar to how read_csv's dtype and na_values accept per-column parameters as a dict.

That would allow users, for example, to apply a different float format to the timestamp than the data columns. Or change the time formatting to military format. (Date formatting was touched in PR #4313 but not time IIUC)

@jreback
Owner

this is really just waiting on a nice API that either does what you are suggesting / templates or both

and of course someone to work on this....

it would not be hard to extend float_format/date_format to accept a dict of columns to format

e.g.

date_format={'A' : '%Y%m%d', 'B' : '%y'}

@nehalecky

A per column template, as suggested by @jreback I think would be grand. For large/complex column arrangements, you could use a series beforehand to prescribe slices across certain columns and generate a dict. :)

@cancan101

Currently the date_format argument is a little unclear as to what it does when the value being formatted is a "date" (datetime w/o a time) as opposed to a "datetime" (datetime w/ a time). At present, it treats these alike and uses the same formatter for each. This is different from how a DatetimeIndex is formatted to CSV. In that case, the formatting code detects if all of the values in the index do not contain times in which cases it only formats the date component. See:

df = pd.DataFrame({'a':[datetime.datetime(2013,1,1)]}, index=pd.to_datetime([datetime.datetime(2013,1,1)]))
io = StringIO()
df.to_csv(io)

In [12]: print io.getvalue()
,a
2013-01-01,2013-01-01 00:00:00

I would suggest having some way how to format datetime w/o time different from datetime with times.

@hayd
Collaborator

@cancan101 could have a flag to drop the minutes / seconds if 00:00:00 (not sure on good name). Could do with an example of date_format in doc, think it would make it clearer (or use default.

Should these be in options.io ?

@cancan101

@hayd I assume you mean drop the hours, minutes, and seconds (ie the time component of the datetime)?

I think the option and its name depend on how it will work: should it be an "intelligent" format that only prints the time component if needed (i.e. if any of the datetime values have a time != midnight, see #5701) or should it work as a truncate where datetimes are truncated to just dates.

@jreback jreback modified the milestone: 0.15.0, 0.14.0
@cpcloud
Collaborator

Did we ever settle on an API here?

@jreback
Owner

I think need to create a Format object

Format(col or cols, format=None, default=None)

so this easiky handles date_fornat and float_format (for back compat)

and handles ability to customize as well

@jreback jreback modified the milestone: 0.16.0, Next Major Release
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.