Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force quoting on certain columns? #751

Closed
Kirkman opened this issue Dec 3, 2020 · 2 comments
Closed

Force quoting on certain columns? #751

Kirkman opened this issue Dec 3, 2020 · 2 comments
Labels

Comments

@Kirkman
Copy link

Kirkman commented Dec 3, 2020

Is it possible to force .to_csv() to quote particular columns when writing the CSV?

I have in mind situations like a column of zero-padded numbers. If that data is not enclosed in quotes in the CSV, then when other programs read the CSV, they interpret it as a number (1) instead of a string ("001").

If I set a column to text_type, Agate will write the column's values properly as 001, 002, etc.:

name,COUNTYFP10
Adair,001
Andrew,003
Atchison,005

Is there a way to make agate generate the CSV like this:

name,COUNTYFP10
Adair,"001"
Andrew,"003"
Atchison,"005"
@dannguyen
Copy link

tl;dr: This doesn't answer your question, but it's a long explanation of why this feature (probably) doesn't exist. And also, why the behavior you've noticed ("001" being interpreted as a string and not a number, by an importing program) is just a quirk specific to that program – it's not CSV standard behavior.

FWIW, how a CSV file is interpreted – particularly, how to typecast each column, is generally up to the interpreting program. For example, importing your latter example (i..e. with quoted COUNTYFP10 values) into a pandas.DataFrame will result in the same as importing the non-quoted data: the COUNTYFP10 column will be interpreted as integers:

>>> import pandas as pd
>>> df = pd.read_csv('/tmp/thedata.csv')
>>> print(df)
       name  COUNTYFP10
0     Adair           1
1    Andrew           3
2  Atchison           5

>>> print(df.dtypes)
name          object
COUNTYFP10     int64
dtype: object

Not sure how Agate handles the import, just using Pandas as an example of another client program that has its own ways of auto-guessing the data types of imported CSVs.

The upshot of all this: I'd be really surprised if Agate or csvkit had a way to specify quoting by column when exporting to CSV, because there wouldn't be any point. There's just no accepted standard for how to assume datatypes of imported CSV, because CSV only has an understanding of every value being text...as opposed to something like JSON, in which there's a concept of strings, numbers, and booleans.

My advice is that you should accept having to configure datatyping on the importing program's side, e.g. in Excel's Text Import Wizard.

Or in Python pandas, it would be to set the dtype argument, e.g.

>>> df = pd.read_csv('/tmp/thedata.csv', dtype={'COUNTYFP10': str})
>>> print(df)
       name COUNTYFP10
0     Adair        001
1    Andrew        003
2  Atchison        005
>>> print(df.dtypes)
name          object
COUNTYFP10    object
dtype: object

It's a pain in the ass, but this kind of vaguery is just inherent to the CSV format. You don't want to depend on application-specific quirks when figuring out the import-export workflow.

@jpmckinney
Copy link
Member

jpmckinney commented Jul 14, 2021

to_csv passes through keyword arguments to Python's csv module. You can import csv and pass quoting=csv.QUOTE_ALL or maybe even quoting=csv.QUOTE_MINIMAL to to_csv to achieve what you want. https://docs.python.org/3/library/csv.html#csv.QUOTE_ALL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants