Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discuss] Guess if a CSV File is CSV or Not #95

Open
rufuspollock opened this issue Nov 10, 2014 · 12 comments
Open

[discuss] Guess if a CSV File is CSV or Not #95

rufuspollock opened this issue Nov 10, 2014 · 12 comments

Comments

@rufuspollock
Copy link
Owner

What: we want this to be a simple library (likely node and/or python) for guessing whether a given file is csv or not.

Why (is this not trivial): CSV is plain text and has no special markers. Since ',' (or even tabs) turn up everywhere (in HTML, PDF, Word) and CSV can have various file encodings it is not simple to give a confidence estimate as to whether a given byte stream is CSV or not.

Please add your thoughts here on features, design and research

Research

  • magic (libmagic) - is that useful here?

See Also

@brew
Copy link
Collaborator

brew commented Nov 10, 2014

Anything in csvkit that can help?

@rufuspollock
Copy link
Owner Author

@brew good suggestion but not specifically to my knowledge. My guess here is that you'll want to do a tiny bit of statistical analysis (similar but different to messytables). e.g. attempt to parse this as CSV and then based on output make a guess as to whether CSV (e.g. if you get columns with huge number of characters or massively varying numbers of characters that would imply you are just parsing something that is not CSV but thinking it is ...).

@holgerd77
Copy link

What's with CSV Lint? Thought about using this in the context of Farmsubsidy/Openspending?

@rufuspollock
Copy link
Owner Author

@holgerd77 i believe that CSVLint is more of a validator of a CSV against a given schema - not checking if a CSV (i.e. more like https://github.com/okfn/json-table-schema-validator).

@holgerd77
Copy link

Hmm, I didn't dig into it very much, but from the README.md there is also various basic validation provided, below is an extract from error types detected. I think I'll definitely have a look on this a bit more closely.

Errors

The following types of error can be reported:

  • :wrong_content_type -- content type is not text/csv
  • :ragged_rows -- row has a different number of columns (than the first row in the file)
  • :blank_rows -- completely empty row, e.g. blank line or a line where all column values are empty
  • :invalid_encoding -- encoding error when parsing row, e.g. because of invalid characters
  • :not_found -- HTTP 404 error when retrieving the data
  • :stray_quote -- missing or stray quote

...

@rufuspollock
Copy link
Owner Author

@holgerd77 right - those are actual issues with the CSV itself (as tabular data - i.e. you have blank rows at the top etc). It is not about checking if this thing I'm looking at which I think is csv is actually xls or html or pdf ...

@holgerd77 you may also want to check out https://github.com/okfn/json-table-schema-validator which is has some similarities to csvlint but is library-only, nodejs version.

@adamamyl
Copy link

magic (libmagic) - is that useful here?

and

It is not about checking if this thing I'm looking at which I think is csv is actually xls or html or pdf ...

so essentially, a wrapper around file(1) like https://pypi.python.org/pypi/filemagic/1.6 ?

@rufuspollock
Copy link
Owner Author

@adamamyl indeed that may be the best simple option. Extra pointers to the segment of the algorithm where they e.g. identify csv vs html vs xlsx. (maybe no one place!)

@adamamyl
Copy link

@adamamyl indeed that may be the best simple option. Extra pointers to the segment of the algorithm where they e.g. identify csv vs html vs xlsx. (maybe no one place!)

a quick ack on the latest source for file doesn't yield any results for 'csv'… odd.

@pwalsh
Copy link

pwalsh commented Nov 19, 2014

There is also Apache Tika (written in Java) which has bindings in Python and Node.

@pwalsh
Copy link

pwalsh commented Nov 19, 2014

Question:

Do we want to know whether a given file is CSV, or, if the contents of a file can be read as CSV?

Example:

If we detect XLS, open a sheet, and can parse the stream as CSV: is it CSV?

@pwalsh
Copy link

pwalsh commented Nov 27, 2014

So, using libmagic (based on some checks I've done with the Python bindings) is quite useful for picking up the obvious stuff from a file.

One important point is that, when directly passed a stream which is valid CSV, libmagic can only identify it as plain text.

So, in cases where file extension and mime-type are wrong (say, plaintext .txt) yet the streamed data is CSV, we'd get a false negative.

@rufuspollock rufuspollock transferred this issue from another repository Mar 25, 2020
@rufuspollock rufuspollock changed the title [discuss] Features and Design [discuss] Guess if a CSV File is CSV or Not: Features and Design Sep 20, 2020
@rufuspollock rufuspollock changed the title [discuss] Guess if a CSV File is CSV or Not: Features and Design [discuss] Guess if a CSV File is CSV or Not Sep 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants