CSV has_headers heuristic could be improved #87791
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = None closed_at = <Date 2021-07-30.17:30:38.304> created_at = <Date 2021-03-25.18:18:35.215> labels = ['type-bug', 'library', '3.10', '3.11'] title = 'CSV has_headers heuristic could be improved' updated_at = <Date 2021-07-30.17:30:38.304> user = 'https://bugs.python.org/ejacq'
activity = <Date 2021-07-30.17:30:38.304> actor = 'lukasz.langa' assignee = 'none' closed = True closed_date = <Date 2021-07-30.17:30:38.304> closer = 'lukasz.langa' components = ['Library (Lib)'] creation = <Date 2021-03-25.18:18:35.215> creator = 'ejacq' dependencies =  files = ['49915', '50131', '50132'] hgrepos =  issue_num = 43625 keywords = ['patch'] message_count = 13.0 messages = ['389515', '389528', '396663', '396699', '396714', '396716', '396720', '396722', '396733', '398575', '398581', '398582', '398586'] nosy_count = 6.0 nosy_names = ['skip.montanaro', 'rhettinger', 'lukasz.langa', 'miss-islington', 'andrei.avk', 'ejacq'] pr_nums = ['26939', '27494'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue43625' versions = ['Python 3.10', 'Python 3.11']
The text was updated successfully, but these errors were encountered:
Here is an sample of CSV input:
when calling has_header() from csv.py on this sample, it returns false.
I think the heuristic will better work if rather than just comparing number types, it would also consider casting the values in this order int -> float -> complex. If the values are similar then consider this upgraded type as the type of the column.
In the end, this file would be considered float columns with headers.
I assume the OP is referring to this sort of usage:
>>> sniffer = csv.Sniffer() >>> raw = open("mixed.csv").read() >>> sniffer.has_header(raw) False
I really wish the Sniffer class had never been added to the CSV module. I can't recall who wrote it (the author is long gone). Though I am responsible for the initial commits, it wasn't me or the main authors of csvmodule.c. As far as I know, it never really worked well. I can't recall ever using it.
A simpler heuristic would be if the first row contains a bunch of strings and the second row contains a bunch of numbers, then the file has a header. That assumes that CSV files consist mostly of numeric data.
Looking at has_header, I see this:
for thisType in [int, float, complex]:
I think this particular problem would be solved if the order of those types were reversed. The attached diff suggests that as well. Note that the Sniffer class currently contains no test cases, so that the test I added failed before the change and passes after doesn't mean it doesn't break someone's mission critical Sniffer usage.
(Sorry, Raymond. My Github-foo is insufficient to allow me to fork, apply the diff and create a PR.)
Skip: If I understand right, in the patch the last two types -- float and int, will never have an effect because if float(x) and int(x) succeed, so will complex(x), and conversely, if complex(x) fails, float and int will also fail.
So the effect of the patch will be to tolerate any mix of numeric columns when the headers are textual. Which sounds fine to me, just want to confirm that sounds good to you, because the unit test in the patch is a much narrower case.
I think the test should then be something like:
a b c
and the code should be updated to just do
Another, more strict option, - would be to special case
Thanks @andrei.avk. You are right, only the complex test is required.
I suppose it's okay to commit this, but reviewing the full code of the has_header method leaves me thinking this is just putting lipstick on a pig. If I read the code correctly, there are two (undocumented) tacit assumptions:
The second criterion means this has a header:
but this doesn't:
It seems to me that it would be a good idea to at least expand on the documentation of that method and maybe add at least one test case where the CSV sample doesn't have a header. I'll try to get that done and attach a patch.
I retract my comment about fixed length strings in the non-numeric case. There are clearly test cases (which I probably wrote, considering the values) where the sample as a header but the values are of varying length. Misread of the code on my part. I have obviously not had enough coffee yet this morning.
Here is a change to the has_header documentation and an extra test case documenting the behavior when the sample contains strings. I'm not sure about the wording of the doc change, perhaps you can tweak it? Seems kind of clumsy to me. If it seems okay to you @andrei.avk, can you fold it into your PR?