New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve csv.Sniffer().sniff() behavior #73591
Comments
I'm trying to use csv.Sniffer().sniff(sample_data) to determine the delimiter on a number of input files. Through some trial and error, many "Could not determine delimiter" errors, and analyzing how this routine works/behaves, I settled on sample_data being some number of lines of the input file, particularly 30. This value seems to allow the routine to work more frequently, although not always, particularly on short input files. I realize the way this routine works is somewhat idiosyncratic, and it won't be so easy to improve it generally, but there's one simple change that occurred to me that would help in some cases. Currently the function _guess_delimiter() in csv.py contains the following lines: # build a list of possible delimiters
modeList = modes.items()
total = float(chunkLength * iteration) So total is increased by chunkLength on each iteration. The problem occurs when total becomes greater than the length of sample_data, that is, the iteration would go beyond the end of sample_data. That reading is handled fine, it's truncated at the end of sample_data, but total is needlessly set too high. My suggested change is to add the following two lines after the above: if total > len(data):
total = float(len(data)) |
FWIW, it might be more concise and more consistent with the existing code to change the one line to: total = min(float(chunkLength * iteration), float(len(data))) |
Sounds reasonable. IIUC if the sample data gets 11 lines the total could be 20. I also think the second min is redundant. Would you mind review my patch Milt? |
That's right, with 11 lines in the sample data, total will become 20 on the second iteration. And that throws off some of the computations done in that function. Your patch looks good, in that it will achieve what I'm requesting. But :-), your pointing out that other redundant min() made me take a closer look at the code, and led me to produce the attached patch as an alternate suggestion. I think it makes the code a bit more sensible and cleaner. Please review, and go with what you think best. Thanks. |
New changeset 724d1aa7589b by Xiang Zhang in branch 'default': |
Thanks Milt. I committed with my change not because it's better, but I want to make the change small so others won't get unfamiliar with the new code. :-) |
Assuming it doesn't cause any behavior changes, I find Milt's patch simple enough and easier to understand than the version uses 'iteration' variable. |
I am fine with any version (both are simple and not the hardest part to understand in the logic). :-) I have no opinion on which is better. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: