Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autodetect encoding from CSV files #129

Open
jeanferri opened this issue Oct 20, 2015 · 5 comments
Open

Autodetect encoding from CSV files #129

jeanferri opened this issue Oct 20, 2015 · 5 comments

Comments

@jeanferri
Copy link
Contributor

@jeanferri jeanferri commented Oct 20, 2015

We are using rows in https://github.com/interlegis/interlegis.portalmodelo.transparency to import generic CSV files that users upload, but sometimes the tools used to generate the CSV are not pattern like and generates data with alien encondings, like MS Excel. We need to autodetec the encoding used in the files maybe using some lib as 'chardet' or the 'file' Linux command.

@turicas
Copy link
Owner

@turicas turicas commented Oct 22, 2015

It's a very good feature to have! I suggest implementing it as a function called rows.utils.detect_encoding so we can re-use it in other plugins.

chardet in my experience is a bit slow an can make some mistakes if you don't pass the whole data.

I prefer to use libmagic (which the file UNIX command uses) since it's way faster than chardet. The problem is: there is no official implementation on PyPI (there are many *magic* packages there) and none of them are packaged into Debian (so it'll create some problems for the new Debian release). In counterpart, the official file release has a Debian package but not a PyPI package (there's an official Python wrapper in the file repository).

In my opinion the best approach for this issue is:

  • Add the official libmagic Python wrapper to the PyPI (I'm working on it, see this issue on file's bug tracker), which is already on Debian
  • Use this official wrapper to implement rows.utils.detect_encoding
  • Change rows.plugins.csv.import_from_csv to use rows.utils.detect_encoding

Pros:

  • Official support from libmagic
  • Debian dependency already solved
  • Faster

Cons:

  • libmagic is compiled (I think chardet is pure Python) -- but as far as I know it's available for UNIX and Windows.
  • libmagic may detect less encodings than chardet (need to check!)
  • We need to convert libmagic's output to Python codecs names (a simple dict solves the problem)
@turicas turicas changed the title Autodetec encoding from CSV files Autodetect encoding from CSV files Nov 15, 2015
@turicas
Copy link
Owner

@turicas turicas commented Feb 2, 2016

Update on it (after some months...): I just uploaded the official version of file wrapper to the PyPI. We can install it by running:

pip install file-magic

And we can use to either detect the encoding or the file type, which could be used to solve this issue and also #143.

I've also written a blog post about file-magic library usage.

@turicas
Copy link
Owner

@turicas turicas commented Jul 15, 2016

@jeanferri, do you still need this feature?
I did a little test here (pip install file-magic before), changed rows/plugins/csv.py, inside import_from_csv I've added:

import magic # in the beginning of the file
[...]
    if encoding is None:
        result = magic.detect_from_content(fobj.read(4096)).encoding
        fobj.seek(0)

Could you please try this solution and verify if this function from file-magic library works well for your use cases?

@jeanferri
Copy link
Contributor Author

@jeanferri jeanferri commented Jul 15, 2016

Yes I do, please! It'll be good to Portal Modelo helping to process any kind of CSV file. Could you please commit this patch and make a release on Pypi?
Thank you @turicas !

@turicas
Copy link
Owner

@turicas turicas commented Jul 20, 2016

@jeanferri I think the detection should only be made if you do not specify encoding as a parameter to import_from_csv (encoding=None, which should be the default after adding this feature) and magic is available (we should also check if magic is actually from file-magic or from another library which provides this module). The problem is: I don't want to force everybody to install file-magic (because it needs libmagic-dev and the compiler), so this feature will be optional. If it's optional and I change encoding=None to be the default, all the users whom not installed file-magic will need to provide the encoding after this change.
About releasing: I'd like you to test before I ship a new version, since we can test it and then improve if needed. I've pushed the changes to feature/129-detect-csv-encoding and you can install by running:

pip install git+https://github.com/turicas/rows.git@129-detect-csv-encoding

Could you please test this version with your current code?

@turicas turicas added this to the Version 0.3.0 milestone Jul 21, 2016
@turicas turicas modified the milestones: Version 0.4.0, Version 0.3.0 Sep 2, 2016
@turicas turicas modified the milestone: Version 0.4.0 Sep 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.