New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autodetect encoding from CSV files #129

Open
jeanferri opened this Issue Oct 20, 2015 · 5 comments

Comments

Projects
None yet
2 participants
@jeanferri
Contributor

jeanferri commented Oct 20, 2015

We are using rows in https://github.com/interlegis/interlegis.portalmodelo.transparency to import generic CSV files that users upload, but sometimes the tools used to generate the CSV are not pattern like and generates data with alien encondings, like MS Excel. We need to autodetec the encoding used in the files maybe using some lib as 'chardet' or the 'file' Linux command.

@turicas

This comment has been minimized.

Show comment
Hide comment
@turicas

turicas Oct 22, 2015

Owner

It's a very good feature to have! I suggest implementing it as a function called rows.utils.detect_encoding so we can re-use it in other plugins.

chardet in my experience is a bit slow an can make some mistakes if you don't pass the whole data.

I prefer to use libmagic (which the file UNIX command uses) since it's way faster than chardet. The problem is: there is no official implementation on PyPI (there are many *magic* packages there) and none of them are packaged into Debian (so it'll create some problems for the new Debian release). In counterpart, the official file release has a Debian package but not a PyPI package (there's an official Python wrapper in the file repository).

In my opinion the best approach for this issue is:

  • Add the official libmagic Python wrapper to the PyPI (I'm working on it, see this issue on file's bug tracker), which is already on Debian
  • Use this official wrapper to implement rows.utils.detect_encoding
  • Change rows.plugins.csv.import_from_csv to use rows.utils.detect_encoding

Pros:

  • Official support from libmagic
  • Debian dependency already solved
  • Faster

Cons:

  • libmagic is compiled (I think chardet is pure Python) -- but as far as I know it's available for UNIX and Windows.
  • libmagic may detect less encodings than chardet (need to check!)
  • We need to convert libmagic's output to Python codecs names (a simple dict solves the problem)
Owner

turicas commented Oct 22, 2015

It's a very good feature to have! I suggest implementing it as a function called rows.utils.detect_encoding so we can re-use it in other plugins.

chardet in my experience is a bit slow an can make some mistakes if you don't pass the whole data.

I prefer to use libmagic (which the file UNIX command uses) since it's way faster than chardet. The problem is: there is no official implementation on PyPI (there are many *magic* packages there) and none of them are packaged into Debian (so it'll create some problems for the new Debian release). In counterpart, the official file release has a Debian package but not a PyPI package (there's an official Python wrapper in the file repository).

In my opinion the best approach for this issue is:

  • Add the official libmagic Python wrapper to the PyPI (I'm working on it, see this issue on file's bug tracker), which is already on Debian
  • Use this official wrapper to implement rows.utils.detect_encoding
  • Change rows.plugins.csv.import_from_csv to use rows.utils.detect_encoding

Pros:

  • Official support from libmagic
  • Debian dependency already solved
  • Faster

Cons:

  • libmagic is compiled (I think chardet is pure Python) -- but as far as I know it's available for UNIX and Windows.
  • libmagic may detect less encodings than chardet (need to check!)
  • We need to convert libmagic's output to Python codecs names (a simple dict solves the problem)

@turicas turicas changed the title from Autodetec encoding from CSV files to Autodetect encoding from CSV files Nov 15, 2015

@turicas

This comment has been minimized.

Show comment
Hide comment
@turicas

turicas Feb 2, 2016

Owner

Update on it (after some months...): I just uploaded the official version of file wrapper to the PyPI. We can install it by running:

pip install file-magic

And we can use to either detect the encoding or the file type, which could be used to solve this issue and also #143.

I've also written a blog post about file-magic library usage.

Owner

turicas commented Feb 2, 2016

Update on it (after some months...): I just uploaded the official version of file wrapper to the PyPI. We can install it by running:

pip install file-magic

And we can use to either detect the encoding or the file type, which could be used to solve this issue and also #143.

I've also written a blog post about file-magic library usage.

@turicas

This comment has been minimized.

Show comment
Hide comment
@turicas

turicas Jul 15, 2016

Owner

@jeanferri, do you still need this feature?
I did a little test here (pip install file-magic before), changed rows/plugins/csv.py, inside import_from_csv I've added:

import magic # in the beginning of the file
[...]
    if encoding is None:
        result = magic.detect_from_content(fobj.read(4096)).encoding
        fobj.seek(0)

Could you please try this solution and verify if this function from file-magic library works well for your use cases?

Owner

turicas commented Jul 15, 2016

@jeanferri, do you still need this feature?
I did a little test here (pip install file-magic before), changed rows/plugins/csv.py, inside import_from_csv I've added:

import magic # in the beginning of the file
[...]
    if encoding is None:
        result = magic.detect_from_content(fobj.read(4096)).encoding
        fobj.seek(0)

Could you please try this solution and verify if this function from file-magic library works well for your use cases?

@jeanferri

This comment has been minimized.

Show comment
Hide comment
@jeanferri

jeanferri Jul 15, 2016

Contributor

Yes I do, please! It'll be good to Portal Modelo helping to process any kind of CSV file. Could you please commit this patch and make a release on Pypi?
Thank you @turicas !

Contributor

jeanferri commented Jul 15, 2016

Yes I do, please! It'll be good to Portal Modelo helping to process any kind of CSV file. Could you please commit this patch and make a release on Pypi?
Thank you @turicas !

@turicas

This comment has been minimized.

Show comment
Hide comment
@turicas

turicas Jul 20, 2016

Owner

@jeanferri I think the detection should only be made if you do not specify encoding as a parameter to import_from_csv (encoding=None, which should be the default after adding this feature) and magic is available (we should also check if magic is actually from file-magic or from another library which provides this module). The problem is: I don't want to force everybody to install file-magic (because it needs libmagic-dev and the compiler), so this feature will be optional. If it's optional and I change encoding=None to be the default, all the users whom not installed file-magic will need to provide the encoding after this change.
About releasing: I'd like you to test before I ship a new version, since we can test it and then improve if needed. I've pushed the changes to feature/129-detect-csv-encoding and you can install by running:

pip install git+https://github.com/turicas/rows.git@129-detect-csv-encoding

Could you please test this version with your current code?

Owner

turicas commented Jul 20, 2016

@jeanferri I think the detection should only be made if you do not specify encoding as a parameter to import_from_csv (encoding=None, which should be the default after adding this feature) and magic is available (we should also check if magic is actually from file-magic or from another library which provides this module). The problem is: I don't want to force everybody to install file-magic (because it needs libmagic-dev and the compiler), so this feature will be optional. If it's optional and I change encoding=None to be the default, all the users whom not installed file-magic will need to provide the encoding after this change.
About releasing: I'd like you to test before I ship a new version, since we can test it and then improve if needed. I've pushed the changes to feature/129-detect-csv-encoding and you can install by running:

pip install git+https://github.com/turicas/rows.git@129-detect-csv-encoding

Could you please test this version with your current code?

@turicas turicas added this to the Version 0.3.0 milestone Jul 21, 2016

@turicas turicas modified the milestones: Version 0.4.0, Version 0.3.0 Sep 2, 2016

@turicas turicas modified the milestone: Version 0.4.0 Sep 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment