We are using rows in https://github.com/interlegis/interlegis.portalmodelo.transparency to import generic CSV files that users upload, but sometimes the tools used to generate the CSV are not pattern like and generates data with alien encondings, like MS Excel. We need to autodetec the encoding used in the files maybe using some lib as 'chardet' or the 'file' Linux command.
The text was updated successfully, but these errors were encountered:
It's a very good feature to have! I suggest implementing it as a function called rows.utils.detect_encoding so we can re-use it in other plugins.
chardetin my experience is a bit slow an can make some mistakes if you don't pass the whole data.
I prefer to use libmagic (which the file UNIX command uses) since it's way faster than chardet. The problem is: there is no official implementation on PyPI (there are many *magic* packages there) and none of them are packaged into Debian (so it'll create some problems for the new Debian release). In counterpart, the official file release has a Debian package but not a PyPI package (there's an official Python wrapper in the file repository).
In my opinion the best approach for this issue is:
@jeanferri I think the detection should only be made if you do not specify encoding as a parameter to import_from_csv (encoding=None, which should be the default after adding this feature) and magic is available (we should also check if magic is actually from file-magic or from another library which provides this module). The problem is: I don't want to force everybody to install file-magic (because it needs libmagic-dev and the compiler), so this feature will be optional. If it's optional and I change encoding=None to be the default, all the users whom not installed file-magic will need to provide the encoding after this change.
About releasing: I'd like you to test before I ship a new version, since we can test it and then improve if needed. I've pushed the changes to feature/129-detect-csv-encoding and you can install by running: