Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically replace NUL (\0x00) in CSV #273

Open
turicas opened this issue Apr 3, 2018 · 8 comments · May be fixed by #282
Open

Automatically replace NUL (\0x00) in CSV #273

turicas opened this issue Apr 3, 2018 · 8 comments · May be fixed by #282

Comments

@turicas
Copy link
Owner

turicas commented Apr 3, 2018

Some CSV files come with NUL chars (\0x00) inside and the Python csv module doesn't know how to deal with it. So I think it's a great idea to have automatic NUL removal in the CSV plugin. An io.TextIOWrapper will do the job, like this one:

class NotNullTextWrapper(io.TextIOWrapper):

    def read(self, *args, **kwargs):
        data = super().read(*args, **kwargs)
        return data.replace('\x00', '')

    def readline(self, *args, **kwargs):
        data = super().readline(*args, **kwargs)
        return data.replace('\x00', '')

Sample file with this problem: http://arquivos.portaldatransparencia.gov.br/downloads.asp?a=2011&m=01&consulta=GastosDiretos

Exception raised: _csv.Error: line contains NULL byte

@turicas
Copy link
Owner Author

turicas commented May 16, 2018

Fixed on d43be1d.

@turicas turicas closed this as completed May 16, 2018
@turicas turicas reopened this May 18, 2018
@turicas
Copy link
Owner Author

turicas commented May 18, 2018

Reopenning because of this error:
AttributeError: 'file' object has no attribute 'readable' (I think it's related to Python2)
Maybe this thread helps.

@turicas
Copy link
Owner Author

turicas commented Jun 23, 2018

Reverted merged change of #276 since it cause problems on python2. Trying to fix the problem in a new branch: feature/csv-remove-null-bytes.

@mawkee
Copy link

mawkee commented Jul 2, 2019

The file is no longer accessible, but it seems you're dealing with an UTF-16 encoded file. Try using:

b = open("file.csv", "rb").read().decode("utf-16")

@turicas
Copy link
Owner Author

turicas commented Jul 2, 2019

@mawkee it was not an UTF-16-encoded file (this one was encoded in ISO-8859-15 but had \x00 bytes inside the data) - it didn't even have the BOM.

@seocam
Copy link

seocam commented Jul 2, 2019

Our doesn't didn't seem to have it either but if you open with "rb" and then decode it magically works as utf-16.

@mawkee
Copy link

mawkee commented Jul 2, 2019

@turicas got it; I tried opening the data using ftfy and it worked all right for my case

@fanden1337
Copy link

Some CSV files come with NUL chars (\0x00) inside and the Python csv module doesn't know how to deal with it. So I think it's a great idea to have automatic NUL removal in the CSV plugin. An io.TextIOWrapper will do the job, like this one:

class NotNullTextWrapper(io.TextIOWrapper):

    def read(self, *args, **kwargs):
        data = super().read(*args, **kwargs)
        return data.replace('\x00', '')

    def readline(self, *args, **kwargs):
        data = super().readline(*args, **kwargs)
        return data.replace('\x00', '')

Sample file with this problem: http://arquivos.portaldatransparencia.gov.br/downloads.asp?a=2011&m=01&consulta=GastosDiretos

Exception raised: _csv.Error: line contains NULL byte

Thanks for posting the code. Was also useful outside of this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants