Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A lot of warnings for mo files. #11

Closed
JulienPalard opened this issue Apr 23, 2017 · 3 comments
Closed

A lot of warnings for mo files. #11

JulienPalard opened this issue Apr 23, 2017 · 3 comments
Assignees
Milestone

Comments

@JulienPalard
Copy link

I think that mo files should just be ignored as they are not plain text, while I'm getting loads of:

WARNING:pygount:cannot read ./git-clones/tendenci/tendenci/tendenci/locale/it/LC_MESSAGES/django.mo using encoding cp1252: 'charmap' codec can't decode byte 0x9d in position 48: character maps to <undefined>
@roskakori roskakori self-assigned this Apr 24, 2017
@roskakori
Copy link
Owner

roskakori commented May 4, 2017

Technically pygments identifies *.mo as Modelica source code:

>>> import pygments.lexers
>>> lexer = pygments.lexers.get_lexer_for_filename('some.mo')
>>> lexer.name
'Modelica'

Automatically excluding binary files from analysis in theory makes sense. However, detecting binaries is non trivial. The most common approach seems to be checking for 0 bytes as used by Subversion and gitattributes. The actual code from git is:

#define FIRST_FEW_BYTES 8000
 int buffer_is_binary(const char *ptr, unsigned long size)
 {
         if (FIRST_FEW_BYTES < size)
                 size = FIRST_FEW_BYTES;
         return !!memchr(ptr, 0, size);
 }

In case of pygments it would make sense to treat files with headers for UTF-16 and UTF-32 as text despite plenty of 0 bytes in it.

@JulienPalard
Copy link
Author

Nice analysis!

Looks like diff from diffutils 3.3 does the same:

#define binary_file_p(buf, size) (memchr (buf, 0, size) != 0)

So diff tells they're binary files:

$ cat ChangeLog | iconv -f utf8 -t utf16 > ChangeLog.utf16
$ diff ChangeLog ChangeLog.utf16
Binary files ChangeLog and ChangeLog.utf16 differ

Looks like your code already sniffs the BOM so just skipping files with zeros after BOM detection should be enough?

roskakori added a commit that referenced this issue May 4, 2017
Added detection of binary files and excluded them from the analysis. In particular Django model objects (``*.mo``) are not considered Modelica source code anymore.
roskakori added a commit that referenced this issue May 4, 2017
@roskakori roskakori added this to the v0.9 milestone May 4, 2017
@roskakori
Copy link
Owner

I implemented the proposed solution with v0.9 (check for BOM first, then for zero bytes within the initial 8K).

It's already available from PyPI so you can give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants