A lot of warnings for mo files. #11

JulienPalard · 2017-04-23T20:36:21Z

I think that mo files should just be ignored as they are not plain text, while I'm getting loads of:

WARNING:pygount:cannot read ./git-clones/tendenci/tendenci/tendenci/locale/it/LC_MESSAGES/django.mo using encoding cp1252: 'charmap' codec can't decode byte 0x9d in position 48: character maps to <undefined>

The text was updated successfully, but these errors were encountered:

roskakori · 2017-05-04T05:12:05Z

Technically pygments identifies *.mo as Modelica source code:

>>> import pygments.lexers
>>> lexer = pygments.lexers.get_lexer_for_filename('some.mo')
>>> lexer.name
'Modelica'

Automatically excluding binary files from analysis in theory makes sense. However, detecting binaries is non trivial. The most common approach seems to be checking for 0 bytes as used by Subversion and gitattributes. The actual code from git is:

#define FIRST_FEW_BYTES 8000
 int buffer_is_binary(const char *ptr, unsigned long size)
 {
         if (FIRST_FEW_BYTES < size)
                 size = FIRST_FEW_BYTES;
         return !!memchr(ptr, 0, size);
 }

In case of pygments it would make sense to treat files with headers for UTF-16 and UTF-32 as text despite plenty of 0 bytes in it.

JulienPalard · 2017-05-04T08:27:44Z

Nice analysis!

Looks like diff from diffutils 3.3 does the same:

#define binary_file_p(buf, size) (memchr (buf, 0, size) != 0)

So diff tells they're binary files:

$ cat ChangeLog | iconv -f utf8 -t utf16 > ChangeLog.utf16
$ diff ChangeLog ChangeLog.utf16
Binary files ChangeLog and ChangeLog.utf16 differ

Looks like your code already sniffs the BOM so just skipping files with zeros after BOM detection should be enough?

Added detection of binary files and excluded them from the analysis. In particular Django model objects (``*.mo``) are not considered Modelica source code anymore.

Cleaned up PEP8 issue.

roskakori · 2017-05-04T20:36:29Z

I implemented the proposed solution with v0.9 (check for BOM first, then for zero bytes within the initial 8K).

It's already available from PyPI so you can give it a try.

roskakori self-assigned this Apr 24, 2017

roskakori added the enhancement label Apr 24, 2017

roskakori added a commit that referenced this issue May 4, 2017

#11: A lot of warnings for mo files.

b1f963b

Added detection of binary files and excluded them from the analysis. In particular Django model objects (``*.mo``) are not considered Modelica source code anymore.

roskakori added a commit that referenced this issue May 4, 2017

#11: A lot of warnings for mo files.

a812995

Cleaned up PEP8 issue.

roskakori added this to the v0.9 milestone May 4, 2017

roskakori closed this as completed May 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A lot of warnings for mo files. #11

A lot of warnings for mo files. #11

JulienPalard commented Apr 23, 2017

roskakori commented May 4, 2017 •

edited

JulienPalard commented May 4, 2017

roskakori commented May 4, 2017

A lot of warnings for mo files. #11

A lot of warnings for mo files. #11

Comments

JulienPalard commented Apr 23, 2017

roskakori commented May 4, 2017 • edited

JulienPalard commented May 4, 2017

roskakori commented May 4, 2017

roskakori commented May 4, 2017 •

edited