Skip to content

Commit

Permalink
Merge pull request #153 from roskakori/122-add-code-count-to-json-output
Browse files Browse the repository at this point in the history
#122 Add code count to JSON format
  • Loading branch information
roskakori committed May 13, 2024
2 parents 10b1ae3 + 7763db2 commit 79b061f
Show file tree
Hide file tree
Showing 8 changed files with 193 additions and 60 deletions.
81 changes: 58 additions & 23 deletions docs/background.rst
Original file line number Diff line number Diff line change
@@ -1,33 +1,43 @@
Background
##########

.. _How to count code:

How pygount counts code
--------------------------
-----------------------

Pygount primarily counts the physical lines of source code. It begins by using
lexers from Pygments, if available. If Pygments doesn't have a suitable lexer,
pygount employs its own internal lexers to differentiate between code and
comments. These include:

- Minimalist lexers for m4, VBScript, and WebFOCUS, capable of distinguishing between comments and code.
- The Java lexer repurposed for OMG IDL.

Pygount basically counts physical lines of source code.
Additionally, plain text is treated with a separate lexer that considers all lines as comments.

First, it lexes the code using the lexers ``pygments`` assigned to it. If
``pygments`` cannot find an appropriate lexer, pygount has a few additional
internal lexers that can at least distinguish between code and comments:
Lines consisting solely of comment tokens or whitespace are counted as comments.

* m4, VBScript and WebFOCUS use minimalistic lexers that can distinguish
between comments and code.
* OMG IDL repurposes the existing Java lexer.
Lines with only whitespace are ignored.

Furthermore plain text has a separate lexer that counts all lines as comments.
All other content is considered code.

Lines that only contain comment tokens and white space count as comments.
Lines that only contain white space are not taken into account. Everything
else counts as code.
White characters
----------------

If a line contains only "white characters" it is not taken into account
presumably because the code is only formatted that way to make it easier to
read. Currently white characters are::
A line containing only "white characters" is also ignored because the do not
contribute to code complexity in any meaningful way. Currently white
characters are::

(),:;[]{}

Because of that, pygount reports about 10 to 20 percent fewer SLOC for C-like
languages than other similar tools.
Because of that, pygount tends to report about 5 to 15 percent fewer SLOC for
C-like languages than other similar tools.

.. _No operations:

No operations
-------------

For some languages "no operations" are detected and treated as white space.
For example Python's ``pass`` or Transact-SQL's ``begin`` and ``end`` .
Expand All @@ -45,6 +55,31 @@ As example consider this Python code:
This counts as 1 line of code and 3 lines of comments. The line with ``pass``
is considered a "no operation" and thus not taken into account.

.. _Pure string lines:

Pure string lines
-----------------

Many programming languages support the concept of strings, which typically
often contain text to be shown to the end user or simple constant values.
Similar to white character and "no operations", in most cases they do not
add much to the complexity of the code. Notable exceptions are strings
containing code for domain specific languages, templates or SQL statements.

Pygount currently takes an opinionated approach on how to count pure string
lines depending on the output format:

- With ``--format=summary``, pure string lines are ignored similar to empty lines
- With ``--format`` set to ``sloccount`` or ``cloc-xml`` string lines are counted
as code, resulting in somewhat similar counts as the original tools.
- With ``format=json`` all variants are available as attributes and you can choose
which one you prefer.

In hindsight, this is an inconsistency that might warrant a cleanup. See issue
`#122 <https://github.com/roskakori/pygount/issues/122>`_ for a discussion and
issue `#152 <https://github.com/roskakori/pygount/issues/152>`_ for a plan on
how to clean this up.

.. _binary:

Binary files
Expand All @@ -62,21 +97,21 @@ performs no further analysis.


Comparison with other tools
-----------------------------------
---------------------------

Pygount can analyze more languages than other common tools such as sloccount
or cloc because it builds on ``pygments``, which provides lexers for hundreds
of languages. This also makes it easy to support another language: simply
of languages. This also makes it easy to support another language: Just
`write your own lexer <http://pygments.org/docs/lexerdevelopment/>`_.

For certain corner cases pygount gives more accurate results because it
actually lexes the code unlike other tools that mostly look for comment
markers and can get confused when they show up inside strings. In practice
though this should not make much of a difference.

Pygount is slower than most other tools. Partially this is due to actually
lexing instead of just scanning the code. Partially other tools can use
statically compiled languages such as Java or C, which are generally faster
than dynamic languages. For many applications though pygount should be
Pygount is slower than most other tools. Partially, this is due to actually
lexing instead of just scanning the code. Partially, because other tools can
use statically compiled languages such as Java or C, which are generally
faster than dynamic languages. For many applications though pygount should be
"fast enough", especially when running as an asynchronous step during a
continuous integration build.
17 changes: 17 additions & 0 deletions docs/changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,23 @@ Changes

This chapter describes the changes coming with each new version of pygount.

Version 1.8.0, 2024-05-13

* Add all available counts and percentages to JSON format (issue
`#122 <https://github.com/roskakori/pygount/issues/122>`_).

In particular, this makes available the ``codeCount``, which is similar to
the already existing ``sourceCount`` but does exclude lines that contain
only strings. You can check their availability by validating that the
``formatVersion`` is at least 1.1.0.

The documentation about ":ref:`How to count code`" has more information
about the available counts and the ways they are computed.

Pygount 2.0 will probably introduce some breaking changes in this area,
which can already be previewed and discussed at issue
`#152 <https://github.com/roskakori/pygount/issues/152>`_.

Version 1.7.0, 2024-05-13

* Fix analysis with
Expand Down
105 changes: 79 additions & 26 deletions docs/json.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@ the results of an analysis for further processing.


General format
--------------
==============

The general structure of the resulting JSON is:

.. code-block:: JavaScript
{
"formatVersion": "1.0.0",
"pygountVersion": "1.3.0",
"formatVersion": "1.1.0",
"pygountVersion": "1.8.0",
"files": [...],
"languages": [...],
"runtime": {...},
Expand All @@ -28,28 +28,44 @@ The naming of the entries deliberately uses camel case to conform to the
`JSLint <https://www.jslint.com/>`_ guidelines.

Both ``formatVersion`` and ``pygountVersion`` use
`semantic versioning <https://semver.org/>`_. The other entries contain the following information:
`semantic versioning <https://semver.org/>`_. For more information about how
this JSON evolved, see :ref:`JSON format history`.

Files
-----

With ``files`` you can access a list of files analyzed, for example:

.. code-block:: JavaScript
{
"path": "/Users/someone/workspace/pygount/pygount/write.py",
"sourceCount": 253,
"emptyCount": 60,
"documentationCount": 27,
"codeCount": 171,
"documentationCount": 28,
"emptyCount": 56,
"group": "pygount",
"isCountable": true,
"language": "Python",
"lineCount": 266,
"path": "/tmp/pygount/pygount/write.py",
"state": "analyzed",
"stateInfo": null
"stateInfo": null,
"sourceCount": 182
}
The ``*Count`` fields have the following meaning:

* ``codeCount``: The number of lines that contains code, excluding
:ref:?`Pure string lines`
* ``documentationCount``: The number of lines containing comments
* ``emptyCount``: The number of empty lines, which includes
":ref:`No operations`" lines
* ``lineCount``: Basically the number of lines shown in your editor
respectively computed by shell commands like ``wc -l``,
* ``sourceCount``: The source lines of code, similar to the traditional SLOC
* ``stringCount``: The number of :ref:`Pure string lines`

Here, ``sourceCount`` is the number of source lines of code (SLOC),
``documentationCount`` the number of lines containing comments and
``emptyCount`` the number of empty lines (which includes "no operation"
lines).

The ``state`` can have one of the following values:

Expand All @@ -62,47 +78,71 @@ The ``state`` can have one of the following values:
* generated: the file has been generated as specified with :option:`--generated`
* unknown: pygments does not offer any lexer to analyze the file

Languages
---------

In ``languages`` the summary for each language is available, for example:

.. code-block:: JavaScript
{
"documentationCount": 406,
"emptyCount": 631,
"fileCount": 18,
"documentationCount": 429,
"documentationPercentage": 11.776008783969257,
"codeCount": 2332,
"codePercentage": 64.01317595388416,
"emptyCount": 706,
"emptyPercentage": 19.3796321712874,
"fileCount": 20,
"filePercentage": 48.78048780487805,
"isPseudoLanguage": false,
"language": "Python",
"sourceCount": 2332
"sourceCount": 2508,
"sourcePercentage": 68.84435904474334,
"stringCount": 176,
"stringPercentage": 4.831183090859182
}
Summary
-------

In ``summary`` the total counts across the whole project can be accessed, for
example:

.. code-block:: JavaScript
"summary": {
"totalDocumentationCount": 410,
"totalEmptyCount": 869,
"totalFileCount": 32,
"totalSourceCount": 2930
"totalCodeCount": 4366,
"totalCodePercentage": 68.38972431077694,
"totalDocumentationCount": 463,
"totalDocumentationPercentage": 7.25250626566416,
"totalEmptyCount": 1275,
"totalEmptyPercentage": 19.971804511278197,
"totalFileCount": 41,
"totalSourceCount": 4646,
"totalSourcePercentage": 72.77568922305764,
"totalStringCount": 280,
"totalStringPercentage": 4.385964912280702
}
Runtime
-------

The ``runtime`` entry collects general information about how well pygount performed
in collecting the information, for example:

.. code-block:: JavaScript
"runtime": {
"durationInSeconds": 0.712625,
"filesPerSecond": 44.904402736362044
"finishedAt": "2022-01-05T11:49:27.009310",
"linesPerSecond": 5906.332222417121,
"startedAt": "2022-01-05T11:49:26.296685",
"durationInSeconds": 0.6333059999999999,
"filesPerSecond": 64.73963613166464,
"finishedAt": "2024-05-13T16:14:31.977070+00:00",
"linesPerSecond": 10080.435050354807,
"startedAt": "2024-05-13T16:14:31.343764+00:00"
}
Pretty printing
---------------
===============

Because the output is concise and consequently mostly illegible for a
human reader, you might want to pipe it through a pretty printer. As you
Expand All @@ -117,3 +157,16 @@ Another alternativ would be `jq <https://stedolan.github.io/jq/>`_:
.. code-block:: sh
pygount --format json | jq .
.. _JSON format history:

JSON format history
===================

v1.1.0, pygount 1.8.0

* Add ``code_count`` and ``line_count``

v1.0.0, pygount 1.3.0

* Initial version
7 changes: 7 additions & 0 deletions pygount/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,13 @@ def empty_count(self) -> int:
"""
return self._empty

@property
def line_count(self) -> int:
"""number of total lines, which is what you text editor a `wc -l`
would show
"""
return self.code_count + self.documentation_count + self.empty_count + self.string_count

@property
def string_count(self) -> int:
"""number of lines containing only strings but no other code"""
Expand Down
16 changes: 8 additions & 8 deletions pygount/summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,12 +200,12 @@ def total_empty_percentage(self) -> float:
return _percentage_or_0(self.total_empty_count, self.total_line_count)

@property
def total_string_count(self) -> int:
return self._total_string_count
def total_file_count(self) -> int:
return self._total_file_count

@property
def total_string_percentage(self) -> float:
return _percentage_or_0(self.total_string_count, self.total_line_count)
def total_line_count(self) -> int:
return self._total_line_count

@property
def total_source_count(self) -> int:
Expand All @@ -216,12 +216,12 @@ def total_source_percentage(self) -> float:
return _percentage_or_0(self.total_source_count, self.total_line_count)

@property
def total_file_count(self) -> int:
return self._total_file_count
def total_string_count(self) -> int:
return self._total_string_count

@property
def total_line_count(self) -> int:
return self._total_line_count
def total_string_percentage(self) -> float:
return _percentage_or_0(self.total_string_count, self.total_line_count)

def add(self, source_analysis: SourceAnalysis) -> None:
"""
Expand Down

0 comments on commit 79b061f

Please sign in to comment.