Merge pull request #153 from roskakori/122-add-code-count-to-json-output

#122 Add code count to JSON format
roskakori · May 13, 2024 · 79b061f · 79b061f
2 parents 10b1ae3 + 7763db2
commit 79b061f
Show file tree

Hide file tree

Showing 8 changed files with 193 additions and 60 deletions.
diff --git a/docs/background.rst b/docs/background.rst
@@ -1,33 +1,43 @@
 Background
 ##########
 
+.. _How to count code:
+
 How pygount counts code
---------------------------
+-----------------------
+
+Pygount primarily counts the physical lines of source code. It begins by using
+lexers from Pygments, if available. If Pygments doesn't have a suitable lexer,
+pygount employs its own internal lexers to differentiate between code and
+comments. These include:
+
+- Minimalist lexers for m4, VBScript, and WebFOCUS, capable of distinguishing between comments and code.
+- The Java lexer repurposed for OMG IDL.
 
-Pygount basically counts physical lines of source code.
+Additionally, plain text is treated with a separate lexer that considers all lines as comments.
 
-First, it lexes the code using the lexers ``pygments`` assigned to it. If
-``pygments`` cannot find an appropriate lexer, pygount has a few additional
-internal lexers that can at least distinguish between code and comments:
+Lines consisting solely of comment tokens or whitespace are counted as comments.
 
-* m4, VBScript and WebFOCUS use minimalistic lexers that can distinguish
-  between comments and code.
-* OMG IDL repurposes the existing Java lexer.
+Lines with only whitespace are ignored.
 
-Furthermore plain text has a separate lexer that counts all lines as comments.
+All other content is considered code.
 
-Lines that only contain comment tokens and white space count as comments.
-Lines that only contain white space are not taken into account. Everything
-else counts as code.
+White characters
+----------------
 
-If a line contains only "white characters" it is not taken into account
-presumably because the code is only formatted that way to make it easier to
-read. Currently white characters are::
+A line containing only "white characters" is also ignored because the do not
+contribute to code complexity in any meaningful way. Currently white
+characters are::
 
     (),:;[]{}
 
-Because of that, pygount reports about 10 to 20 percent fewer SLOC for C-like
-languages than other similar tools.
+Because of that, pygount tends to report about 5 to 15 percent fewer SLOC for
+C-like languages than other similar tools.
+
+.. _No operations:
+
+No operations
+-------------
 
 For some languages "no operations" are detected and treated as white space.
 For example Python's ``pass`` or Transact-SQL's ``begin`` and ``end`` .
@@ -45,6 +55,31 @@ As example consider this Python code:
 This counts as 1 line of code and 3 lines of comments. The line with ``pass``
 is considered a "no operation" and thus not taken into account.
 
+.. _Pure string lines:
+
+Pure string lines
+-----------------
+
+Many programming languages support the concept of strings, which typically
+often contain text to be shown to the end user or simple constant values.
+Similar to white character and "no operations", in most cases they do not
+add much to the complexity of the code. Notable exceptions are strings
+containing code for domain specific languages, templates or SQL statements.
+
+Pygount currently takes an opinionated approach on how to count pure string
+lines depending on the output format:
+
+- With ``--format=summary``, pure string lines are ignored similar to empty lines
+- With ``--format`` set to ``sloccount`` or ``cloc-xml`` string lines are counted
+  as code, resulting in somewhat similar counts as the original tools.
+- With ``format=json`` all variants are available as attributes and you can choose
+  which one you prefer.
+
+In hindsight, this is an inconsistency that might warrant a cleanup. See issue
+`#122 <https://github.com/roskakori/pygount/issues/122>`_ for a discussion and
+issue `#152 <https://github.com/roskakori/pygount/issues/152>`_ for a plan on
+how to clean this up.
+
 .. _binary:
 
 Binary files
@@ -62,21 +97,21 @@ performs no further analysis.
 
 
 Comparison with other tools
------------------------------------
+---------------------------
 
 Pygount can analyze more languages than other common tools such as sloccount
 or cloc because it builds on ``pygments``, which provides lexers for hundreds
-of languages. This also makes it easy to support another language: simply
+of languages. This also makes it easy to support another language: Just
 `write your own lexer <http://pygments.org/docs/lexerdevelopment/>`_.
 
 For certain corner cases pygount gives more accurate results because it
 actually lexes the code unlike other tools that mostly look for comment
 markers and can get confused when they show up inside strings. In practice
 though this should not make much of a difference.
 
-Pygount is slower than most other tools. Partially this is due to actually
-lexing instead of just scanning the code. Partially other tools can use
-statically compiled languages such as Java or C, which are generally faster
-than dynamic languages. For many applications though pygount should be
+Pygount is slower than most other tools. Partially, this is due to actually
+lexing instead of just scanning the code. Partially, because other tools can
+use statically compiled languages such as Java or C, which are generally
+faster than dynamic languages. For many applications though pygount should be
 "fast enough", especially when running as an asynchronous step during a
 continuous integration build.
diff --git a/docs/changes.rst b/docs/changes.rst
@@ -5,6 +5,23 @@ Changes
 
 This chapter describes the changes coming with each new version of pygount.
 
+Version 1.8.0, 2024-05-13
+
+* Add all available counts and percentages to JSON format (issue
+  `#122 <https://github.com/roskakori/pygount/issues/122>`_).
+
+  In particular, this makes available the ``codeCount``, which is similar to
+  the already existing ``sourceCount`` but does exclude lines that contain
+  only strings. You can check their availability by validating that the
+  ``formatVersion`` is at least 1.1.0.
+
+  The documentation about ":ref:`How to count code`" has more information
+  about the available counts and the ways they are computed.
+
+  Pygount 2.0 will probably introduce some breaking changes in this area,
+  which can already be previewed and discussed at issue
+  `#152 <https://github.com/roskakori/pygount/issues/152>`_.
+
 Version 1.7.0, 2024-05-13
 
 * Fix analysis with

diff --git a/docs/json.rst b/docs/json.rst
@@ -9,15 +9,15 @@ the results of an analysis for further processing.
 
 
 General format
---------------
+==============
 
 The general structure of the resulting JSON is:
 
 .. code-block:: JavaScript
 
   {
-    "formatVersion": "1.0.0",
-    "pygountVersion": "1.3.0",
+    "formatVersion": "1.1.0",
+    "pygountVersion": "1.8.0",
     "files": [...],
     "languages": [...],
     "runtime": {...},
@@ -28,28 +28,44 @@ The naming of the entries deliberately uses camel case to conform to the
 `JSLint <https://www.jslint.com/>`_ guidelines.
 
 Both ``formatVersion`` and ``pygountVersion`` use
-`semantic versioning <https://semver.org/>`_. The other entries contain the following information:
+`semantic versioning <https://semver.org/>`_. For more information about how
+this JSON evolved, see :ref:`JSON format history`.
+
+Files
+-----
 
 With ``files`` you can access a list of files analyzed, for example:
 
 .. code-block:: JavaScript
 
   {
-    "path": "/Users/someone/workspace/pygount/pygount/write.py",
-    "sourceCount": 253,
-    "emptyCount": 60,
-    "documentationCount": 27,
+    "codeCount": 171,
+    "documentationCount": 28,
+    "emptyCount": 56,
     "group": "pygount",
     "isCountable": true,
     "language": "Python",
+    "lineCount": 266,
+    "path": "/tmp/pygount/pygount/write.py",
     "state": "analyzed",
-    "stateInfo": null
+    "stateInfo": null,
+    "sourceCount": 182
   }
 
+The ``*Count`` fields have the following meaning:
+
+* ``codeCount``: The number of lines that contains code, excluding
+  :ref:?`Pure string lines`
+* ``documentationCount``: The number of lines containing comments
+* ``emptyCount``: The number of empty lines,  which includes
+  ":ref:`No operations`" lines
+* ``lineCount``: Basically the number of lines shown in your editor
+  respectively computed by shell commands like ``wc -l``,
+* ``sourceCount``: The source lines of code, similar to the traditional SLOC
+* ``stringCount``: The number of :ref:`Pure string lines`
+
 Here, ``sourceCount`` is the number of source lines of code (SLOC),
 ``documentationCount`` the number of lines containing comments and
-``emptyCount`` the number of empty lines (which includes "no operation"
-lines).
 
 The ``state`` can have one of the following values:
 
@@ -62,47 +78,71 @@ The ``state`` can have one of the following values:
 * generated: the file has been generated as specified with :option:`--generated`
 * unknown: pygments does not offer any lexer to analyze the file
 
+Languages
+---------
+
 In ``languages`` the summary for each language is available, for example:
 
 .. code-block:: JavaScript
 
   {
-    "documentationCount": 406,
-    "emptyCount": 631,
-    "fileCount": 18,
+    "documentationCount": 429,
+    "documentationPercentage": 11.776008783969257,
+    "codeCount": 2332,
+    "codePercentage": 64.01317595388416,
+    "emptyCount": 706,
+    "emptyPercentage": 19.3796321712874,
+    "fileCount": 20,
+    "filePercentage": 48.78048780487805,
     "isPseudoLanguage": false,
     "language": "Python",
-    "sourceCount": 2332
+    "sourceCount": 2508,
+    "sourcePercentage": 68.84435904474334,
+    "stringCount": 176,
+    "stringPercentage": 4.831183090859182
   }
 
+
+Summary
+-------
+
 In ``summary`` the total counts across the whole project can be accessed, for
 example:
 
 .. code-block:: JavaScript
 
   "summary": {
-    "totalDocumentationCount": 410,
-    "totalEmptyCount": 869,
-    "totalFileCount": 32,
-    "totalSourceCount": 2930
+    "totalCodeCount": 4366,
+    "totalCodePercentage": 68.38972431077694,
+    "totalDocumentationCount": 463,
+    "totalDocumentationPercentage": 7.25250626566416,
+    "totalEmptyCount": 1275,
+    "totalEmptyPercentage": 19.971804511278197,
+    "totalFileCount": 41,
+    "totalSourceCount": 4646,
+    "totalSourcePercentage": 72.77568922305764,
+    "totalStringCount": 280,
+    "totalStringPercentage": 4.385964912280702
   }
 
+Runtime
+-------
+
 The ``runtime`` entry collects general information about how well pygount performed
 in collecting the information, for example:
 
 .. code-block:: JavaScript
 
   "runtime": {
-    "durationInSeconds": 0.712625,
-    "filesPerSecond": 44.904402736362044
-    "finishedAt": "2022-01-05T11:49:27.009310",
-    "linesPerSecond": 5906.332222417121,
-    "startedAt": "2022-01-05T11:49:26.296685",
+    "durationInSeconds": 0.6333059999999999,
+    "filesPerSecond": 64.73963613166464,
+    "finishedAt": "2024-05-13T16:14:31.977070+00:00",
+    "linesPerSecond": 10080.435050354807,
+    "startedAt": "2024-05-13T16:14:31.343764+00:00"
   }
 
-
 Pretty printing
----------------
+===============
 
 Because the output is concise and consequently mostly illegible for a
 human reader, you might want to pipe it through a pretty printer. As you
@@ -117,3 +157,16 @@ Another alternativ would be `jq <https://stedolan.github.io/jq/>`_:
 .. code-block:: sh
 
   pygount --format json | jq .
+
+.. _JSON format history:
+
+JSON format history
+===================
+
+v1.1.0, pygount 1.8.0
+
+* Add ``code_count`` and ``line_count``
+
+v1.0.0, pygount 1.3.0
+
+* Initial version
diff --git a/pygount/analysis.py b/pygount/analysis.py
@@ -415,6 +415,13 @@ def empty_count(self) -> int:
         """
         return self._empty
 
+    @property
+    def line_count(self) -> int:
+        """number of total lines, which is what you text editor a `wc -l`
+        would show
+        """
+        return self.code_count + self.documentation_count + self.empty_count + self.string_count
+
     @property
     def string_count(self) -> int:
         """number of lines containing only strings but no other code"""

diff --git a/pygount/summary.py b/pygount/summary.py
@@ -200,12 +200,12 @@ def total_empty_percentage(self) -> float:
         return _percentage_or_0(self.total_empty_count, self.total_line_count)
 
     @property
-    def total_string_count(self) -> int:
-        return self._total_string_count
+    def total_file_count(self) -> int:
+        return self._total_file_count
 
     @property
-    def total_string_percentage(self) -> float:
-        return _percentage_or_0(self.total_string_count, self.total_line_count)
+    def total_line_count(self) -> int:
+        return self._total_line_count
 
     @property
     def total_source_count(self) -> int:
@@ -216,12 +216,12 @@ def total_source_percentage(self) -> float:
         return _percentage_or_0(self.total_source_count, self.total_line_count)
 
     @property
-    def total_file_count(self) -> int:
-        return self._total_file_count
+    def total_string_count(self) -> int:
+        return self._total_string_count
 
     @property
-    def total_line_count(self) -> int:
-        return self._total_line_count
+    def total_string_percentage(self) -> float:
+        return _percentage_or_0(self.total_string_count, self.total_line_count)
 
     def add(self, source_analysis: SourceAnalysis) -> None:
         """