Improve performance of collate & reduce memorty consumption

We used to read all files into memory - which gets too much if someone runs a massively parallel CI because 400 * 10MB is still ~4GB and that's just pure file size and we do a lot more with it. Breaks the interface of ResultMerger.merge_and_store but it's not intended as a public interface. Will leave a note in the Changelog anyhow. What is attempted to do top leve is perhaps easier to see when looking at the spike code: https://github.com/simplecov-ruby/simplecov/compare/collate-plus-plus?expand=1 Changes go further than just not reading all files in at once, during the merge process we also operate on the raw file structure as opposed to creating SimpleCov::Result. Creating SimpleCov::Result comes with a lot of overhead, notably reading in all source files. So, that's even worse doing ~400 times in a large code base. There's more optimization potential for cases like these which I'll open a ticket about but notably: * Potentially don't create SimpleCov::Result at all until we really produce results (just dump the raw coverage more or less) * allow running without a formatter as only the last one really needs the formatter
simplecov-ruby · Jan 3, 2021 · ed03db5 · ed03db5
1 parent 0fe63fd
commit ed03db5
Show file tree

Hide file tree

Showing 11 changed files with 166 additions and 124 deletions.
diff --git a/.rubocop.yml b/.rubocop.yml
@@ -125,6 +125,7 @@ Metrics/MethodLength:
 
 Metrics/ModuleLength:
   Description: Avoid modules longer than 100 lines of code.
+  Max: 300
   Exclude:
     - "lib/simplecov.rb"
 

diff --git a/lib/simplecov.rb b/lib/simplecov.rb
@@ -81,17 +81,13 @@ def start(profile = nil, &block)
     # information about coverage collation
     #
     def collate(result_filenames, profile = nil, &block)
-      raise "There's no reports to be merged" if result_filenames.empty?
+      raise "There are no reports to be merged" if result_filenames.empty?
 
       initial_setup(profile, &block)
 
-      results = result_filenames.flat_map do |filename|
-        # Re-create each included instance of SimpleCov::Result from the stored run data.
-        Result.from_hash(JSON.parse(File.read(filename)) || {})
-      end
-
       # Use the ResultMerger to produce a single, merged result, ready to use.
-      @result = ResultMerger.merge_and_store(*results)
+      # TODO: Did/does collate ignore old results? It probably shouldn't, right?
+      @result = ResultMerger.merge_and_store(*result_filenames)
 
       run_exit_tasks!
     end

diff --git a/lib/simplecov/configuration.rb b/lib/simplecov/configuration.rb
@@ -10,7 +10,7 @@ module SimpleCov
   # defined here are usable from SimpleCov directly. Please check out
   # SimpleCov documentation for further info.
   #
-  module Configuration # rubocop:disable Metrics/ModuleLength
+  module Configuration
     attr_writer :filters, :groups, :formatter, :print_error_status
 
     #

diff --git a/lib/simplecov/result.rb b/lib/simplecov/result.rb
@@ -26,7 +26,7 @@ class Result
     # Initialize a new SimpleCov::Result from given Coverage.result (a Hash of filenames each containing an array of
     # coverage data)
     def initialize(original_result, command_name: nil, created_at: nil)
-      result = adapt_result(original_result)
+      result = original_result
       @original_result = result.freeze
       @command_name = command_name
       @created_at = created_at
@@ -72,10 +72,6 @@ def to_hash
       }
     end
 
-    def time_since_creation
-      Time.now - created_at
-    end
-
     # Loads a SimpleCov::Result#to_hash dump
     def self.from_hash(hash)
       hash.map do |command_name, data|
@@ -85,31 +81,6 @@ def self.from_hash(hash)
 
   private
 
-    # We changed the format of the raw result data in simplecov, as people are likely
-    # to have "old" resultsets lying around (but not too old so that they're still
-    # considered we can adapt them).
-    # See https://github.com/simplecov-ruby/simplecov/pull/824#issuecomment-576049747
-    def adapt_result(result)
-      if pre_simplecov_0_18_result?(result)
-        adapt_pre_simplecov_0_18_result(result)
-      else
-        result
-      end
-    end
-
-    # pre 0.18 coverage data pointed from file directly to an array of line coverage
-    def pre_simplecov_0_18_result?(result)
-      _key, data = result.first
-
-      data.is_a?(Array)
-    end
-
-    def adapt_pre_simplecov_0_18_result(result)
-      result.transform_values do |line_coverage_data|
-        {"lines" => line_coverage_data}
-      end
-    end
-
     def coverage
       keys = original_result.keys & filenames
       Hash[keys.zip(original_result.values_at(*keys))]

diff --git a/lib/simplecov/result_merger.rb b/lib/simplecov/result_merger.rb
@@ -19,81 +19,110 @@ def resultset_writelock
         File.join(SimpleCov.coverage_path, ".resultset.json.lock")
       end
 
-      # Loads the cached resultset from JSON and returns it as a Hash,
-      # caching it for subsequent accesses.
-      def resultset
-        @resultset ||= begin
-          data = stored_data
-          if data
-            begin
-              JSON.parse(data) || {}
-            rescue StandardError
-              {}
-            end
-          else
-            {}
-          end
+      def merge_and_store(*file_paths)
+        result = merge_results(*file_paths)
+        store_result(result) if result
+        result
+      end
+
+      def merge_results(*file_paths)
+        # It is intentional here that files are only read in and parsed one at a time.
+        #
+        # In big CI setups you might deal with 100s of CI jobs and each one producing Megabytes
+        # of data. Reading them all in easily produces Gigabytes of memory consumption which
+        # we want to avoid.
+        #
+        # For similar reasons a SimpleCov::Result is only created in the end as that'd create
+        # even more data especially when it also reads in all source files.
+        initial_memo = valid_results(file_paths.shift)
+
+        command_names, coverage = file_paths.reduce(initial_memo) do |memo, file_path|
+          merge_coverage(memo, valid_results(file_path))
         end
+
+        SimpleCov::Result.new(coverage, command_name: Array(command_names).sort.join(", "))
       end
 
-      # Returns the contents of the resultset cache as a string or if the file is missing or empty nil
-      def stored_data
-        synchronize_resultset do
-          return unless File.exist?(resultset_path)
+      def valid_results(file_path)
+        parsed = parse_file(file_path)
+        valid_results = parsed.select { |_command_name, data| within_merge_timeout?(data) }
+        command_plus_coverage = valid_results.map { |command_name, data| [[command_name], adapt_result(data.fetch("coverage"))] }
+
+        # one file itself _might_ include multiple test runs
+        merge_coverage(*command_plus_coverage)
+      end
 
-          data = File.read(resultset_path)
-          return if data.nil? || data.length < 2
+      def parse_file(path)
+        data = read_file(path)
+        parse_json(data)
+      end
 
-          data
-        end
+      def read_file(path)
+        return unless File.exist?(path)
+
+        data = File.read(path)
+        return if data.nil? || data.length < 2
+
+        data
       end
 
-      # Gets the resultset hash and re-creates all included instances
-      # of SimpleCov::Result from that.
-      # All results that are above the SimpleCov.merge_timeout will be
-      # dropped. Returns an array of SimpleCov::Result items.
-      def results
-        results = Result.from_hash(resultset)
-        results.select { |result| result.time_since_creation < SimpleCov.merge_timeout }
+      def parse_json(content)
+        return {} unless content
+
+        JSON.parse(content) || {}
+      rescue StandardError
+        warn "[SimpleCov]: Warning! Parsing JSON content of resultset file failed"
+        {}
       end
 
-      def merge_and_store(*results)
-        result = merge_results(*results)
-        store_result(result) if result
-        result
+      def within_merge_timeout?(data)
+        time_since_result_creation(data) < SimpleCov.merge_timeout
       end
 
-      # Merge two or more SimpleCov::Results into a new one with merged
-      # coverage data and the command_name for the result consisting of a join
-      # on all source result's names
-      def merge_results(*results)
-        parsed_results = JSON.parse(JSON.dump(results.map(&:original_result)))
-        combined_result = SimpleCov::Combine::ResultsCombiner.combine(*parsed_results)
-        result = SimpleCov::Result.new(combined_result)
-        # Specify the command name
-        result.command_name = results.map(&:command_name).sort.join(", ")
-        result
+      def time_since_result_creation(data)
+        Time.now - Time.at(data.fetch("timestamp"))
+      end
+
+      def merge_coverage(*results)
+        return results.first if results.size == 1
+
+        results.reduce do |(memo_command, memo_coverage), (command, coverage)|
+          # timestamp is dropped here, which is intentional
+          merged_coverage = SimpleCov::Combine::ResultsCombiner.combine(memo_coverage, coverage)
+          merged_command = memo_command + command
+
+          [merged_command, merged_coverage]
+        end
       end
 
       #
-      # Gets all SimpleCov::Results from cache, merges them and produces a new
+      # Gets all SimpleCov::Results stored in resultset, merges them and produces a new
       # SimpleCov::Result with merged coverage data and the command_name
       # for the result consisting of a join on all source result's names
       #
+      # TODO: Maybe put synchronization just around the reading?
       def merged_result
-        merge_results(*results)
+        synchronize_resultset do
+          merge_results(resultset_path)
+        end
+      end
+
+      def read_resultset
+        synchronize_resultset do
+          parse_file(resultset_path)
+        end
       end
 
       # Saves the given SimpleCov::Result in the resultset cache
       def store_result(result)
         synchronize_resultset do
           # Ensure we have the latest, in case it was already cached
-          clear_resultset
-          new_set = resultset
+          new_resultset = read_resultset
+          # FIXME
           command_name, data = result.to_hash.first
-          new_set[command_name] = data
+          new_resultset[command_name] = data
           File.open(resultset_path, "w+") do |f_|
-            f_.puts JSON.pretty_generate(new_set)
+            f_.puts JSON.pretty_generate(new_resultset)
           end
         end
         true
@@ -116,9 +145,29 @@ def synchronize_resultset
         end
       end
 
-      # Clear out the previously cached .resultset
-      def clear_resultset
-        @resultset = nil
+      # We changed the format of the raw result data in simplecov, as people are likely
+      # to have "old" resultsets lying around (but not too old so that they're still
+      # considered we can adapt them).
+      # See https://github.com/simplecov-ruby/simplecov/pull/824#issuecomment-576049747
+      def adapt_result(result)
+        if pre_simplecov_0_18_result?(result)
+          adapt_pre_simplecov_0_18_result(result)
+        else
+          result
+        end
+      end
+
+      # pre 0.18 coverage data pointed from file directly to an array of line coverage
+      def pre_simplecov_0_18_result?(result)
+        _key, data = result.first
+
+        data.is_a?(Array)
+      end
+
+      def adapt_pre_simplecov_0_18_result(result)
+        result.transform_values do |line_coverage_data|
+          {"lines" => line_coverage_data}
+        end
       end
     end
   end

diff --git a/spec/fixtures/conditionally_loaded_1.rb b/spec/fixtures/conditionally_loaded_1.rb
@@ -0,0 +1,3 @@
+# some comment
+puts "wargh"
+puts "wargh 1"
diff --git a/spec/fixtures/conditionally_loaded_2.rb b/spec/fixtures/conditionally_loaded_2.rb
@@ -0,0 +1,3 @@
+# some comment
+puts "wargh"
+puts "wargh 2"
diff --git a/spec/fixtures/parallel_tests.rb b/spec/fixtures/parallel_tests.rb
@@ -0,0 +1,4 @@
+# foo
+puts "foo"
+# bar
+puts "bar"