You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After adding longer testcase to the testsuite for BNFC I was surprised to see the testsuite hanging. I then discovered that 13GB had been allocated (as much as my machine can offer). I traced the problem down to HTF and further to the diff algorithm used here, Data.Algorithm.Diff. It does not seem to scale to long strings with quite some differences.
However, when a test fails, I would be happy to see only the first couple of differences in long strings; if there are many differences, something fundamental must be wrong anyway.
I wonder whether assertEqual or the test runner could be configured to limit the number of differences shown and speed up the process.
However, I also have doubts in the quality of the library in use Data.Algorithm.Diff. It seems it uses reverse, making it impossible to only get the first n differences without computing all:
--| A form of 'getDiff' with no 'Eq' constraint. Instead, an equality predicate-- is taken as the first argument.getDiffBy:: (a->b->Bool) -> [a] -> [b] -> [PolyDiffab]
getDiffBy eq a b = markup a b .reverse$ lcs eq a b
where markup ...
To get a feel for the memory consumption of this diff algorithm, here is a protocol:
$ ./Diff 1000 +RTS -s
Comparing strings of lengths 5888896 and 5887107
Containing 15761 differences
83,067,632,840 bytes allocated in the heap
4066 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 80176 colls, 0 par 13.886s 14.796s 0.0002s 0.0440s
Gen 1 31 colls, 0 par 15.410s 17.389s 0.5609s 2.1239s
MUT time 18.025s ( 18.951s elapsed)
GC time 29.297s ( 32.184s elapsed)
Total time 47.322s ( 51.144s elapsed)
That is already 4GB for strings of length 6M, and the memory consumption is linear!
I attach my benchmark program if you want to experiment further:
{-# OPTIONS_GHC -F -pgmF htfpp #-}
{-# LANGUAGE LambdaCase, ScopedTypeVariables #-}
importData.List (groupBy)
importData.FunctorimportSystem.EnvironmentimportData.Algorithm.Diff (getDiff, PolyDiff(Both,First,Second))
importTest.Framework
main =do
n ::Int<- getArgs <&>\case
[s] ->read s
[]->14let a =concat$mapshow$ [1..n*n]
b =concat$mapshow$concat$map (\ i -> [n*i +1.. n*(i+1) -1]) [1..n]
putStrLn$"Comparing strings of lengths "++show (length a) ++" and "++show (length b)
-- runTest $ assertEqual a b-- runTest $ assertEqual (a == b) True -- without diffputStrLn$"Containing (truncated to 100) "++show (length$filter notSame $ getDiff a b) ++" differences"-- directly Diff
notSame =\caseBoth{} ->False
_ ->True-- Runs of @assertEqual a b@----------------------------------------------------------------------------- $ ./Diff 30-- Comparing strings of lengths 2592 and 2543-- Total execution time: 8ms-- 300/ms if 2.500 chars-- $ ./Diff 100-- Comparing strings of lengths 38894 and 38808-- Total execution time: 215ms-- 493,509,200 bytes allocated in the heap-- 23 MB total memory in use-- 200/ms if 40.000 chars (0.2s)-- $ ./Diff 300-- Comparing strings of lengths 438894 and 438136-- Total execution time: 2735ms-- 5,921,117,560 bytes allocated in the heap-- 269 MB total memory in use-- 160/ms if 440.000 chars (2.7s)-- $ ./Diff 500-- Comparing strings of lengths 1388895 and 1387719-- Total execution time: 9857ms-- 20,944,605,752 bytes allocated in the heap-- 867 MB total memory in use-- 140/ms if 1.400.000 chars (10s)-- $ ./Diff 1000-- Comparing strings of lengths 5888896 and 5887107-- Total execution time: 59925ms-- Total execution time: 54541ms-- 98,255,990,936 bytes allocated in the heap-- 3888 MB total memory in use-- 100/ms if 6.000.000 chars (60s)-- $ ./Diff 1500-- Comparing strings of lengths 14638896 and 14634738-- Total execution time: 160551ms-- 248,975,720,128 bytes allocated in the heap-- 9892 MB total memory in use-- 90/ms if 15.000.000 chars (160s)-- $ ./Diff 2000-- Comparing strings of lengths 26888896 and 26882552-- <INTERRUPT>-- Without diff (@assertEqual (a == b) True@)----------------------------------------------------------------------------- $ ./Diff 1000 +RTS -s-- Comparing strings of lengths 5888896 and 5887107-- Total execution time: 0ms-- 1,587,810,664 bytes allocated in the heap-- 391 MB total memory in use-- $ ./Diff 1500 +RTS -s-- Comparing strings of lengths 14638896 and 14634738-- Total execution time: 0ms-- 3,837,623,072 bytes allocated in the heap-- 1358 MB total memory in use (0 MB lost due to fragmentation)-- $ ./Diff 2000 +RTS -s-- Comparing strings of lengths 26888896 and 26882552-- Total execution time: 0ms-- 6,987,537,384 bytes allocated in the heap-- 2630 MB total memory in use-- Only Data.Algorithm.Diff----------------------------------------------------------------------------- $ ./Diff 100 +RTS -s-- Comparing strings of lengths 38894 and 38808-- Containing 1066 differences-- 387,598,992 bytes allocated in the heap-- 72,168,480 bytes copied during GC-- 10,251,528 bytes maximum residency (7 sample(s))-- 185,080 bytes maximum slop-- 25 MB total memory in use (0 MB lost due to fragmentation)-- Tot time (elapsed) Avg pause Max pause-- Gen 0 367 colls, 0 par 0.058s 0.062s 0.0002s 0.0031s-- Gen 1 7 colls, 0 par 0.026s 0.036s 0.0052s 0.0147s-- INIT time 0.000s ( 0.003s elapsed)-- MUT time 0.088s ( 0.094s elapsed)-- GC time 0.084s ( 0.098s elapsed)-- EXIT time 0.000s ( 0.006s elapsed)-- Total time 0.172s ( 0.202s elapsed)-- %GC time 48.8% (48.7% elapsed)-- $ ./Diff 1000 +RTS -s-- Comparing strings of lengths 5888896 and 5887107-- Containing 15761 differences-- 83,067,632,840 bytes allocated in the heap-- 33,259,227,112 bytes copied during GC-- 1,518,548,672 bytes maximum residency (31 sample(s))-- 8,911,168 bytes maximum slop-- 4066 MB total memory in use (0 MB lost due to fragmentation)-- Tot time (elapsed) Avg pause Max pause-- Gen 0 80176 colls, 0 par 13.886s 14.796s 0.0002s 0.0440s-- Gen 1 31 colls, 0 par 15.410s 17.389s 0.5609s 2.1239s-- INIT time 0.000s ( 0.002s elapsed)-- MUT time 18.025s ( 18.951s elapsed)-- GC time 29.297s ( 32.184s elapsed)-- EXIT time 0.000s ( 0.006s elapsed)-- Total time 47.322s ( 51.144s elapsed)-- %GC time 61.9% (62.9% elapsed)-- Alloc rate 4,608,394,976 bytes per MUT second-- Productivity 38.1% of total user, 37.1% of total elapsed
The text was updated successfully, but these errors were encountered:
The fix just restricts the length of the input strings.
With this, I get:
./Diff 1000
Total execution time: 5490ms
10,814,412,072 bytes allocated in the heap
7,258,721,136 bytes copied during GC
707,046,280 bytes maximum residency (15 sample(s))
10,879,464 bytes maximum slop
674 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10342 colls, 0 par 2.177s 2.189s 0.0002s 0.0018s
Gen 1 15 colls, 0 par 1.819s 2.481s 0.1654s 0.5047s
INIT time 0.000s ( 0.002s elapsed)
MUT time 1.471s ( 1.713s elapsed)
GC time 3.996s ( 4.670s elapsed)
EXIT time 0.000s ( 0.005s elapsed)
Total time 5.468s ( 6.390s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 7,350,942,162 bytes per MUT second
Productivity 26.9% of total user, 26.8% of total elapsed
Still, not brilliant, but that should suffice in practice.
After adding longer testcase to the testsuite for BNFC I was surprised to see the testsuite hanging. I then discovered that 13GB had been allocated (as much as my machine can offer). I traced the problem down to HTF and further to the diff algorithm used here, Data.Algorithm.Diff. It does not seem to scale to long strings with quite some differences.
However, when a test fails, I would be happy to see only the first couple of differences in long strings; if there are many differences, something fundamental must be wrong anyway.
I wonder whether
assertEqual
or the test runner could be configured to limit the number of differences shown and speed up the process.However, I also have doubts in the quality of the library in use Data.Algorithm.Diff. It seems it uses
reverse
, making it impossible to only get the first n differences without computing all:To get a feel for the memory consumption of this diff algorithm, here is a protocol:
That is already 4GB for strings of length 6M, and the memory consumption is linear!
I attach my benchmark program if you want to experiment further:
The text was updated successfully, but these errors were encountered: