Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected code in ReNumber #211

Open
jodavies opened this issue Jun 30, 2017 · 41 comments
Open

Unexpected code in ReNumber #211

jodavies opened this issue Jun 30, 2017 · 41 comments
Labels
bug Something isn't working

Comments

@jodavies
Copy link
Collaborator

Hello,

I was writing a small script to see to what extent I could increase the test coverage in sort.c and compress.c. I have something like the following:

#-

#: TermsInSmall 128
#: LargePatches 16
#: FilePatches 4
*#: SubTermsInSmall 64
#: SubLargePatches 8
#: SubFilePatches 2

Off Statistics;

CFunction f,g;
Symbol x,y;

* 128*16=2048 terms cause a sort of the large buffer to disk.
* multiples of 2048*4=8192 terms cause a stage 4 sort
#define NTERMS "40001"

#define ARGNTERMS "2001"

Local test1 = <f(1)>+...+<f(`NTERMS')>;
Local test2 = g(<f(1)>+...+<f(`ARGNTERMS')>);
.sort

* Cancel all terms, but keep distance so that most terms only cancel in the final sort
Identify f(x?) = f(x) - f(`NTERMS'-x+1);
Argument g;
	Identify f(x?) = f(x) - f(`ARGNTERMS'-x+1);
EndArgument;


Print +s;
.end

If one uncomments the #: SubTermsInSmall 64 setting, tform crashes with Unexpected code in ReNumber. form does not crash, but test2 contains many f(...) terms that are not inside g.

Thanks,
Josh

@jodavies
Copy link
Collaborator Author

==7336== Conditional jump or move depends on uninitialised value(s)
==7336==    at 0x5095C5: TestSub (proces.c:1622)
==7336==    by 0x501D66: Generator (proces.c:3051)
==7336==    by 0x503B9E: Generator (proces.c:3762)
==7336==    by 0x5040BA: Generator (proces.c:3930)
==7336==    by 0x50ACE9: Processor (proces.c:404)
==7336==    by 0x44ABB5: DoExecute (execute.c:838)
==7336==    by 0x4690CF: ExecModule (module.c:274)
==7336==    by 0x4F836E: PreProcessor (pre.c:962)
==7336==    by 0x544884: main (startup.c:1605)

@tueda
Copy link
Collaborator

tueda commented Jun 30, 2017

It must be an old bug. I got many f's with FORM 4.1 (Oct 25 2013).

@vermaseren
Copy link
Owner

vermaseren commented Jun 30, 2017 via email

@tueda tueda added the bug Something isn't working label Jun 30, 2017
@jodavies
Copy link
Collaborator Author

jodavies commented Jun 30, 2017

Great, that seems to work.

Now, if I add to the first module:

#$test3dol = <f(1)>+...+<f(`NTERMS')>;

and to the second module:

#inside $test3dol
	Identify f(x?) = f(x) - f(`NTERMS'-x+1);
#endinside

form has no problems, but tform -w4 crashes about 60% of the time, with various errors:

Unexpected code in ReNumber
Program terminating in thread 1 at sorting.frm Line 33 --> 
0FillInputGZIP: Read error during uncompressed sort.
0++Reading 4295000 bytes at position    1366592
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 418958 bytes
Program terminating in thread 0 at sorting.frm Line 23 --> 
Program terminating in thread 4 at sorting.frm Line 23 --> 

EDIT: For this, at some point before the crash, valgrind gives


==18016== Syscall param read(buf) points to unaddressable byte(s)
==18016==    at 0x5B8551D: ??? (syscall-template.S:84)
==18016==    by 0x520956: read (unistd.h:44)
==18016==    by 0x520956: Uread (unixfile.c:134)
==18016==    by 0x50F375: ReadFile (tools.c:1158)
==18016==    by 0x42D7A5: FillInputGZIP (compress.c:694)
==18016==    by 0x4ED173: PutIn (sort.c:1260)
==18016==    by 0x4F153F: MergePatches (sort.c:3656)
==18016==    by 0x4F3798: EndSort (sort.c:1066)
==18016==    by 0x430DA8: CatchDollar (dollar.c:117)
==18016==    by 0x4BB230: PreProcessor (pre.c:1050)
==18016==    by 0x4F88A0: main (startup.c:1605)
==18016==  Address 0x671d52c0 is 0 bytes after a block of size 2,560,640 alloc'd
==18016==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18016==    by 0x510762: Malloc1 (tools.c:2236)
==18016==    by 0x5111E0: TermMallocAddMemory (tools.c:2470)
==18016==    by 0x45C57B: Normalize (normal.c:220)
==18016==    by 0x4C24AF: Generator (proces.c:3064)
==18016==    by 0x430D61: CatchDollar (dollar.c:112)
==18016==    by 0x4BB230: PreProcessor (pre.c:1050)
==18016==    by 0x4F88A0: main (startup.c:1605)

@tueda
Copy link
Collaborator

tueda commented Jul 1, 2017

Running valgrind tvorm -w2 for

#:FilePatches 4
#:SubTermsInSmall 64
#:SubLargePatches 8
#:SubFilePatches 2
CF f;
#define NTERMS "5000"
#$x = <f(1)>+...+<f(`NTERMS')>;
.end

gives me

StageSort in thread 0
StageSort in thread 0
==11421== Syscall param read(buf) points to unaddressable byte(s)
==11421==    at 0x32E1A0E82D: ??? (in /lib64/libpthread-2.12.so)
==11421==    by 0x51C1B3: Uread (unixfile.c:134)
==11421==    by 0x50B031: ReadFile (tools.c:1158)
==11421==    by 0x42CF56: FillInputGZIP (compress.c:694)
==11421==    by 0x4E95DB: PutIn (sort.c:1260)
==11421==    by 0x4ED90D: MergePatches (sort.c:3656)
==11421==    by 0x4EFB3C: EndSort (sort.c:1066)
==11421==    by 0x4304D2: CatchDollar (dollar.c:117)
==11421==    by 0x4B8301: PreProcessor (pre.c:1050)
==11421==    by 0x4F4A43: main (startup.c:1605)
==11421==  Address 0x37e9c640 is not stack'd, malloc'd or (recently) free'd
==11421==
0FillInputGZIP: Read error during uncompressed sort.
0++Reading 4295000 bytes at position     106280
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 418958 bytes
Program terminating in thread 0 at test.frm Line 7 -->
==11421== Invalid read of size 4
==11421==    at 0x50F5F8: Crash (tools.c:3711)
==11421==    by 0x4F4070: Terminate (startup.c:1707)
==11421==    by 0x4E9618: PutIn (sort.c:1266)
==11421==    by 0x4ED90D: MergePatches (sort.c:3656)
==11421==    by 0x4EFB3C: EndSort (sort.c:1066)
==11421==    by 0x4304D2: CatchDollar (dollar.c:117)
==11421==    by 0x4B8301: PreProcessor (pre.c:1050)
==11421==    by 0x4F4A43: main (startup.c:1605)
==11421==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Now 100% it occurs (though it is not in a worker).

@vermaseren
Copy link
Owner

vermaseren commented Jul 1, 2017 via email

@tueda
Copy link
Collaborator

tueda commented Jul 1, 2017

The minimum seems to be 129 terms,

TFORM 4.1 (Jun 30 2017, v4.1-20131025-382-gc169281) 64-bits 0 workers  Run: Sat Jul  1 23:46:29 2017
    #:FilePatches 4
    #:SubTermsInSmall 16
    #:SubLargePatches 1
    #:SubFilePatches 4
    S x;
    CF f;
    #define NTERMS "129"
    L F = f(<x^1>+...+<x^`NTERMS'>);
    ModuleOption noparallel;
    .end
StageSort in thread 0
StageSort in thread 0
==12127== Syscall param read(buf) points to unaddressable byte(s)
==12127==    at 0x32E1A0E810: __read_nocancel (in /lib64/libpthread-2.12.so)
==12127==    by 0x51C1B3: Uread (unixfile.c:134)
==12127==    by 0x50B031: ReadFile (tools.c:1158)
==12127==    by 0x42CF56: FillInputGZIP (compress.c:694)
==12127==    by 0x4E95DB: PutIn (sort.c:1260)

but unfortunately it doesn't crash in spite of the memory bug.

@tueda
Copy link
Collaborator

tueda commented Jul 1, 2017

Two things:

  • Actually, the same valgrind error occurs without zlib.
  • MaxTermSize affects on the appearance of the error (similar to gzip errors #95).

In the following log, is file->POsize of file->handle=4 OK??

TFORM 4.1 (Jul  1 2017, v4.1-20131025-382-gc169281-dirty) 64-bits 0 workers  Run: Sun Jul  2 00:31:33 2017
    #:MaxTermSize 600
    #:FilePatches 4
    #:SubTermsInSmall 16
    #:SubLargePatches 1
    #:SubFilePatches 4
    S x;
    CF f;
    #define NTERMS "65"
    L F = f(<x^1>+...+<x^`NTERMS'>);
    ModuleOption noparallel;
    .end
StageSort in thread 0
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=4 file->POsize=4373800]
[PutIn: file->handle=4 file->POsize=4373800]
==30515== Syscall param read(buf) points to unaddressable byte(s)
==30515==    at 0x32E1A0E810: __read_nocancel (in /lib64/libpthread-2.12.so)
==30515==    by 0x51A799: Uread (unixfile.c:134)
==30515==    by 0x509618: ReadFile (tools.c:1158)
==30515==    by 0x4E831B: PutIn (sort.c:1273)
==30515==    by 0x4EC2A3: MergePatches (sort.c:3658)
==30515==    by 0x4EE13D: EndSort (sort.c:1066)
==30515==    by 0x4C477F: InFunction (proces.c:2057)
==30515==    by 0x4BFA19: Generator (proces.c:3759)
==30515==    by 0x4C00A0: Generator (proces.c:3931)
==30515==    by 0x4C1AA3: Processor (proces.c:404)
==30515==    by 0x43A4E8: DoExecute (execute.c:838)
==30515==    by 0x451866: ExecModule (module.c:274)
==30515==  Address 0x1cb453d8 is 0 bytes after a block of size 5,100,440 alloc'd
==30515==    at 0x4A05A57: malloc (vg_replace_malloc.c:299)
==30515==    by 0x50A950: Malloc1 (tools.c:2236)
==30515==    by 0x4E4EA2: AllocSort (setfile.c:901)
==30515==    by 0x4E814D: NewSort (sort.c:619)
==30515==    by 0x4C46DB: InFunction (proces.c:2033)
==30515==    by 0x4BFA19: Generator (proces.c:3759)
==30515==    by 0x4C00A0: Generator (proces.c:3931)
==30515==    by 0x4C1AA3: Processor (proces.c:404)
==30515==    by 0x43A4E8: DoExecute (execute.c:838)
==30515==    by 0x451866: ExecModule (module.c:274)
==30515==    by 0x4B6D9B: PreProcessor (pre.c:962)
==30515==    by 0x4F302A: main (startup.c:1605)

The corresponding vorm log:

FORM 4.1 (Jul  1 2017, v4.1-20131025-382-gc169281-dirty) 64-bits  Run: Sun Jul  2 00:38:11 2017
    #:MaxTermSize 600
    #:FilePatches 4
    #:SubTermsInSmall 16
    #:SubLargePatches 1
    #:SubFilePatches 4
    S x;
    CF f;
    #define NTERMS "65"
    L F = f(<x^1>+...+<x^`NTERMS'>);
    ModuleOption noparallel;
    .end
StageSort
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=4 file->POsize=100000]
[PutIn: file->handle=4 file->POsize=100000]
StageSort
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=4 file->POsize=100000]
[PutIn: file->handle=4 file->POsize=100000]

@tueda
Copy link
Collaborator

tueda commented Jul 3, 2017

The first example causes memory leaks: https://travis-ci.org/tueda/form/jobs/249641044#L4542. (Should we reopen the issue?)

@vermaseren
Copy link
Owner

vermaseren commented Jul 3, 2017 via email

@tueda
Copy link
Collaborator

tueda commented Jul 3, 2017

Here is another spam of valgrind log (vorm) with increasing SortIOSize:

FORM 4.1 (Jul  3 2017, v4.1-20131025-383-gd8ae7a6-dirty) 64-bits  Run: Mon Jul  3 21:05:02 2017
    #:SortIOSize 10M
    #:MaxTermSize 600
    #:SubTermsInSmall 16
    #:SubLargePatches 1
    #:SubFilePatches 2
    S x;
    CF f;
    #define NTERMS "65"
    L F = f(<x^1>+...+<x^`NTERMS'>);
    P;
    .end
StageSort
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=4 file->POsize=10000000]
==5619== Syscall param read(buf) points to unaddressable byte(s)
==5619==    at 0x32E12DB670: __read_nocancel (in /lib64/libc-2.12.so)
==5619==    by 0x50E343: Uread (unixfile.c:134)
==5619==    by 0x4FE08C: ReadFile (tools.c:1158)
==5619==    by 0x42B0B6: FillInputGZIP (compress.c:694)
==5619==    by 0x4DDD98: PutIn (sort.c:1260)
==5619==    by 0x4E1AA0: MergePatches (sort.c:3656)
==5619==    by 0x4E391C: EndSort (sort.c:1066)
==5619==    by 0x4BC6EA: InFunction (proces.c:2057)
==5619==    by 0x4B8007: Generator (proces.c:3759)
==5619==    by 0x4B8463: Generator (proces.c:3931)
==5619==    by 0x4B9BE3: Processor (proces.c:404)
==5619==    by 0x43762D: DoExecute (execute.c:838)
==5619==  Address 0x15469408 is 0 bytes after a block of size 5,100,488 alloc'd
==5619==    at 0x4A05A57: malloc (vg_replace_malloc.c:299)
==5619==    by 0x4FF227: Malloc1 (tools.c:2236)
==5619==    by 0x4DB0B4: AllocSort (setfile.c:901)
==5619==    by 0x4DDBF0: NewSort (sort.c:619)
==5619==    by 0x4BC64C: InFunction (proces.c:2033)
==5619==    by 0x4B8007: Generator (proces.c:3759)
==5619==    by 0x4B8463: Generator (proces.c:3931)
==5619==    by 0x4B9BE3: Processor (proces.c:404)
==5619==    by 0x43762D: DoExecute (execute.c:838)
==5619==    by 0x44DAAA: ExecModule (module.c:274)
==5619==    by 0x4AFB2D: PreProcessor (pre.c:962)
==5619==    by 0x4E8747: main (startup.c:1605)
==5619== 
[PutIn: file->handle=4 file->POsize=10000000]

It is very strange that PutIn() is called with file->POsize=10000000, i.e., SortIOSize. Then ReadFile() overflows from the buffer allocated by AllocSort for the function argument (only 5100488 bytes, valgrind says). GDB shows it is allocated with AllocSort (LargeSize=4000000, SmallSize=500000, SmallEsize=800000, TermsInSmall=16, MaxPatches=1, MaxFpatches=4, IOsize=32768). For TFORM, instead of SortIOSize, the value calculated from AM.S0->file.POsize is used (see threads.c:541).

@vermaseren
Copy link
Owner

vermaseren commented Jul 3, 2017 via email

@tueda
Copy link
Collaborator

tueda commented Jul 3, 2017

Actually, the problematic PutIn() is called with AR.FoStage4[0]:

Breakpoint 1, PutIn (file=file@entry=0x797d30 <A+7664>, position=0x7fff4e4ba520, buffer=0x7fff4e4bb8e0,
    take=take@entry=0x7fffffffcf40, npat=npat@entry=0) at sort.c:1242

(gdb) p &AR.FoStage4[0]
$14 = (FILEHANDLE *) 0x797d30 <A+7664>

@jodavies
Copy link
Collaborator Author

jodavies commented Jul 3, 2017

If you enable gzipdebug do you see the same pattern of events as in #95 (comment) ?

@vermaseren
Copy link
Owner

vermaseren commented Jul 3, 2017 via email

@tueda
Copy link
Collaborator

tueda commented Jul 3, 2017

@jodavies No, I can see only ++Reading 795200 bytes at position 2116 and similar lines.

@tueda
Copy link
Collaborator

tueda commented Jul 3, 2017

OK. I didn't enable GZIPDEBUG in sort.c. But except that I can get Valgrind errors, it looks fine...

@vermaseren
Copy link
Owner

vermaseren commented Jul 3, 2017 via email

tueda added a commit that referenced this issue Dec 13, 2017
Currently the test fails with asan/valgrind (#248). This will be fixed
by the next commit.
@tueda
Copy link
Collaborator

tueda commented Oct 3, 2018

With the current master branch a53fd8f, I locally (Ubuntu on Windows, Xenial, Homebrew gcc 5.5.0_4) have a TFORM problem:

cd checking
./check.rb ../sources/tform Issue211

always leading to

StageSort in thread 1
1FillInputGZIP: Read error during uncompressed sort.
1++Reading 131170000 bytes at position          0
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 7D17ED0 bytes

I didn't get the same error with previous versions.

Confusingly, on another machine (CentOS 6.10, gcc Red Hat 7.3.1-5), this error occurs relatively rarely.

The error did not occur on Travis CI: https://travis-ci.org/vermaseren/form/jobs/436825672, and it did fail on Circle CI once: https://circleci.com/gh/tueda/form/156.

Does anyone also hit the same gzip error with the test Issue211?

@tueda
Copy link
Collaborator

tueda commented Oct 3, 2018

Ah, maybe this is related to the fact that TFORM now consumes a crazy amount of memory even if it performs nothing (~ 8GB for tform -w4) due to the increases in fsizes.h. FORM requires big machines.

@tueda
Copy link
Collaborator

tueda commented Oct 3, 2018

The change of memory usage for doing nothing:

form: 391M -> 2226M
tform -w4: 2001M -> 7400M

FORM is now really heavy software in a sense of memory allocation. This may explain why the 32-bit test never succeeds (over the 4GB limit).

tueda added a commit that referenced this issue Oct 4, 2018
- Update the Ruby lint configuration for RuboCop 0.59.2.
  (.rubocop.yml, check.rb)

- Ignore Valgrind "set address range perms" warnings in the test suite
  (otherwise succeeded? returns false), which happen when FORM
  allocates a big memory chunk. (check.rb)

- Fix a LaTeX typesetting error. (setup.tex)

- Collect Jos's laptop settings into the usual place.
  (optimize.cc -> form3.h)

- Suppress memory usage on 32-bit machines, especially for
  the 2^32 ~ 4G limit and tform -w4. (fsizes.h)

- Don't include any system header files before config.h, because of
  the large file support on 32-bit machines. (gentopo.cc)

Note that the Issue211 test may still fail. See the recent comments on
Issue #211.
@tueda
Copy link
Collaborator

tueda commented Oct 4, 2018

A valgrind dump:

==17684== Memcheck, a memory error detector
==17684== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==17684== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==17684== Command: ./tvorm -w4 1.frm
==17684== 
==17684== error calling PR_SET_PTRACER, vgdb might block
==17684== Warning: set address range perms: large range [0x395d8040, 0xbe5a8a78) (undefined)
==17684== Warning: set address range perms: large range [0xbe5a9040, 0xe585a8f8) (undefined)
==17684== Warning: set address range perms: large range [0xe585b040, 0x10cb0c8f8) (undefined)
==17684== Warning: set address range perms: large range [0x10cb0d040, 0x133dbe8f8) (undefined)
==17684== Warning: set address range perms: large range [0x133dbf040, 0x15b0708f8) (undefined)
==17684== Warning: set address range perms: large range [0x15b071040, 0x178d47840) (undefined)
==17684== Warning: set address range perms: large range [0x178d48040, 0x196a1e840) (undefined)
==17684== Warning: set address range perms: large range [0x196a1f040, 0x1b46f5840) (undefined)
TFORM 4.2.0 (Oct  4 2018, v4.2.0-69-g766ae40) 64-bits 4 workers  Run: Thu Oct  4 22:10:32 2018
    #: TermsInSmall 128
    #: LargePatches 16
    #: FilePatches 4
    #: SubTermsInSmall 64
    #: SubLargePatches 8
    #: SubFilePatches 2
    
    Off stats;
    CF f,g;
    S x,y;
    
    #define NTERMS "40001"
    #define ARGNTERMS "2001"
    
    L F = g(<f(1)>+...+<f(`ARGNTERMS')>);
    .sort
    
    id f(x?) = f(x) - f(`NTERMS'-x+1);
    argument g;
      id f(x?) = f(x) - f(`ARGNTERMS'-x+1);
    endargument;
    .end
StageSort in thread 1
==17684== Warning: set address range perms: large range [0x1d01b5040, 0x1ef614b80) (undefined)
==17684== Warning: set address range perms: large range [0x1ef615040, 0x20ea74b80) (undefined)
==17684== Thread 2:
==17684== Syscall param read(buf) points to unaddressable byte(s)
==17684==    at 0x5BB751D: ??? (syscall-template.S:84)
==17684==    by 0x523411: Uread (unixfile.c:134)
==17684==    by 0x5119D2: ReadFile (tools.c:1158)
==17684==    by 0x42DCFA: FillInputGZIP (compress.c:708)
==17684==    by 0x4EFCD7: PutIn (sort.c:1276)
==17684==    by 0x4F4064: MergePatches (sort.c:3734)
==17684==    by 0x4F61FD: EndSort (sort.c:1074)
==17684==    by 0x409838: execarg (argument.c:680)
==17684==    by 0x4C5EA4: Generator (proces.c:3421)
==17684==    by 0x5255D2: RunThread (threads.c:1386)
==17684==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17684==    by 0x5ECB41C: clone (clone.S:109)
==17684==  Address 0x37967970 is 0 bytes after a block of size 9,443,632 alloc'd
==17684==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17684==    by 0x512D3F: Malloc1 (tools.c:2250)
==17684==    by 0x4EC81E: AllocSort (setfile.c:904)
==17684==    by 0x4EFB30: NewSort (sort.c:624)
==17684==    by 0x4CBD93: InFunction (proces.c:2066)
==17684==    by 0x4C6E28: Generator (proces.c:3815)
==17684==    by 0x4C74D2: Generator (proces.c:3987)
==17684==    by 0x5255D2: RunThread (threads.c:1386)
==17684==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17684==    by 0x5ECB41C: clone (clone.S:109)
==17684== 
1FillInputGZIP: Read error during uncompressed sort.
1++Reading 131170000 bytes at position          0
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 7D17ED0 bytes
Program terminating in thread 1 at 1.frm Line 21 --> 
==17684== Invalid read of size 4
==17684==    at 0x51607A: Crash (tools.c:3767)
==17684==    by 0x4FA9EF: Terminate (startup.c:1721)
==17684==    by 0x4EFD14: PutIn (sort.c:1282)
==17684==    by 0x4F4064: MergePatches (sort.c:3734)
==17684==    by 0x4F61FD: EndSort (sort.c:1074)
==17684==    by 0x409838: execarg (argument.c:680)
==17684==    by 0x4C5EA4: Generator (proces.c:3421)
==17684==    by 0x5255D2: RunThread (threads.c:1386)
==17684==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17684==    by 0x5ECB41C: clone (clone.S:109)
==17684==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==17684== 
  0.75 sec + 0.25 sec: 1.00 sec out of 0.96 sec
==17684== 
==17684== HEAP SUMMARY:
==17684==     in use at exit: 8,627,974,160 bytes in 338 blocks
==17684==   total heap usage: 417 allocs, 79 frees, 8,647,379,520 bytes allocated
==17684== 
==17684== LEAK SUMMARY:
==17684==    definitely lost: 0 bytes in 0 blocks
==17684==    indirectly lost: 0 bytes in 0 blocks
==17684==      possibly lost: 288 bytes in 1 blocks
==17684==    still reachable: 8,627,973,872 bytes in 337 blocks
==17684==         suppressed: 0 bytes in 0 blocks
==17684== Rerun with --leak-check=full to see details of leaked memory
==17684== 
==17684== For counts of detected and suppressed errors, rerun with: -v
==17684== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

@tueda
Copy link
Collaborator

tueda commented Oct 4, 2018

Another dump, the same problem and same environment, but different errors:

==17751== Memcheck, a memory error detector
==17751== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==17751== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==17751== Command: ./tvorm -w4 1.frm
==17751== 
==17751== error calling PR_SET_PTRACER, vgdb might block
==17751== Warning: set address range perms: large range [0x395d8040, 0xbe5a8a78) (undefined)
==17751== Warning: set address range perms: large range [0xbe5a9040, 0xe585a8f8) (undefined)
==17751== Warning: set address range perms: large range [0xe585b040, 0x10cb0c8f8) (undefined)
==17751== Warning: set address range perms: large range [0x10cb0d040, 0x133dbe8f8) (undefined)
==17751== Warning: set address range perms: large range [0x133dbf040, 0x15b0708f8) (undefined)
==17751== Warning: set address range perms: large range [0x15b071040, 0x178d47840) (undefined)
==17751== Warning: set address range perms: large range [0x178d48040, 0x196a1e840) (undefined)
==17751== Warning: set address range perms: large range [0x196a1f040, 0x1b46f5840) (undefined)
TFORM 4.2.0 (Oct  4 2018, v4.2.0-69-g766ae40) 64-bits 4 workers  Run: Thu Oct  4 22:13:09 2018
    #: TermsInSmall 128
    #: LargePatches 16
    #: FilePatches 4
    #: SubTermsInSmall 64
    #: SubLargePatches 8
    #: SubFilePatches 2
    
    Off stats;
    CF f,g;
    S x,y;
    
    #define NTERMS "40001"
    #define ARGNTERMS "2001"
    
    L F = g(<f(1)>+...+<f(`ARGNTERMS')>);
    .sort
    
    id f(x?) = f(x) - f(`NTERMS'-x+1);
    argument g;
      id f(x?) = f(x) - f(`ARGNTERMS'-x+1);
    endargument;
    .end
==17751== Thread 2:
==17751== Conditional jump or move depends on uninitialised value(s)
==17751==    at 0x4F357F: SplitMerge (sort.c:3251)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F35F5: SplitMerge (sort.c:3258)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F5310: StoreTerm (sort.c:4357)
==17751==    by 0x4C5732: Generator (proces.c:3190)
==17751==    by 0x4C74D2: Generator (proces.c:3987)
==17751==    by 0x40973C: execarg (argument.c:658)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751== 
==17751== Conditional jump or move depends on uninitialised value(s)
==17751==    at 0x4F3893: SplitMerge (sort.c:3329)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F5310: StoreTerm (sort.c:4357)
==17751==    by 0x4C5732: Generator (proces.c:3190)
==17751==    by 0x4C74D2: Generator (proces.c:3987)
==17751==    by 0x40973C: execarg (argument.c:658)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751== 
==17751== Conditional jump or move depends on uninitialised value(s)
==17751==    at 0x4F3893: SplitMerge (sort.c:3329)
==17751==    by 0x4F5310: StoreTerm (sort.c:4357)
==17751==    by 0x4C5732: Generator (proces.c:3190)
==17751==    by 0x4C74D2: Generator (proces.c:3987)
==17751==    by 0x40973C: execarg (argument.c:658)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751== 
==17751== Conditional jump or move depends on uninitialised value(s)
==17751==    at 0x4F46E1: MergePatches (sort.c:3967)
==17751==    by 0x4F5413: StoreTerm (sort.c:4386)
==17751==    by 0x4C5732: Generator (proces.c:3190)
==17751==    by 0x4C74D2: Generator (proces.c:3987)
==17751==    by 0x40973C: execarg (argument.c:658)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751== 
StageSort in thread 1
==17751== Warning: set address range perms: large range [0x1d0c59040, 0x1f00b8b80) (undefined)
==17751== Warning: set address range perms: large range [0x1f00b9040, 0x20f518b80) (undefined)
==17751== Syscall param read(buf) points to unaddressable byte(s)
==17751==    at 0x5BB751D: ??? (syscall-template.S:84)
==17751==    by 0x523411: Uread (unixfile.c:134)
==17751==    by 0x5119D2: ReadFile (tools.c:1158)
==17751==    by 0x42DCFA: FillInputGZIP (compress.c:708)
==17751==    by 0x4EFCD7: PutIn (sort.c:1276)
==17751==    by 0x4F4064: MergePatches (sort.c:3734)
==17751==    by 0x4F61FD: EndSort (sort.c:1074)
==17751==    by 0x409838: execarg (argument.c:680)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751==  Address 0x1d0356970 is 0 bytes after a block of size 9,443,632 alloc'd
==17751==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17751==    by 0x512D3F: Malloc1 (tools.c:2250)
==17751==    by 0x4EC81E: AllocSort (setfile.c:904)
==17751==    by 0x4EFB30: NewSort (sort.c:624)
==17751==    by 0x4096A3: execarg (argument.c:641)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751== 
  0.20 sec + 1.50 sec: 1.70 sec out of 1.67 sec
==17751== 
==17751== HEAP SUMMARY:
==17751==     in use at exit: 8,649,290,592 bytes in 340 blocks
==17751==   total heap usage: 421 allocs, 81 frees, 8,659,384,544 bytes allocated
==17751== 
==17751== LEAK SUMMARY:
==17751==    definitely lost: 0 bytes in 0 blocks
==17751==    indirectly lost: 0 bytes in 0 blocks
==17751==      possibly lost: 0 bytes in 0 blocks
==17751==    still reachable: 8,649,290,592 bytes in 340 blocks
==17751==         suppressed: 0 bytes in 0 blocks
==17751== Rerun with --leak-check=full to see details of leaked memory
==17751== 
==17751== For counts of detected and suppressed errors, rerun with: -v
==17751== Use --track-origins=yes to see where uninitialised values come from
==17751== ERROR SUMMARY: 49 errors from 5 contexts (suppressed: 0 from 0)

tueda added a commit that referenced this issue Oct 5, 2018
Now the test suite script recognises Valgrind errors like
  "Syscall param ... points to unaddressable byte(s)"
and makes warning? return true.

The test of #211 fails for "valgrind tvorm" on Travis CI as it should
do.
@tueda
Copy link
Collaborator

tueda commented Oct 5, 2018

I modified the test suite script such that it can realises Syscall param read(buf) points to unaddressable byte(s) in the above is a memory bug. The test on Travis CI now fails: https://travis-ci.org/vermaseren/form/builds/437423864.

The "Conditional jump or move depends on uninitialised value(s)" of S->PolyWise are easily fixable as Jos suggested (private communication).

The "pointing to unaddressable bytes" looks a real and more serious buffer-overflow bug.

@tueda tueda reopened this Oct 5, 2018
tueda added a commit that referenced this issue Oct 10, 2018
One needs to revert these changes (to check if it succeeds) when the
stage 4 sorting of function arguments and $-variables is correctly
implemented in future (#211).
@ahamaline
Copy link

Reporting that I got an "Unexpected code in ReNumber" crash. I was running TFORM 5.0.0-beta.1 (Mar 15 2024, v5.0.0-beta.1-42-g2663e14) 127 workers. The crash occurred after a very long sequence of StageSorts within one module; an average of 440 for each of the 127 threads.

Is this likely to be the same bug you guys were working on here, and if so what's a good strategy to avoid it? I guess I anyway need a much bigger LargePatches setting to get a sane runtime. Will this get rid of the crash as well?

@jodavies
Copy link
Collaborator Author

jodavies commented Apr 8, 2024

As I understand, it is hard to say: the Unexpected code in ReNumber error was probably due to messed up internal state due to some wrong buffer sizes used somewhere.

Are you able to provide an example of your input and form script in this case? Is it working with a lot of large function arguments or dollar variables? These days form is supposed to terminate with an error, if a function arg or dollar sort reaches stage4 sorting.

As a first tip, I would say that if you have any other parallelization opportunities (over diagrams, for eg) you should run more concurrent form jobs with fewer threads. The performance does not scale so well up to 128 threads (this statement does depend heavily on what exactly the form script does, though) and with fewer threads, each thread gets more memory space and you shouldn't stagesort as often or at all. Increasing your buffer sizes to avoid stagesorts will also help.

@ahamaline
Copy link

If I'm not mistaken, the module in question is just one line:
id tenredgg(?mu) = dd_(?mu);
So yes, function arguments are involved. Unfortunately I don't understand this part of the code so well, so I don't know what those arguments look like in my case.

@jodavies
Copy link
Collaborator Author

jodavies commented Apr 8, 2024

It seems like you are working at the "ground level" though, so the stagesorts are caused by a large number of generated terms, and your crash shouldn't be related to sub-buffers.

Are you able to provide the expression just before this module, if it is not too big already, or at least a sample of the terms?

@ahamaline
Copy link

I did not print any terms before the crash, but I do have a recovery file. Is that useful to you?

@jodavies
Copy link
Collaborator Author

jodavies commented Apr 8, 2024

I don't think so, without all of the scripts which come with it. Did it take a long time to arrive at this module (i.e., can you just add a print statement in the module before, and re-run) ?

@ahamaline
Copy link

It did take a couple of hours to reach that point, and there are already ~50M terms before the module.

But looking at the code, I think the arguments of tenredgg are a list of indices, perhaps up to 12 of them, that are taken from powers of a loop momentum. The dd_ then sets up all of the contractions of pairs of those indices, creating dotproducts between external momenta and polarization vectors. Does that tell you what you wanted to know?

@jodavies
Copy link
Collaborator Author

jodavies commented Apr 8, 2024

Without seeing the whole terms I can't tell what the d_ are going to end up contracting with etc, so I it seems unlikely that your crash can be reproduced without more information.

To continue investigating, I would add a print before the crashing module and simply re-run the code with the same threads/memory settings etc. Then you can see if you can reproduce the crash yourself, and at the same time we can see what the expression looks like exactly.

@vermaseren
Copy link
Owner

vermaseren commented Apr 8, 2024 via email

@ahamaline
Copy link

Advise: Put a bracket statement in tenredgg.
Can you explain more? Is this to help avoid the crash, or to investigate it?

@vermaseren
Copy link
Owner

vermaseren commented Apr 8, 2024 via email

@ahamaline
Copy link

OK, I'm running it now with the first of those options.

@ahamaline
Copy link

So I got to the printout but not yet to the crash. Here is an example term: outside the brackets we have tenredgg[mu1,mu2], and within the brackets one of the terms is 2/( - 1 + D)*Etens[p2,mu2]* Etens[p2,nu97]*Etens[p1,mu1]*Etens[p1,nu99]*Mom[p2,nu2]*Mom[p1,nu1]^2 *Mom[p1,nu2]*Mom[p1,nu97]*Mom[p1,nu99]*Mom[p1,mudotp1]^2*Mom[p1, mudotp2]^2*prop[k1,0]^4*den[4,2]*(k1.k1)^2*sSS^10*D*kappa^2*mh^(-6)
Of course the more problematic terms are those where tenredgg has many indices, but I can't seem to scroll far enough to get to those... The idea should be the same; each index appears exactly twice in either tenredgg, Mom, or Evect. There are always 4 powers of Evect but the powers of Mom vary.

'Mom[p2,nu2]' is really just a way to write p2[nu2], chosen specifically to avoid having the vector contract automatically.

@ahamaline
Copy link

Ok, I found an example term with sixteen indices in tenredgg (in the printout before the module):
tenredgg[mu1,mu2,mu4,nu1,nu99,mudotp1,mudotp2,mudotp3,mudotp4,mudotp5 ,mudotp6,mudotp7,mudotp8,mudotp9,mudotp10,mudotp11] * ( 16384/( - 1 + D) /( - 1 + D)*Etens[p2,mu2]*Etens[p2,mu4]*Etens[p1,mu1]*Etens[p1,nu99]* Mom[p1,nu1]*Mom[p1,mudotp1]*Mom[p1,mudotp2]*Mom[p1,mudotp3]*Mom[p1, mudotp4]*Mom[p1,mudotp5]*Mom[p1,mudotp6]*Mom[p1,mudotp7]*Mom[p1, mudotp8]*Mom[p1,mudotp9]*Mom[p1,mudotp10]*Mom[p1,mudotp11]*prop[k1,0] ^13*den[92897280,169906176]*(k1.k1)^11*sSS^12*kappa^2*mh^(-8) +...)

The code has been running for 17.5 hours now and has neither crashed nor finished the module.

@jodavies
Copy link
Collaborator Author

jodavies commented Apr 9, 2024

If you resolve the Mom into actual scalar products where you have repeated indices you will get a much smaller result, but I don't know what the rest of your code does and it is not really relevant to the fact that you got the crash.

I imagine this is not going to be easily reproducible. Possibly with a larger portion of your input expression and artificially small buffer settings or something like that.

Do you have the log file from the crashed run?

@vermaseren
Copy link
Owner

vermaseren commented Apr 9, 2024 via email

@ahamaline
Copy link

Yes, I agree that the code is absurdly inefficient for my use case. I did not write it and I don't understand the reasons for the authors' choices, so changing things is scary, but it looks like the savings would be worthwhile. The large number of terms is mainly due to the combinatorial explosion of index configurations, which would be avoided if we used either vectors or dummy indices.

I did not save the log file; it did not have much that seemed informative. Just 56,000 StageSort reports from the various threads, followed by the error.

Thank you for the tip about avoiding denominators. But how could the denominator be related to a crash that occurs in the dd_ replacement?

I will add that the error must not be deterministic, since this run has already continued for many more StageSorts than the one that crashed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants