Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ATLAS restart build on tolerance error #1641

Closed
sagetrac-mabshoff mannequin opened this issue Dec 30, 2007 · 10 comments
Closed

Make ATLAS restart build on tolerance error #1641

sagetrac-mabshoff mannequin opened this issue Dec 30, 2007 · 10 comments

Comments

@sagetrac-mabshoff
Copy link
Mannequin

sagetrac-mabshoff mannequin commented Dec 30, 2007

When the ATLAS build fails due to tolerance errors we can restart the build by restarting the build process via "make". We should do it a set number of times, i.e. 5 and then finally fail. I have hit the problem repeatedly while building in a VMWare machine and have little to no control to prevent the issue from happening.

Cheers,

Michael

Component: packages: standard

Issue created by migration from https://trac.sagemath.org/ticket/1641

@sagetrac-mabshoff sagetrac-mabshoff mannequin added this to the sage-3.2 milestone Dec 30, 2007
@sagetrac-mabshoff sagetrac-mabshoff mannequin self-assigned this Dec 30, 2007
@sagetrac-mabshoff sagetrac-mabshoff mannequin modified the milestones: sage-3.2, sage-3.1.3 Sep 30, 2008
@sagetrac-mabshoff
Copy link
Mannequin Author

sagetrac-mabshoff mannequin commented Feb 20, 2009

comment:3

This ought to be fixed via #5311.

Cheers,

Michael

@sagetrac-mabshoff sagetrac-mabshoff mannequin modified the milestones: sage-3.4.1, sage-3.3 Feb 20, 2009
@sagetrac-mabshoff
Copy link
Mannequin Author

sagetrac-mabshoff mannequin commented Feb 20, 2009

comment:4

Fixed via #5311. This can probably be improved upon, but we will open another ticket once we get to this point.

Cheers,

Michael

@sagetrac-mabshoff sagetrac-mabshoff mannequin closed this as completed Feb 20, 2009
@williamstein
Copy link
Contributor

comment:5

REFEREE REPORT:

  • There is a typo "Restartig build for the first time"

  • This patch doesn't work. I tried this on my vmware farm and what happens is that you restart part of the build that fails, but other parts also fail later.

  • I wonder if it would be better to simply wrap the whole spkg-install in a repeat timer instead of each little bit. I.e., put the current spkg-install in another file, say spkg-install-script and then make spkg-install try to run spkg-install-script, then wait some amount of time, and try again up to n times.

@williamstein williamstein reopened this Feb 21, 2009
@sagetrac-mabshoff
Copy link
Mannequin Author

sagetrac-mabshoff mannequin commented Feb 21, 2009

comment:6

Replying to @williamstein:

REFEREE REPORT:

  • There is a typo "Restartig build for the first time"

Ok.

  • This patch doesn't work. I tried this on my vmware farm and what happens is that you restart part of the build that fails, but other parts also fail later.

The failures you reported are unrelated to this script: For example the ubuntu64.out failure for 3.3.rc3:

<SNIP>
ATLAS install complete.  Examine 
<SNIP>
Finished building ATLAS
<SNIP>
make[3]: [install_lib] Error 1 (ignored)
make[3]: Leaving directory `/space/wstein/farm/sage-3.3.rc3/spkg/build/atlas-3.8.3/ATLAS-build'
make[2]: Leaving directory `/space/wstein/farm/sage-3.3.rc3/spkg/build/atlas-3.8.3/ATLAS-build'
ATLAS failed to build because your system is too heavily loaded to obtain accurate timing.
Please restart the build by typing make, when the load on your system has decreased.

So ATLAS did finish tuning and some other error was triggered after the "make install" target, so this is not this tickets fault.

  • I wonder if it would be better to simply wrap the whole spkg-install in a repeat timer instead of each little bit. I.e., put the current spkg-install in another file, say spkg-install-script and then make spkg-install try to run spkg-install-script, then wait some amount of time, and try again up to n times.

If you do that you will not reuse the tuning info, but start the tune from scratch each time.

Cheers,

Michael

@sagetrac-mabshoff
Copy link
Mannequin Author

sagetrac-mabshoff mannequin commented Feb 21, 2009

comment:7

Ok, I figured it out I think: on debian32 this happens:

   STAGE 2-1-5: GEMV TUNE 
make -f Makefile INSTALL_LOG/dMVRES pre=d 2>&1 | ./xatlas_tee 
INSTALL_LOG/dMVTUNE.LOG 
make[3]: *** [build] Error 255 
make[3]: Leaving directory `/space/wstein/farm/sage-3.3.rc3/spkg/build/ 
atlas-3.8.3/ATLAS-build' 
make[2]: *** [build] Error 2 
make[2]: Leaving directory `/space/wstein/farm/sage-3.3.rc3/spkg/build/ 
atlas-3.8.3/ATLAS-build' 
ATLAS failed - round 1 - sleeping  5 minutes 

Then the restart kicks in and finishes the build.

Restartig build for the first time 
make[2]: Entering directory `/space/wstein/farm/sage-3.3.rc3/spkg/ 
build/atlas-3.8.3/ATLAS-build' 
make -f Make.top build 
make[3]: Entering directory `/space/wstein/farm/sage-3.3.rc3/spkg/ 
build/atlas-3.8.3/ATLAS-build' 
cd bin/ ; make xatlas_install 
<SNIP> 

Because at some point there was a failure the makefile errors out at
the very end even though it all worked. So the script you wrote is
likely to hit the same bug unless you completely clean out the ATLAS
build directory.

The fix here is to figure out which file causes the tuning failure
message in the end and to get rid of it before restart.

Thoughts?

Cheers,

Michael

@sagetrac-mabshoff
Copy link
Mannequin Author

sagetrac-mabshoff mannequin commented Feb 21, 2009

comment:8

The latest spkg is at:

http://sage.math.washington.edu/home/was/patches/atlas-3.8.3.p0.spkg

Two things need fixing:

  • spkg-install claims up to 10 tries, but max_tries is set to 5 :)
  • SPKG.txt overwrites my 3.8.3 entry, but William's changes need to be 3.8.3.p0:
-=== atlas-3.8.3 (Michael Abshoff, Januar 2nd, 2009) ===
- * rebase against latest upstream (#5311)
- * make ATLAS automatically restart build on tolerance error (#1641)
+=== atlas-3.8.3 (William Stein, February 20, 2009) ===
+ * implement up to 5 auto-restarts with random timeouts. 

The 3.8.3 entry also needs to be dated February 20th, but that was my bug.

Cheers,

Michael

@sagetrac-mabshoff sagetrac-mabshoff mannequin changed the title Make ATLAS restart build on tolerance error [positive review pending fixes] Make ATLAS restart build on tolerance error Feb 21, 2009
@sagetrac-mabshoff
Copy link
Mannequin Author

sagetrac-mabshoff mannequin commented Feb 21, 2009

comment:9

The bug in my 3.8.3.spkg was actually not killing error* in ATLAS's build directory on restart:

cd $CUR/ATLAS-build
if [ -f error* ]; then
   echo "ATLAS failed to build because your system is too heavily loaded to obtain accurate timing."
   echo "Please restart the build by typing make, when the load on your system has decreased."
   exit 1
fi

That error message is wrong by the way since not every failure is due to timing issues - even though these days for ATLAS 99.9% of the time an error indicates a tolerance failure.

I still think an incremental restart is better than start from scratch, i.e. think of being two hours into a tune on Sparc or Itanium and it blows up. I have made this #5328.

@williamstein
Copy link
Contributor

comment:10

I think the new spkg at

http://sage.math.washington.edu/home/was/patches/atlas-3.8.3.p0.spkg

addresses all of the above comments.

@sagetrac-mabshoff
Copy link
Mannequin Author

sagetrac-mabshoff mannequin commented Feb 21, 2009

comment:11

Positive review. All my concerns have been addressed.

Cheers,

Michael

@sagetrac-mabshoff sagetrac-mabshoff mannequin changed the title [positive review pending fixes] Make ATLAS restart build on tolerance error Make ATLAS restart build on tolerance error Feb 21, 2009
@sagetrac-mabshoff
Copy link
Mannequin Author

sagetrac-mabshoff mannequin commented Feb 21, 2009

comment:12

Merged in Sage 3.3.final.

Cheers,

Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant