Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition during thirdparty downloads #4842

Open
mbautin opened this issue Jun 21, 2020 · 5 comments
Open

Race condition during thirdparty downloads #4842

mbautin opened this issue Jun 21, 2020 · 5 comments
Assignees

Comments

@mbautin
Copy link
Collaborator

mbautin commented Jun 21, 2020

Example:

/-------------------------------------------------------------------------------
| COMPILATION FAILED
|-------------------------------------------------------------------------------
ccache: error: execv of /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20200620175034-da0bfa8338-centos/clang-toolchain/bin/clang failed: No such file or directory

Input files:
  src/yb/gutil/CMakeFiles/gutil.dir/dynamic_annotations.c.o
  src/yb/gutil/dynamic_annotations.c
Output file (from -o): src/yb/gutil/CMakeFiles/gutil.dir/dynamic_annotations.c.o
\-------------------------------------------------------------------------------
@svarnau
Copy link
Member

svarnau commented Jun 22, 2020

This occurs when a new THIRDPARTY pre-built tarball is being introduced that is not already downloaded to build-worker images. Multiple jenkins jobs starting in parallel have multiple tasks for build-workers to find or download the dependencies. For all the tasks running in parallel on a particular build worker, that find they need to download it, that code is protected by a file lock, so only one process actually downloads and extracts the tarball.

For tasks that got started a bit later, they find the third-party directory is already there, so they do not try to download it and never have to acquire the lock. Then they try to proceed to build while the tarfile is still being extracted. The build retries with 0.1 second delay so it can fail ten times in about second and it declares it a failure.

Proposed fix is to use a temporary directory while extracting the tarfile and move it to final name once it is completely extracted.

@svarnau
Copy link
Member

svarnau commented Jun 22, 2020

Hmm, analysis of the messages and timing across multiple logs seemed to support this theory, but the script used to extract the tar file (download_and_extract_archive.py) seems to already have a temp directory mechanism already. So I'm no longer confident of this theory.

@svarnau
Copy link
Member

svarnau commented Jun 23, 2020

I've been unable to reproduce this issue from my dev-server. It seems to take multiple jenkins jobs talking to the build-workers to see this issue. Now trying to reproduce in jenkins.

@svarnau
Copy link
Member

svarnau commented Jun 23, 2020

This does reproduce often in jenkins environment, even when only one jenkins job is updating thirdparty url. The compiler wrapper seems to check the existence of the compiler executable twice -- in find_compiler_by_type() and again in the wrapper itself before calling the compiler (via ccache). But the checks seem to pass fine.

@svarnau
Copy link
Member

svarnau commented Jun 29, 2020

The centos thirdparty URL file was updated again in commit 82f53b1. This did require build-workers to download it. This problem did not recur with the build-workers. The workers check the thirdparty path in the build wrapper, and it worked fine for several days. (6/24 to 6/29, when the build-worker image was updated).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants