Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only do PCL workarounds for x86 #73

Closed
wants to merge 1 commit into from
Closed

Only do PCL workarounds for x86 #73

wants to merge 1 commit into from

Conversation

dhood
Copy link
Member

@dhood dhood commented May 13, 2017

Looks like #71 broke the new turtlebot CI job for aarch64: Build Status

It's a docker mounting error (error creating aufs mount to /var/lib/docker/aufs/mnt/25641437ee2e6f787c6877d8df5ca9441c85594145074c71f3d0ca27082f484d: invalid argument) but it persists even with a clean cache. From testing it seems to only have issues mounting if these lines are run (e.g. if INSTALL_TURTLEBOT2_DEMO_DEPS is false it's fine)

I don't have quite enough context to know if this is the appropriate fix (maybe these lines need to be completely reworked?), but it's one thing that works. @clalancette could you provide input please?

Turtlebot CI jobs:
linux: Build Status
linux-aarch64: Build Status

@dhood dhood added the in review Waiting for review (Kanban column) label May 13, 2017
@dhood dhood self-assigned this May 13, 2017
@dhood dhood requested a review from clalancette May 13, 2017 02:53
@clalancette
Copy link
Contributor

Hm, that is pretty weird. We need these workarounds for aarch64 as well, and they do work on my local Pine64 board. I'll take a look at this today and see if I can figure something out.

@dhood
Copy link
Member Author

dhood commented May 15, 2017

Are you able to build the dockerfile with this invocation locally? I suspect it would have the same issue but haven't checked myself, I've just been doing investigations on CI jobs.


This are the jobs I was investigating with: http://ci.ros2.org/view/All/job/test_ci_turtlebot-demo_linux-aarch64/ (they will be deleted after this issue is resolved). The first 10 jobs I was messing around with the docker cache on the host (eventually I just completely removed /var/lib/docker/aufs). You'll see that this job had a clear cache and had to rebuild the entire image from scratch, and it still failed.

Realising that the cache probably wasn't the issue, jobs 11-20 I was playing around with the job configuration. In those jobs, the only ones that successfully built+ran the docker image were ones that either:

That's what led me to the conclusion that it's the dockerfile that's causing it to not build properly, not docker issues on the host.

@clalancette
Copy link
Contributor

Hm, this is getting stranger. If I log into the packet.net server, and run:

$ docker build --build-arg PLATFORM=arm --build-arg INSTALL_TURTLEBOT2_DEMO_DEPS=true -t ros2_batch_ci_turtlebot_demo linux_docker_resources

by hand, it builds the docker image just fine. I did have to make a few edits to the Dockerfile to make this work; namely, I had to edit FROM ubuntu:xenial to be FROM aarch64/ubuntu:xenial, and I had to comment out everything having to do with RTI, but other than that it seemed to work. I'm not sure what is going on in the context of jenkins that is causing it to fail. Still looking into it.

@clalancette
Copy link
Contributor

OK. So having the RTI stuff commented out actually materially affects this issue in ways I don't understand. If I have all of the RTI stuff in the Dockerfile, I can see the problem clearly when running by hand on the packet.net server. Oddly, if I comment out the very last ADD of the rticonnext-dds_tools debian file, then things start working again. Similarly, if I comment out one or the other of the

RUN if test ${INSTALL_TURTLEBOT2_DEMO_DEPS}

(it doesn't matter which one), it also starts working. I can't say I understand any of this, but that last point lends itself to a workaround; a single RUN statement that does all of the workarounds at once. I don't like it, because I don't understand the underlying issue, but that could be a workaround for now. I'm going to try that out in a CI build.

@clalancette
Copy link
Contributor

clalancette commented May 15, 2017

TB aarch64: Build Status
TB amd64: Build Status

@clalancette
Copy link
Contributor

clalancette commented May 15, 2017

All right, that looked pretty good. Here's some builds that should actually build the PCL libraries, to make sure that the workarounds still work:

TB aarch64: Build Status
TB amd64: Build Status

@nuclearsandwich
Copy link
Member

It irks me that the issue is worked around by reducing the number of RUN directives rather than any change specific to the generated image. I wonder if the problem is related to aufs itself. It looks like the packet.net machines are currently rocking 3.13 which means we miss overlayfs (in kernel as of 3.18) by a few revisions and the overlay2 driver available in Docker 1.12 requires 4.0 or newer.

If we hit other errors that seem to be related to image layering, looking closer at aufs bugs might be worthwhile.

@clalancette
Copy link
Contributor

@nuclearsandwich It also concerns me a lot. There is clearly something wrong, because removing some combination of ADD and RUN statements makes the thing work. Further, this only happens on aarch64, so I'm pretty convinced it is a bug in the lower layers (Docker, aufs, or the kernel). Unfortunately, I can't afford to spend another day mucking around with this, so I'd like to go with the workaround for now.

@clalancette
Copy link
Contributor

All right, the turtlebot builds that actually use PCL work (the unstable bit of that is because of a slight packaging bug in cartographer, which I will address later). One more CI, just to check that we don't affect the "regular" jobs, then I'll open a new PR with my changes:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@nuclearsandwich
Copy link
Member

I can't afford to spend another day mucking around with this, so I'd like to go with the workaround for now.

Yeah that's fine, I just wanted to put a breadcrumb down before I even forgot that I once knew of errors like that.

@dirk-thomas
Copy link
Member

You might be seeing this one: ros-infrastructure/ros_buildfarm#377

@clalancette
Copy link
Contributor

@dirk-thomas Oh, interesting. Let me try that out.

@clalancette
Copy link
Contributor

The magic fix didn't work. Also, looking at the linked github issue, the symptoms are different. I think we'll still have to go with my workaround for now.

@clalancette
Copy link
Contributor

All right, all of the jobs we care about succeeded. I've opened #74 instead to fix this, so I'm going to close this out. Also, I've opened up #75 to track down the "real" issue.

@clalancette clalancette removed the in review Waiting for review (Kanban column) label May 16, 2017
@dhood dhood deleted the debug_tb_arm branch May 16, 2017 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants