Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Make CI for glow-torch #3164

Closed
wants to merge 11 commits into from

Conversation

Projects
None yet
6 participants
@zrphercule
Copy link
Member

commented Jun 24, 2019

Summary:

Documentation:

[Optional Fixes #issue]

Test Plan:

Please see a detailed explanation of how to fill out the fields in the relevant sections in PULL_REQUEST.md.

@zrphercule zrphercule changed the title [WIP][Don] Make CI for glow-torch [WIP][Dont Review/Merge] Make CI for glow-torch Jun 24, 2019

@jackm321

This comment has been minimized.

Copy link
Contributor

commented Jun 25, 2019

We probably don't want to build from source, it will take too long

@jackm321

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2019

Also I don't think that this should pull in new versions automatically (example pip install pytorch-nightly) because then something could change on the pytorch side and this will break the Glow CI for unrelated Glow diffs. It would be better if we could have control of when the pytorch version we test against is updated for example by installing a version into a docker image and updating that periodically.

@zrphercule

This comment has been minimized.

Copy link
Member Author

commented Jun 26, 2019

Also I don't think that this should pull in new versions automatically (example pip install pytorch-nightly) because then something could change on the pytorch side and this will break the Glow CI for unrelated Glow diffs. It would be better if we could have control of when the pytorch version we test against is updated for example by installing a version into a docker image and updating that periodically.

Agreed. I am just dealing with the building here, that's the only reason I am building pytorch from src... It seems no matter how I try it is using C++ 17, which deprecated "register" in pybind11

@jackm321

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2019

It seems no matter how I try it is using C++ 17, which deprecated "register" in pybind11

Weird, I get that error on my mac as well but its just a warning (i guess the difference is -Werror). I think upgrading to python 3 should fix this though, have you tried that?

@zrphercule

This comment has been minimized.

Copy link
Member Author

commented Jun 27, 2019

It seems no matter how I try it is using C++ 17, which deprecated "register" in pybind11

Weird, I get that error on my mac as well but its just a warning (i guess the difference is -Werror). I think upgrading to python 3 should fix this though, have you tried that?

Let me take a try, but, why does python 3 fix C++ standard error....

@zrphercule zrphercule force-pushed the zrphercule:makeci branch 2 times, most recently from 4ee4232 to 46d3cd3 Jun 27, 2019

@zrphercule

This comment has been minimized.

Copy link
Member Author

commented Jun 27, 2019

It seems no matter how I try it is using C++ 17, which deprecated "register" in pybind11

Weird, I get that error on my mac as well but its just a warning (i guess the difference is -Werror). I think upgrading to python 3 should fix this though, have you tried that?

Let me take a try, but, why does python 3 fix C++ standard error....

I see why now, pybind11 maybe better right....

@jackm321

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

Yeah supposedly the python3 version doesn't use this deprecated c++ feature anymore

@jackm321
Copy link
Contributor

left a comment

The test output says, do you know why this is?

Ran 0 tests in 0.000s

OK
@@ -76,7 +76,7 @@ jobs:
OPENCL:
environment:
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/caffe2/py2-clang8-ubuntu16.04:283"
<<: *linux_default

This comment has been minimized.

Copy link
@jackm321

jackm321 Jun 30, 2019

Contributor

nit this change seems unnecessary

This comment has been minimized.

Copy link
@zrphercule

zrphercule Jul 1, 2019

Author Member

hmmmmmmm not so sure what have been changed here...

@@ -16,6 +16,8 @@ option(GLOW_WITH_BUNDLES "Build bundles" OFF)
option(LINK_PROTOBUF_AS_DLL "Link against protobuf build as dynamic libray." OFF)

set(CMAKE_CXX_STANDARD 14)
set(CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

This comment has been minimized.

Copy link
@jackm321

jackm321 Jun 30, 2019

Contributor

why are these changes needed?

This comment has been minimized.

Copy link
@zrphercule

zrphercule Jun 30, 2019

Author Member

These are to be removed lol

@@ -46,6 +46,10 @@ run_and_check_resnet50_bundle() {
done
}

run_pytorch_tests() {
python3.6 "${GLOW_SRC}/torch_glow/setup.py" test --run_cmake

This comment has been minimized.

Copy link
@jackm321

jackm321 Jun 30, 2019

Contributor

probably better to just cd to ${GLOW_SRC}/torch_glow and run from there

CMAKE_ARGS+=("-DCMAKE_C_COMPILER=/usr/bin/clang-8")

if [[ "${CIRCLE_JOB}" == "PYTORCH" ]]; then
CMAKE_ARGS=("-DCMAKE_CXX_COMPILER=/usr/bin/clang++-7")

This comment has been minimized.

Copy link
@jackm321

jackm321 Jun 30, 2019

Contributor

I know you mentioned this but why do we need clang 7 for pytorch?

This comment has been minimized.

Copy link
@zrphercule

zrphercule Jun 30, 2019

Author Member

Actually we dont. It is because we need a python3-clang7.0+ docker image, and we dont have python3-clang8 image......

This comment has been minimized.

Copy link
@rdzhabarov

rdzhabarov Jul 9, 2019

Contributor

what stops from making one?

This comment has been minimized.

Copy link
@zrphercule

zrphercule Jul 9, 2019

Author Member

what stops from making one?

nothing, will do this tomorrow. The only problem here is we are using caffe2 docker images, need to find who is maintainning it......

@zrphercule zrphercule changed the title [WIP][Dont Review/Merge] Make CI for glow-torch [WIP]Make CI for glow-torch Jun 30, 2019

@zrphercule

This comment has been minimized.

Copy link
Member Author

commented Jun 30, 2019

The test output says, do you know why this is?

Ran 0 tests in 0.000s

OK

Not really, python torch_glow/setup.py test shows the same thing locally & in my docker image. I will try to follow your suggestion by cd into torch_glow first

@zrphercule zrphercule force-pushed the zrphercule:makeci branch from 31fd63a to c9e7ad9 Jul 4, 2019

@zrphercule zrphercule changed the title [WIP]Make CI for glow-torch Make CI for glow-torch Jul 8, 2019

@zrphercule zrphercule force-pushed the zrphercule:makeci branch 2 times, most recently from 852ca17 to 337aa8c Jul 8, 2019

@zrphercule zrphercule force-pushed the zrphercule:makeci branch from 337aa8c to eb79a01 Jul 8, 2019

@zrphercule

This comment has been minimized.

Copy link
Member Author

commented Jul 8, 2019

@jackm321 This pr is ready to be merged; however, currently CI is super slow because we are building it from src, like one hour waiting CI to be passed.
I think so far 1 hour is tolerable, what is your opinion?

@facebook-github-bot
Copy link

left a comment

@zrphercule has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@yinghai

yinghai approved these changes Jul 8, 2019

Copy link
Contributor

left a comment

1hr is not ideal but maybe acceptable as the internal Glow CI takes about an hour too.

@rdzhabarov
Copy link
Contributor

left a comment

Current open source CI takes 10 mins to run which is nice and gives very quick signal (though ultimately we are still bottlenecked with sandcastle for merging)

@jackm321 @zrphercule any thoughts on how to reduce that (1 hr run)?
Could we put more pre build stuff into docker container, etc?

@zrphercule

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2019

Current open source CI takes 10 mins to run which is nice and gives very quick signal (though ultimately we are still bottlenecked with sandcastle for merging)

@jackm321 @zrphercule any thoughts on how to reduce that (1 hr run)?
Could we put more pre build stuff into docker container, etc?

I agree with that 1 hr is still too long for OSS building, at last we need to reduce the build time.
One solution, as you mentioned, is to build a docker image on our own, with a stable pytorch version and some environment we need (py3 & clang8). The only problem is we might need to sync pytorch periodically with this docker image to make sure that everything works fine (which I guess is not a problem maybe)

The discussion about this pr is whether we can afford 1 or 2 weeks slow build before we update the docker image (since I am not confident to say I can finish it in a day or two, although it looks easy but it is CI related, who knows...), since we might want to merge a lot of things including the support of resnet soon with this CI.

@yinghai

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

Let's setup sccache and see how it helps.

@rdzhabarov

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

The only problem is we might need to sync pytorch periodically with this docker image to make sure that everything works fine (which I guess is not a problem maybe)

This would require some manual work every time we do update of pytorch and likely we want to build against latest. I'd try to speed up existing approach.

Do you know rough breakdown where most of the time spend?

Is it just building pytorch (sccache as @yinghai mentioned, do we explore parallelism during the build)?
Setting up pytorch dependencies (those are easy to preinstall into docker)

@jackm321

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

I agree 1 hour is too long and we should look for an alternative
@rdzhabarov out of curiosity, how painful is it to update a docker image each time we want to update PyTorch version?
sccache sounds like it could be a better option

@facebook-github-bot
Copy link

left a comment

@zrphercule has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

CI time has been reduced to 20min while Roman is sick QAQ

CI
@facebook-github-bot
Copy link

left a comment

@zrphercule has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

left a comment

@zrphercule has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@rdzhabarov

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2019

@zrphercule perfect, thank you! Have you tried disabling compiling with CUDA as well? Does it speed up things?

@zrphercule

This comment has been minimized.

Copy link
Member Author

commented Jul 10, 2019

@zrphercule perfect, thank you! Have you tried disabling compiling with CUDA as well? Does it speed up things?
Not really, since we dont have CUDA it is ignored actually. Anyway right now it is reduced to 15 min which I think is good enough. Cheers!

@rdzhabarov

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2019

Before merging please check why "coverage" now takes 20 mins instead of 10mins. Build time increased by a lot.

check:
https://circleci.com/gh/pytorch/glow/24822?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
vs
https://circleci.com/gh/pytorch/glow/24729?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
(or other previous coverage builds).

Pytorch test time looks great.

@houseroad
Copy link
Member

left a comment

Rebase to resolve the conflict?

@@ -97,6 +97,11 @@ jobs:
environment:
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/caffe2/py2-clang8-ubuntu16.04:283"
<<: *linux_default
PYTORCH:
resource_class: 2xlarge+

This comment has been minimized.

Copy link
@rdzhabarov

rdzhabarov Jul 11, 2019

Contributor

xlarge? I do not see this in the documentation: https://circleci.com/docs/2.0/configuration-reference/#resource_class

See also warning in the corresponding CI job.

This comment has been minimized.

Copy link
@zrphercule

zrphercule Jul 11, 2019

Author Member

hmmmm this is a recommendation of tvm. will double check.

@rdzhabarov rdzhabarov changed the title Make CI for glow-torch [CI] Make CI for glow-torch Jul 11, 2019

zrphercule added some commits Jul 11, 2019

@zrphercule

This comment has been minimized.

Copy link
Member Author

commented Jul 11, 2019

Coverage time is reduced to 10 min after removed sccache from Coverage test.
Let's go!

@facebook-github-bot
Copy link

left a comment

@zrphercule is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@rdzhabarov

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

hmmmm this is a recommendation of tvm. will double check.

Does not seem to work. Let's make that xlarge and merge this PR.

@zrphercule

This comment has been minimized.

Copy link
Member Author

commented Jul 11, 2019

hmmmm this is a recommendation of tvm. will double check.

Does not seem to work. Let's make that xlarge and merge this PR.

lol this diff is just landed internally. I will make another one to change this...

@rdzhabarov

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

Sounds good!

facebook-github-bot added a commit that referenced this pull request Jul 11, 2019

Make large host for pytorch. (#3223)
Summary:
2xlarge+ is not currently available.

Some comments from #3164

Documentation:
n/a

cc: zrphercule
Pull Request resolved: #3223

Differential Revision: D16207528

Pulled By: rdzhabarov

fbshipit-source-id: 366886ba863d5fba09f9cbd86dcb976d0d3caccd
@facebook-github-bot

This comment has been minimized.

Copy link

commented Jul 11, 2019

@zrphercule merged this pull request in 05c925e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.