Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored Dockerfiles #625

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Expand Up @@ -206,6 +206,10 @@ docker run -it -v$(pwd)/..:/workspace/TRTorch build_trtorch_wheel /bin/bash /wor
```
Python compilation expects using the tarball based compilation strategy from above.

## Building Docker Image
`Dockerfile` in `//docker` directory builds an image, based on latest NGC PyTorch container, with TRTorch binaries/libraries/headers installed in /opt/trtorch, and Python wheel installed locally.
To specify different NGC PyTorch container release as the base, use `--build-arg BASE=21.06` for 21.06 etc.

## How do I add support for a new op...

### In TRTorch?
Expand Down
44 changes: 44 additions & 0 deletions docker/Dockerfile
@@ -0,0 +1,44 @@
ARG BASE=21.07
FROM nvcr.io/nvidia/pytorch:${BASE}-py3 as base

FROM base as trtorch-builder-base

RUN apt-get update && apt-get install -y curl gnupg
RUN curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor > /etc/apt/trusted.gpg.d/bazel.gpg
RUN echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list

RUN apt-get update && apt-get install -y bazel-4.0.0
RUN ln -s /usr/bin/bazel-4.0.0 /usr/bin/bazel

# Workaround for bazel expecting both static and shared versions, we only use shared libraries inside container
RUN cp /usr/lib/x86_64-linux-gnu/libnvinfer.so /usr/lib/x86_64-linux-gnu/libnvinfer_static.a
borisfom marked this conversation as resolved.
Show resolved Hide resolved

RUN apt-get update && apt-get install -y locales ninja-build && rm -rf /var/lib/apt/lists/* && locale-gen en_US.UTF-8

FROM trtorch-builder-base as trtorch-builder

COPY . /workspace/trtorch/src
WORKDIR /workspace/trtorch/src
RUN cp ./docker/WORKSPACE.cu.docker WORKSPACE

# This script builds both libtrtorch bin/lib/include tarball and the Pythin wheel, in dist/
RUN ./docker/dist-build.sh

FROM base as trtorch

# copy source repo
COPY . /workspace/trtorch
COPY --from=trtorch-builder /workspace/trtorch/src/dist/ .

RUN conda init bash

RUN pip install ipywidgets --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host=files.pythonhosted.org
RUN jupyter nbextension enable --py widgetsnbextension

RUN mkdir -p /opt/trtorch && tar xvf libtrtorch.tar.gz --strip-components 2 -C /opt/trtorch --exclude=LICENSE && pip3 install *.whl && rm -fr /workspace/trtorch/dist/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this library to be in /usr/ or /opt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is up for you to decide. Since various parts of TRTorch are going to be pulled to other containers (Triton, Riva SM), it's easier to handle if it's under its own root - but I am fine with either way you choose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a standing convention in DLFW containers to put libraries we add in /opt then I am fine with /opt. My preference is /usr otherwise since it seems more conventional and easier to link against

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we did put things in /opt, like triton and such. Let's use that for now.


ENV LD_LIBRARY_PATH /opt/conda/lib/python3.8/site-packages/torch/lib:/opt/trtorch/lib:${LD_LIBRARY_PATH}
ENV PATH /opt/trtorch/bin:${PATH}

WORKDIR /workspace/trtorch/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do any sort of clean up for size reduction? Do we conventionally leave the source in container?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like things like bazel are probably not needed post testing and install in the container itself

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bazel is only installed in trtorch-builder image, not in target container. I do leave the source in TRTorch container - won't in Pytorch MR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

CMD /bin/bash
19 changes: 11 additions & 8 deletions docker/Dockerfile.21.06
Expand Up @@ -15,14 +15,13 @@ RUN pip install notebook

FROM builder as trtorch

COPY . /opt/trtorch
RUN rm /opt/trtorch/WORKSPACE
COPY ./docker/WORKSPACE.cu.docker /opt/trtorch/WORKSPACE
COPY . /opt/trtorch/src
WORKDIR /opt/trtorch/src
RUN cp ./docker/WORKSPACE.cu.docker WORKSPACE

WORKDIR /opt/trtorch
RUN bazel build //:libtrtorch --compilation_mode opt
RUN bazel build //:libtrtorch //:bin --compilation_mode opt

WORKDIR /opt/trtorch/py
WORKDIR /opt/trtorch/src/py

RUN pip install ipywidgets --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host=files.pythonhosted.org
RUN jupyter nbextension enable --py widgetsnbextension
Expand All @@ -32,11 +31,15 @@ RUN apt-get update && apt-get install -y locales ninja-build && rm -rf /var/lib/
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
RUN python3 setup.py install --use-cxx11-abi
RUN MAX_JOBS=1 python3 setup.py install --use-cxx11-abi

RUN conda init bash

ENV LD_LIBRARY_PATH /opt/conda/lib/python3.8/site-packages/torch/lib:$LD_LIBRARY_PATH
WORKDIR /opt/trtorch/src

RUN cd ./bazel-bin && tar xvf libtrtorch.tar.gz -C /usr/local --strip-components 2 --exclude=LICENSE

ENV LD_LIBRARY_PATH /opt/conda/lib/python3.8/site-packages/torch/lib:${LD_LIBRARY_PATH}

WORKDIR /opt/trtorch/
CMD /bin/bash
42 changes: 0 additions & 42 deletions docker/Dockerfile.21.07

This file was deleted.

21 changes: 21 additions & 0 deletions docker/dist-build.sh
@@ -0,0 +1,21 @@
#!/bin/bash

mkdir -p dist

bazel build //:libtrtorch //:bin --compilation_mode opt
borisfom marked this conversation as resolved.
Show resolved Hide resolved

cd py && MAX_JOBS=1 LANG=en_US.UTF-8 LANGUAGE=en_US:en LC_ALL=en_US.UTF-8 python3 setup.py bdist_wheel --use-cxx11-abi

cd ..

cp bazel-bin/libtrtorch.tar.gz dist/
cp py/dist/* dist/

pip3 install ipywidgets --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host=files.pythonhosted.org
jupyter nbextension enable --py widgetsnbextension

pip3 install timm

# test install
mkdir -p /opt/trtorch && tar xvf dist/libtrtorch.tar.gz --strip-components 2 -C /opt/trtorch --exclude=LICENSE && pip3 install dist/*.whl

8 changes: 8 additions & 0 deletions docker/dist-test.sh
@@ -0,0 +1,8 @@
#!/bin/bash

# Build and run unit tests
cd tests/modules && python3 ./hub.py
cd ../..

bazel test //tests:tests //tests:python_api_tests --compilation_mode opt --jobs 4

2 changes: 1 addition & 1 deletion py/requirements.txt
@@ -1,3 +1,3 @@
-f https://download.pytorch.org/whl/torch_stable.html
torch==1.9.0+cu111
torch>=1.9.0+cu111
borisfom marked this conversation as resolved.
Show resolved Hide resolved
pybind11==2.6.2
4 changes: 2 additions & 2 deletions py/setup.py
Expand Up @@ -203,7 +203,7 @@ def run(self):
'trtorch/csrc/tensorrt_classes.cpp',
'trtorch/csrc/register_tensorrt_classes.cpp',
],
library_dirs=[(dir_path + '/trtorch/lib/'), "/opt/conda/lib/python3.6/config-3.6m-x86_64-linux-gnu"],
library_dirs=[(dir_path + '/trtorch/lib/')],
libraries=["trtorch"],
include_dirs=[
dir_path + "trtorch/csrc",
Expand Down Expand Up @@ -234,7 +234,7 @@ def run(self):
long_description=long_description,
ext_modules=ext_modules,
install_requires=[
'torch>=1.9.0+cu111,<1.10.0',
'torch>=1.9.0+cu111,<1.11.0',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we want this change in master until we retarget the mainline codebase to PyTorch 1.10 (post PyTorch's 1.10 release). @andi4191 this might be a good time to start working through the DLFW release workflow. We can retarget this branch to be merged into a DLFW-21.10 branch (lets all agree on a convention for these branch names @borisfom @andi4191 @ptrblck) and then cherry-pick any changes we want in the mainline at a later date (Like the single unified Dockerfile I think will be super useful)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this critical? It just allows TRTorch from master branch to be used in 21.07, 21.08, 21.09 containers, and it breaks nothing ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in the typical case (like not for the state of the repo as it is right now) we only guarantee compatibility with the latest released pytorch since we depend on internal APIs. There have been cases in the past where changes needed to support pytorch-next break support for the previous version. We arent checking against pytorch ToT and if some user has a newer build installed, it might not work even though setup.py says the dependency is compatible. I'd prefer if we only make changes specifically targeted for DLFW in their own branches so that we can keep the guarantees we give to users simple to understand.

Copy link
Collaborator Author

@borisfom borisfom Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the concern is understood. However, the effect of user trying to install current ToT TrTorch in 21.07+ container would not be non-working TrTorch - it would be messing user's container (it will uninstall Pytorch 1.10a and will try to install 1.9). So I believe we'll get less users frustrated if we relax the requirement. Especially that I have checked both 21.07 and 21.08 and at least mostly it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@borisfom: I agree with @narendasan on this. We don't have control over the end-user use cases here. It is safe to proceed with what we can guarantee that works.

Also, can you re-target this PR to release/ngc/21.10 instead of DLFW-21.10 ?

],
setup_requires=[],
cmdclass={
Expand Down
2 changes: 1 addition & 1 deletion tests/modules/requirements.txt
@@ -1,3 +1,3 @@
-f https://download.pytorch.org/whl/torch_stable.html
timm==v0.4.12
torch==1.9.0+cu111
torch>=1.9.0+cu111