Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow execution for torch models after pyinstaller packing #8211

Closed
fridary opened this issue Jan 5, 2024 · 7 comments
Closed

Extremely slow execution for torch models after pyinstaller packing #8211

fridary opened this issue Jan 5, 2024 · 7 comments

Comments

@fridary
Copy link

fridary commented Jan 5, 2024

OS: Windows 11 x64
python: 3.11.7 (latest)
torch: 2.1.2 (latest)
torchvision: 0.16.2 (latest)
pyinstaller: 6.3.0 (latest)
ultralytics: 8.0.231 (latest)

Ultralytics is a torch lib for working with images and machine learning. When doing torch/ultralytics model.predict() (make machine learning image predictions) after pyinstaller packing, it goes extremely slow ~12 seconds. Without packing ~1.8 seconds.

Any ideas what can be wrong? With speed 12 seconds is unreal to work.

How I pack:
$ pyinstaller --onedir --hidden-import 'torch.jit' --collect-all ultralytics --collect-all torch --collect-all torchvision script.py

Minimal Reproducible Example

$ conda create --name test1 python=3.11
$ conda activate test1
$ conda install libpng jpeg
$ conda install pytorch==2.1.2 torchvision==0.16.2 -c pytorch
$ pip install ultralytics pyinstaller

script.py:

import time
import cv2
import numpy as np
import base64
import multiprocessing
from ultralytics import YOLO

if __name__ == "__main__":
    multiprocessing.freeze_support()

    with open('img.png', 'rb') as f:
        image_data_binary = f.read()
        image_puzzle = (base64.b64encode(image_data_binary)).decode('ascii')
    nparr = np.frombuffer(base64.b64decode(image_puzzle), np.uint8)
    img_np = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    
    model = YOLO('yolov8n.pt') # this file will be downloaded automatically from github repo
    print('model loaded')
    start_time = time.time()
    results = model.predict(source=img_np, show=False, save=False, save_conf=False, show_conf=False, save_txt=False)
    print("predicted in %.3f seconds" % (time.time() - start_time))
    time.sleep(5)

img.png can be any png image.

@fridary fridary added the triage Please triage and relabel this issue label Jan 5, 2024
@rokm
Copy link
Member

rokm commented Jan 5, 2024

Does it make any difference if you divert the program flow in the multi-processing worker sub-processes as soon as possible, instead of going through imports of numpy, cv2, etc?

I.e., if you reorganize the code as follows:

# Divert the program in multiprocessing workers as soon as possible...
if __name__ == "__main__":
    import multiprocessing
    multiprocessing.freeze_support()
    
import time
import cv2
import numpy as np
import base64
from ultralytics import YOLO

if __name__ == "__main__":
    with open('img.png', 'rb') as f:
        image_data_binary = f.read()
        image_puzzle = (base64.b64encode(image_data_binary)).decode('ascii')
    nparr = np.frombuffer(base64.b64decode(image_puzzle), np.uint8)
    img_np = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    
    model = YOLO('yolov8n.pt') # this file will be downloaded automatically from github repo
    print('model loaded')
    start_time = time.time()
    results = model.predict(source=img_np, show=False, save=False, save_conf=False, show_conf=False, save_txt=False)
    print("predicted in %.3f seconds" % (time.time() - start_time))
    time.sleep(5)

@fridary
Copy link
Author

fridary commented Jan 6, 2024

@rokm thank you very much, now prediction is ~3.1 seconds. So it's on ~1.3 seconds higher, than original. Is there something we can do else to speed up? Difference is in 2 times, it's possible to work with that, but, of course, higher speed is always good.

@rokm
Copy link
Member

rokm commented Jan 6, 2024

Hmm, afraid I'm out of other obvious optimization opportunities.

FWIW, I can reproduce the original slowdown (although not to such drastic extent) on my Win10 x64 laptop, but the modified program seems to perform comparable to the original unfrozen code. The again, I'm using python.org python with all dependencies installed via pip in a clean virtual environment - i.e., no conda involved.

Full reproduction steps:

>c:\Users\Rok\AppData\Local\Programs\Python\Python311\python.exe -m venv venv
>venv\scripts\activate
>python -m pip install -U pip wheel setuptools
>pip install pyinstaller ultralytics

This installs torch and torchvision from PyPI, as well as latest pyinstaller and latest pyinstaller-hooks-contrib (which has pyinstaller/pyinstaller-hooks-contrib#676, so torch and torchvision works out-of-the-box):

>pip freeze
altgraph==0.17.4
certifi==2023.11.17
charset-normalizer==3.3.2
colorama==0.4.6
contourpy==1.2.0
cycler==0.12.1
filelock==3.13.1
fonttools==4.47.0
fsspec==2023.12.2
idna==3.6
Jinja2==3.1.2
kiwisolver==1.4.5
MarkupSafe==2.1.3
matplotlib==3.8.2
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.3
opencv-python==4.9.0.80
packaging==23.2
pandas==2.1.4
pefile==2023.2.7
pillow==10.2.0
psutil==5.9.7
py-cpuinfo==9.0.0
pyinstaller==6.3.0
pyinstaller-hooks-contrib==2023.12
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
pywin32-ctypes==0.2.2
PyYAML==6.0.1
requests==2.31.0
scipy==1.11.4
seaborn==0.13.1
six==1.16.0
sympy==1.12
thop==0.1.1.post2209072238
torch==2.1.2
torchvision==0.16.2
tqdm==4.66.1
typing_extensions==4.9.0
tzdata==2023.4
ultralytics==8.0.235
urllib3==2.1.0

Now, if I run the unfrozen script 5 times (on otherwise idle system):

>python process_image.py
Downloading https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8n.pt to 'yolov8n.pt'...
100%|█████████████████████████████████████████████████████████████████████████████| 6.23M/6.23M [00:01<00:00, 6.34MB/s]
model loaded

0: 640x480 1 person, 131.7ms
Speed: 5.6ms preprocess, 131.7ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 3.101 seconds
>python process_image.py
model loaded

0: 640x480 1 person, 147.6ms
Speed: 6.0ms preprocess, 147.6ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.229 seconds
>python process_image.py
model loaded

0: 640x480 1 person, 141.6ms
Speed: 6.0ms preprocess, 141.6ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.233 seconds
>python process_image.py
model loaded

0: 640x480 1 person, 132.7ms
Speed: 6.0ms preprocess, 132.7ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.221 seconds
>python process_image.py
model loaded

0: 640x480 1 person, 154.6ms
Speed: 6.0ms preprocess, 154.6ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.246 seconds

With exception of the very first run, the total measured prediction time is ~2.2 seconds.


If I freeze the original version of the program

>pyinstaller --clean --noconfirm --collect-data ultralytics process_image.py

and run the resulting executable five times:

>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 90.8ms
Speed: 4.0ms preprocess, 90.8ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 6.430 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 125.8ms
Speed: 5.0ms preprocess, 125.8ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 5.744 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 141.6ms
Speed: 4.0ms preprocess, 141.6ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 5.775 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 106.7ms
Speed: 5.0ms preprocess, 106.7ms inference, 1.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 5.779 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 130.6ms
Speed: 5.0ms preprocess, 130.6ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 5.786 seconds

The measured total prediction time is ~5.7 s.


Finally, if I move the multiprocessing.freeze_support at the top, rebuild again, and run five times:

>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 93.7ms
Speed: 4.0ms preprocess, 93.7ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.665 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 136.6ms
Speed: 4.0ms preprocess, 136.6ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.002 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 129.7ms
Speed: 5.0ms preprocess, 129.7ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.013 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 133.6ms
Speed: 6.0ms preprocess, 133.6ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.063 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 126.7ms
Speed: 6.0ms preprocess, 126.7ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 1.996 seconds

This time, I get prediction time that is comparable (if not slightly better) than the unfrozen version.

@rokm
Copy link
Member

rokm commented Jan 6, 2024

Can you try with pure python.org python + pip environment instead of using conda, to see if you if the difference between unfrozen and frozen version is smaller on your hardware?

I suppose it would also be interesting to compare, on the same hardware, the speed of unfrozen script in python.org + pip environment vs. the conda environment that you are using.

@rokm rokm removed the triage Please triage and relabel this issue label Jan 6, 2024
@fridary
Copy link
Author

fridary commented Jan 7, 2024

@rokm I tested on env and speed jump 2-4 seconds on clean script start. Very weird. After packing to pyinstaller, it's ~3 seconds. Anyway, thank you, I will keep testing. What python version you used?

@rokm
Copy link
Member

rokm commented Jan 7, 2024

@rokm I tested on env and speed jump 2-4 seconds on clean script start. Very weird. After packing to pyinstaller, it's ~3 seconds. Anyway, thank you, I will keep testing. What python version you used?

I used python.org python 3.11.7.

If python.org + pip python is slower in unfrozen variant compared to conda, it could be that conda-installed packages (numpy comes to mind, since Anaconda version uses MKL) are more optimized for this sort of task. It could also be that due to use of MKL, Anaconda numpy is slower to import, which is why the initial slow-down in the frozen application was more pronounced than in my case.

Suppose I'll also check what happens on my laptop with (mini)conda environment.

@rokm
Copy link
Member

rokm commented Jan 7, 2024

I've tested with miniconda-installed environment following your instructions, and on my laptop, the performance is largely unchanged.

For the reference, here is the output of conda env export (to see what is installed from conda packages and what from PyPI):

Output

>conda env export
name: pyi-ultralytics
channels:
  - pytorch
  - defaults
dependencies:
  - blas=1.0=mkl
  - brotli-python=1.0.9=py311hd77b12b_7
  - bzip2=1.0.8=he774522_0
  - ca-certificates=2023.12.12=haa95532_0
  - certifi=2023.11.17=py311haa95532_0
  - cffi=1.16.0=py311h2bbff1b_0
  - charset-normalizer=2.0.4=pyhd3eb1b0_0
  - cryptography=41.0.7=py311h89fc84f_0
  - filelock=3.13.1=py311haa95532_0
  - freetype=2.12.1=ha860e81_0
  - giflib=5.2.1=h8cc25b3_3
  - gmpy2=2.1.2=py311h7f96b67_0
  - idna=3.4=py311haa95532_0
  - intel-openmp=2023.1.0=h59b6b97_46320
  - jinja2=3.1.2=py311haa95532_0
  - jpeg=9e=h2bbff1b_1
  - lerc=3.0=hd77b12b_0
  - libdeflate=1.17=h2bbff1b_1
  - libffi=3.4.4=hd77b12b_0
  - libjpeg-turbo=2.0.0=h196d8e1_0
  - libpng=1.6.39=h8cc25b3_0
  - libtiff=4.5.1=hd77b12b_0
  - libuv=1.44.2=h2bbff1b_0
  - libwebp=1.3.2=hbc33d0d_0
  - libwebp-base=1.3.2=h2bbff1b_0
  - lz4-c=1.9.4=h2bbff1b_0
  - markupsafe=2.1.3=py311h2bbff1b_0
  - mkl=2023.1.0=h6b88ed4_46358
  - mkl-service=2.4.0=py311h2bbff1b_1
  - mkl_fft=1.3.8=py311h2bbff1b_0
  - mkl_random=1.2.4=py311h59b6b97_0
  - mpc=1.1.0=h7edee0f_1
  - mpfr=4.0.2=h62dcd97_1
  - mpir=3.0.0=hec2e145_1
  - mpmath=1.3.0=py311haa95532_0
  - networkx=3.1=py311haa95532_0
  - numpy=1.26.3=py311hdab7c0b_0
  - numpy-base=1.26.3=py311hd01c5d8_0
  - openjpeg=2.4.0=h4fc8c34_0
  - openssl=3.0.12=h2bbff1b_0
  - pillow=10.0.1=py311h045eedc_0
  - pip=23.3.1=py311haa95532_0
  - pycparser=2.21=pyhd3eb1b0_0
  - pyopenssl=23.2.0=py311haa95532_0
  - pysocks=1.7.1=py311haa95532_0
  - python=3.11.7=he1021f5_0
  - pytorch=2.1.2=py3.11_cpu_0
  - pytorch-mutex=1.0=cpu
  - pyyaml=6.0.1=py311h2bbff1b_0
  - requests=2.31.0=py311haa95532_0
  - setuptools=68.2.2=py311haa95532_0
  - sqlite=3.41.2=h2bbff1b_0
  - sympy=1.12=py311haa95532_0
  - tbb=2021.8.0=h59b6b97_0
  - tk=8.6.12=h2bbff1b_0
  - torchvision=0.16.2=py311_cpu
  - typing_extensions=4.7.1=py311haa95532_0
  - urllib3=1.26.18=py311haa95532_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - wheel=0.41.2=py311haa95532_0
  - win_inet_pton=1.1.0=py311haa95532_0
  - xz=5.4.5=h8cc25b3_0
  - yaml=0.2.5=he774522_0
  - zlib=1.2.13=h8cc25b3_0
  - zstd=1.5.5=hd43e919_0
  - pip:
      - altgraph==0.17.4
      - colorama==0.4.6
      - contourpy==1.2.0
      - cycler==0.12.1
      - fonttools==4.47.0
      - fsspec==2023.12.2
      - kiwisolver==1.4.5
      - matplotlib==3.8.2
      - opencv-python==4.9.0.80
      - packaging==23.2
      - pandas==2.1.4
      - pefile==2023.2.7
      - psutil==5.9.7
      - py-cpuinfo==9.0.0
      - pyinstaller==6.3.0
      - pyinstaller-hooks-contrib==2023.12
      - pyparsing==3.1.1
      - python-dateutil==2.8.2
      - pytz==2023.3.post1
      - pywin32-ctypes==0.2.2
      - scipy==1.11.4
      - seaborn==0.13.1
      - six==1.16.0
      - thop==0.1.1-2209072238
      - tqdm==4.66.1
      - tzdata==2023.4
      - ultralytics==8.0.237
prefix: C:\Users\Rok\miniconda3\envs\pyi-ultralytics

And the results (using same image):

Five runs of unfrozen script

>python process_image.py
model loaded

0: 640x480 1 person, 137.6ms
Speed: 6.0ms preprocess, 137.6ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.424 seconds
>python process_image.py
model loaded

0: 640x480 1 person, 156.6ms
Speed: 5.0ms preprocess, 156.6ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.311 seconds
>python process_image.py
model loaded

0: 640x480 1 person, 164.6ms
Speed: 6.0ms preprocess, 164.6ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.342 seconds
>python process_image.py
model loaded

0: 640x480 1 person, 152.1ms
Speed: 6.0ms preprocess, 152.1ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.303 seconds
>python process_image.py
model loaded

0: 640x480 1 person, 154.6ms
Speed: 6.0ms preprocess, 154.6ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.292 seconds

And five runs of the frozen application rebuilt in this environment (which, as a side note, is now 1.19 GB compared to 618 MB when built with python.org and pip-installed dependencies).

>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 142.6ms
Speed: 6.0ms preprocess, 142.6ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.918 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 128.2ms
Speed: 5.0ms preprocess, 128.2ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.081 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 132.6ms
Speed: 5.0ms preprocess, 132.6ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.112 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 136.6ms
Speed: 4.0ms preprocess, 136.6ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.094 seconds
>dist\process_image\process_image.exe
model loaded

0: 640x480 1 person, 127.7ms
Speed: 5.0ms preprocess, 127.7ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 480)
predicted in 2.055 seconds

So I get results that are largely consistent both between frozen and unfrozen version, and with earlier results with python.org + pip environment.

No idea why it differs in your case.

@fridary fridary closed this as completed Jan 8, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants