-
Notifications
You must be signed in to change notification settings - Fork 182
Description
Hi,
I like this project a lot, and thanks for releasing it! I see the potential to save a lot of time when I first receive new datasets. However, I have issues with performance.
OS: the notebook is running in a Docker container based on https://hub.docker.com/r/tensorflow/tensorflow/.
Hardware:
GPUs 16X NVIDIA® Tesla V100
GPU Memory 512GB total
CPU Dual Intel Xeon Platinum
8168, 2.7 GHz, 24-cores
System Memory 1.5TB
It takes me >8 hours (30900 s) to compute the statistics for a ~100 dataset, with file sizes from 0.5 MB to 300 MB, with a median of 70 MB. It's true that the Docker container introduces some overhead, but given the specs of my hardware, I think it's too much. Any tips on how to speed up computations, w/o changing hardware (i.e., no cloud)? For example, if there was an option to compute statistics in a dataframe, rather than in a protocol buffer, one could use Modin to speedup pandas computations with minimal changes to code.
PS if I use a GPU container instead, with
$ docker run --runtime=nvidia -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
should I see a speedup?