Slow performance when computing stats for moderately large data set

Hi,

I like this project **a lot**, and thanks for releasing it! I see the potential to save a lot of time when I first receive new datasets. However, I have issues with performance.

OS: the notebook is running in a Docker container based on https://hub.docker.com/r/tensorflow/tensorflow/.

Hardware:
```
GPUs 16X NVIDIA® Tesla V100
GPU Memory 512GB total
CPU Dual Intel Xeon Platinum
8168, 2.7 GHz, 24-cores
System Memory 1.5TB
```
It takes me >8 hours (30900 s) to compute the statistics for a  ~100 dataset, with file sizes from 0.5 MB to 300 MB, with a median of 70 MB. It's true that the Docker container introduces some overhead, but given the specs of my hardware, I think it's too much. Any tips on how to speed up computations, w/o changing hardware (i.e., no cloud)? For example, if there was an option to compute statistics in a dataframe, rather than in a protocol buffer, one could use [Modin](https://github.com/modin-project/modin) to speedup pandas computations with minimal changes to code.

PS if I use a GPU container instead, with

`$ docker run --runtime=nvidia -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
`
should I see a speedup?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow performance when computing stats for moderately large data set #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow performance when computing stats for moderately large data set #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions