Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q-Q plot file size scales linearly with data size #8582

Open
adamjstewart opened this issue Dec 16, 2022 · 3 comments
Open

Q-Q plot file size scales linearly with data size #8582

adamjstewart opened this issue Dec 16, 2022 · 3 comments

Comments

@adamjstewart
Copy link

Describe the bug

It appears that the file size of Q-Q plots scales $\mathcal{O}(N)$ for data of size $N$.

Code Sample

import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
from statsmodels.graphics.gofplots import qqplot_2samples

for size in [1e2, 1e3, 1e4, 1e5, 1e6]:
    size = int(size)
    x = np.random.normal(loc=8.5, scale=2.5, size=size)
    y = np.random.normal(loc=8.0, scale=3.0, size=size)
    pp_x = sm.ProbPlot(x)
    pp_y = sm.ProbPlot(y)
    qqplot_2samples(pp_x, pp_y)
    plt.savefig(f"test_{size}.svg")
$ ls -lh test_*.svg
-rw-r-----  1 ajstewart  staff    33K Dec 16 14:19 test_100.svg
-rw-r-----  1 ajstewart  staff   128K Dec 16 14:19 test_1000.svg
-rw-r-----  1 ajstewart  staff   1.0M Dec 16 14:19 test_10000.svg
-rw-r-----  1 ajstewart  staff    10M Dec 16 14:19 test_100000.svg
-rw-r-----  1 ajstewart  staff   102M Dec 16 14:20 test_1000000.svg

Note: the same behavior can be reproduced with qqplot or qqplot_2samples.

Expected Output

I would expect each plot to contain 100 points (1 for each quantile), but it seems to contain every single point in the dataset. This results in files that are too large to even render, making the plot rather useless for large datasets. Is this the intended behavior? I think it would be more useful to plot 100 points, or at least have an option to do so.

Versions

INSTALLED VERSIONS

Python: 3.10.8.final.0
OS: Darwin 21.6.0 Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:56 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T6000 arm64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

statsmodels

Installed: 0.13.5 (/Users/ajstewart/spack/opt/spack/darwin-monterey-m1/apple-clang-14.0.0/py-statsmodels-0.13.5-adxlc7oy3ktl4kh5vovnox6p4xfo42mu/lib/python3.10/site-packages/statsmodels)

Required Dependencies

cython: 0.29.32 (/Users/ajstewart/spack/opt/spack/darwin-monterey-m1/apple-clang-14.0.0/py-cython-0.29.32-ivsv2b4ous7evv2ubjgvg55kfhmipfha/lib/python3.10/site-packages/Cython)
numpy: 1.23.4 (/Users/ajstewart/spack/opt/spack/darwin-monterey-m1/apple-clang-14.0.0/py-numpy-1.23.4-lrwjrgphd7votelcqhtiofnwjc66qeu7/lib/python3.10/site-packages/numpy)
scipy: 1.9.3 (/Users/ajstewart/spack/opt/spack/darwin-monterey-m1/apple-clang-14.0.0/py-scipy-1.9.3-awasw63634ccwei2d4yzqawpemumjce3/lib/python3.10/site-packages/scipy)
pandas: 1.5.1 (/Users/ajstewart/spack/opt/spack/darwin-monterey-m1/apple-clang-14.0.0/py-pandas-1.5.1-ig3wajvrqdbhneo6hrxjjtqblcxjbm7b/lib/python3.10/site-packages/pandas)
dateutil: 2.8.2 (/Users/ajstewart/spack/opt/spack/darwin-monterey-m1/apple-clang-14.0.0/py-python-dateutil-2.8.2-6jntibflip4pxdx4zcobobgq6uujtrth/lib/python3.10/site-packages/dateutil)
patsy: 0.5.2 (/Users/ajstewart/spack/opt/spack/darwin-monterey-m1/apple-clang-14.0.0/py-patsy-0.5.2-abnyxte3so75czuftdj4aqc2c34hri2f/lib/python3.10/site-packages/patsy)

Optional Dependencies

matplotlib: 3.6.2 (/Users/ajstewart/.spack/.spack-env/view/lib/python3.10/site-packages/matplotlib)
backend: MacOSX
cvxopt: Not installed
joblib: 1.2.0 (/Users/ajstewart/.spack/.spack-env/view/lib/python3.10/site-packages/joblib)

Developer Tools

IPython: 8.5.0 (/Users/ajstewart/.spack/.spack-env/view/lib/python3.10/site-packages/IPython)
jinja2: 3.1.2 (/Users/ajstewart/.spack/.spack-env/view/lib/python3.10/site-packages/jinja2)
sphinx: 5.3.0 (/Users/ajstewart/.spack/.spack-env/view/lib/python3.10/site-packages/sphinx)
pygments: 2.13.0 (/Users/ajstewart/.spack/.spack-env/view/lib/python3.10/site-packages/pygments)
pytest: 7.1.3 (/Users/ajstewart/.spack/.spack-env/view/lib/python3.10/site-packages/pytest)
virtualenv: Not installed

@pkaf
Copy link

pkaf commented Jan 17, 2023

Think we can get around it with

perc = lambda x: np.percentile(x, np.arange(101))
qqplot_2samples(perc(pp_x.data), perc(pp_y.data))

But happy to patch it as an option if suggested.

@adamjstewart
Copy link
Author

I think that's a good workaround but it should probably happen by default when len(x) > 100.

@bashtage
Copy link
Member

It would need an enhancement that would initiate some sort of subsampling when a flag was set. One of the issues with using some percentile is that the values shown may not be actual data points since some percentile estimation methods interpolate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants