Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal Pyjanitor instalation #826

Open
GuiMarthe opened this issue Apr 7, 2021 · 7 comments
Open

Minimal Pyjanitor instalation #826

GuiMarthe opened this issue Apr 7, 2021 · 7 comments
Labels
available for hacking This issue has not been claimed by any individual. help wanted Extra attention is needed high priority High priority issues infrastructure Infrastructure-related issues

Comments

@GuiMarthe
Copy link

Hey folks, I recently looked at the package dependency for pyjanitor and it seems too large for a production environment.
I saw in the .requirements directory that there are a few sets of dependencies for different use cases, but I don't see anywhere how to actually limit the scope at installation time.

Is the following possible?
pip install pyjanitor[base]

Just in case, this is the packages dependency list generated by pipdeptree.

pyjanitor==0.20.10
  black==20.8b1
    appdirs==1.4.4
    click==7.1.2
    mypy-extensions==0.4.3
    pathspec==0.8.1
    regex==2021.4.4
    toml==0.10.2
    typed-ast==1.4.2
    typing-extensions==3.7.4.3
  hypothesis==6.8.9
    attrs==20.3.0
    sortedcontainers==2.3.0
  interrogate==1.3.2
    attrs==20.3.0
    click==7.1.2
    colorama==0.4.4
    py==1.10.0
    tabulate==0.8.9
    toml==0.10.2
  ipykernel==5.5.3
    ipython==7.21.0
      backcall==0.2.0
      decorator==5.0.5
      jedi==0.18.0
        parso==0.8.2
      pexpect==4.8.0
        ptyprocess==0.7.0
      pickleshare==0.7.5
      prompt-toolkit==3.0.18
        wcwidth==0.2.5
      Pygments==2.8.1
      setuptools==54.2.0
      traitlets==5.0.5
        ipython-genutils==0.2.0
    jupyter-client==6.1.13
      jupyter-core==4.7.1
        traitlets==5.0.5
          ipython-genutils==0.2.0
      nest-asyncio==1.5.1
      python-dateutil==2.8.0
        six==1.15.0
      pyzmq==22.0.3
      tornado==6.1
      traitlets==5.0.5
        ipython-genutils==0.2.0
    tornado==6.1
    traitlets==5.0.5
      ipython-genutils==0.2.0
  isort==5.8.0
  jupyter-client==6.1.13
    jupyter-core==4.7.1
      traitlets==5.0.5
        ipython-genutils==0.2.0
    nest-asyncio==1.5.1
    python-dateutil==2.8.0
      six==1.15.0
    pyzmq==22.0.3
    tornado==6.1
    traitlets==5.0.5
      ipython-genutils==0.2.0
  lxml==4.6.3
  natsort==7.1.1
  nbsphinx==0.8.2
    docutils==0.15.2
    Jinja2==2.11.3
      MarkupSafe==1.1.1
    nbconvert==6.0.7
      bleach==3.3.0
        packaging==20.9
          pyparsing==2.4.7
        six==1.15.0
        webencodings==0.5.1
      defusedxml==0.7.1
      entrypoints==0.3
      Jinja2==2.11.3
        MarkupSafe==1.1.1
      jupyter-core==4.7.1
        traitlets==5.0.5
          ipython-genutils==0.2.0
      jupyterlab-pygments==0.1.2
        Pygments==2.8.1
      mistune==0.8.4
      nbclient==0.5.3
        async-generator==1.10
        jupyter-client==6.1.13
          jupyter-core==4.7.1
            traitlets==5.0.5
              ipython-genutils==0.2.0
          nest-asyncio==1.5.1
          python-dateutil==2.8.0
            six==1.15.0
          pyzmq==22.0.3
          tornado==6.1
          traitlets==5.0.5
            ipython-genutils==0.2.0
        nbformat==5.1.3
          ipython-genutils==0.2.0
          jsonschema==3.2.0
            attrs==20.3.0
            importlib-metadata==3.10.0
              typing-extensions==3.7.4.3
              zipp==3.4.1
            pyrsistent==0.17.3
            setuptools==54.2.0
            six==1.15.0
          jupyter-core==4.7.1
            traitlets==5.0.5
              ipython-genutils==0.2.0
          traitlets==5.0.5
            ipython-genutils==0.2.0
        nest-asyncio==1.5.1
        traitlets==5.0.5
          ipython-genutils==0.2.0
      nbformat==5.1.3
        ipython-genutils==0.2.0
        jsonschema==3.2.0
          attrs==20.3.0
          importlib-metadata==3.10.0
            typing-extensions==3.7.4.3
            zipp==3.4.1
          pyrsistent==0.17.3
          setuptools==54.2.0
          six==1.15.0
        jupyter-core==4.7.1
          traitlets==5.0.5
            ipython-genutils==0.2.0
        traitlets==5.0.5
          ipython-genutils==0.2.0
      pandocfilters==1.4.3
      Pygments==2.8.1
      testpath==0.4.4
      traitlets==5.0.5
        ipython-genutils==0.2.0
    nbformat==5.1.3
      ipython-genutils==0.2.0
      jsonschema==3.2.0
        attrs==20.3.0
        importlib-metadata==3.10.0
          typing-extensions==3.7.4.3
          zipp==3.4.1
        pyrsistent==0.17.3
        setuptools==54.2.0
        six==1.15.0
      jupyter-core==4.7.1
        traitlets==5.0.5
          ipython-genutils==0.2.0
      traitlets==5.0.5
        ipython-genutils==0.2.0
    Sphinx==3.5.3
      alabaster==0.7.12
      Babel==2.9.0
        pytz==2020.5
      docutils==0.15.2
      imagesize==1.2.0
      Jinja2==2.11.3
        MarkupSafe==1.1.1
      packaging==20.9
        pyparsing==2.4.7
      Pygments==2.8.1
      requests==2.23.0
        certifi==2020.12.5
        chardet==3.0.4
        idna==2.10
        urllib3==1.25.11
      setuptools==54.2.0
      snowballstemmer==2.1.0
      sphinxcontrib-applehelp==1.0.2
      sphinxcontrib-devhelp==1.0.2
      sphinxcontrib-htmlhelp==1.0.3
      sphinxcontrib-jsmath==1.0.1
      sphinxcontrib-qthelp==1.0.3
      sphinxcontrib-serializinghtml==1.1.4
    traitlets==5.0.5
      ipython-genutils==0.2.0
  pandas-flavor==0.2.0
    pandas==1.1.3
      numpy==1.20.2
      python-dateutil==2.8.0
        six==1.15.0
      pytz==2020.5
    xarray==0.17.0
      numpy==1.20.2
      pandas==1.1.3
        numpy==1.20.2
        python-dateutil==2.8.0
          six==1.15.0
        pytz==2020.5
      setuptools==54.2.0
  pandas-vet==0.2.2
    attrs==20.3.0
    flake8==3.9.0
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
      mccabe==0.6.1
      pycodestyle==2.7.0
      pyflakes==2.3.1
  pre-commit==2.12.0
    cfgv==3.2.0
    identify==2.2.2
    importlib-metadata==3.10.0
      typing-extensions==3.7.4.3
      zipp==3.4.1
    nodeenv==1.5.0
    PyYAML==5.4.1
    toml==0.10.2
    virtualenv==20.4.3
      appdirs==1.4.4
      distlib==0.3.1
      filelock==3.0.12
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
      six==1.15.0
  pyspark==3.1.1
    py4j==0.10.9
  pytest==6.2.3
    attrs==20.3.0
    importlib-metadata==3.10.0
      typing-extensions==3.7.4.3
      zipp==3.4.1
    iniconfig==1.1.1
    packaging==20.9
      pyparsing==2.4.7
    pluggy==0.13.1
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
    py==1.10.0
    toml==0.10.2
  pytest-azurepipelines==0.8.0
    pytest==6.2.3
      attrs==20.3.0
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
      iniconfig==1.1.1
      packaging==20.9
        pyparsing==2.4.7
      pluggy==0.13.1
        importlib-metadata==3.10.0
          typing-extensions==3.7.4.3
          zipp==3.4.1
      py==1.10.0
      toml==0.10.2
  pytest-cov==2.11.1
    coverage==5.5
    pytest==6.2.3
      attrs==20.3.0
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
      iniconfig==1.1.1
      packaging==20.9
        pyparsing==2.4.7
      pluggy==0.13.1
        importlib-metadata==3.10.0
          typing-extensions==3.7.4.3
          zipp==3.4.1
      py==1.10.0
      toml==0.10.2
  scikit-learn==0.23.2
    joblib==0.13.2
    numpy==1.20.2
    scipy==1.6.0
      numpy==1.20.2
    threadpoolctl==2.1.0
  seaborn==0.11.1
    matplotlib==3.4.1
      cycler==0.10.0
        six==1.15.0
      kiwisolver==1.3.1
      numpy==1.20.2
      Pillow==8.0.0
      pyparsing==2.4.7
      python-dateutil==2.8.0
        six==1.15.0
    numpy==1.20.2
    pandas==1.1.3
      numpy==1.20.2
      python-dateutil==2.8.0
        six==1.15.0
      pytz==2020.5
    scipy==1.6.0
      numpy==1.20.2
  setuptools==54.2.0
  sphinxcontrib-fulltoc==1.2.0
  unyt==2.8.0
    numpy==1.20.2
    sympy==1.7.1
      mpmath==1.2.1
  xarray==0.17.0
    numpy==1.20.2
    pandas==1.1.3
      numpy==1.20.2
      python-dateutil==2.8.0
        six==1.15.0
      pytz==2020.5
    setuptools==54.2.0

If I'm doing anything wrong, please let me know!

@samukweku
Copy link
Collaborator

possible dup of #793

@ericmjl
Copy link
Member

ericmjl commented Apr 8, 2021

Hello @GuiMarthe! Thanks for chiming in. Yes, there are a lot of dependencies for pyjanitor. I think the dependency sprawl has been something I haven't managed well in the past, still doing a bit of learning here.

Looks like it might be good for us to split out at least the pip-installable package using the optional dependencies convention that I have just learned from googling on SO. The conda package will have to wait a bit though.

Would you like to help contribute a PR, if you've got the bandwidth? Meanwhile, I'll tag this and the other #793 as being one of the higher priority issues.

@ericmjl ericmjl added available for hacking This issue has not been claimed by any individual. help wanted Extra attention is needed high priority High priority issues infrastructure Infrastructure-related issues labels Apr 8, 2021
@GuiMarthe
Copy link
Author

hey @ericmjl, thank you for the prompt response! I think I can help you with this issue, even though I'm a beginner, I am aware of how friendly pyjanitor is 😄 .

Now, I'd still need to dig into the code, but from the pipdeptree list I've posted above, I can see the following groups of dependencies:

  • development: pytest, hypothesis, sphinx, and a few others
  • base/minimal: pandas-flavor, numpy, and pandas itself
  • notebook: ipython, jupyter, seaborn (?), etc
  • ML: scikit-learn
  • Spark: pyspark and etc.

Perhaps the ML category could be merged with the base category. Does that make sense?

@GuiMarthe
Copy link
Author

Also, should we add messages and/or warnings whenever the user tries to use a method that depends on a non-installed dependency?

@ericmjl
Copy link
Member

ericmjl commented Apr 12, 2021

Also, should we add messages and/or warnings whenever the user tries to use a method that depends on a non-installed dependency?

@GuiMarthe yesss! That sounds like a great idea.

Perhaps the ML category could be merged with the base category. Does that make sense?

That makes sense too.

In terms of the different ways we can group things, would you be kind enough to do the following?

  • pyjanitor[all]
  • pyjanitor[base]- includes base dependencies only
  • pyjanitor[bio] - includes base + bio packages
  • pyjanitor[chem] - includes base + chemistry packages
  • pyjanitor[eng] - includes base + engineering packages
  • pyjanitor[spark] - includes base + spark dependencies

Doing so would mirror the structure in the .requirements directory (https://github.com/pyjanitor-devs/pyjanitor/tree/dev/.requirements); the more patterns we have, the easier it is to follow later on while maintaining the project. That said, I think we can omit the test submodule because that's generally for development purposes.

I'm looking forward to reviewing the PR! I will be getting my vaccine shot this week, so I might be KO'd for a day or two (depending on whether my immune system kicks off in a big or small way), but I should be able to come back to it later.

@hectormz
Copy link
Collaborator

Would pyjanitor[base] be the default and require no actual [base] specification?

@ericmjl
Copy link
Member

ericmjl commented May 9, 2021

Actually, that sounds like a good idea, @hectormz!

@GuiMarthe, can I check in, do you have bandwidth to handle this one? I ran into a bit of a busy patch myself, and have dropped the ball here, and probably will be like this until the end of the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
available for hacking This issue has not been claimed by any individual. help wanted Extra attention is needed high priority High priority issues infrastructure Infrastructure-related issues
Projects
None yet
Development

No branches or pull requests

4 participants