Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on the list of dependencies #93

Closed
adrinjalali opened this issue Feb 23, 2021 · 6 comments · Fixed by #275
Closed

Question on the list of dependencies #93

adrinjalali opened this issue Feb 23, 2021 · 6 comments · Fixed by #275
Assignees
Labels
installation Installation and dependency problems

Comments

@adrinjalali
Copy link

Right now the dependencies on install_requires included packages which are not necessarily a hard dependency, depending on what the user would like to do.

For instance, if the user would like to have a minimal environment doing machine learning where they use frameworks other than tensorflow, this package would still pull tensorflow and many other packages in their environment.

I was wondering if you'd be open to the idea of making dependencies soft dependencies as much as possible, and only tell users in the documentation which dependencies are optional for which parts of the library, and also tell them they need extra libraries if they call specific functions of the library.

This is the list of packages pulled by pip on a fresh environment, which admittedly is quite a long list:

Installing collected packages: urllib3, six, pyasn1, ipython-genutils, idna, chardet, traitlets, rsa, requests, pyrsistent, pyparsing, pycparser, pyasn1-modules, protobuf, oauthlib, cachetools, attrs, wcwidth, typing-extensions, tornado, requests-oauthlib, pyzmq, pytz, python-dateutil, ptyprocess, parso, packaging, mypy-extensions, jupyter-core, jsonschema, grpcio, googleapis-common-protos, google-auth, cffi, werkzeug, webencodings, typing-inspect, tensorboard-plugin-wit, pyyaml, pygments, prompt-toolkit, pickleshare, pexpect, pbr, numpy, nest-asyncio, nbformat, MarkupSafe, markdown, jupyter-client, jedi, httplib2, grpcio-gcp, google-crc32c, google-auth-oauthlib, google-api-core, docopt, decorator, backcall, async-generator, absl-py, wrapt, testpath, termcolor, tensorflow-estimator, tensorboard, pymongo, pydot, pyarrow, proto-plus, pandocfilters, opt-einsum, oauth2client, nbclient, mock, mistune, libcst, keras-preprocessing, jupyterlab-pygments, jinja2, ipython, hdfs, h5py, grpc-google-iam-v1, google-resumable-media, google-pasta, google-cloud-core, gast, future, fasteners, fastavro, entrypoints, dill, defusedxml, crcmod, bleach, avro-python3, astunparse, uritemplate, terminado, tensorflow, Send2Trash, prometheus-client, nbconvert, ipykernel, google-cloud-vision, google-cloud-videointelligence, google-cloud-spanner, google-cloud-pubsub, google-cloud-language, google-cloud-dlp, google-cloud-datastore, google-cloud-build, google-cloud-bigtable, google-cloud-bigquery, google-auth-httplib2, google-apitools, argon2-cffi, apache-beam, tensorflow-serving-api, tensorflow-metadata, pandas, notebook, google-api-python-client, widgetsnbextension, tfx-bsl, jupyterlab-widgets, tensorflow-transform, scipy, pillow, kiwisolver, joblib, ipywidgets, cycler, tensorflow-model-analysis, tensorflow-data-validation, semantic-version, ml-metadata, matplotlib, model-card-toolkit

@amadeuspzs
Copy link

Some metrics to support this issue:

macOS 10.15.7
1.4 GHz Quad-Core Intel Core i5
Python 3.8.2
pip 21.2.4

$ pip install model_card_toolkit --no-cache-dir --use-deprecated=legacy-resolve  
70.43s user 31.26s system 18% cpu 9:13.00 total

9 minutes is a very long time to install a package.

The size of site-packages is 1545mb which seems excessive for this tool?

@vishwanath-prudhivi
Copy link

Hi,

We currently are experimenting with model cards and are using them as summary reports for sklearn models trained on vertex ai on GCP. We wanted to understand by when we could expect a refined package dependency list - currently the training jobs are unable to move past environment setup (due to model card toolkit dependencies) with a lot of time being taken by pip to determine the right versions to install.

Here is the package list from our setup.py file to create the vertex training package (custom code option):

REQUIRED_PACKAGES = ['pandas-gbq>=0.10.0',
'pandas==1.1.3',
'google_compute_engine',
'google-cloud-bigquery==1.24.0',
'google-cloud-core>=1.0.0',
'google-cloud-logging',
'google-cloud-storage>=1.16.0',
'parmap',
'pyarrow==0.16.0',
'google-api-core>=1.11.0',
'google-api-python-client>=1.7.8',
'google-cloud-pubsub>=0.41.0',
'cython',
'gcsfs',
'sklearn',
'google-cloud-profiler',
#'imblearn==0.8.0',
#'autoimpute==0.12.2',
'imblearn',
'autoimpute',
'optbinning',
'model_card_toolkit==1.1.0']

This runs on top of the europe-docker.pkg.dev/vertex-ai/training/tf-cpu.2-6:latest container for tensorflow 2.6 ML framework version.

Scanning the training job logs, we see many of the following messages - INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking

Installing the model card toolkit library on a jupyter notebook is a 1 time step and works fine, however.

@amadeuspzs
Copy link

@vishwanath-prudhivi are you passing the --use-deprecated=legacy-resolver to pip as per https://github.com/tensorflow/model-card-toolkit/blob/master/model_card_toolkit/documentation/guide/install.md?

@vishwanath-prudhivi
Copy link

@amadeuspzs thanks for the suggestion. So when we build a training package , we call the command :-
python3 setup.py sdist --formats=gztar
After the job is submitted to vertex ai, by default the following command is invoked : -

pip3 install --user <training_package_name>.tar.gz

Any suggestions on how to add additional flags (as mentioned in the previous suggestion) here would be helpful

Regards

@codesue codesue linked a pull request Nov 9, 2022 that will close this issue
@codesue
Copy link
Collaborator

codesue commented Nov 9, 2022

Hi all, we're in the process of removing the tfx dependency, which will reduce the number of packages installed considerably. I linked the pull request to this issue.

@adrinjalali, we opened up the discussion topic #228 to brainstorm how to approach loosening up the dependencies and what the new dependency list will look like. I'd love to learn more about your use case and what an ideal workflow would look like for you. 😄

@codesue codesue linked a pull request Nov 17, 2022 that will close this issue
4 tasks
@codesue codesue removed a link to a pull request Nov 17, 2022
@codesue codesue reopened this Dec 3, 2022
@codesue codesue removed a link to a pull request Dec 6, 2022
4 tasks
@codesue
Copy link
Collaborator

codesue commented Dec 7, 2022

After removing tfx, the required packages on the main branch are:

  • absl-py>=0.9,<1.1, primarily used for testing and building docs, used for logging in one instance
  • jinja2>=3.1,<3.2, used for rendering the model card
  • matplotlib>=3.2.0,<4, used for generating graphics from tfma and tfdv objects
  • jsonschema>=3.2.0,<4, used for validating JSON schema
  • tensorflow-data-validation>=1.5.0,<2.0.0, used for reading and parsing tf stats
  • tensorflow-model-analysis>=0.36.0,<0.42.0, used for reading and parsing tf metrics
  • tensorflow-metadata>=1.5.0,<2.0.0, contains the stats proto definition
  • ml-metadata>=1.5.0,<2.0.0, used for querying MLMD
  • dataclasses;python_version<"3.7", used for Python 3.6 which is no longer supported

I did some testing, and I think the minimal requirements to support core functionality with little refactoring are:

  • absl-py>=0.9,<1.1: primarily used for testing and building docs, used for logging in one instance
  • jinja2>=3.1,<3.2: used for rendering the model card
  • jsonschema>=3.2.0,<4: used for validating JSON schema
  • protobuf>=3.19.0,<3.20.0: used for building, serializing, and parsing protos -- previously installed as a transitive dependency

Here, core functionality means creating ModelCard and ModelCardToolkit objects, only manually annotating graphics and performance metrics, and rendering and exporting the model card.

Depending on how TensorFlow docs are built, it might be possible to move absl-py to an extra. If exporting a model card as a proto is made optional, protobuf could be made optional as well.

Model Card Toolkit is now a community-led open source project under the TFX Addons special interest group. (Learn more in this announcement.) The project now depends on community contributions, bug fixes, and documentation. This means the timeline for a creating a basic model-card-toolkit package for core functionality and moving optional dependencies to extras depends on contributors from the community volunteering to implement and review these changes. We're in the process of updating the contributing guide to improve the contributor experience and lower the barriers to contributing. 😄

@codesue codesue added help wanted Extra attention is needed contributions welcome This issue is ready to be worked on labels Dec 7, 2022
@codesue codesue self-assigned this Dec 8, 2022
@codesue codesue removed their assignment Dec 16, 2022
@codesue codesue self-assigned this May 11, 2023
@codesue codesue added work in progress Someone is working on this issue and removed help wanted Extra attention is needed labels May 11, 2023
@codesue codesue added the installation Installation and dependency problems label May 15, 2023
@codesue codesue removed contributions welcome This issue is ready to be worked on work in progress Someone is working on this issue labels May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
installation Installation and dependency problems
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants