New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a better system for distributing pickled models and other extra data #20

Closed
sloria opened this Issue Sep 14, 2013 · 4 comments

Comments

Projects
None yet
2 participants
@sloria
Copy link
Owner

sloria commented Sep 14, 2013

Currently, @syllog1sm's trontagger.pickle file is distributed using Github's Releases, which allows appending of binary files to a release.

While this is fine in the short term, there is a 5MB limit on appended files, which will likely be too small for the long term.

I am not sure of the best way to do this and I welcome suggestions.

Also, the process of installing the pickled models is very manual, i.e. saving the files to the TextBlob installation path. It would be nice to automate this in some way, possibly through a downloader module, similar to NLTK.

>>> from text.downloader import download
>>> download("trontagger")
Successfully downloaded trontagger.pickle
@syllog1sm

This comment has been minimized.

Copy link

syllog1sm commented Sep 14, 2013

Here's how I see the deciderata:

  1. Host should provide curl/wget-able URLs. We want to be able to automate the downloads.
  2. Host should be free.
  3. Host should allow downloads up to at least 50mb, and ideally 100mb.
  4. Host should be stable and long-term. We want to know that URLs will persist, even if the project stops being actively maintained.
@sloria

This comment has been minimized.

Copy link
Owner

sloria commented Sep 15, 2013

Yup I agree with all of the above.

@sloria

This comment has been minimized.

Copy link
Owner

sloria commented Sep 18, 2013

Why not just distribute the models on the PyPI? The PerceptronTagger could be distributed as its own package. This would solve a number of problems:

  1. No more of this manual downloading business. Just a pip install textblob-aptagger.
$ pip install textblob-aptagger

allows you to get this

from text.blob import TextBlob
from textblob_aptagger import PerceptronTagger
blob = TextBlob("some text", pos_tagger=PerceptronTagger())
  1. The extensions could be developed independently of TextBlob core.

  2. No need to host binary files on Github. The binary files would be hosted on the PyPI. It is to the package maintainer's discretion whether or not to put the package on Github.

I've begun restructuring TextBlob to make it amenable to developing extensions (see #23 ).

@sloria

This comment has been minimized.

Copy link
Owner

sloria commented Sep 18, 2013

@syllog1sm : I've ported your PerceptronTagger to an experimental extension here: https://github.com/sloria/textblob-aptagger. You own the copyright to the code; if you'd like to be a maintainer, I can add you as a collaborator on the repo.

@sloria sloria closed this Sep 24, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment