Skip to content
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.

Extension for our production usage #18

Closed
vmarkovtsev opened this issue Oct 9, 2019 · 7 comments
Closed

Extension for our production usage #18

vmarkovtsev opened this issue Oct 9, 2019 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@vmarkovtsev
Copy link

vmarkovtsev commented Oct 9, 2019

Hi there, thank you so much for this great lib!

We want to use it in our production, and need to implement these critical features:

  1. Saving the model to a Python file object.
  2. Fix the packaging. It is a bad practice to include the precompiled Cython code because CPython API changes in a backwards-incompatible way from time to time (I learned it the hard way), and yttm.cpp will compile to a malfunctioning extension or will not compile at all. It has to be regenerated with Cython every time setup.py is called, no shortcuts.
  3. Switch to the binary model format. The current model format is text - we want a more compact representation which is also compression-friendly (columnar, not interleaved rules).
  4. Pure Go model loading and encoding implementation.

My questions are:

  • Are you OK with features (1), (2), and (3) or we should hard-fork?
  • If you can help make (1), (2), (3), please tell us! We will devote all our efforts to (4).
  • Shall we contribute the Go lib to this repo or create a new one?
@xbelonogov
Copy link
Contributor

Hi,
I am glad that you considered using our lib in production.
(1) What do you mean by saving model as python file object? Is it about saving model using pickle?
(2) Could you check this one? Cython documentation strongly recommend distribute .c files, so we followed that.
(3) Current model size almost always doesn't exceed 1MB. We can compress it. But is it really necessary?

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Oct 10, 2019

(1) https://docs.python.org/3/c-api/file.html
(2) Probably Cython maintainers did not have weird CPython API mismatch problems in production. We had. I can list 2-3 other projects where there are issues regarding the inability to install the package straight away: e.g. SheffieldML/GPy#649 (comment) or snowballstem/pystemmer#18
I am sure that if I dig this crap properly I will find hundreds of suffered users because of that evil practice.
(3) This is about saving in binary and slicing the rules by column to be compression-friendly. The compression will be on our side because we will package in ASDF - ofc separately :)
(4) We will create src-d/go-YouTokenToMe.

@yutkin
Copy link
Contributor

yutkin commented Oct 10, 2019

I want to add that (2) requires a C++ compiler toolchain in OS. We use minified Docker images for fast CI pipelines and deployments to our Cloud. This images don't have pre-installed GCC. Moreover, Windows users often don't have compilers at their system at all. To minimize problems with different CPython API we use ManyLinux for building compatible wheels, therefore YouTokenToMe maintainers are responsible for all build related issues.

@yutkin yutkin added enhancement New feature or request help wanted Extra attention is needed labels Oct 10, 2019
@yutkin
Copy link
Contributor

yutkin commented Oct 10, 2019

(4) It would be better to make a separate repo for Golang wrapper. We'll add a link to it in README.

@vmarkovtsev
Copy link
Author

@yutkin

We are using Docker, too. Our images do not have preinstalled GCC (or Go compiler, whatever), too. I wonder if you know about multiple FROM? Example: https://github.com/src-d/identity-matching/blob/master/Dockerfile#L10

To minimize problems with different CPython API we use ManyLinux for building compatible wheels

You don't seem to ship different yttm.cpp for different CPython API versions in this repo. 3.7 broke everybody last year. This is not the first time, this is not the last time. Those Windows users who you mention will be the first to report broken installations, and they will show no mercy and will attach screenshots instead of logs 😄

ManyLinux will not help with Python API compatibility. ManyLinux means that your native extension is supposed to run on any Linux (to be precise, those with some ancient libc versions and above).

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Oct 10, 2019

Actually, if you build wheels for all possible OS-Python combinations, then the build from source can be arbitrary complex (nobody is going to use it anyway, right?).

@vmarkovtsev
Copy link
Author

We are hard-forking. Thanks for answering the questions and all the help!

The fork will live at src-d/YouTokenToMe and the Python pkg name will be youtokentome-srcd. I'll leave the PRs open just in case, feel free to close.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants