separate version updates to their own microservice #842

beckermr · 2020-02-25T18:58:37Z

It is time to split out the version update cron job to its own service.

For now, we will reuse the LazyJson infrastructure to build out a new version update table stored in a new cf-graph-version repo. (I had mentioned putting it in the cf-graph-countyfair repo, but then we have to deal with git pull and push issues, which we can punt on using a separate repo.)
Eventually, this will be moved into dynamodb in another table.
The JSON model will be a list of version blobs per feedstock.
This list will be flattened by version string in the final dynamodb table.
We will run it at half past the hour with the hope that it finishes by the time the bot starts, but it doesn't have to.

beckermr · 2020-03-16T21:14:30Z

The entire script is here: https://github.com/regro/cf-scripts/blob/master/conda_forge_tick/update_upstream_versions.py

The key is to move the data structures to another repo with a separate process.

viniciusdc · 2020-03-16T21:19:43Z

I think I can do it, but it will take some time, cause I have to understand the operation of the entire structure that is specified in the script mentioned. And there is a big number of things that i still don't know, of course i may be able to learn it.

beckermr · 2020-03-16T21:23:21Z

I would start simpler.

Read the graph (clone cf-graph-county-fair, use --depth=1, cd to that dir)
Load it (call gx = load_graph(), see the main function)
call the update versions function in main
Inside the update versions function, write the potentially new version for each package to some json in a dir

Once that works, post a PR with the code and then we can move on from there.

viniciusdc · 2020-03-16T21:27:00Z

Ok, I will try that.

viniciusdc · 2020-03-19T22:03:59Z

Ok, now should i work on the first code you sent , or continue with the simple tasks (3) "call the update versions function in main" ?

beckermr · 2020-03-19T22:11:16Z

continue on 3!

CJ-Wright · 2020-03-19T22:30:03Z

@beckermr is it worthwhile to put this into the same Circle system? I think we can have multiple cron jobs offset from each other.

beckermr · 2020-03-19T22:32:29Z

Yep that is fine by me! Great idea!

viniciusdc · 2020-03-19T22:56:57Z

well i got something, but i think that i missed something in the way...
`RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Task exception was never retrieved`

viniciusdc · 2020-03-19T23:01:38Z

in load_graph.py i call update_upstream_versions(gx) from .\update_upstream_versions, and i got the result above when running

viniciusdc · 2020-03-19T23:22:33Z

And i am not supposed to fork as cited cause i am running in a new dir for test right ?

beckermr · 2020-03-20T00:43:03Z

Try turning off dask. If you set the right environment variables it will run serially.

viniciusdc · 2020-03-21T22:22:21Z

Hi, sorry for the late response. I am still in the 3 step, but i have a question, once definied gx = load graph(), should i write something inside it before call update_upstream_versions(gx) ?

beckermr · 2020-03-21T23:12:37Z

Nope probably not.

viniciusdc · 2020-03-22T02:30:27Z

So, i think that the error that i am seeing is probable cause i am not defining a
source for update_upstream_versions. Now my question, if i create a new archive called pck.json and use its dir as source, it will read ? (pck has a package and its version listed, as in cf-scripts/requirements/test)

beckermr · 2020-03-22T02:48:37Z

Idk sorry. Trace the control flow path through the main function in that file. That should help you figure it out.

viniciusdc · 2020-03-22T02:57:49Z

yes, i think that i discover the issue with my attempts, using the code you sent i have to insert something on gx to use update_upstream_versions(gx) right ? cause the attrs are LazyJson objects... right ?

And if it's the case, how exactly i can input the new data, i think that my doubts focus on this "Inside the update versions function, write the potentially new version for each package to some json in a dir". Just to see if it's right, create a new json file (as told) and call update_upstream_versions(gx, 'json_dir') ?

viniciusdc · 2020-03-22T03:08:18Z

and, the json file, must have some especial structure ? cause i saw some "verifications" on the source's structures inside some classes.
Thanks for everything until now, and sorry for so much trouble.

beckermr · 2020-03-22T03:18:58Z

I’m not sure I follow. The goal is to write the new versions to a separate data structure outside of the graph.

viniciusdc · 2020-03-22T03:43:52Z

yes as you cited here "The JSON model will be a list of version blobs per feedstock." right.
But returning to the subject of update_upstream_versions, i am a litle bit confused with the process (4).Cause, as i see in the script you sended (https://github.com/regro/cf-scripts/blob/master/conda_forge_tick/update_upstream_versions.py), when calling update_upstream_versions it asks for two inputs gx, sources as you said our goal it's to create a new file outside the gx.
Here, enters my question, can i put a dir to the json file (with package and version information) as a source ?

beckermr · 2020-03-22T12:35:50Z

Ahhh. See what the code currently inputs for sources and follow that.

viniciusdc · 2020-03-23T01:56:55Z

Hi, if possible can you explain me a little bit about the function "main",
`def main(args: Any = None) -> None:
from .xonsh_utils import env

debug = env.get("CONDA_FORGE_TICK_DEBUG", False)
if debug:
    setup_logger(logger, level="debug")
else:
    setup_logger(logger)

logger.info("Reading graph")
gx = load_graph()

update_upstream_versions(gx)

logger.info("writing out file")
dump_graph(gx)`

and the meta_yaml. ? Thanks

viniciusdc · 2020-03-23T02:00:48Z

The .JSON file can be something like this:
{ "package": "package_name", "version": "package_version" }

beckermr · 2020-03-23T02:27:33Z

We can work on the json format later so go with that for now.

What exactly is your question about the main function?

viniciusdc · 2020-03-23T02:56:16Z

What means
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available ?

beckermr · 2020-03-23T03:00:07Z

I have no idea. Usually the dask warnings are harmless and opaque. Maybe @CJ-Wright knows?

viniciusdc · 2020-03-23T03:08:31Z

This part i don't understand. "Inside the update versions function, write the potentially new version for each package to some json in a dir".
by now i have the following:
from conda_forge_tick.utils import load_graph
from conda_forge_tick.update_upstream_versions import *
gx = load_graph()
update_upstream_versions(gx)

may i add the dir to a json file inside the update function ?

viniciusdc · 2020-03-23T03:27:22Z

wait, i think i get it. I was thinking other thing. You want me to create a new function called "new update function" that reads the graph and then make an output with the potentially new versions right ?

viniciusdc · 2020-03-23T03:35:32Z

ok, one more question, how is organized the nodes os the graph ?
what was expected to "have" when running gx.nodes['python']['payloads'] ?

beckermr · 2020-05-21T20:06:29Z

That is an env var. So on the command line on linux do something like export CONDA_FORGE_TICK_DEBUG=1

viniciusdc · 2020-05-21T20:11:53Z

Thanks 👍

viniciusdc · 2020-05-21T20:15:11Z

I know that we are using the networkx to create the graph, so it's necessary to create the json output as a nx.DiGraph too ?

beckermr · 2020-05-21T20:32:48Z

I'd use a python dict in memory.

viniciusdc · 2020-05-21T20:34:41Z

"in memory" ?

beckermr · 2020-05-21T20:48:45Z

When coding in python, store the versions in a dictionary. Then id' dump that dictionary to disk, separating it out into different files for different nodes. There is no need for a graph here.

viniciusdc · 2020-05-22T18:44:57Z

Hi, I am separating the source classes to test... What is this source ?

class AbstractSource(abc.ABC):
    name: str

    @abc.abstractmethod
    def get_version(self, url: str) -> Optional[str]:
        pass

    @abc.abstractmethod
    def get_url(self, url: str) -> Optional[str]:
        pass

beckermr · 2020-05-22T18:46:12Z

https://docs.python.org/3/library/abc.html

viniciusdc · 2020-05-22T18:56:52Z

thanks :)

viniciusdc · 2020-05-22T18:59:52Z

More one thing, why we use a 'if' here ? It's not always sources = none (update_upstream_versions(gx))?

sources = (
        (PyPI(), CRAN(), NPM(), ROSDistro(), RawURL(), Github())
        if sources is None
        else sources
    )

beckermr · 2020-05-22T19:01:38Z

link to the source?

we have lots of residual code lying around so many things might just be old

viniciusdc · 2020-05-22T19:12:19Z

Here

cf-scripts/conda_forge_tick/update_upstream_versions.py

Line 598 in 5b9be3c

sources = (

but if you look to main when it's called we use only gx as a parameter.
And there is a not necessary comma here too

cf-scripts/conda_forge_tick/update_upstream_versions.py

Line 596 in 5b9be3c

gx: nx.DiGraph, sources: Iterable[AbstractSource] = None,

beckermr · 2020-05-22T19:15:22Z

Yeah so source is None by default, but this block of code lets us pass in other things if we want.

viniciusdc · 2020-05-22T19:34:22Z

yup, so... if we want to use the update_upstream_version separatly using another source it's possible, basically this what the block means right

viniciusdc · 2020-05-25T18:35:26Z

Hi, I've changed the code, now I only need to rebuild the get_latest_version function and its affiliates. Where I can put the new script ?

viniciusdc · 2020-05-25T18:36:05Z

Here's a link https://github.com/viniciusdc/ts-graph/blob/master/load_graph.py

CJ-Wright · 2020-05-25T19:04:09Z

Why was that development not done in your fork of cf-scripts?

viniciusdc · 2020-05-25T19:12:30Z

I created a separate work space to test some ideas with the graph itself. Than when finished I will send to my fork and than create a PR. But this time I forgot this, haha

viniciusdc · 2020-05-25T19:14:40Z

One question, do I create a new PR with this or update the old one ?

CJ-Wright · 2020-05-25T20:11:42Z

Whichever is easiest for you

viniciusdc · 2020-05-25T20:49:46Z

OK, I added the new updates to the fork repo, and I will work on the get_update_versions. Sorry for any late response.

viniciusdc · 2020-05-26T18:49:24Z

When using logging, is there a way to set the same logger for a submodule (like update_sources) ? e.g. I've separated the sources classes to another file (well... not entirely) but some of them gives debug info on our logger (but, it's currently defined in another script). Is there a simple way to solve this ? Or just call (logging.getLogger) again ?

viniciusdc · 2020-05-26T18:57:58Z

Can I do something like this ?

logging.getLogger(conda_forge_tick.cs-graph_update_version.update_sources)

As a hierarchical logger for the update sources debug ?

viniciusdc · 2020-06-22T13:25:26Z

Hi @CJ-Wright , @beckermr , sorry for the late feedback. I was testing the pool version. It looks fine until now but I've received a strange output for the get latest_version. I saved the initial output to a file warnings.text, after that the code runs smoothly as before (but now just takes 4 min 😄 ).
warnings.txt

I will make a new PR with the updates, also I would like to know if we will pass into the DynamoDB after it is running ok.

viniciusdc · 2020-07-31T17:24:29Z

I think that with #1075 working, we can start discussing the next steps, right ?

beckermr · 2020-07-31T17:33:37Z

Yup. this issue is done!

beckermr pinned this issue Feb 25, 2020

This was referenced Jun 29, 2020

Attempt to Update upstream version process pool #1049

Merged

make_graph update versions from .json #1050

Merged

beckermr closed this as completed Jul 31, 2020

viniciusdc unpinned this issue Sep 22, 2020

separate version updates to their own microservice #842

separate version updates to their own microservice #842

Comments

beckermr commented Feb 25, 2020 • edited

beckermr commented Mar 16, 2020

viniciusdc commented Mar 16, 2020

beckermr commented Mar 16, 2020

viniciusdc commented Mar 16, 2020

viniciusdc commented Mar 19, 2020

beckermr commented Mar 19, 2020

CJ-Wright commented Mar 19, 2020

beckermr commented Mar 19, 2020

viniciusdc commented Mar 19, 2020

viniciusdc commented Mar 19, 2020 • edited

viniciusdc commented Mar 19, 2020

beckermr commented Mar 20, 2020

viniciusdc commented Mar 21, 2020

beckermr commented Mar 21, 2020

viniciusdc commented Mar 22, 2020

beckermr commented Mar 22, 2020

viniciusdc commented Mar 22, 2020

viniciusdc commented Mar 22, 2020

beckermr commented Mar 22, 2020

viniciusdc commented Mar 22, 2020

beckermr commented Mar 22, 2020

viniciusdc commented Mar 23, 2020

viniciusdc commented Mar 23, 2020

beckermr commented Mar 23, 2020

viniciusdc commented Mar 23, 2020

beckermr commented Mar 23, 2020

viniciusdc commented Mar 23, 2020 • edited

viniciusdc commented Mar 23, 2020

viniciusdc commented Mar 23, 2020

beckermr commented May 21, 2020

viniciusdc commented May 21, 2020

viniciusdc commented May 21, 2020

beckermr commented May 21, 2020

viniciusdc commented May 21, 2020

beckermr commented May 21, 2020

viniciusdc commented May 22, 2020 • edited

beckermr commented May 22, 2020

viniciusdc commented May 22, 2020

viniciusdc commented May 22, 2020 • edited

beckermr commented May 22, 2020

viniciusdc commented May 22, 2020 • edited

beckermr commented May 22, 2020

viniciusdc commented May 22, 2020

viniciusdc commented May 25, 2020

viniciusdc commented May 25, 2020 • edited

CJ-Wright commented May 25, 2020

viniciusdc commented May 25, 2020 • edited

viniciusdc commented May 25, 2020

CJ-Wright commented May 25, 2020

viniciusdc commented May 25, 2020

viniciusdc commented May 26, 2020 • edited

viniciusdc commented May 26, 2020 • edited

viniciusdc commented Jun 22, 2020 • edited

viniciusdc commented Jul 31, 2020

beckermr commented Jul 31, 2020

beckermr commented Feb 25, 2020 •

edited

viniciusdc commented Mar 19, 2020 •

edited

viniciusdc commented Mar 23, 2020 •

edited

viniciusdc commented May 22, 2020 •

edited

viniciusdc commented May 22, 2020 •

edited

viniciusdc commented May 22, 2020 •

edited

viniciusdc commented May 25, 2020 •

edited

viniciusdc commented May 25, 2020 •

edited

viniciusdc commented May 26, 2020 •

edited

viniciusdc commented May 26, 2020 •

edited

viniciusdc commented Jun 22, 2020 •

edited