Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate version updates to their own microservice #842

Closed
beckermr opened this issue Feb 25, 2020 · 99 comments
Closed

separate version updates to their own microservice #842

beckermr opened this issue Feb 25, 2020 · 99 comments

Comments

@beckermr
Copy link
Contributor

beckermr commented Feb 25, 2020

It is time to split out the version update cron job to its own service.

  • For now, we will reuse the LazyJson infrastructure to build out a new version update table stored in a new cf-graph-version repo. (I had mentioned putting it in the cf-graph-countyfair repo, but then we have to deal with git pull and push issues, which we can punt on using a separate repo.)
  • Eventually, this will be moved into dynamodb in another table.
  • The JSON model will be a list of version blobs per feedstock.
  • This list will be flattened by version string in the final dynamodb table.
  • We will run it at half past the hour with the hope that it finishes by the time the bot starts, but it doesn't have to.

cc @CJ-Wright @mariusvniekerk @scopatz

@beckermr beckermr pinned this issue Feb 25, 2020
@beckermr
Copy link
Contributor Author

The entire script is here: https://github.com/regro/cf-scripts/blob/master/conda_forge_tick/update_upstream_versions.py

The key is to move the data structures to another repo with a separate process.

@viniciusdc
Copy link
Contributor

I think I can do it, but it will take some time, cause I have to understand the operation of the entire structure that is specified in the script mentioned. And there is a big number of things that i still don't know, of course i may be able to learn it.

@beckermr
Copy link
Contributor Author

I would start simpler.

  1. Read the graph (clone cf-graph-county-fair, use --depth=1, cd to that dir)
  2. Load it (call gx = load_graph(), see the main function)
  3. call the update versions function in main
  4. Inside the update versions function, write the potentially new version for each package to some json in a dir

Once that works, post a PR with the code and then we can move on from there.

@viniciusdc
Copy link
Contributor

Ok, I will try that.

@viniciusdc
Copy link
Contributor

Ok, now should i work on the first code you sent , or continue with the simple tasks (3) "call the update versions function in main" ?

@beckermr
Copy link
Contributor Author

continue on 3!

@CJ-Wright
Copy link
Member

@beckermr is it worthwhile to put this into the same Circle system? I think we can have multiple cron jobs offset from each other.

@beckermr
Copy link
Contributor Author

Yep that is fine by me! Great idea!

@viniciusdc
Copy link
Contributor

well i got something, but i think that i missed something in the way...
`RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Task exception was never retrieved`

@viniciusdc
Copy link
Contributor

viniciusdc commented Mar 19, 2020

in load_graph.py i call update_upstream_versions(gx) from .\update_upstream_versions, and i got the result above when running

@viniciusdc
Copy link
Contributor

And i am not supposed to fork as cited cause i am running in a new dir for test right ?

@beckermr
Copy link
Contributor Author

Try turning off dask. If you set the right environment variables it will run serially.

@viniciusdc
Copy link
Contributor

Hi, sorry for the late response. I am still in the 3 step, but i have a question, once definied gx = load graph(), should i write something inside it before call update_upstream_versions(gx) ?

@beckermr
Copy link
Contributor Author

Nope probably not.

@viniciusdc
Copy link
Contributor

So, i think that the error that i am seeing is probable cause i am not defining a
source for update_upstream_versions. Now my question, if i create a new archive called pck.json and use its dir as source, it will read ? (pck has a package and its version listed, as in cf-scripts/requirements/test)

@beckermr
Copy link
Contributor Author

Idk sorry. Trace the control flow path through the main function in that file. That should help you figure it out.

@viniciusdc
Copy link
Contributor

yes, i think that i discover the issue with my attempts, using the code you sent i have to insert something on gx to use update_upstream_versions(gx) right ? cause the attrs are LazyJson objects... right ?

And if it's the case, how exactly i can input the new data, i think that my doubts focus on this "Inside the update versions function, write the potentially new version for each package to some json in a dir". Just to see if it's right, create a new json file (as told) and call update_upstream_versions(gx, 'json_dir') ?

@viniciusdc
Copy link
Contributor

and, the json file, must have some especial structure ? cause i saw some "verifications" on the source's structures inside some classes.
Thanks for everything until now, and sorry for so much trouble.

@beckermr
Copy link
Contributor Author

I’m not sure I follow. The goal is to write the new versions to a separate data structure outside of the graph.

@viniciusdc
Copy link
Contributor

yes as you cited here "The JSON model will be a list of version blobs per feedstock." right.
But returning to the subject of update_upstream_versions, i am a litle bit confused with the process (4).Cause, as i see in the script you sended (https://github.com/regro/cf-scripts/blob/master/conda_forge_tick/update_upstream_versions.py), when calling update_upstream_versions it asks for two inputs gx, sources as you said our goal it's to create a new file outside the gx.
Here, enters my question, can i put a dir to the json file (with package and version information) as a source ?

@beckermr
Copy link
Contributor Author

Ahhh. See what the code currently inputs for sources and follow that.

@viniciusdc
Copy link
Contributor

Hi, if possible can you explain me a little bit about the function "main",
`def main(args: Any = None) -> None:
from .xonsh_utils import env

debug = env.get("CONDA_FORGE_TICK_DEBUG", False)
if debug:
    setup_logger(logger, level="debug")
else:
    setup_logger(logger)

logger.info("Reading graph")
gx = load_graph()

update_upstream_versions(gx)

logger.info("writing out file")
dump_graph(gx)`

and the meta_yaml. ? Thanks

@viniciusdc
Copy link
Contributor

The .JSON file can be something like this:
{ "package": "package_name", "version": "package_version" }

@beckermr
Copy link
Contributor Author

We can work on the json format later so go with that for now.

What exactly is your question about the main function?

@viniciusdc
Copy link
Contributor

What means
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available ?

@beckermr
Copy link
Contributor Author

I have no idea. Usually the dask warnings are harmless and opaque. Maybe @CJ-Wright knows?

@viniciusdc
Copy link
Contributor

viniciusdc commented Mar 23, 2020

This part i don't understand. "Inside the update versions function, write the potentially new version for each package to some json in a dir".
by now i have the following:
from conda_forge_tick.utils import load_graph
from conda_forge_tick.update_upstream_versions import *
gx = load_graph()
update_upstream_versions(gx)

may i add the dir to a json file inside the update function ?

@viniciusdc
Copy link
Contributor

wait, i think i get it. I was thinking other thing. You want me to create a new function called "new update function" that reads the graph and then make an output with the potentially new versions right ?

@viniciusdc
Copy link
Contributor

ok, one more question, how is organized the nodes os the graph ?
what was expected to "have" when running gx.nodes['python']['payloads'] ?

@beckermr
Copy link
Contributor Author

That is an env var. So on the command line on linux do something like export CONDA_FORGE_TICK_DEBUG=1

@viniciusdc
Copy link
Contributor

Thanks 👍

@viniciusdc
Copy link
Contributor

I know that we are using the networkx to create the graph, so it's necessary to create the json output as a nx.DiGraph too ?

@beckermr
Copy link
Contributor Author

I'd use a python dict in memory.

@viniciusdc
Copy link
Contributor

"in memory" ?

@beckermr
Copy link
Contributor Author

When coding in python, store the versions in a dictionary. Then id' dump that dictionary to disk, separating it out into different files for different nodes. There is no need for a graph here.

@viniciusdc
Copy link
Contributor

viniciusdc commented May 22, 2020

Hi, I am separating the source classes to test... What is this source ?

class AbstractSource(abc.ABC):
    name: str

    @abc.abstractmethod
    def get_version(self, url: str) -> Optional[str]:
        pass

    @abc.abstractmethod
    def get_url(self, url: str) -> Optional[str]:
        pass

@beckermr
Copy link
Contributor Author

@viniciusdc
Copy link
Contributor

thanks :)

@viniciusdc
Copy link
Contributor

viniciusdc commented May 22, 2020

More one thing, why we use a 'if' here ? It's not always sources = none (update_upstream_versions(gx))?

sources = (
        (PyPI(), CRAN(), NPM(), ROSDistro(), RawURL(), Github())
        if sources is None
        else sources
    )

@beckermr
Copy link
Contributor Author

link to the source?

we have lots of residual code lying around so many things might just be old

@viniciusdc
Copy link
Contributor

viniciusdc commented May 22, 2020

Here

but if you look to main when it's called we use only gx as a parameter.
And there is a not necessary comma here too
gx: nx.DiGraph, sources: Iterable[AbstractSource] = None,

@beckermr
Copy link
Contributor Author

Yeah so source is None by default, but this block of code lets us pass in other things if we want.

@viniciusdc
Copy link
Contributor

yup, so... if we want to use the update_upstream_version separatly using another source it's possible, basically this what the block means right

@viniciusdc
Copy link
Contributor

Hi, I've changed the code, now I only need to rebuild the get_latest_version function and its affiliates. Where I can put the new script ?

@viniciusdc
Copy link
Contributor

viniciusdc commented May 25, 2020

@CJ-Wright
Copy link
Member

Why was that development not done in your fork of cf-scripts?

@viniciusdc
Copy link
Contributor

viniciusdc commented May 25, 2020

I created a separate work space to test some ideas with the graph itself. Than when finished I will send to my fork and than create a PR. But this time I forgot this, haha

@viniciusdc
Copy link
Contributor

One question, do I create a new PR with this or update the old one ?

@CJ-Wright
Copy link
Member

Whichever is easiest for you

@viniciusdc
Copy link
Contributor

OK, I added the new updates to the fork repo, and I will work on the get_update_versions. Sorry for any late response.

@viniciusdc
Copy link
Contributor

viniciusdc commented May 26, 2020

When using logging, is there a way to set the same logger for a submodule (like update_sources) ? e.g. I've separated the sources classes to another file (well... not entirely) but some of them gives debug info on our logger (but, it's currently defined in another script). Is there a simple way to solve this ? Or just call (logging.getLogger) again ?

@viniciusdc
Copy link
Contributor

viniciusdc commented May 26, 2020

Can I do something like this ?

logging.getLogger(conda_forge_tick.cs-graph_update_version.update_sources)

As a hierarchical logger for the update sources debug ?

@viniciusdc
Copy link
Contributor

viniciusdc commented Jun 22, 2020

Hi @CJ-Wright , @beckermr , sorry for the late feedback. I was testing the pool version. It looks fine until now but I've received a strange output for the get latest_version. I saved the initial output to a file warnings.text, after that the code runs smoothly as before (but now just takes 4 min 😄 ).
warnings.txt

I will make a new PR with the updates, also I would like to know if we will pass into the DynamoDB after it is running ok.

@viniciusdc
Copy link
Contributor

I think that with #1075 working, we can start discussing the next steps, right ?

@beckermr
Copy link
Contributor Author

Yup. this issue is done!

@viniciusdc viniciusdc unpinned this issue Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants