Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update/improve get deployment elastic #276

Merged

Conversation

adrianfusco
Copy link
Member

  • I wanted to do the SQL equivalent of GROUP BY to take the information of the last build. The way to approach it is using aggregations so I've modified the query used.
  • We have been using the scan helper to get the documents from Elasticsearch through a query. In this case it doesn't support the aggregations so in this case we'll have a condition to use the search method of the elastic client.
  • We can't send a giant query in the request to the Elasticsearch for asking to all the jobs information so we have been asking the information of the deployment for each job 1:1. Instead of doing one query for job we create a list of jobs sub lists and do calls divided by chunks. chunk_size_for_search quantity will be the size of every sub list. e.g, If we have 2000 jobs we will have the following calls: 2000 / 600 = 3.33 -> 4 calls.
  • With method append_get_specific_field we'll get just the specific fields we require from the source field, equivalent to SELECT field in SQL instead of SELECT *.

- The aggregations query will use now the search method of the client. The queries will use the scan helper. The second one is good if we are receiving a lot of information (more than 10K).
- Add explanation about chunks division in the get_deployment method of the elastic source
@adrianfusco adrianfusco requested a review from a team May 25, 2022 14:07
@adrianfusco adrianfusco marked this pull request as draft May 25, 2022 14:07
Copy link
Contributor

@bregman-arie bregman-arie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference I see in speed compared to the previous version is amazing. Well done.
One thing we need to address is inconsistency with other sources. I get completely different results running the same queries with Jenkins for example. But this is something that should be addressed in a separate future PR.

Maybe you and @cescgina can take a look at it together and identify any shared components/blocks in the code.

@adrianfusco
Copy link
Member Author

The difference I see in speed compared to the previous version is amazing. Well done. One thing we need to address is inconsistency with other sources. I get completely different results running the same queries with Jenkins for example. But this is something that should be addressed in a separate future PR.

Maybe you and @cescgina can take a look at it together and identify any shared components/blocks in the code.

Thanks. Totally. I've tested now clear; cibyl -vvv -d --source elasticsearch --jobs --topology and I see a good difference:

INFO     cibyl.orchestrator   Took 3.49s to query system osp_jenkins using source elasticsearch of type elasticsearch using method get_deployment

We should take in count the following things:

  1. We're not showing all the information in Elasticsearch because we don't have stored all the information of all jobs and builds as Jenkins is doing.
  2. Using the Jenkins source we are showing all the jobs. The jobs that don't have information associate we display: No openstack information associated with this job. Using the Elasticsearch source I just show the jobs that have information associated. What do we prefer here?
  3. Using Jenkins we're displaying all the information related to each node and role of this one. I don't have this information in Elasticsearch so I can't show it.

@cescgina
Copy link
Contributor

2. Using the Jenkins source we are showing all the jobs. The jobs that don't have information associate we display: No openstack information associated with this job. Using the Elasticsearch source I just show the jobs that have information associated. What do we prefer here?

About this, I worked under the assumption that if the user just passed --jobs --topology without specifying any value for them, then no filtering is done. How does it work for builds in elasticsearch @adrianfusco? If for example called cibyl with --jobs --builds would it show the jobs with no builds?

adrianfusco added a commit to adrianfusco/cibyl that referenced this pull request May 30, 2022
At the same way of get_deployment, we are doing one query for each job so if we have 2000 jobs then we have 2000 queries.

As we can't send all the jobs to ask information of them in one single query because it's too big and it returns error, I've used chunks like: RedHatCRE#276 and we have reduced the time of the queries:

From:
INFO     cibyl.orchestrator   Took 551.80s to query system osp_jenkins using source: 'elasticsearch' of type: 'elasticsearch' using method get_builds
To:
INFO     cibyl.orchestrator   Took 117.57s to query system osp_jenkins using source elasticsearch of type elasticsearch using method get_builds
adrianfusco added a commit that referenced this pull request May 30, 2022
At the same way of get_deployment, we are doing one query for each job so if we have 2000 jobs then we have 2000 queries.

As we can't send all the jobs to ask information of them in one single query because it's too big and it returns error, I've used chunks like: #276 and we have reduced the time of the queries:

From:
INFO     cibyl.orchestrator   Took 551.80s to query system osp_jenkins using source: 'elasticsearch' of type: 'elasticsearch' using method get_builds
To:
INFO     cibyl.orchestrator   Took 117.57s to query system osp_jenkins using source elasticsearch of type elasticsearch using method get_builds
@adrianfusco
Copy link
Member Author

  1. Using the Jenkins source we are showing all the jobs. The jobs that don't have information associate we display: No openstack information associated with this job. Using the Elasticsearch source I just show the jobs that have information associated. What do we prefer here?

About this, I worked under the assumption that if the user just passed --jobs --topology without specifying any value for them, then no filtering is done. How does it work for builds in elasticsearch @adrianfusco? If for example called cibyl with --jobs --builds would it show the jobs with no builds?

Yes, we're showing all the jobs even those with no builds. I've changed the behavior here too.

@adrianfusco adrianfusco marked this pull request as ready for review May 30, 2022 15:25
@adrianfusco
Copy link
Member Author

The difference I see in speed compared to the previous version is amazing. Well done. One thing we need to address is inconsistency with other sources. I get completely different results running the same queries with Jenkins for example. But this is something that should be addressed in a separate future PR.

Maybe you and @cescgina can take a look at it together and identify any shared components/blocks in the code.

I added another change to show all jobs. Let's review it and we can continue working on the other topics in a corespondent task and branch.

Copy link
Contributor

@cescgina cescgina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adrianfusco adrianfusco merged commit 84b8f7f into RedHatCRE:main May 30, 2022
@adrianfusco adrianfusco deleted the update/improve-get-deployment-elastic branch May 30, 2022 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants