Skip to content

Conversation

@ernjvr
Copy link

@ernjvr ernjvr commented Jun 26, 2018

Description

Overview:
There is currently no functional mechanism that captures or persists the end time of when an asynchronous job has finished. As a result, users are not able to do any reporting about the duration of various asynchronous jobs in Cloudstack.

  • Data Layer changes:
    The time stamp of when an asynchronous job has finished should be persisted to the 'removed' column of the async_job database table. This is an existing column which is not being used for any other functionality.
  • QueryAsyncJobResult API changes:
    The queryAsyncJobResult API should now include an 'end_time' response tag that will display the value of the 'removed' database column.
  • Management Server process changes:
    When a Cloudstack management server starts up, it finds in the database all the asynchronous jobs it is the owner of, whose statuses are in an 'in_progress' state and updates them to a 'failed' status. At the same time, it should now also mark them as 'removed' by updating the 'removed' field with the current timestamp.
  • Garbage collection changes:
    Currently, the garbage collector uses the 'created' database column to find in the database completed asynchronous job records that are older than the 'job.expire.minutes' global setting's timestamp and, then, deletes these records. It should no longer use the 'created' column to do this. It should use the 'removed' column instead.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

GitHub Issue/PRs

Screenshots (if appropriate):

How Has This Been Tested?

  • Start a virtual machine to initiate an asynchronous job. Query the cloud.async_job table to find the record that was created in response to the 'start virtual machine' job. When the job finishes, verify that the 'removed' column has been correctly updated with the end timestamp of the relevant record.
  • Using cloudmonkey, make an API call to queryAsyncJobResult using the uuid as the jobid and verify that the 'endtime' field returns the correct timestamp value of the relevant database record's 'removed' column.
  • Verify that the garbage collector deletes the correct expired asynchronous job records from the cloud.async_job table.
  • Hypervisor: KVM

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
    Testing
  • I have added tests to cover my changes.
  • All relevant new and existing integration tests have passed.
  • A full integration testsuite with all test that can run on my environment has passed.

Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes request, overall LGTM good work. I've not tested/checked your marvin test cases though.

@Param(description = " the created date of the job")
private Date created;

@SerializedName(ApiConstants.END_TIME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be finished or completed?

expiringUnfinishedAsyncJobSearch.done();

expiringCompletedAsyncJobSearch = createSearchBuilder();
expiringCompletedAsyncJobSearch.and("created", expiringCompletedAsyncJobSearch.entity().getCreated(), SearchCriteria.Op.LTEQ);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is okay only if the 'created' it not used in any of the methods to query the result.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used by the Asyncjob garbage collector.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified; it is only used by the async jobs garbage collector.

SearchCriteria<AsyncJobVO> sc = expiringCompletedAsyncJobSearch.create();
sc.setParameters("created", cutTime);
public List<AsyncJobVO> getExpiredCompletedJobs(final Date cutTime, final int limit) {
final SearchCriteria<AsyncJobVO> sc = expiringCompletedAsyncJobSearch.create();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to discuss this bit over our syncup call when you're back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: during the upgrade 4.11.1->4.12.0 -> set all the removed field of the async_jobs -> NOW()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update added to schema-41110to41200.sql

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update added.

job.setLastUpdated(DateUtil.currentGMTTime());
final Date currentGMTTime = DateUtil.currentGMTTime();
job.setLastUpdated(currentGMTTime);
job.setRemoved(currentGMTTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

public AsyncJob queryJob(long jobId, boolean updatePollTime) {
AsyncJobVO job = _jobDao.findById(jobId);
public AsyncJob queryJob(final long jobId, final boolean updatePollTime) {
final AsyncJobVO job = _jobDao.findByIdIncludingRemoved(jobId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have some API response changes, wrt who uses this method and if it can cause any regression (as the method will return a removed job).

minimum_vms_to_pass = 2
for vm_ip in vms_ips:
ssh_command = "ping -c 5 %s" % vm_ip
ssh_command = "ping -c 10 %s" % vm_ip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ernjvr is this related change? if it's a bugfix can we have it in a separate PR?

Copy link
Member

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ernjvr I see lots of other commits in your PR, can you pull and rebase your code against master and resubmit the pull request

@ernjvr ernjvr force-pushed the async_jobs_add_endtime branch from 01ee941 to 98ddc26 Compare July 3, 2018 10:36
private Date created;

@SerializedName(ApiConstants.COMPLETED)
@Param(description = " the removed date of the job")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the description and variable to completed?

public interface AsyncJobDao extends GenericDao<AsyncJobVO, Long> {

@Override
AsyncJobVO findById(Long id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was referenced by interface and not implementation. But as discussed, I will remove the override and use findByIdIncludingRemoved.

}

@Override
public AsyncJobVO findById(Long id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The findByIdIncluding removed previous usage was okay.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

cmd.jobid = self.virtual_machine.jobid
cmd_response = self.apiclient.queryAsyncJobResult(cmd)
endtime = cmd_response.endtime
self.assertIsNotNone(endtime, "Expected 'endtime' field of queryAsyncJobResult to have a timestamp of when the job finished.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd_response -> get the complete time, then use the dbclient -> async_jobs table -> job id and verify that completed matches what you got in the API response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the API, jobstatus -> success/failure -> check that against db as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"basic",
"sg"],
required_hardware="true")
def test_async_job_table_values(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squash the above two test cases into one, asset equal for the job id's removed/completed time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm not a fan of combining separate tests into one, this is not a good practice since it's limiting the test granularity. But in this case maybe it does makes sense to do it since the api result is depended on the DB value and could not work without it. In this case it'll save us the time to deploy a VM which is a lot and it does makes sense to combine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def test_async_job_table_values(self):
"""
Test async job db table for valid values
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another failure test case. Destroy something twice, etc. job status -> failed (int 2?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried this case and I'm not sure it baheves as expected. I've successfully destroyed a VM and tried to do it again and the job result was successful same as the first try. Should this behave like that? I think it should return an error saying the VM is already in Destroyed state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the same result as you @borisstoyanov

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ernjvr do you think we should refactor the service if it tries to bring a VM into a state which it is, to prompt an error? It kinda makes sense to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@borisstoyanov I investigated the code for the scenario of starting a VM. It also returns a success status when I make a request to start a VM that has already been started. This success status is returned by the VM job handler code. The async job status gets updated with whatever status is returned by the VM job handler code. So, I think changing how the VM handler code deals with duplicate commands for the same resource is out of scope for this feature. I think it should be dealt with in a separate bug fix ticket. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound the right way to do it, thanks for the investigation @ernjvr, let me log a ticket in community. When we have time we could get back to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments. For integration tests, make use of both the API respones and db client/query result to verify both the job status and the removed time.

except Exception as e:
self.debug("Warning! Exception in tearDown: %s" % e)

@attr(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be formatted to a single line.

"advancedns",
"basic",
"sg"],
required_hardware="true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be set to required_hardware="false".

VirtualMachine.RUNNING)
self.assertEqual(response[0], PASS, response[1])

result = self.dbclient.execute("select * from async_job where uuid='%s'" % self.virtual_machine.jobid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to line 127, closer to the access on result[0][17], probably name this as db_result to make it more obvious.

removed = timezone('UTC').localize(result[0][17])
removed = removed.astimezone(timezone('CET'))
removed = removed.strftime("%Y-%m-%dT%H:%M:%S%z")
self.assertEqual(completed, removed, "Expected 'completed' tag value to be equal to 'removed' db column value.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the word usage tag and use timestamp/datetime instead, it's good enough otherwise. Same at line 137.

Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor nits, otherwise LGTM. I did not test it, which the internal testing would approve or advise bugs.

jobstatus_db = result[0][8]
self.assertEqual(jobstatus_api, jobstatus_db, "Expected 'jobstatus' tag value to be equal to 'job_status' db column value.")

return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't return at the end .

domainid=self.domain.id
)
self.cleanup = [self.account]
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the return.

from marvin.lib.base import *
from marvin.lib.common import *
from pytz import timezone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run pep8 or pylint on this file and fix and lint related issues.

@ernjvr ernjvr closed this Jul 11, 2018
rohityadavcloud added a commit that referenced this pull request Jan 20, 2021
Fixes #11

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
shwstppr pushed a commit that referenced this pull request Jul 27, 2021
* Changed lease time to inifinity

* message around DHCP lease infinity

Co-authored-by: Andrija Panic <45762285+andrijapanicsb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants