Skip to content
This repository has been archived by the owner on Jul 11, 2022. It is now read-only.

Commit

Permalink
[1094540] Agents get permanently backfilled if backend database goes …
Browse files Browse the repository at this point in the history
…offline for a moment

Clean up some of the live availability doco/workflow:
- Fix the jdoc for DiscoveryAgentService.getCurrentAvailability() to reflect reality
- Fix ResourceManagerBean.getLiveResourceAvailability() to mark its avail report
  as a "ServerSideReport".  This prevents it from interfering with server-agent
  backfill coordination.  Also, correct some confusing inline doco and doe some cleanup.
- Fix AvailabilityManagerBean.mergeAvailabilityReport() such that logic checking for the
  backfill flag does not execute *after* the backfill flag has been reset. And clean up
  the overall logic a bit more.
- Improve jdoc for AvailabilityManagerLocal.updateLastAvailabilityReportInNewTransaction
  to clearly indicate its side-effect of clearing the backfill flag.
  • Loading branch information
jshaughn committed Jun 13, 2014
1 parent 8cca198 commit 2fa7c54
Show file tree
Hide file tree
Showing 4 changed files with 67 additions and 74 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,7 @@
import org.rhq.core.domain.discovery.MergeResourceResponse;
import org.rhq.core.domain.discovery.PlatformSyncInfo;
import org.rhq.core.domain.discovery.ResourceSyncInfo;
import org.rhq.core.domain.measurement.AvailabilityType;
import org.rhq.core.domain.resource.Resource;
import org.rhq.core.domain.resource.ResourceError;
import org.rhq.core.domain.resource.ResourceType;

/**
Expand Down Expand Up @@ -129,13 +127,13 @@ void updatePluginConfiguration(int resourceId, Configuration newPluginConfigurat
AvailabilityReport executeAvailabilityScanImmediately(boolean changedOnlyReport);

/**
* Returns the current availability for the specified resource.
* Return an availability report for the specified root resource and its descendants.
* <p/>
* This call returns an availability report (rather just a simple availability of a single resource)
* because it also scans for the changes in availability in the child resources. Notice that the returned report may
* contain no results if {@code changesOnly} is set to true. If it is false, the report will always contain
* the availability of the supplied resource but can also additionally contain the availabilities of some of its
* child resources, if they were eligible for availability collection at the time of calling this method.
* The returned report may contain no results if {@code changesOnly} is set to true. Otherwise it will return
* the availability of the root resource and its descendants. Note, a live availability check (i.e. a call
* to getAvailability()) is always performed on the root resource. Only descendants normally eligible for
* availability collection at the time of this call will also have live availability. Others will report their
* most recently reported availability.
* <p/>
* Also note that the availability types of the resources in the report may have any of the following values from
* the {@link AvailabilityType} enum - it may happen that the availability of the resource is
Expand All @@ -146,13 +144,10 @@ void updatePluginConfiguration(int resourceId, Configuration newPluginConfigurat
* correctly handle the report within the server.
*
* @param resource the resource to return the availability of.
* @param changesOnly if true, only changes in availability will be reported, if false the report will contain
* the availabilities of all resources eligible for collection at the time of the call regardless
* of whether their availability changed or not.
* @return an availability report containing at least the availability of the supplied resource + possibly avails
* of some of the child resources that were eligible for avail collection at the time. The rest of the
* children are scheduled for availability collection in the next collector run (which happens
* approximately 30 seconds after this call).
* @param changesOnly if true, only changes in availability will be reported. if false the report will contain
* the availabilities of the root resource and all descendants, whether their availability
* changed or not.
* @return an availability report populated as described in the above options.
*/
@NotNull
AvailabilityReport getCurrentAvailability(Resource resource, boolean changesOnly);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -714,12 +714,9 @@ public boolean mergeAvailabilityReport(AvailabilityReport report) {
if (reportSize == 0) {
log.error("Agent [" + agentName + "] sent an empty availability report. This is a bug, please report it");
return true; // even though this report is bogus, do not ask for an immediate full report to avoid unusual infinite recursion due to this error condition
}

if (log.isDebugEnabled()) {
if (reportSize > 1) {
log.debug("Agent [" + agentName + "]: processing availability report of size: " + reportSize);
}
} else if (log.isDebugEnabled()) {
log.debug("Agent [" + agentName + "]: processing availability report of size: " + reportSize);
}

// translate data into Availability objects for downstream processing
Expand All @@ -730,44 +727,41 @@ public boolean mergeAvailabilityReport(AvailabilityReport report) {
}

Integer agentToUpdate = agentManager.getAgentIdByName(agentName);

// if this report is from an agent update the lastAvailReport time
if (!report.isServerSideReport() && agentToUpdate != null) {
availabilityManager.updateLastAvailabilityReportInNewTransaction(agentToUpdate.intValue());
}

MergeInfo mergeInfo = new MergeInfo(report);

// if this report is from an agent, and is a changes-only report, and the agent appears backfilled,
// then we need to skip this report so as not to waste our time. Then, immediately request and process
// a full report because, obviously, the agent is no longer down but the server thinks
// it still is down - we need to know the availabilities for all the resources on that agent
if (!report.isServerSideReport() && report.isChangesOnlyReport()
&& agentManager.isAgentBackfilled(agentToUpdate.intValue())) {
// For agent reports (not a server-side report)
if (!report.isServerSideReport() && agentToUpdate != null) {
// if this is a changes-only report, and the agent appears backfilled, then immediately request and process
// a full report because, obviously, the agent is no longer down but the server thinks it still is down -
// we need to know the availabilities for all the resources on that agent
if (report.isChangesOnlyReport() && agentManager.isAgentBackfilled(agentToUpdate.intValue())) {
mergeInfo.setAskForFullReport(true);
}

mergeInfo.setAskForFullReport(true);
// update the lastAvailReport time and unset the backfill flag if it is set.
availabilityManager.updateLastAvailabilityReportInNewTransaction(agentToUpdate.intValue());

} else {
// process the report in batches to avoid an overly long transaction and to potentially increase the
// speed in which an avail change becomes visible.
}

while (!availabilities.isEmpty()) {
int size = availabilities.size();
int end = (MERGE_BATCH_SIZE < size) ? MERGE_BATCH_SIZE : size;
// process the report in batches to avoid an overly long transaction and to potentially increase the
// speed in which an avail change becomes visible.

List<Availability> availBatch = availabilities.subList(0, end);
availabilityManager.mergeAvailabilitiesInNewTransaction(availBatch, mergeInfo);
while (!availabilities.isEmpty()) {
int size = availabilities.size();
int end = (MERGE_BATCH_SIZE < size) ? MERGE_BATCH_SIZE : size;

// Advance our progress and possibly help GC. This will remove the processed avails from the backing list
availBatch.clear();
}
List<Availability> availBatch = availabilities.subList(0, end);
availabilityManager.mergeAvailabilitiesInNewTransaction(availBatch, mergeInfo);

MeasurementMonitor.getMBean().incrementAvailabilityReports(report.isChangesOnlyReport());
MeasurementMonitor.getMBean().incrementAvailabilitiesInserted(mergeInfo.getNumInserted());
MeasurementMonitor.getMBean().incrementAvailabilityInsertTime(watch.getElapsed());
watch.reset();
// Advance our progress and possibly help GC. This will remove the processed avails from the backing list
availBatch.clear();
}

MeasurementMonitor.getMBean().incrementAvailabilityReports(report.isChangesOnlyReport());
MeasurementMonitor.getMBean().incrementAvailabilitiesInserted(mergeInfo.getNumInserted());
MeasurementMonitor.getMBean().incrementAvailabilityInsertTime(watch.getElapsed());
watch.reset();

if (!report.isServerSideReport()) {
if (agentToUpdate != null) {
// don't bother asking for a full report if the one we are currently processing is already full
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,9 @@ List<AvailabilityPoint> findAvailabilitiesForAutoGroup(Subject subject, int pare

/**
* Executing this method will update the given agent's lastAvailabilityReport time
* in a new transaction
* in a new transaction.
* <p/>
* SIDE-EFFECT: will unset the backfill flag if currently set on the agent.
*
* @param agentId the id of the agent
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2499,15 +2499,17 @@ public Resource getResource(Subject subject, int resourceId) {
@Override
@TransactionAttribute(TransactionAttributeType.NEVER)
public ResourceAvailability getLiveResourceAvailability(Subject subject, int resourceId) {

Resource res = getResourceById(subject, resourceId);
//platforms are never unknown, just up or down, so we need to default the availability to a different value
//depending on the resource's category
ResourceAvailability results = new ResourceAvailability(res,

// platforms are never unknown, just up or down, so we need to default the availability to a different value
// depending on the resource's category
ResourceAvailability result = new ResourceAvailability(res,
res.getResourceType().getCategory() == ResourceCategory.PLATFORM ? AvailabilityType.DOWN
: AvailabilityType.UNKNOWN);

try {
// first, quickly see if we can even ping the agent, if not, don't bother trying to get the resource avail
// validate the resource and agent, protect against REST dummy agent
Agent agent = agentManager.getAgentByResourceId(subjectManager.getOverlord(), resourceId);
if (agent == null) {
if (log.isDebugEnabled()) {
Expand All @@ -2516,7 +2518,6 @@ public ResourceAvailability getLiveResourceAvailability(Subject subject, int res
new IllegalStateException("No agent is associated with the resource with id [" + resourceId + "]");
} else if (agent.getName().startsWith(ResourceHandlerBean.DUMMY_AGENT_NAME_PREFIX)
&& agent.getAgentToken().startsWith(ResourceHandlerBean.DUMMY_AGENT_TOKEN_PREFIX)) {
// dummy agent created from REST
return getResourceById(subject, resourceId).getCurrentAvailability();
}
AgentClient client = agentManager.getAgentClient(agent);
Expand All @@ -2526,45 +2527,46 @@ public ResourceAvailability getLiveResourceAvailability(Subject subject, int res

AvailabilityReport report = null;

// first, quickly see if we can even ping the agent, if not, don't bother trying to get the resource avail
boolean agentPing = client.pingService(5000L);
if (agentPing) {
// we can't serialize the resource due to the hibernate proxies (agent can't deserialize hibernate objs)
// but we know we only need the basics for the agent to collect availability, so create a bare resource object
Resource bareResource = new Resource(res.getResourceKey(), res.getName(), res.getResourceType());
bareResource.setId(res.getId());
bareResource.setUuid(res.getUuid());
// root the avail check at the desired resource. Ask for a full report to guarantee that we
// get back the agent-side avail for the resource and keep the server in sync. This also means we'll
// get the descendants as well.
report = client.getDiscoveryAgentService().getCurrentAvailability(bareResource, false);
}

if (report == null) {
report = new AvailabilityReport(client.getAgent().getName());
Availability fakeAvail = new Availability(res,
res.getResourceType().getCategory() == ResourceCategory.PLATFORM ? AvailabilityType.DOWN
: AvailabilityType.UNKNOWN);
fakeAvail.setStartTime(System.currentTimeMillis());
report.addAvailability(fakeAvail);
}
if (report != null) {
// although the data came from the agent this should be processed like a server-side report
// because it was requested and initiated by the server (bz 1094540). The availabilities will
// still be merged but certain backfill logic will remain unscathed.
report.setServerSideReport(true);

// The report is most likely empty as it's unlikely the avail has changed. Don't merge it and return
AvailabilityType foundAvail = report.forResource(res.getId());
if (foundAvail != null) {
availabilityManager.mergeAvailabilityReport(report);
} else {
foundAvail = res.getCurrentAvailability() == null ? AvailabilityType.UNKNOWN : res
.getCurrentAvailability().getAvailabilityType();
}
AvailabilityType foundAvail = report.forResource(res.getId());
if (foundAvail != null) {
availabilityManager.mergeAvailabilityReport(report);
} else {
foundAvail = res.getCurrentAvailability() == null ? AvailabilityType.UNKNOWN : res
.getCurrentAvailability().getAvailabilityType();
}

// make sure we don't somehow leak/persist a MISSING avail
foundAvail = (AvailabilityType.MISSING == foundAvail) ? AvailabilityType.DOWN : foundAvail;
results.setAvailabilityType(foundAvail);
// make sure we don't somehow leak/persist a MISSING avail
foundAvail = (AvailabilityType.MISSING == foundAvail) ? AvailabilityType.DOWN : foundAvail;
result.setAvailabilityType(foundAvail);
}

} catch (Exception e) {
if (log.isDebugEnabled()) {
log.debug("Failed to get live availability: " + e.getMessage());
}
}

return results;
return result;
}

// lineage is a getXXX (not findXXX) because it logically returns a single object, but modeled as a list here
Expand Down

0 comments on commit 2fa7c54

Please sign in to comment.