Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing lastSuccess/lastFailure in GET /v2/subscriptions in some cases #2974

Closed
fgalan opened this issue Sep 14, 2017 · 4 comments
Closed

Missing lastSuccess/lastFailure in GET /v2/subscriptions in some cases #2974

fgalan opened this issue Sep 14, 2017 · 4 comments
Assignees
Labels
Milestone

Comments

@fgalan
Copy link
Member

fgalan commented Sep 14, 2017

We have tests that in some cases (e.g. 1960_failure_stats_in_db_and_notifications/failure_stats_in_db_and_notifications_stop_start_accumulator.test) the lastSucess/lastFailure field is missing in the GET /v2/subscription operation, although lastNotification and timesSent seem to be ok.

The problem is a bit annoying as it doesn't happend in all systems (e.g. we have a jenkins running regression that is showing the problem... while another jenkins with the same setup is not showing it!). It could be related with cache (in some systems we have observed the problem arises when cache is enabled and doesn't happen with noCache enabled).

Research in progress...

@fgalan fgalan added this to the 1.9.0 milestone Sep 14, 2017
@fgalan
Copy link
Member Author

fgalan commented Sep 14, 2017

In 1960_failure_stats_in_db_and_notifications/failure_stats_in_db_and_notifications_stop_start_accumulator.test it seems it happens only in the case of initial notification (which we know is special in the sense is synchronous to the subscription operation). In this test step 1 creates the entity, step 2 creates the subscription. If we reverse these steps, the problem doesn't happen.

@fgalan fgalan changed the title Missing lastSuccess/lastFailure in GET /v2/subscripiotions in some cases Missing lastSuccess/lastFailure in GET /v2/subscriptions in some cases Sep 14, 2017
@fgalan
Copy link
Member Author

fgalan commented Sep 14, 2017

In one of the jenkins environment I'm testing this I have found that every time a test fails, the following appears in the logs:

time=2017-09-14T13:25:46.279Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=AlarmManager.cpp[405]:badInput | msg=Raising alarm BadInput : intent to update error status of non-existing subscription

That points out to the subCacheItemNotificationErrorStatus(), which is invoked by startSenderThread() once the notification is sent. Let's have a look to this function:

void subCacheItemNotificationErrorStatus(const std::string& tenant, const std::string& subscriptionId, int errors)
{
  if (noCache)
  {
    <here is the logic that updates directly in the database, so it makes sense that the problem doesn't happens with noCache is enabled>
    return;
  }
  ...
  cacheSemTake(__FUNCTION__, "Looking up an item for lastSuccess/Failure");

  CachedSubscription* subP = subCacheItemLookup(tenant.c_str(), subscriptionId.c_str());

  if (subP == NULL)
  {
    cacheSemGive(__FUNCTION__, "Looking up an item for lastSuccess/Failure");
    const char* errorString = "intent to update error status of non-existing subscription";

    alarmMgr.badInput(clientIp, errorString);
    return;
  }
  ...
}

Note that the notification flow is independent of the create subscription flow (in other words, notifications are asyncrhonous to the subscription creation) as they happen in separate threads. The creation flow is the one that inserts the subscription in the cache. But it may happen that the notifications flow executes faster than the creation flow, thus in that case the subCacheItemLookup() doesn't find the entry in which annotate the fail/sucess status and that is the cause of the problem.

Note I say "it may happen". That would explain the "randomness" observed in this problem, happening in some environments but not in others.

@fgalan
Copy link
Member Author

fgalan commented Sep 14, 2017

I will try to refactor so csub is inserted in the cache before spawn the notification thread. If the refactor gets to complex, I'll .DISABLE the affected .test in a quick PR in order not blocking CI.

@fgalan
Copy link
Member Author

fgalan commented Sep 14, 2017

Fixed in PR #2975

@fgalan fgalan self-assigned this Sep 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant