-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[portmgr] Fixed the orchagent crash due to late arrival of notif #2431
Conversation
Required for 202205 |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
ProducerStateTable doesn't actually SET/DEL to the APPL_DB table we intend to check. It's the ConsumerStateTable (i.e. orchagent in this case) that gets the notification and it reads the data from Temp hash and SET/DEL's the corresponding table in the APPL_DB. Thus, i thought of putting in some delay b/w CFG_DB checks and APPL_DB checks. But apparently that didn't help. The vs test still failed |
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
5d088b6
to
9700711
Compare
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
…nto bulk_write_portmgr
@prsunny, @Junchao-Mellanox please review |
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@prsunny as this is a degradation inteorudced with prev PRs i would appreciate if you can prioritize the review so we will have it on 202205 |
/azp run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, @prgeor , can you please review?
@@ -2615,6 +2615,15 @@ bool PortsOrch::addPort(const set<int> &lane_set, uint32_t speed, int an, string | |||
{ | |||
SWSS_LOG_ENTER(); | |||
|
|||
if (!speed || lane_set.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this impact VS swss initialization? Seems these two are mandatory on create
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why we are deferring until both of them are available. No, it doesn't. cfg used in VS has speed and lane_set defined, so shouldn't be a problem.
…ic-net#2431) Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com> Bulk write to APP_DB i.e. alias, lanes, speed must be read through one notification by orchagent during create_port Handled a race condition in portmgrd which tries to immediately apply a mtu/admin_status SET notif after a DEL causing it to crash
…) (#2451) Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com> Bulk write to APP_DB i.e. alias, lanes, speed must be read through one notification by orchagent during create_port Handled a race condition in portmgrd which tries to immediately apply a mtu/admin_status SET notif after a DEL causing it to crash
…ication (#2704) What I did Handled a race condition in intfmgrd which tries to immediately apply an admin_status SET notif after a DEL causing it to crash Why I did it Ignores errors on the set admin_status command for subinterface when the subinterface state is not OK. How I verified it Unit tests Details if related This PR reference to older PR that fix the same issue in portmgr: #2431
What I did
Why I did it
After the recent incremental config update changes, portmgrd will be writing to the APPL_DB PORT_TABLE (Except during startup where portyncd writes to the APP_DB).
This can lead to a problem during breakout/breakin and might cause orchagent to crash because portmgrd currently doesn't write to the APPL_DB at once but in steps.
In some cases (high probability, not always), orchagent decides to act on the notifcations immediately and since speed is not set, it tries to create_port with speed 0 and the orchagent crashes
Also handled a race condition in portmgrd happens because it was trying to set mtu/admin_status immediately after the a DEL was sent. portmgrd raced to set the fields on a netdev which was by then deleted and thus portmgrd crashed
How I verified it
Before the fix:
root@r-panther-23:/home/admin# config interface breakout Ethernet72 2x50G[25G,10G] -y -f
After the fix:
Details if related