Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling exceptions in CMIS SM to prevent xcvrd crash #483

Merged
merged 2 commits into from
May 1, 2024

Conversation

mihirpat1
Copy link
Contributor

@mihirpat1 mihirpat1 commented Apr 30, 2024

Description

Currently, the CmisManagerTask thread crashes upon encountering an exception which causes the entire XCVRD process to restart. The CmisManagerTask thread crash scenarios are more often seen during instances of failure to read EEPROM of the transceivers.

Crash snippet

Apr  1 13:39:29.652330 STG01-0101-0200-02T2-lc01 ERR pmon#: Exception occured at CmisManagerTask thread due to TypeError("'NoneType' object is not subscriptable")
Apr  1 13:39:29.654469 STG01-0101-0200-02T2-lc01 ERR pmon#: Traceback (most recent call last):
Apr  1 13:39:29.654498 STG01-0101-0200-02T2-lc01 ERR pmon#:   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1693, in run
Apr  1 13:39:29.654498 STG01-0101-0200-02T2-lc01 ERR pmon#:     self.task_worker()
Apr  1 13:39:29.654498 STG01-0101-0200-02T2-lc01 ERR pmon#:   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1655, in task_worker
Apr  1 13:39:29.654518 STG01-0101-0200-02T2-lc01 ERR pmon#:     if not self.check_datapath_state(api, host_lanes_mask, ['DataPathInitialized']):
Apr  1 13:39:29.654531 STG01-0101-0200-02T2-lc01 ERR pmon#:   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1263, in check_datapath_state
Apr  1 13:39:29.654531 STG01-0101-0200-02T2-lc01 ERR pmon#:     if dpstate[key] not in states:
Apr  1 13:39:29.654569 STG01-0101-0200-02T2-lc01 ERR pmon#: TypeError: 'NoneType' object is not subscriptable
Apr  1 13:39:29.654752 STG01-0101-0200-02T2-lc01 ERR pmon#: Xcvrd: exception found at child thread CmisManagerTask due to TypeError("'NoneType' object is not subscriptable")
Apr  1 13:39:29.654752 STG01-0101-0200-02T2-lc01 ERR pmon#: Exiting main loop as child thread raised exception!

Motivation and Context

In order to avoid restarting of XCVRD triggered due to CmisManagerTask thread crash, this PR will ensure to move the CMIS SM to CMIS_STATE_FAILED state for the corresponding ports which have generated an exception. This will also help in ensuring that if module EEPROM access fails for 1 or more ports, the corresponding port will transition to CMIS_STATE_FAILED instead.

How Has This Been Tested?

An exception was manually generated while CMIS SM was in CMIS_STATE_INSERTED and it was ensured that XCVRD did not crash.

Apr 30 08:58:01.283582 sonic NOTICE pmon#xcvrd[16173]: CMIS: Ethernet0: 400G, lanemask=0xff, state=INSERTED, appl 1 host_lane_count 8 retries=0
Apr 30 08:58:01.283582 sonic ERR pmon#xcvrd[16173]: CMIS: Ethernet0: internal errors due to 'PATELMI: Simulated KeyError!!!'
Apr 30 08:58:01.285296 sonic ERR pmon#xcvrd[16173]: Traceback (most recent call last):
Apr 30 08:58:01.285296 sonic ERR pmon#xcvrd[16173]:   File "/usr/local/lib/python3.11/dist-packages/xcvrd/xcvrd.py", line 1404, in task_worker
Apr 30 08:58:01.285296 sonic ERR pmon#xcvrd[16173]:     raise KeyError("PATELMI: Simulated KeyError!!!")
Apr 30 08:58:01.285296 sonic ERR pmon#xcvrd[16173]: KeyError: 'PATELMI: Simulated KeyError!!!'

root@sonic:/home/admin# redis-cli -n 6 hget "TRANSCEIVER_STATUS|Ethernet0" cmis_state
"FAILED"
root@sonic:/home/admin# 

Also, CMIS initialization was successful on the same port after the exception was not seen any more.

Additional Information (Optional)

MSFT ADO - 27441561

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@mihirpat1 mihirpat1 marked this pull request as ready for review April 30, 2024 09:07
@mihirpat1 mihirpat1 requested a review from prgeor April 30, 2024 09:08
@rlhui rlhui added the P0 label Apr 30, 2024
@@ -1376,6 +1381,11 @@ def task_worker(self):
# Skip if these essential routines are not available
self.update_port_transceiver_status_table_sw_cmis_state(lport, CMIS_STATE_READY)
continue
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihirpat1 why we made distinction between AttributeError vs others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor The AttributeError relevant handling here seems to be a day 1 code. However, I am not sure on the original reason for this so decided to leave the existing behavior as is since with AttributeError, we are moving CMIS SM to CMIS_STATE_READY.
Hence, I am now handling all other exceptions in a different block and moving the CMIS SM to CMIS_STATE_FAILED state.

@@ -1376,6 +1381,11 @@ def task_worker(self):
# Skip if these essential routines are not available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to understand when the Attribute error exception happens. Ideally the below line "except Exception as e:" is a superset and should cover all exceptions.

Also what is the exact error/exception when the eeprom data is corrupted/bad -- sue to bad optics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@judyjoseph Yes - the below line "except Exception as e:" covers all exceptions apart from AttributeError. However, for AttributeError, the CMIS SM is currently being transitioned to CMIS_STATE_READY and not CMIS_STATE_FAILED.

In case of bad optics, I have seen KeyError and TypeError so far.

@gechiang
Copy link
Contributor

@mihirpat1 , This PR has conflict for 202205. Can you please raise a separate PR under 202205 branch and link the PR# here.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants