Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chassis] Fix issues regarding database service failure handling and mid-plane connectivity for namespace. #10500

Merged
merged 19 commits into from
May 24, 2022

Conversation

abdosi
Copy link
Contributor

@abdosi abdosi commented Apr 8, 2022

What/Why I did:

Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability. 

Scenario: For example if let's say database service goes in fail state  then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.

Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.

Issue 2: Use of IPVLAN vs MACVLAN

Currently we are using ipvlan mode.  However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device."  based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN)  it remains in Kernel Database and never goes to free pool even though namespace is deleted. 

Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.

Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis   Server.

Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis  Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle

Solution: Added explicit PING from namespace that will check this reachability.

Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.

Solution: Handle gracefully via try..except and log the message.

Verification:
Did manual verification by killing database service and see post restart all redis connectivity and database service are healthy.

abdosi added 17 commits April 3, 2021 01:41
df46ed418e661a9bccdb2639d8873def356f8ba0 (HEAD -> master, origin/master, origin/HEAD) Fix the LLDP_LOC_CHASSIS not getting populated if no remote neighbors are present (sonic-net#39)
e487532e11cc0e97cfce573b6b997fdd0beeb660 [CI] Set up CI&PR with Azure Pipelines (sonic-net#38)
3c9f488490a1dbded20dbf2d8a88a5ab4dbda8df Replace swsssdk's SonicV2Connector with swsscommon's implementation (sonic-net#35)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
78f167e4728f939712b3f3ea550949e2ea675fec With the changes in PR:sonic-net#5289 access to redis unix socket is given to the redis group members. Many of sonic-util commands (especially in multi-asic) case use redis unix socket to connect to DB and thus those comamnd fails without providing sudo. This PR is continuation  of PR: sonic-net#7002 where we default to use TCP for Redis if user is not root

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
mid-plane connectivity for namespace.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
@abdosi abdosi requested a review from lguohan as a code owner April 8, 2022 00:00
@abdosi abdosi requested a review from judyjoseph April 8, 2022 00:01
@abdosi abdosi changed the title Swssdk [chassis] Fix issues regarding database service failure handling and mid-plane connectivity for namespace. Apr 8, 2022
@abdosi
Copy link
Contributor Author

abdosi commented Apr 8, 2022

cc @anamehra

@abdosi abdosi added Chassis 🤖 Modular chassis support labels Apr 8, 2022
@abdosi
Copy link
Contributor Author

abdosi commented Apr 8, 2022

@ysmanman @minionatwork can you please review this too,

1 similar comment
@abdosi
Copy link
Contributor Author

abdosi commented Apr 8, 2022

@ysmanman @minionatwork can you please review this too,

@@ -152,7 +175,8 @@ function postStartAction()
# then we catch python exception of file not valid
# that comes to syslog which is unwanted so wait till database
# config is ready and then ping
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) ]]; do
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) &&
($(docker exec -i database$DEV sonic-db-cli PING | grep -c PONG) -gt 0) ]]; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't line 178 also doing SONIC_DB_CLI PING and line 179 also doing sonic-db-cli PING using docker exec? Whats the difference here?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are in namespace context SONIC_DB_CLI maps as sonic-db-cli -n asicx



if [[ -n "$midplane_subnet" ]]; then
# Use /16 for loopback interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see code to cleanup all these linux networking state when database docker restart?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ip netns delete should remove all the resources.

minionatwork
minionatwork previously approved these changes Apr 12, 2022
judyjoseph
judyjoseph previously approved these changes Apr 29, 2022
Copy link
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arlakshm
Copy link
Contributor

arlakshm commented May 7, 2022

add @nikitha6179 for viz

@abdosi
Copy link
Contributor Author

abdosi commented May 16, 2022

@minionatwork @judyjoseph this pr enforces not to have midplane-subnet in chassisdb.conf for supervisor . we do not want to create macvlan interface on supervisor as they are reachable via docker bridge.

Updated Nokia chassisdb.conf file with same.

@abdosi
Copy link
Contributor Author

abdosi commented May 16, 2022

@ysmanman please review also.

@minionatwork
Copy link
Contributor

your code is still using midplane_subnet bash variable. Where is it set now?.

@abdosi
Copy link
Contributor Author

abdosi commented May 17, 2022

your code is still using midplane_subnet bash variable. Where is it set now?.

it should be present in line-card chassisdb.conf. Not Needed on Supervisor chassisdb.conf.

@abdosi
Copy link
Contributor Author

abdosi commented May 17, 2022

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@abdosi
Copy link
Contributor Author

abdosi commented May 19, 2022

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@mlorrillere mlorrillere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me.

@abdosi abdosi merged commit 0285bfe into sonic-net:master May 24, 2022
@abdosi abdosi deleted the swssdk branch May 24, 2022 23:54
liushilongbuaa pushed a commit to liushilongbuaa/sonic-buildimage that referenced this pull request Jun 20, 2022
…anch

Related work items: #52, #71, #73, #75, #77, sonic-net#1306, sonic-net#1588, sonic-net#1991, sonic-net#2031, sonic-net#2040, sonic-net#2053, sonic-net#2066, sonic-net#2069, sonic-net#2087, sonic-net#2107, sonic-net#2110, sonic-net#2112, sonic-net#2113, sonic-net#2117, sonic-net#2124, sonic-net#2125, sonic-net#2126, sonic-net#2128, sonic-net#2130, sonic-net#2131, sonic-net#2132, sonic-net#2133, sonic-net#2134, sonic-net#2135, sonic-net#2136, sonic-net#2137, sonic-net#2138, sonic-net#2139, sonic-net#2140, sonic-net#2143, sonic-net#2158, sonic-net#2161, sonic-net#2233, sonic-net#2243, sonic-net#2250, sonic-net#2254, sonic-net#2260, sonic-net#2261, sonic-net#2267, sonic-net#2278, sonic-net#2282, sonic-net#2285, sonic-net#2288, sonic-net#2289, sonic-net#2292, sonic-net#2294, sonic-net#8887, sonic-net#9279, sonic-net#9390, sonic-net#9511, sonic-net#9700, sonic-net#10025, sonic-net#10322, sonic-net#10479, sonic-net#10484, sonic-net#10493, sonic-net#10500, sonic-net#10580, sonic-net#10595, sonic-net#10628, sonic-net#10634, sonic-net#10635, sonic-net#10644, sonic-net#10670, sonic-net#10691, sonic-net#10716, sonic-net#10731, sonic-net#10750, sonic-net#10751, sonic-net#10752, sonic-net#10761, sonic-net#10769, sonic-net#10775, sonic-net#10776, sonic-net#10779, sonic-net#10786, sonic-net#10792, sonic-net#10793, sonic-net#10800, sonic-net#10806, sonic-net#10826, sonic-net#10839, sonic-net#10840, sonic-net#10842, sonic-net#10844, sonic-net#10847, sonic-net#10849, sonic-net#10852, sonic-net#10865, sonic-net#10872, sonic-net#10877, sonic-net#10886, sonic-net#10889, sonic-net#10903, sonic-net#10904, sonic-net#10905, sonic-net#10913, sonic-net#10914, sonic-net#10916, sonic-net#10919, sonic-net#10925, sonic-net#10926, sonic-net#10929, sonic-net#10933, sonic-net#10934, sonic-net#10937, sonic-net#10941, sonic-net#10947, sonic-net#10952, sonic-net#10953, sonic-net#10957, sonic-net#10959, sonic-net#10971, sonic-net#10972, sonic-net#10980
@@ -180,7 +203,8 @@ function postStartAction()
# then we catch python exception of file not valid
# that comes to syslog which is unwanted so wait till database
# config is ready and then ping
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) ]]; do
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) &&
($(docker exec -i database$DEV sonic-db-cli PING | grep -c PONG) -gt 0) ]]; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you have an issue in the supervisor's database containers here, since midplane_subnet is not defined the midplane interface doesn't exist in the container, so trying to PING the databases from within the container will fail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chassis 🤖 Modular chassis support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants