Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-136041: Multipathd forgets faulty paths after an SR operation #274

Closed
wants to merge 1 commit into from

Conversation

geosharath
Copy link

The operations pbd-plug and -unplug cause 'multipathd -k "reconfigure"'.
During reconfigure all(!) maps are deleted and re-created. Multipathd
re-creates maps without paths that are temporarily down when doing this.
These paths will only restore if they timeout beyond their dev_loss_tmo
and get re-added later, as this triggers the multipathd behaviour for new
devices. Hence, any SR operation done before dev_loss_tmo the faulty paths
are lost and never added. The fix to this problem is two fold -

a) multipathd reconfigure has been modified to include even faulty paths
while recreating maps

b) devices have to be deleted from system when SR operations are performed
before dev_loss_tmo

Signed-off-by: Sharath Babu sharath.babu@citrix.com

The operations pbd-plug and -unplug cause 'multipathd -k "reconfigure"'.
During reconfigure all(!) maps are deleted and re-created. Multipathd
re-creates maps without paths that are temporarily down when doing this.
These paths will only restore if they timeout beyond their dev_loss_tmo
and get re-added later, as this triggers the multipathd behaviour for new
devices. Hence, any SR operation done before dev_loss_tmo the faulty paths
are lost and never added. The fix to this problem is two fold -

a) multipathd reconfigure has been modified to include even faulty paths
 while recreating maps

b) devices have to be deleted from system when SR operations are performed
 before dev_loss_tmo

Signed-off-by: Sharath Babu <sharath.babu@citrix.com>
# for any SR operation done before dev_loss_tmo, post which the device is
# automatically dropped by the kernal.

delete_nodes=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot force this behaviour ignoring what is passed to the function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this is mitigating the problem with multipath reconfigure this will force us to rescan the bus
every time we need those paths.
This is an expensive operation that slows down our control path significantly.

@germanop
Copy link
Contributor

I added a sleep after deleting the nodes to give time to the system to settle.
In fact, after sr-create, sr_attach is issued immediately by xapi and, in some cases, it finds devices
that are in the middle of a remove operation but still visible.
Because of that, we think we do not need to rescan the bus, finding ourselves without valid devices for multipath few lines later

@germanop
Copy link
Contributor

@geosharath, I am going to close this PR for the following reasons:

  • Device removal could not be as quick as expected by later calls, as explained above
  • Removing the devices explicitly is something that goes against vendor specific settings specified
    in multipath configuration (on one hand we allow them to specify their favourite settings and on the other hand we device removal)
  • Rescanning the bus is too expensive to be done without a valid reason
  • The counterpart of this patch in multipath is dangerous, because we do not let "reconfigure" undo
    the paths that are down but if those paths are in the system while doing the mapping, the mapping
    will fail. This is a corner case less likely than the one we are trying to address but still likely.
  • We do have an alternative approach

@germanop germanop closed this Mar 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants