New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multipath support/boot from san for ReaR -- i.e. recovery technique #572
Comments
Just tested this twice. Still having issues restoring /boot on /mnt/local/boot. Unsure what the issue is. Will keep trying a few different ways to see what the deal is. But i believe this has to do with grub running on multipath versus being on a standard block device. Error is:
Tailing the log and looking at it shows the following:
Which is odd, because of the following output...
so it's mounted.... Unsure why it can't find the actual partition as it is sitting in /mnt/local/. Will boot into Linux rescue image and re-load grub from scratch as is stated in the document I wrote above. After grub rebuild of boot device. Reboot commencing... running ####NOTES#### The above issue can be worked around. When mounting boot to /mnt/local/boot, simply change the permissions of the boot folder to 755 to allow your backup agent to restore to this directory. Then once the contents are restored, you may still see the above error, but you can switch the permissions back to 555 and continue with the selinux and the |
So just tried another recovery. This time, I deleted (
mount up recovery ISO Boot to ISO (1st option (manual mode)) do not do auto as this process is quite manual!
mount these in order in /mnt/local
select file systems you wish to recover (1 2 3 4 21 22 23 (or whatever you need)). select 1 to continue for any continue/abort messages let TSM take over and restore everything (including boot). Will get an error that no boot partitions are found (even though I just recovered /boot). Once done chmod 555 /mnt/local/boot
change enforcing to permissive.
reboot system will come up, start the autorelabel process, and then reboot again once system comes up, login and then reboot again system will come up in a good state, you can then put selinux back into enforcement mode. Congrats, you just successfully restored a system that is running multipath with boot from SAN. |
Just to give an update, trying a restore where REAR needs to be able to rebuild an entire disk (i.e. your disk has failed, or you have migrated to another system and need to rebuild from scratch). Currently the output of ReaR tells me that the "disk layout has been created." But this is a false echo message when running with multipath (nothing is actually created). So i'm currently attempting to script this as a pre-run script before the ReaR software contacts TSM (so that my filesystems are created, and mounted via lvols into /mnt/local so that even though there is no content, it can still be mounted and a TSM restore can take place). Since the grub re-install is also not working (if you whack your entire VG00 drive as well as the MBR for boot, regardless if you restored /boot, you will still need to do the following: Pop in RHEL7 installation. Select boot/install, hit escape twice to get to the boot: prompt of the DVD/ISO.
Ideally this needs to be scripted, but as the drives are failing to do a full re-creation of partitions and lvols, that will need to be scripted as well. Probably something to the effect of: EDIT: I added some more logic and some corrections to my test machine to get these mounted and created if for some reason your entire disk is gone. Obviously, as you can see, my boot drive and volume group are currently hard coded for testing purposes, but will become variables that pull from the system itself as time goes on, hence my cat output.recover | awk lines, they aren't finished until I can spool up the test lab again that is connected to our SAN infrastructure. But this will at least create a disk from scratch, will then like to see it find the correct disk from the original mkrescue image. But from a testing perspective, it will rebuild vg00 (vg01 in my testing), complete with boot partition and root partition and create the LVOLS and then mount them to /mnt/local, then from that point forward, rear recover takes over and restores the mounted filesystems from backup.
But I am still in the infancy with this particular task. Currently can restore if someone simply does an rm -Rf on some directory or drive. But from a full on baremetal restore, this is quite difficult, more information as I come across it and can do more testing in the lab. |
Wow - lots of information to absorb! Let start with the beginning: if you define For my understanding - an EMC gatekeeper is something similar as an Hitachi (or HP) command device, however a command device has Another, important item is (not mentioned) the settings of the FC card to make it bootable (at least in case of cloning to another HW). You could come to the conclusion that rear is broken, but it could just be a missing (or wrong) setting on the FC card (inside its BIOS). I have the impression (but could be completely wrong here) that you make it too complex to restore. But, it would be nice that I could test it out myself (you can always hire me for a short period :). You could have a look at |
Awesome! Thank you so much for getting back to me. I understand that you guys are super busy. I have no issues taking a look at those shell scripts and providing feedback. Also good information about the command device being read/write as well. I was under the impression that most gatekeeper type devices would be read only. As far as thh required_progs in the local.conf, BOOT_OVER_SAN is most definitely set to Y, so I'm unsure as to why I need multipathd and multipath in there, but you will need to put them in there for sure (I have tested this a few times, it's a must). As well as grub2 binaries being included as well(i.e. if you need to rebuild your MBR/boot area). Currently GDHA, we believe that the SCSI ID of our boot LUN was set improperly and so we are working with our SAN team to get this ironed out. Once we have corrected the SAN issue, we are going to kickstart our lab server again and then I will be alotted more time to focus on getting this to work. Ideally I am looking for you to point me in some direction where we need more information, or maybe to get a rough construct across the finish line that is at least somewhat workable. I will take a look at the Fibre Channel info as well as the multipath_tools.sh and transport_info.sh files once I can get the lab server back up and running. Thank you again for responding to this post (where somehow i managed to mess up the formatting). :-) Thanks -- Mmaster |
Update time! Okay, GDHA, have this thing almost across the finish line from a restore perspective. I edited
all i did was include parted at the end. Here is the site.conf file:
then, in I have this running at the very top (above the beginning of your if statement):
So my test is to go inside /etc/ or something to this nature, and rm -Rf * Then, using my newly created iso image (dsmc i; rear mkrescue) I can do the automatic recovery of the filesystem I have whacked (i.e. rm -Rf *). The issue: At the very end of the restore process, grub complains that he could not find a filesystem that has a boot partition, which is quite odd, because my script mounts the boot partition to /mnt/local/boot So if this was a bare metal restore, you will have to use a RHEL7 rescue image to do the whole
and then:
, and then finally reboot. So, since my script is integrated as far as pulling boot and vg00 information from the df.txt file. The problem is, even though I could write a script to put the .autorelabel file in / and also to change the /etc/selinux/config parameters to permissive mode, I would like to place this script in the post_recovery script, but since my recovery errors on the grub portion, the rear scripting set never actually gets that far(so putting that script in the post would be frivolous because it never makes it that far). So i'm looking for options from you and your team on how I could get the grub portion to work. However, what is nice is that even though it errors on the grub portion that it could not find a boot partition, the process is actually 100% complete with the restore and you can just reboot back to a fully restored system (HOORAY!!!!!). So at this point in time, we are literally 95% finished with making this work, it's the last portion of the grub re-install that is the issue (i.e. i fully tested this in automated mode this morning after blowing away /var and /etc ). Hope to hear back from you guys soon! -- MultipathMaster |
Just wanted to provide an update that sometimes dhcp isn't loaded for the network adapter, if this is the case, then doing the automatic recovery will fail with TSM (our recover agent) not being able to get an available IP address. Just hit option 2 when it fails to get to a shell and simply 'dhclient adapter_name' and then re-run "rear recover" and it should be good to go. Tested this again this morning by removing /etc, /home, and /var, thus putting the system into a degraded state. Using the above script at the beginning of the pre_recovery script mounts everything that is required and restores. Simply reboot (even though the error with grub is still present) and it will come back up fully restored. Just another update. Thanks! -- MultipathMaster |
Update, been scouring for days trying to figure out why grub re-install isn't working. Finally found something super important from redhat site.... Note: The grub-install command does not work for multipath devices. Well shoot, that kinda shoots a giant hole in everything if grub-install doesn't actually work for multipath devices. Will continue testing, maybe coming up with a different way to get grub re-installed, the initramfs work will be simple enough (dracut -f -v) , but getting grub to install is certainly a challenge. There might be a few different ways to do this, the first option that I'm going to explore is going to be to get the multipath devices to fail down to only one path (thus giving us a /dev/sdX number) and maybe grub install can go ahead and install here, but the problem then being that it won't boot up properly if /etc/fstab is showing that /dev/sdX number..... Will update as I have more information to do so. Thanks -- MultipathMaster |
Just to provide an update. I have opened a case with redhat regarding not having standard grub included with RHEL7. This whole grub thing can be sorted out if the standard packaged grub comes with the system out of the box (can script some answers and pipe them into grub in order to re-write the MBR and boot partition). I will update you guys whenever I have heard some information back from redhat. Thanks -- MultipathMaster |
Still waiting on redhat. They have apparently setup a similar setup in their lab and are conducting testing. |
I have gotten this to work, currently still in the lab testing with a senior resource from redhat, as soon as this is completed I will update you guys with my findings, it's starting to look like we might not need grub2-install to be run at all for the final steps of restoring, we might actually use dd to get the MBR and GRUB partition, save it to a small image file, restore that with the TSM run, and instead of doing a grub2-install at the end, we will run dd to restore from that image back to the root drive in question. Just an update, but I should have more notes by the end of the day, with a completed script without all of the dirty stuff. Thanks so much! -- MultipathMaster |
Okay, gents, here are my findings while working with redhat as well as a few other resources.
To re-create the MBR for boot, working with redhat yielded that using dd is the best option. So we run the dd string before and output the img file to /usr/share/rear/custom prior to the mkrescue. I know this has stripped a majority of the scripts from this process, however because of the delicacy of using it with boot from san, as well as multipath, it was required to get this working and functioning. As ReaR is going to be included with RHEL7.2 (talking to the redhat engineers), they helped me figure out exactly what was wrong with grub2-install, and have created a KCS article for this very issue. Also, each time you patch and put a new kernel on the system, the dd as well as the mkrescue should be re-run to re-create an image that points to the correct kernel that the system is running. Feel free to leave me any feedback, but in all honesty this should get anyone that is using multipath/lvm/boot from SAN, as well as TSM, across the finish line. My hope being that you guys can use this information going forward to make this a bit more streamlined.
Thanks, I look forward to hearing from you! MultipathMaster PS -- Seems formatting in this text box got messed up again, i don't know what i'm doing to make that happen shrugs. |
@multipathmaster: Can you please contact me privately via webmaster AT schapiro DOT org? I would like to discuss some details with you... |
Just shot you an email. |
@multipathmaster @schlomo could you share some info about this issue if it is still hot? If we integrate this in rear it should be with prep scripts and finalize scripts, right? |
Sorry to re-open this gents, I finally have some time dedicated back to this project and will update this issue accordingly. We have streamlined the module method that Schlomo has shown us, and have slowly been getting this to install properly with puppet (i.e. ensure a version, and then install our custom module that will utilize the mrecover module method). I will update this shortly with additional information that has been gathered by also utilizing the livecd-creator to create a bootable image that we are going to be using for troubleshooting. |
No feedback since more than one year |
Hey gents, I know y'all are really busy, however that being said, I have some information for you guys(as well as a bug, but I will get to the bug at the end of this rough draft instruction manual for restoring to a system that is boot from san as well as using multipath).
Late on Friday I was able to successfully restore a system that uses boot from SAN, multipath, and LVM2 for RHEL 7.X.
I have written a script that does some translations over multipath to specific block devices that are pertaining to multipath. But to start off, I will start this thread from scratch.
The first thing that you will want to do before creating a new recovery image via 'rear mkrescue' will be to run this script, from which I have provided...
NOTE --> You will have to edit this script for your output of multipath -ll and exclude the model from field 5. You can elect to keep the model name within the contents, however, being an admin/engineer, you should know what your environment is using and how it is utilizing these technologies.
The above script will put your block devices on a line with the VG that is associated with that disk, as well as the read/write permission (i.e. assuming that you may have gatekeepers that are presented to the blade that would be read only).
Sample output from the above script:
Now what does this do for ReaR? It at least sets an example on what volume groups are associated with what multipath disks, and what block devices are associated with that multipath disk number. It will also eliminate any read only disks that may show up due to gatekeepers being presented, or any number of read only associated disks that may be presented to the system over fibre channel.
Once you have taken note on what volume groups are associated with what drives, go into the disklayout.conf (/var/lib/rear/layout/.) section and uncomment the multipath entries that are associated with the set of block devices that your volume group is a member of. In the case above, if we were only to restore VG00 (i.e. the root volume group), we would find the entries for /dev/sda that are associated with /dev/mapper/mpatha(yes it is actually mpatha2, but we aren't talking about partitions, YET...) within the disklayout.conf file.
As shown here:
Also, head into /etc/rear and look at the
site.conf
file.Ensure in the
REQUIRED_PROGS=(
thatmultipathd
andmultipath
are there as well.An example would be the following:
Also set your
BOOT_OVER_SAN=Y
in this file as well asAUTOEXCLUDE_MULTIPATH=n
or leave it blank as well so it is unset.Once these steps are done, go ahead and make the rescue image.
rear mkrescue
(this takes some time).Once complete, the rescue image will be in
/var/lib/rear/output/rear-<servername>.iso
FTP this image to your laptop, or whatever device you will be using to go into the console.
In this particular case, this server had a DELL iDRAC connected, from which we will use the CD/DVD/ISO virtual media manager to mount the image and force the system to boot to the rescue image.
Once at the boot menu, DO NOT SELECT AUTOMATIC, select MANUAL recovery mode (there are some steps that must be completed first and foremost).
Once booted into the rescue image after selecting manual recovery method, the first, and THE MOST IMPORTANT step to do while in this environment is to turn on multipath(assuming you didn't have to dhclient your interface to obtain an IP address from DHCP).
multipathd
(this will turn on the multipath daemon)multipath -ll
(this will spit out and confirm whether or not multipath is working and can translate the block devices into selective paths via the WWID on each LUN from the SAN).If you notice above, there is a disk that is presented that is only 5.6M in size, the
wp
flag is set to ro (read only) which means this LUN is in actuality a gatekeeper from the EMC SAN. If you followed along correctly above, you would know to ignore this particular device and shouldn't have un-commented this entry in thedisklayout.conf
file in the previous step (prior to mkrescue).Now, here is where the "bug" comes into play, however I have figured out a workaround for this particular bug (the bug that says, NO FILE SYSTEMS MOUNTED TO /mnt/local).
This isn't due to a bug within multipath or a bug within anything else, except for a bug within the rear code and how it pertains to LVM2 while having a VG associated with the mpath disk. I want it to be understood that this is in no way anyone's fault. The ideas and concepts around boot from SAN and multipath can differ greatly from environment to environment. Some choose to use LVM and LVM2 to control the growth of data, other environments choose to not use LVM and to go with straight block devices (mpath numbers) instead. Boot from SAN can also vary greatly depending on the environment in question as well, so my goal with this document was to try and show all the mechanics behind several different methodologies and to try and get around certain bugs that may be presented because there isn't a huge group of folks that are necessarily going about this method like anyone else would (i.e. there is no RFC white paper on the proper way to do boot from SAN with multipathing enabled while using LVM2 for data size control).
Now that being said, let's get around this bug where it cannot mount logical volumes to /mnt/local.
run a
vgchange -a y
(that's right, do a blanket activation of all volume groups that the system can see, even if the contents of the LVOLS within this volume group have been deleted, the volume group and subsequent LVOLS will still be there, and still able to be activated). Now you will see why running my script at the very beginning, prior to mkrescue is so important.Once the volume groups, or volume group (you could theoretically just have vg00) are activated, mount the lvols from the volume group you are using into /mnt/local. Yes, I understand that I am doing this manually, but the rear scripts themselves are failing because of some error where it isn't fully understanding that from this point forward, because multipath is enabled and LVM2 is up and operational, you don't need to worry about disk names (and that is the whole point of multipath, so that you do not have to worry about what block devices your data is on).
DO NOT FORGET ABOUT /boot!!!!
This is because 9 times out of 10, boot will be at 555, and thus so, TSM or any other backup/restore utility will not be able to write to this directory due to permissions, please check the permissions, or if so, chmod them 755, after the restore, you can chmod it back to 555.
The boot information was gathered again from the original script you run before you even mkrescue, and more specifically in the fdisk area that is printed at the bottom, the asterisk marked block device will specify the boot partition, and since we know that mpatha is our primary disk, we know that mpatha1 will be our boot disk, and mpatha2 will be our vg00 disk. As shown here:
Now, if everything is okay, and the sun and planets are aligned just perfectly, go ahead and do the recover from rear.
rear recover
everything should be fine with the configuration of the
disklayout.conf
as well as default.conf and site.conf.Select the correct numbers for all the filesystems you wish to restore. i.e. 1 2 3 4 5 6 7 8 9 10, etc...
it will be normal to see some errors that say "no code added to for filesystem blah blah blah"
Select option 1 for every question that asks this.
Once it goes through these questions, it will then check that there is something mounted in /mnt/local. SURPRISE!!!! Your lvols are already mounted there via multipath because of the steps you just followed, so if files are missing or directories are missing(i.e. if someone tried a 'rm -Rf * /'), TSM will now pick up where you left off and begin recovery.
When the above is completed, if you are utilizing selinux, you need to go into /mnt/local/etc/selinux and set enforce mode to permissive mode for selinux. Then touch /mnt/local/.autorelabel. This is to combat a minor issue when SELINUX blocks all new scanned FC devices during reboot and therefore your devices will not be mounted (your kernel will panic, or go into the mode where it kills off udev every so often).
Change /mnt/local/boot back to permission level of 555 (note: You may still encounter the error that rear is unable to find any /boot partitions. From a theoretical standpoint, if something was wiped from /boot, you could use the permissions work around to restore from backup.)
Reboot. And wait for the system to come back up.
Once the system comes back up, run
multipath -r
, thenmultipath -v3
, then reboot one more time (sometimes only 3 of the 4 paths come back on the initial reboot, and doing a refresh, followed by a rescan (v3) will fix this issue and voila, you will have your multipath, boot from SAN, restored from backup machine.FOOD FOR THOUGHT:
It might be worth saying that for this type of recovery it is best to get your root filesystem recovered first and foremost. Once you can boot normally because / is working correctly, then you could theoretically kick off a real TSM session with a system that is up in order to recover any additional volume groups that you have (i.e. oracle database, some application, or customer data).
I HOPE TO GOD THAT THIS HELPS THE REAR TEAM GET CLOSER TO COMPLETING THIS PROJECT WITH FULLY AUTOMATED SUPPORT for BOOT FROM SAN as well as MULTIPATHD.
TOTAL DISASTER NOTE
If the lvols that are associated with the VG are completely gone (i.e.
vgchange -ay
completely fails out due to bad/broken disks). What you could do prior to the recover, is to use the RHEL rescue image (i.e. boot into rescue mode with multipath enabled), go ahead and re-create your disks (i.e. pvcreate, kpartx -a, etc...), then go ahead and create your VG's as well as your LVOLS(use the df.txt file that ReaR has coded into their recovery agent if you need to take a quick look at how you had once setup your filesystems). Once this is complete, stop right there, boot into the rescue image you created with ReaR and then follow the documented guide I have above. This will mean that all LVOLS are completely empty and that TSM will then restore EVERYTHING that you mount into /mnt/local/.TOTAL DISASTER NOTE
The text was updated successfully, but these errors were encountered: