Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVS wait code 00090064 during IPL under VM wih ECPS:VM active #63

Closed
wably opened this issue Jun 15, 2017 · 2 comments
Closed

MVS wait code 00090064 during IPL under VM wih ECPS:VM active #63

wably opened this issue Jun 15, 2017 · 2 comments

Comments

@wably
Copy link
Contributor

wably commented Jun 15, 2017

The following bug issue and text is reproduced from Hyperion issue 193 so that this bug may be documented and fixed in Spinhawk.

When attempting to IPL MVS 3.8J under VM/370 with ECPS:VM active, a disabled wait code PSW of 00020000 00090064 is sometimes issued by MVS. When it occurs, the wait code appears after responding (just pressing ENTER) to the MVS message IEA101A SPECIFY SYSTEM PARAMETERS.

According to OS/VS2 System Codes, wait 064 is issued because of a program check during nucleus initialization (the x'09' in the code specifically means program check), and that the program old PSW points to the instruction that failed. The problem is, the program old PSW contains a wait PSW: 070E0000 00000004. You cannot have a program check while in a wait state, so something is amiss here.

Skipping over the details of hours of research, debugging, single stepping and so forth I tracked the issue to the DISP2 assist of ECPS:VM. It turns out that DISP2 is dispatching the run user (MVS) even though the user's virtual PSW is in a wait. DISP2 dutifully builds the dispatch PSW by merging in the virtual instruction address with a standard CP dispatch PSW. Since the virtual PSW instruction address is 0, the resulting dispatch PSW is 070D0000 00000000. Then DISP2 then exits so that control can be given to the run user. MVS immediately program checks because the instruction address is 0. The value 070E0000 ends up in MVS's program old PSW because that's what the virtual PSW was.

The bottom line is that DISP2 should not be dispatching a user that is in virtual PSW wait. Moreover, there are dispatchability flags in the VMBLOK that indicates that a user should not be dispatched for a number of reasons, and one of them is VMPSWAIT (in byte VMRSTAT) which means the user is in virtual PSW wait. The assist code in DISP2 is not checking this flag.

But even if DISP2 did check this flag, it would not resolve the problem. It turns out that the flag is not set anyway. I have been unable to resolve how the user can have a virtual PSW with the wait bit set and not have the VMPSWAIT bit set.

Regardless, adding a check in the DISP2 code to see if the wait bit is set in the virtual PSW is the likely solution. This will cause the user to be skipped and another runnable user to be selected or the machine idled. The solution is this code snippet:

if(EVM_LH(vmb+VMPSW) & 0x0002)
{
    	DEBUG_CPASSISTX(DISP2,logmsg("DISP2 : VMB @ %6.6X Not eligible : User in virtual PSW wait\n",vmb));
	continue;
}

This new code should be located immediately after this line in ecpsvm_do_disp2( ):

      for(vmb=EVM_L(FW1);vmb!=FW1;vmb=EVM_L(vmb))
 	  {

My justification for this solution is based on these points:
• There is no case where a user in a virtual wait state should be dispatched.
• While I cannot explain the reason for the discrepancy between the VMPSWAIT dispatchability flag and the wait bit in the virtual PSW, the rules throughout the ECPS code logic say: when in doubt about something, let CP handle it. The new code does exactly that. This is a dispatch case that cannot be reconciled, so let CP deal with it.
• I do think there is a problem somewhere that allows this discrepancy to occur but I have been unable to find it. Nevertheless, the solution code does resolve the issue. Thus, the safest course when something isn't right is to turn it over to CP.

The problem of the MVS wait 064 is resolved after implementing the solution above.

@wably
Copy link
Contributor Author

wably commented Jun 15, 2017

on 02/05/2017 PeterCoghlan wrote:

I appear to have come across this one too. While trying to start an RSCS link driver with TRACE PROG active:

*** 000002 PROG 0001 ==> 0104D8
D 28.8
000028 FF060001 40000002

suggesting that RSCS took a PROG 1 interrupt while in a wait state. This one was quite elusive. Even more elusive was the single one I got at IPL time:

IPL 191
RDR 001 DETACHED
RDR 001 DEFINED
*** 000002 PROG 0001 ==> 000007
(I seem to have mislaid the contents of the old PSW in this case unfortunately.)

It is very hard to be completely sure but so far, there is a very high degree of correlation between disabling DISP2 and the problem not occurring. Re-enabling DISP2 does result in it occurring again.

I haven't seen the problem since but it is very hard to be certain that it is gone as the tearing down and setting up again of the environment required to apply the fix makes it hard to know if I have successfully recreated the conditions under which it used to occur. It looks good so far though.

@wably
Copy link
Contributor Author

wably commented Jul 31, 2017

Fixed by commit e28ab94, pull request #66.

@wably wably closed this as completed Jul 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant