Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A misconfigured CTCE causes Hercules to crash #646

Closed
jeff-snyder opened this issue Mar 30, 2024 · 2 comments
Closed

A misconfigured CTCE causes Hercules to crash #646

jeff-snyder opened this issue Mar 30, 2024 · 2 comments
Assignees

Comments

@jeff-snyder
Copy link

jeff-snyder commented Mar 30, 2024

Hi Peter,

The two systems JS01 and JS10 are configured to talk to each other over a CTC. Unfortunately, a typo was made on the JS01 configuration and instead of 610 as the remote CUU, E20 was entered.

JS01
610 CTCE 30801 E20=Hercules 30810 # link to JS10/E20

JS10
610 CTCE 30810 610=Hercules 30801 ATTNDELAY 200 # link to JS01/610

When I bring up Hercules for JS01 and IPL VM/ESA, there are no problems. A "devlist ctca" shows the link defined (among others).

2024-03-29 18:09:18 HHC01603I devlist ctca
2024-03-29 18:09:18 HHC02279I 0:0610 CTCE CTCE 30801/63504 !=! 0:0E20=192.168.1.32:30810/* IO[0] 

When I then bring up Hercules for JS10, it thinks it successfully connected to this link.

2024-03-29 18:10:48 HHC05063I 0:0610 CTCE: Awaiting inbound connection :30810 <- 0:0610=192.168.1.32:30801/*
2024-03-29 18:10:48 HHC05070I 0:0610 CTCE: Accepted inbound connection :30810 <- 0:0610=192.168.1.32:63830 (bufsize=62552,16)
2024-03-29 18:10:48 HHC05054I 0:0610 CTCE: Renewed outbound connection :63844 -> 0:0610=192.168.1.32:30801

The "devlist" for JS10 shows a connection.

2024-03-29 18:10:57 HHC01603I devlist ctca
2024-03-29 18:10:57 HHC02279I 0:0610 CTCE CTCE 30810/63844 <-> 0:0610=192.168.1.32:30801/63830 IO[0] open 

The log for JS01 shows it started an outbound connection, but there was never an inbound connection.

2024-03-29 18:10:48 HHC05054I 0:0610 CTCE: Started outbound connection :63830 -> 0:0E20=192.168.1.32:30810

The "devlist" confirms an incomplete link.

2024-03-29 18:11:17 HHC01603I devlist ctca
2024-03-29 18:11:17 HHC02279I 0:0610 CTCE CTCE 30801/63830 !=> 0?0E20=192.168.1.32:30810/* IO[2] open 

When I IPL JS10, errors ensue:

2024-03-29 18:11:56 HHC01603I ipl 1c0
2024-03-29 18:11:56 HHC05074E 0:0610 CTCE: Error writing to 0:0610=192.168.1.32:30801/63830: An established connection was aborted by the software in your host machine.
2024-03-29 18:11:56 HHC00007I Previous message from function 'CTCE_Send' at ctcadpt.c(2555)
2024-03-29 18:11:56 HHC05086I 0:0610 CTCE: Recovery is about to issue Hercules command: DEVINIT 0:0610
2024-03-29 18:12:31 HHC00822S PROCESSOR CP00 APPEARS TO BE HUNG!

and, eventually, a crash dump.

Note, this happened on Windows 10, running Hercules version 4.8.0.11129-SDL-DEV-g5517d322-modified
I retested with version Hercules version 4.8.0.11129-SDL-DEV-g5517d322, i.e. without the changes to ctcadpt.c, and it still happens.

JS01 is VM/ESA 2.4 and JS10 is VM/SP 5.

Here are the associated log and config files.

Unfortunately, due to the 74 MB file size, I cannot upload the dump file. For now, I have put it on my Google drive. Hopefully, you can get it from there or we can find another way to get it to you.

Thanks for looking at this!
Jeff

@Peter-J-Jansen
Copy link
Collaborator

Peter-J-Jansen commented Mar 31, 2024

Hi Jeff,

The CTCE recovery attempts are known to not always end successfully. The DEVINIT 0:0610 attempt may very well fail when the device by then is busy or has an interrupt pending. That in this case it caused a crash was probably due to the Hercules watchdog timer discovering HHC00822S PROCESSOR CP00 APPEARS TO BE HUNG!. So this crash was probably a case of Works As Desgined ("WAD").

Some years ago numerous efforts were spent on making the CTCE automatic recovery's fail-safe, and progress was made, but no, I was unable to make it work in all cases. As this occurrence was started by an incorrect CTCE configuration, I'd suggest we close this Issue. As some additional help avoiding CTCE configuration errors, I'd suggest to not specify any port numbers at all when the CTCE links are between difference hosts, but just restrict the configuration to just use device (CCUU) numbers, e.g.:

0610 CTCE =Hercules

or if one prefers the device number host-side specific:

0610 CTCE 0601=js01.hostname

0601 CTCE 0610=js10.hostname

Cheers,

Peter

@jeff-snyder
Copy link
Author

Peter,

Some years ago numerous efforts were spent on making the CTCE automatic recovery's fail-safe, and progress was made, but no, I was unable to make it work in all cases. As this occurrence was started by an incorrect CTCE configuration, I'd suggest we close this Issue.

I'm good with that. I have a work around (i.e. fix your stupid configuration error!).

As some additional help avoiding CTCE configuration errors, I'd suggest to not specify any port numbers at all when the CTCE links are between difference hosts, but just restrict the configuration to just use device (CCUU) numbers, e.g.:

Unfortunately, this doesn't work for me because I run multiple Hercules images on each host and I move them around, so I'm never sure which images will be running on which hosts. It's a good solution for people with fewer images or a more stable environment, theough!

Thanks,
Jeff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants