{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":204618213,"defaultBranch":"master","name":"pytorch","ownerLogin":"satgera","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2019-08-27T04:16:15.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/15789658?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1703115937.0","currentOid":""},"activityList":{"items":[{"before":null,"after":"e6a4ad623e6b3c20a8fa23f3b1b4654b4feb7593","ref":"refs/heads/export-D52348727","pushedAt":"2023-12-20T23:45:37.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"satgera","name":"Satendra Gera","path":"/satgera","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/15789658?s=80&v=4"},"commit":{"message":"Fix elastic multiprocessing in case a child does exit(0)\n\nSummary: Fix elastic multiprocessing in case a child does exit(0)\n\nTest Plan: We got into an edge case where child process calls exit(0) this isn't handled well in elastic multiprocessing and we end up deadlocking. This change take care of the same. The issue is that torch.multiprocessing considers only non-zero exit codes as errors while elastic logic assumes unless there is an error we would get the return info in the queue.\n\nDifferential Revision: D52348727","shortMessageHtmlLink":"Fix elastic multiprocessing in case a child does exit(0)"}},{"before":"303e8b0d5a8f14ee29b1ac20de1ca0f9e7acd280","after":"425ac85783a4880dde6ef184f6aefac512cd3ec3","ref":"refs/heads/export-D51862545","pushedAt":"2023-12-07T16:20:19.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"satgera","name":"Satendra Gera","path":"/satgera","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/15789658?s=80&v=4"},"commit":{"message":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)\n\nSummary:\n\n[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process\nWe have seen a handful of jobs training stuck where one of the trainer goes down\nwhile others are stuck in c++ land and hence not handling the sigterm.\n\nTest Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```\n\nReviewed By: wconstab, fduwjj\n\nDifferential Revision: D51862545","shortMessageHtmlLink":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill …"}},{"before":"0468926ec0c7894d2cb4a058b7185a493b8ab7fc","after":"303e8b0d5a8f14ee29b1ac20de1ca0f9e7acd280","ref":"refs/heads/export-D51862545","pushedAt":"2023-12-07T02:08:26.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"satgera","name":"Satendra Gera","path":"/satgera","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/15789658?s=80&v=4"},"commit":{"message":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)\n\nSummary:\n\n[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process\nWe have seen a handful of jobs training stuck where one of the trainer goes down\nwhile others are stuck in c++ land and hence not handling the sigterm.\n\nTest Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```\n\nReviewed By: wconstab, fduwjj\n\nDifferential Revision: D51862545","shortMessageHtmlLink":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill …"}},{"before":"c5417ce245156cbb8b949f5e24c7fad45cc2a211","after":"0468926ec0c7894d2cb4a058b7185a493b8ab7fc","ref":"refs/heads/export-D51862545","pushedAt":"2023-12-07T00:02:33.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"satgera","name":"Satendra Gera","path":"/satgera","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/15789658?s=80&v=4"},"commit":{"message":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)\n\nSummary:\n\n[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process\nWe have seen a handful of jobs training stuck where one of the trainer goes down\nwhile others are stuck in c++ land and hence not handling the sigterm.\n\nTest Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```\n\nReviewed By: wconstab, fduwjj\n\nDifferential Revision: D51862545","shortMessageHtmlLink":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill …"}},{"before":"dce8fcda8dc5f3098fd907f140ec60d6955ad003","after":"c5417ce245156cbb8b949f5e24c7fad45cc2a211","ref":"refs/heads/export-D51862545","pushedAt":"2023-12-06T23:03:40.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"satgera","name":"Satendra Gera","path":"/satgera","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/15789658?s=80&v=4"},"commit":{"message":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)\n\nSummary:\n\n[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process\nWe have seen a handful of jobs training stuck where one of the trainer goes down\nwhile others are stuck in c++ land and hence not handling the sigterm.\n\nTest Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```\n\nReviewed By: wconstab, fduwjj\n\nDifferential Revision: D51862545","shortMessageHtmlLink":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill …"}},{"before":"10cb508796c7bc52e178a04dc41ec4ee376217de","after":"dce8fcda8dc5f3098fd907f140ec60d6955ad003","ref":"refs/heads/export-D51862545","pushedAt":"2023-12-06T22:39:53.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"satgera","name":"Satendra Gera","path":"/satgera","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/15789658?s=80&v=4"},"commit":{"message":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)\n\nSummary:\n\n[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process\nWe have seen a handful of jobs training stuck where one of the trainer goes down\nwhile others are stuck in c++ land and hence not handling the sigterm.\n\nTest Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```\n\nReviewed By: wconstab, fduwjj\n\nDifferential Revision: D51862545","shortMessageHtmlLink":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill …"}},{"before":"23cf6390e9618497b4941461882cde88f869162a","after":"10cb508796c7bc52e178a04dc41ec4ee376217de","ref":"refs/heads/export-D51862545","pushedAt":"2023-12-06T21:08:51.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"satgera","name":"Satendra Gera","path":"/satgera","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/15789658?s=80&v=4"},"commit":{"message":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)\n\nSummary:\n\n[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process\nWe have seen a handful of jobs training stuck where one of the trainer goes down\nwhile others are stuck in c++ land and hence not handling the sigterm.\n\nTest Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```\n\nReviewed By: wconstab, fduwjj\n\nDifferential Revision: D51862545","shortMessageHtmlLink":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill …"}},{"before":"3655e31e60e634354bd8ee4803365b21763c7b3d","after":"23cf6390e9618497b4941461882cde88f869162a","ref":"refs/heads/export-D51862545","pushedAt":"2023-12-06T01:14:59.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"satgera","name":"Satendra Gera","path":"/satgera","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/15789658?s=80&v=4"},"commit":{"message":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)\n\nSummary:\n\n[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process\nWe have seen a handful of jobs training stuck where one of the trainer goes down\nwhile others are stuck in c++ land and hence not handling the sigterm.\n\nTest Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```\n\nDifferential Revision: D51862545","shortMessageHtmlLink":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill …"}},{"before":null,"after":"3655e31e60e634354bd8ee4803365b21763c7b3d","ref":"refs/heads/export-D51862545","pushedAt":"2023-12-06T00:47:52.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"satgera","name":"Satendra Gera","path":"/satgera","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/15789658?s=80&v=4"},"commit":{"message":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process\n\nSummary:\n[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process\nWe have seen a handful of jobs training stuck where one of the trainer goes down\nwhile others are stuck in c++ land and hence not handling the sigterm.\n\nTest Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```\n\nDifferential Revision: D51862545","shortMessageHtmlLink":"[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill …"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAADzyNaLQA","startCursor":null,"endCursor":null}},"title":"Activity · satgera/pytorch"}