Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Jan 2, 2019

Currently, in test_proper_exit,

  1. we do not kill the correct input pid in the kill_pid function
    def kill_pid(pid):
    if IS_WINDOWS:
    os.system('taskkill /PID ' + str(os.getpid()) + ' /F')
    else:
    os.kill(os.getpid(), signal.SIGKILL)
  2. the Windows command that detects process status doesn't actually work
    def _is_process_alive(pid, pname):
    # There is a chance of a terminated child process's pid being reused by a new unrelated process,
    # but since we are looping this check very frequently, we will know that the child process dies
    # before the new unrelated process starts.
    if IS_WINDOWS:
    command = 'tasklist | find "{}" /i'.format(pid)
  3. worker_error and worker_kill cases (sometimes?) are not tested because the workers may exit naturally due to the pre-fetching mechanism and a too small dataset size / batch size.

In this PR, I, in separate commits:

  1. Install psutil (a python package specifically built for process monitoring) on some CI builds. (Linux builds installation are done in Install psutil for dataloader tests pietern/pytorch-dockerfiles#29 install psutil and librosa on non-conda builds pietern/pytorch-dockerfiles#30 Update pytorch docker version to 278 ossci-job-dsl#36 and Bump CircleCI docker version to 278 #15795).

  2. Rewrite test_proper_exit with psutil so we

    1. do not rely on the hacky is_process_alive
      @staticmethod
      def _is_process_alive(pid, pname):
      # There is a chance of a terminated child process's pid being reused by a new unrelated process,
      # but since we are looping this check very frequently, we will know that the child process dies
      # before the new unrelated process starts.
      if IS_WINDOWS:
      command = 'tasklist | find "{}" /i'.format(pid)
      else:
      command = 'ps -p {} -o comm='.format(pid)
      p = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
      (output, err) = p.communicate()
      p_status = p.wait()
      output = output.decode('utf-8')
      return pname in output
    2. increase the #task per worker so worker_error and worker_kill properly trigger
    3. test error message content to ensure that the loader exits with correct message corresponding to each exiting scenario.
  3. Fix Windows data loader not having any mechanism to detect worker failures.

@ssnl ssnl force-pushed the fix_test_proper_exit branch 13 times, most recently from ac3b842 to e21b7f0 Compare January 9, 2019 12:36
@ssnl ssnl changed the title [Test CI] fix TestDataLoader.test_proper_exit Fix TestDataLoader.test_proper_exit Jan 9, 2019
@ssnl ssnl force-pushed the fix_test_proper_exit branch 4 times, most recently from 389a18c to 2babbf8 Compare January 9, 2019 16:36
@ssnl ssnl requested review from apaszke, soumith and yf225 January 9, 2019 16:39
@ssnl ssnl force-pushed the fix_test_proper_exit branch from 2babbf8 to 0303df2 Compare January 9, 2019 16:43
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 10, 2019

@soumith I observed some flakiness so I increased the test join time out. Please land again when you have time. :)

@ssnl ssnl force-pushed the fix_test_proper_exit branch from fff7213 to 5da15f9 Compare January 10, 2019 01:53
@ssnl
Copy link
Collaborator Author

ssnl commented Jan 10, 2019

lint is already broken on master

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl ssnl changed the title Fix TestDataLoader.test_proper_exit [DataLoader] Fix Windows worker exit detection, fix test_proper_exit Jan 14, 2019
@ssnl ssnl deleted the fix_test_proper_exit branch January 14, 2019 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants