Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for response time out issue during build #420

Merged
merged 92 commits into from Jun 25, 2020
Merged

Fix for response time out issue during build #420

merged 92 commits into from Jun 25, 2020

Conversation

harshbafna
Copy link
Contributor

@harshbafna harshbafna commented Jun 1, 2020

Description

We use Java’s inbuilt destroyForcibly method to kill the backend worker and as per the following documentation it doesn’t guarantee an immediate forceful killing of the process and may take long time, which is more than the unregister response time out, whenever the CPU is busy/overloaded.

The process may not terminate immediately. i.e. isAlive() may return true for a brief period after destroyForcibly() is called. This method may be chained to waitFor() if needed

As a fix, instead of using the inbuilt destroyForcibly method and waiting on the same, we create another process with kill -9 <worker_pid> and wait on this process to complete.

Fixes #417

Also addresses #416

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Feature/Issue validation/testing

Successfully executed install_from_src_ubuntu on m4.xlarge EC2 instance and install_from_src_mac on local MacBook.

Logs from m4.xlarge EC2 instance for executing install_from_src_ubuntu script 10 times in a loop for staging_0_1_1 and issue_417 branch

staging_install.log
issue_417_install.log

Checklist:

  • Have you added tests that prove your fix is effective or that this feature works?
  • New and existing unit tests pass locally with these changes?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

harshbafna and others added 30 commits April 23, 2020 20:31
fixed resnet152 custom handler to work with GPU
* issue template

* Moving issue templates to root folder per pull request template

* adding feature request template

* Minor changes related to sections and names.

* Update feature_template.md

* Update bug_template.md

* doc template

* minor changes

* Minor changes to upload custom model or handler

Co-authored-by: mycpuorg <mail@mycpu.org>
Invokes docker with  --runtime=nvidia for GPU
    Added the option to specify the GPU / specific GPU ids in start.sh
    Fixed the documentation for start.sh script
    Fixed the JDK version in Dockerfile.gpu

Co-authored-by: Aaqib <maaquib@gmail.com>
* pylint fixes and code cleanup

* UT fixes, pylint updates per latest version and updated sanity script to run pylint

* fixed pylint and wrapt installation issue

* freeze pylint version to 2.4.4 due to bug in version 2.5

* removed duplicate statement to install pylint

Co-authored-by: Aaqib <maaquib@gmail.com>
* updated code to now log errors only in case when exception is not genreated by unregister api call

* java formatting

* fixed illigalstateexception

* removed the unrequired handling for illegal monitor state exception

Co-authored-by: dhaniram kshirsagar <dhaniram_kshirsagar@persistent.com>
Co-authored-by: Aaqib <maaquib@gmail.com>
* changes for Homebrew  issue 98

* adding changes for path in benchmark.py

* CI Build corrected

* benchmark changes

corrected path

* changed PATH for homebrew

* Update Readme

Made changes for homebrew installation and instruction for GPU based machine instruction for install_dependencies.sh

* Updated Readme.md

removed unnecessory  lines

* corrected indentation

corrected indentation for  new lines added

* Update Readme.md fil

added separate commands for CPU and GPU based

Co-authored-by: Ubuntu <ubuntu@ip-172-31-35-42.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-42-124.ec2.internal>
Co-authored-by: harshbafna <harsh_bafna@persistent.co.in>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-67.ec2.internal>
Co-authored-by: Harsh Bafna <harshbafna619@gmail.com>
Co-authored-by: Aaqib <maaquib@gmail.com>
* Moved plugins SDK to org.pytorch

* Updating the repo for publishing plugins SDK

* Enabled bintray maven repo temporarily

* Reset tests

* Added GPG plugin to pom file to sign the plugins SDK

* Update the SDK version

* Removed bintray repo

* Modify the buildspec to not deploy the SDK

Co-authored-by: Vamshidhar Dantu <dantu@amazon.com>
Co-authored-by: Aaqib <maaquib@gmail.com>
Co-authored-by: mycpuorg <mail@mycpu.org>
Co-authored-by: mycpuorg <mail@mycpu.org>
Co-authored-by: Brad Heintz <bradheintz@fb.com>
* Snapshot is now independent of model

* fixed checkstyle issue

* fixed formatting

* doc review: configuration and custom_service

* addressed review comments

* fixed incorrect merge

* fixed review comments and corrected the error messages

* adding remaining fixes for string constants

Co-authored-by: harshbafna <harsh_bafna@persistent.co.in>
Co-authored-by: eslesar-aws <eslesar@amazon.com>
Co-authored-by: Aaron Markham <markhama@amazon.com>
Co-authored-by: Harsh Bafna <harshbafna619@gmail.com>
#285)

* Regression Tests Suite - AWS Code Build Infra - Initial Commit for #57

* Script to clone Pytorch Repo & Run Post Man Scripts
* Basic Tests for Inference & Managment APIs
* AWS Code Build buildspec.yml

* Refactor postman tests, add empty postman collection for todos

* Add a readme for adding tests

* Add more management tests

* Fix typo

* Snapshot Regression tests - #57

* Fix typo

* Comment out snapshot test for now

* Weird typos are showing up, try saving everything

* Need to manually save every request in postman...

* When reregister don't use s3

* More workarounds for rereg

* Fix test messages

* Remove test for empty reqs

* Change get to post

* Fix pytest for subprocess - #57

* Enable pytest - #57

* Inference API Testcases / Rename Management Testcases  #57

* Wipe Model MARs in Modelstore between tests #54

* Fix Inference Endpoint Execution #54

* Fix file paths - #57

* Fix Initial Workers #57

* Fix Inference Assertions #57

* Fix DenseNet Inference #57

* Fix Densenet Register #54

* Redirect Logs to File #54

* Fix Test Execution Log #54

* README, Build Badge #57

* Format ReadMe #57

* Update README - Add Instruction to add new tests / Run tests locally

* Update README

* Update branch for build Codebuild Badge

Co-authored-by: alexwong <11878166+alexwong@users.noreply.github.com>
Co-authored-by: Aaqib <maaquib@gmail.com>
Co-authored-by: mycpuorg <mail@mycpu.org>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-12-178.us-west-2.compute.internal>
* documentation: fixed issues in installation and quick start in README.md

* documentation: Reverting conda related doc changes from PR#286

Fixes #297

* remove minus y (#311)

update pip install instructions

* Update instructions for installation from source

This commit aims to address Issue #312

Tested on a fresh EC2 P3.2 instance:
* Followed instructions under
https://github.com/pytorch/serve#install-torchserve-for-development
* Started the server manually from command line
* Ran a ping query

Received a response from server with "Healthy" status

Signed-off-by: Manoj Rao <mail@mycpu.org>

* add separate files for installation

one each for linux and macos

* update README and remove relative path from script

* Install script - Dont uninstall if not previously installed - #132

* explicitly build the source code with gradle prior to install

* remove upgrade instructions in favor of clean reinstall from source

* update instructions to contain the correct script name

Co-authored-by: eslesar-aws <eslesar@amazon.com>
Co-authored-by: Aaqib Ansari <maaquib@gmail.com>
Co-authored-by: Geeta Chauhan <4461127+chauhang@users.noreply.github.com>
Co-authored-by: Aaron Markham <markhama@amazon.com>
Co-authored-by: Manoj Rao <mail@mycpu.org>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-0-61.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-7-197.us-west-2.compute.internal>
* updated sanity script to use install_from_src_ubuntu script

* added java version check

* fixed issue with if statement

* fixed variable name

* fixed repeated cd command

* updated artifacts

* removed artifacts from buildspec
@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: f0f1c24
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: f0f1c24
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: 61d863d
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: 61d863d
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@harshbafna harshbafna changed the base branch from staging_0_1_1 to master June 19, 2020 06:16
@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: e708c80
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: 44f75fa
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: 91c89a5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: 91c89a5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: e708c80
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: 44f75fa
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: 93194dd
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: 93194dd
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@dhanainme dhanainme merged commit fe4cfb2 into master Jun 25, 2020
@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: 93194dd
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: 93194dd
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@harshbafna harshbafna deleted the issue_417 branch July 1, 2020 10:04
@maaquib maaquib added this to PR In Progress in TorchServe v0.2.0 PR Lifecycle via automation Jul 2, 2020
@maaquib maaquib moved this from PR In Progress to PR Ready for Testing in TorchServe v0.2.0 PR Lifecycle Jul 2, 2020
@dhanainme dhanainme moved this from PR Ready for Testing to PR Tested in TorchServe v0.2.0 PR Lifecycle Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects