Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with InfraValidator: RPC Error StatusCode.UNAVAILABLE #2914

Closed
dbustosp opened this issue Dec 3, 2020 · 7 comments
Closed

Problem with InfraValidator: RPC Error StatusCode.UNAVAILABLE #2914

dbustosp opened this issue Dec 3, 2020 · 7 comments

Comments

@dbustosp
Copy link

dbustosp commented Dec 3, 2020

I have been trying to use InfraValidator component on local environment. I have tried 3 different ways: Interactive, Airflow and Beam. For all of them I am getting the same stack error.

It is strange because the container is triggered, the model is loaded successfully but I got confusing error. Finally the model is blessed by the component.

How I am using it:

infra_validator = InfraValidator(
        model = trainer.outputs['model'],
        serving_spec = infra_validator_pb2.ServingSpec(
            tensorflow_serving = infra_validator_pb2.TensorFlowServing(
                tags=['latest']
            ),
            local_docker = infra_validator_pb2.LocalDockerConfig(
            )
        ),
        validation_spec = infra_validator_pb2.ValidationSpec(
            max_loading_time_seconds=60,
            num_tries=2
        ),
        request_spec = infra_validator_pb2.RequestSpec(
            tensorflow_serving = infra_validator_pb2.TensorFlowServingRequestSpec(
                signature_names = ['classification']
            ),
            num_examples=10  # How many requests to make.
        )
)

The logs and error in between: "<_InactiveRpcError.."

INFO:absl:Starting infra validation (attempt 1/2).
INFO:absl:Starting LocalDockerRunner(image: tensorflow/serving:latest).
INFO:absl:Running container with parameter {'auto_remove': True, 'detach': True, 'publish_all_ports': True, 'image': 'tensorflow/serving:latest', 'environment': {'MODEL_NAME': 'infra-validation-model', 'MODEL_BASE_PATH': '/model'}, 'mounts': [{'Target': '/model/infra-validation-model/1', 'Source': '/Users/home/.temp/466/infra-validation-model/1606971601', 'Type': 'bind', 'ReadOnly': True}]}
INFO:absl:Error while obtaining model status:
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1606971601.677635000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4166,"referenced_errors":[{"created":"@1606971601.677632000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>
INFO:absl:Waiting for model to be loaded...
INFO:absl:Model is successfully loaded.
INFO:absl:Stopping LocalDockerRunner(image: tensorflow/serving:latest).
INFO:absl:Stopping container.
INFO:absl:Running publisher for InfraValidator
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component InfraValidator is finished.

Could you please tell me why I am seeing this behaviour. The model is still blessed even getting this error.

Thanks!

@Arghya999
Copy link

Please check "docker service logs" output as well as "docker info" and see the status of the container.
Seems to be a memory related issue.
Restarting the docker daemon after flushing the changes might help.

@dbustosp
Copy link
Author

dbustosp commented Dec 3, 2020

Please check "docker service logs" output as well as "docker info" and see the status of the container.
Seems to be a memory related issue.
Restarting the docker daemon after flushing the changes might help.

I already did and everything looks normal.

@chongkong
Copy link
Contributor

Hi, this is an INFO log (not WARNING or ERROR) which was intentionally turned on for verbosity. When InfraValidator is checking whether model is successfully loaded, it does it by periodically (1s?) polling to the model server, and the server might respond UNAVAILABLE until the model is actually loaded, and the log shows that particular response. This polling will continue until model becomes available or timeout, where the former case model is blessed, and the latter it isn't.

Source code:

logging.info('Error while obtaining model status:\n%s', e)

@rmothukuru
Copy link
Contributor

@selknamintech,
Can you please respond to @chongkong's comment above. Thanks!

@dbustosp
Copy link
Author

dbustosp commented Dec 10, 2020

Hi @chongkong

Sorry for the delay in my response.
It absolutely make sense. Appreciate your answer and your clarification.

That message is confusing though. It will be great if at least the word "ERROR" can be changed because it can lead to misunderstanding

Marking this help as done ✅

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

copybara-service bot pushed a commit that referenced this issue Jan 5, 2021
@chongkong
Copy link
Contributor

Will update the INFO log message format in the following PR

copybara-service bot pushed a commit that referenced this issue Jan 5, 2021
copybara-service bot pushed a commit that referenced this issue Jan 5, 2021
copybara-service bot pushed a commit that referenced this issue Jan 6, 2021
copybara-service bot pushed a commit that referenced this issue Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants