-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchserve gives errors while running docker image on k8s but not when running image locally #2300
Comments
Tested with archiving and building on linux, but same issue. Runs fine locally but gives errors on k8s (also tested different clusters). Another strange behaviour that I observed is that when you set initial models equal to only one of the models, it sometimes also works on k8s and not only locally. |
@stefanknegt Do you mind sharing the yaml file you are using to deploy the cluster and how you are calling the inference api. Also, its not clear what the 2nd and 3rd part of the repro steps are. or is the entire thing a part of one Dockerfile? |
I have bash scripts (the second and third code snippet in the repro steps) that are used to make the .mar files. I am not calling the inference api, since it already 'breaks' before I can make calls to it (see logs in initial post). In order to reproduce you can do the following:
and running the torch-model-archiver.
You have to do this twice since loading one model sometimes does work.
@agunapal Thanks for helping me out. If you have any questions, please let me know. |
@stefanknegt I made this PR showing k8s mnist example with minikube |
@agunapal Do you have any thoughts on how I can fix this issue? |
@agunapal Is there anything I can do to get this fixed? Thanks! |
+1. My code runs well on a local image, but fails on a k8s cluster with the same image. However, unlike your random errors, the error from my side is always the same. |
Check if you guys have set self.init, and self.context = context.bla |
@arnavmehta7 See my code comment above, I have both. |
Hi @stefanknegt
|
I ran into a similar issue when deploying the torch server to Kubernetes. It turned out to be due to OOM. |
@sowmyay , right finding...I used Jconsole to figure this out, but there were no relevant logs in the exception. |
🐛 Describe the bug
When running Torchserve on a k8s cluster (Minikube locally), I get a lot of errors while running the exact same docker image locally works fine. The errors seem to be different every time. Sometimes they are about loading transformer models, for instance a .json file being corrupt (which it is not) but sometimes it is also just a lot of Torchserve java errors.
Error logs
Installation instructions
I am using the following base image: pytorch/torchserve:0.7.0-cpu
Model Packaing
I am packaging the models locally and then adding the .mar files to the dockerfile.
config.properties
Versions
Repro instructions
Run the following docker file where models are archived using bash scripts below.
Possible Solution
The only suspicion I have currently is that it has something to do with packaging the models on a Mac M1 and then serving them on a Linux server, could anyone tell me whether this could cause these errors?
The text was updated successfully, but these errors were encountered: