-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak in JIT Model under Multi-Goroutine Environment #117
Comments
Thanks for the report. However, I have a quick look and see 2-3 things that may cause memory blown up:
Please try those things to see how thing are going. Thanks. |
Thank you for your response. I have adjusted my code according to your suggestions, but the memory usage still keeps increasing. Below is my latest code: package main
import (
"encoding/json"
"os"
"time"
"github.com/sugarme/gotch"
"github.com/sugarme/gotch/nn"
"github.com/sugarme/gotch/pickle"
"github.com/sugarme/gotch/ts"
"github.com/sugarme/gotch/vision"
)
func getModel() (net nn.FuncT) {
modelName := "resnet18"
url, ok := gotch.ModelUrls[modelName]
if !ok {
panic("Unsupported model name")
}
modelFile, err := gotch.CachedPath(url)
if err != nil {
panic(err)
}
vs := nn.NewVarStore(gotch.CPU)
net = vision.ResNet18NoFinalLayer(vs.Root())
err = pickle.LoadAll(vs, modelFile)
if err != nil {
panic(err)
}
return
}
func getTensor() (tensor *ts.Tensor) {
b, err := os.ReadFile("test.data")
if err != nil {
panic(err)
}
var data []float32
err = json.Unmarshal(b, &data)
if err != nil {
panic(err)
}
tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
tensor = tensor.MustUnsqueeze(0, true)
return
}
func main() {
net := getModel()
tensor := getTensor()
defer tensor.MustDrop()
var goroutineNum = 10
for i := 0; i < goroutineNum; i++ {
go func(net nn.FuncT) {
for {
ts.NoGrad(func() {
result := net.ForwardT(tensor, false)
result.MustDrop()
})
}
}(net)
}
time.Sleep(5 * time.Minute)
} |
When calling the model in multiple goroutines, a lot of warning messages appear, as follows: 2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235087". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235091". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235100". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235098". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235215". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235245". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235395". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235566". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235609". Called from "ts.Drop()". Just skipping... |
Probably you should create a model for each go routine then. Actually, I have never tried to do concurrency on one model like that. I guess, there will be a lot of data collision as all go routines feed into a single model. |
I created a model for each goroutine, and used the corresponding model when calling within the goroutine, but there are still issues. package main
import (
"encoding/json"
"os"
"time"
"github.com/sugarme/gotch"
"github.com/sugarme/gotch/nn"
"github.com/sugarme/gotch/pickle"
"github.com/sugarme/gotch/ts"
"github.com/sugarme/gotch/vision"
)
func getModel() (net nn.FuncT) {
modelName := "resnet18"
url, ok := gotch.ModelUrls[modelName]
if !ok {
panic("Unsupported model name")
}
modelFile, err := gotch.CachedPath(url)
if err != nil {
panic(err)
}
vs := nn.NewVarStore(gotch.CPU)
net = vision.ResNet18NoFinalLayer(vs.Root())
err = pickle.LoadAll(vs, modelFile)
if err != nil {
panic(err)
}
return
}
func getTensor() (tensor *ts.Tensor) {
b, err := os.ReadFile("test.data")
if err != nil {
panic(err)
}
var data []float32
err = json.Unmarshal(b, &data)
if err != nil {
panic(err)
}
tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
tensor = tensor.MustUnsqueeze(0, true)
return
}
func main() {
var goroutineNum = 10
var nets []nn.FuncT
for i := 0; i < goroutineNum; i++ {
nets = append(nets, getModel())
}
tensor := getTensor()
defer tensor.MustDrop()
for i := 0; i < goroutineNum; i++ {
net := nets[i]
go func(net nn.FuncT) {
for {
ts.NoGrad(func() {
result := net.ForwardT(tensor, false)
result.MustDrop()
})
}
}(net)
}
time.Sleep(5 * time.Minute)
} |
I will try to reproduce your problem when having time by this week. However, your latest What about some thing like this: for i := 0; i < goroutineNum; i++ {
go func() {
net := getModel()
tensor := getTensor()
ts.NoGrad(func() {
result := net.ForwardT(tensor, false)
result.MustDrop()
})
tensor.MustDrop()
}
}()
} |
The memory usage still keeps increasing, the key code is as follows: for i := 0; i < goroutineNum; i++ {
go func() {
// goroutine model
net := getModel()
// test input tensor
tensor := getTensor()
defer tensor.MustDrop()
// stress test to observe memory increase
for {
ts.NoGrad(func() {
result := net.ForwardT(tensor, false)
// drop result tensor
result.MustDrop()
})
}
}()
} |
I understand now, I seem to have found a bug in tensor.go that causes some Tensors not to be released. this is old code: atomic.AddInt64(&TensorCount, 1)
nbytes := x.nbytes()
atomic.AddInt64(&AllocatedMem, nbytes)
lock.Lock()
if _, ok := ExistingTensors[name]; ok {
name = fmt.Sprintf("%s_%09d", name, TensorCount)
}
ExistingTensors[name] = struct{}{}
lock.Unlock() change to: tensorCount := atomic.AddInt64(&TensorCount, 1)
nbytes := x.nbytes()
atomic.AddInt64(&AllocatedMem, nbytes)
lock.Lock()
if _, ok := ExistingTensors[name]; ok {
name = fmt.Sprintf("%s_%09d", name, tensorCount)
}
ExistingTensors[name] = struct{}{}
lock.Unlock() I just realized that you had made a fix for this issue last week, but I didn't use your latest code. The problem is resolved now, it can be closed. |
Thanks for reporting. |
I have encountered a memory leak issue when executing a JIT model under a multi-goroutine environment. When a single goroutine is used, the memory usage appears to be normal, stabilizing around 1GB. However, when multiple goroutines are launched (e.g., 10), the memory usage rapidly exceeds 10GB and continues to increase at a fast pace.
Below is the code to reproduce the issue:
Steps to Reproduce:
getModel
function.getTensor
function.ForwardT
method on thenet
object, and observe the memory usage.Expected Behavior:
The memory usage should remain stable regardless of the number of goroutines launched.
Actual Behavior:
The memory usage rapidly increases when multiple goroutines are launched, indicating a potential memory leak issue.
Environment:
Any assistance on this issue would be greatly appreciated. Thank you!
The text was updated successfully, but these errors were encountered: