Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak in JIT Model under Multi-Goroutine Environment #117

Closed
yinziyang opened this issue Oct 30, 2023 · 9 comments
Closed

Memory Leak in JIT Model under Multi-Goroutine Environment #117

yinziyang opened this issue Oct 30, 2023 · 9 comments

Comments

@yinziyang
Copy link

I have encountered a memory leak issue when executing a JIT model under a multi-goroutine environment. When a single goroutine is used, the memory usage appears to be normal, stabilizing around 1GB. However, when multiple goroutines are launched (e.g., 10), the memory usage rapidly exceeds 10GB and continues to increase at a fast pace.

Below is the code to reproduce the issue:

package main

import (
    "encoding/json"
    "log"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

// getModel loads the resnet18 model.
func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }
    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

// getTensor generates a test tensor.
func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, false)
    return
}

func main() {

    // Load resnet18 model
    net := getModel()

    // Generate test tensor
    tensor := getTensor()
    defer tensor.MustDrop()

    // Launch goroutines
    var goroutineNum = 10
    // When a single goroutine is used, the memory usage appears normal, long-term occupying 1GB.
    // When multiple goroutines are launched, e.g., 10, memory usage quickly exceeds 10GB, and continues to increase rapidly.
    for i := 0; i < goroutineNum; i++ {
        go func(net nn.FuncT) {
            for {
                log.Println(net.ForwardT(tensor, false))
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

Steps to Reproduce:

  1. Load the resnet18 model using the getModel function.
  2. Generate a test tensor using the getTensor function.
  3. Launch a goroutine to continuously call the ForwardT method on the net object, and observe the memory usage.

Expected Behavior:
The memory usage should remain stable regardless of the number of goroutines launched.

Actual Behavior:
The memory usage rapidly increases when multiple goroutines are launched, indicating a potential memory leak issue.

Environment:

  • Go version: (1.21.3)
  • Gotch version: (v0.9.0)
  • OS: (e.g., Ubuntu 22.04)

Any assistance on this issue would be greatly appreciated. Thank you!

@sugarme
Copy link
Owner

sugarme commented Oct 30, 2023

@yinziyang ,

Thanks for the report. However, I have a quick look and see 2-3 things that may cause memory blown up:

  1. Your func getTensor last operation should be tensor = tensor.MustUnsqueeze(0, true) (true) to delete existing tensor before assigning to new one otherwise mem leak here.
  2. In go routine for loop, you run net.Forward(tensor), which return a tensor, that tensor should be deleted after being used by log.Println() otherwise leak here as well.
  3. When doing forward in inference mode, you should put inside ts.NoGrad() otherwise, autograd will build up (not really a memory leak but hidden tensors).

Please try those things to see how thing are going. Thanks.

@yinziyang
Copy link
Author

yinziyang commented Oct 30, 2023

#@sugarme

Thank you for your response. I have adjusted my code according to your suggestions, but the memory usage still keeps increasing. Below is my latest code:

package main

import (
    "encoding/json"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }

    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, true)
    return
}

func main() {

    net := getModel()

    tensor := getTensor()
    defer tensor.MustDrop()

    var goroutineNum = 10
    for i := 0; i < goroutineNum; i++ {
        go func(net nn.FuncT) {
            for {
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

@yinziyang yinziyang reopened this Oct 30, 2023
@yinziyang
Copy link
Author

When calling the model in multiple goroutines, a lot of warning messages appear, as follows:

2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235087". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235091". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235100". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235098". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "BatchNorm_000235215". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235245". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235395". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Conv2d_000235566". Called from "ts.Drop()". Just skipping...
2023/10/30 11:54:50 WARNING: Probably double free tensor "Relu_000235609". Called from "ts.Drop()". Just skipping...

@sugarme
Copy link
Owner

sugarme commented Oct 30, 2023

@yinziyang ,

Probably you should create a model for each go routine then. Actually, I have never tried to do concurrency on one model like that. I guess, there will be a lot of data collision as all go routines feed into a single model.

@yinziyang
Copy link
Author

I created a model for each goroutine, and used the corresponding model when calling within the goroutine, but there are still issues.

package main

import (
    "encoding/json"
    "os"
    "time"

    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch/pickle"
    "github.com/sugarme/gotch/ts"
    "github.com/sugarme/gotch/vision"
)

func getModel() (net nn.FuncT) {
    modelName := "resnet18"
    url, ok := gotch.ModelUrls[modelName]
    if !ok {
        panic("Unsupported model name")
    }
    modelFile, err := gotch.CachedPath(url)
    if err != nil {
        panic(err)
    }

    vs := nn.NewVarStore(gotch.CPU)
    net = vision.ResNet18NoFinalLayer(vs.Root())

    err = pickle.LoadAll(vs, modelFile)
    if err != nil {
        panic(err)
    }

    return
}

func getTensor() (tensor *ts.Tensor) {
    b, err := os.ReadFile("test.data")
    if err != nil {
        panic(err)
    }

    var data []float32
    err = json.Unmarshal(b, &data)
    if err != nil {
        panic(err)
    }

    tensor = ts.MustOfSlice(data).MustView([]int64{3, 224, 224}, true)
    tensor = tensor.MustUnsqueeze(0, true)
    return
}

func main() {

    var goroutineNum = 10

    var nets []nn.FuncT
    for i := 0; i < goroutineNum; i++ {
        nets = append(nets, getModel())
    }

    tensor := getTensor()
    defer tensor.MustDrop()

    for i := 0; i < goroutineNum; i++ {
        net := nets[i]
        go func(net nn.FuncT) {
            for {
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
            }
        }(net)
    }

    time.Sleep(5 * time.Minute)
}

@sugarme
Copy link
Owner

sugarme commented Oct 30, 2023

@yinziyang ,

I will try to reproduce your problem when having time by this week. However, your latest go func() should not have input then.

What about some thing like this:

for i := 0; i < goroutineNum; i++ {
        go func() {
                net := getModel()
                tensor := getTensor()
                ts.NoGrad(func() {
                    result := net.ForwardT(tensor, false)
                    result.MustDrop()
                })
                tensor.MustDrop()
            }
        }()
}

@yinziyang
Copy link
Author

The memory usage still keeps increasing, the key code is as follows:

for i := 0; i < goroutineNum; i++ {
    go func() {
        // goroutine model
        net := getModel()

        // test input tensor
        tensor := getTensor()
        defer tensor.MustDrop()

        // stress test to observe memory increase
        for {
            ts.NoGrad(func() {
                result := net.ForwardT(tensor, false)

                // drop result tensor
                result.MustDrop()
            })
        }
    }()
}

@yinziyang
Copy link
Author

@sugarme

I understand now, I seem to have found a bug in tensor.go that causes some Tensors not to be released.

this is old code:

	atomic.AddInt64(&TensorCount, 1)
	nbytes := x.nbytes()
	atomic.AddInt64(&AllocatedMem, nbytes)

	lock.Lock()
	if _, ok := ExistingTensors[name]; ok {
		name = fmt.Sprintf("%s_%09d", name, TensorCount)
	}
	ExistingTensors[name] = struct{}{}
	lock.Unlock()

change to:

	tensorCount := atomic.AddInt64(&TensorCount, 1)
	nbytes := x.nbytes()
	atomic.AddInt64(&AllocatedMem, nbytes)

	lock.Lock()
	if _, ok := ExistingTensors[name]; ok {
		name = fmt.Sprintf("%s_%09d", name, tensorCount)
	}
	ExistingTensors[name] = struct{}{}
	lock.Unlock()

I just realized that you had made a fix for this issue last week, but I didn't use your latest code. The problem is resolved now, it can be closed.

@sugarme
Copy link
Owner

sugarme commented Oct 30, 2023

@yinziyang ,

Thanks for reporting.

@sugarme sugarme closed this as completed Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants