Put tensor on different devices does not reduce GPU memory use

### 🐛 Describe the bug

I tried to use model parallelism with PyTorch.

Firstly, I put all Linears in one cuda device.

```python
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.linear1 = nn.Linear(1000, 1000)
        self.linear2 = nn.Linear(1000, 1000)

net = Net()
net.cuda(6)
```
Then I observe PyTorch occupy 14GB memory of device 6.

```
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:DB:00.0 Off |                    0 |
| N/A   33C    P0    54W / 300W |   1421MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:DC:00.0 Off |                    0 |
| N/A   32C    P0    56W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
```

Secondly, I put these Linears on different devices.

```
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.linear1 = nn.Linear(1000, 1000)
        self.linear2 = nn.Linear(1000, 1000)

net.linear1.cuda(6)
net.linear2.cuda(7)
```

Then I observe PyTorch occupy 1421MB per GPU. Why model parallelism can not save gpu memory. I think it is ideal to use only 7GB  per GPU.

```
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:DB:00.0 Off |                    0 |
| N/A   33C    P0    55W / 300W |   1421MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:DC:00.0 Off |                    0 |
| N/A   32C    P0    56W / 300W |   1421MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
```

### Versions

```
Collecting environment information...
PyTorch version: 1.10.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Alibaba Group Enterprise Linux Server 7.2 (Paladin) (x86_64)
GCC version: (GCC) 5.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-327.ali2018.alios7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 440.64.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] torch==1.10.0+cu102
[pip3] torchvision==0.11.1+cu102
[conda] numpy                     1.19.5                    <pip>
[conda] torch                     1.10.0+cu102              <pip>
[conda] torchvision               0.11.1+cu102              <pip>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Put tensor on different devices does not reduce GPU memory use #86780

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Put tensor on different devices does not reduce GPU memory use #86780

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions