Background
- After working on a Seq2Seq model with attention using only LightingModules, I was still getting on_device errors.
- This also occurred with PackedSequence objects but I think that is due to a PyTorch cpu-specific implementation, not Lightning.
- The following class for Attention required the .cuda() flag ( .to(DEVICE) )
- The class is called in another LightningModule during sequence decoding, which is in turn called by the training /testing loop.

- Without the .to(DEVICE) flag, we get:

```
  File "/content/11785-hw4-lightning/models.py", line 46, in forward
    energy.masked_fill_(mask, -1e9) # (N, T_max)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #2 'mask' in call to _th_masked_fill_bool_

```






This is an issue with ```torch.arange()```, not the boolean operator. I'm not sure if the scope of this problem extends to any CPU tensor returned by torch within a LightningModule, but I suspect it does.
As such, simply removing all ```to('cuda')``` calls can break things.



In [None]:
class Attention(pl.LightningModule):
    '''
    Attention is calculated using key, value and query from Encoder and decoder.
    Below are the set of operations you need to perform for computing attention:
        energy = bmm(key, query)
        attention = softmax(energy)
        context = bmm(attention, value)
    '''
    def __init__(self):
        super(Attention, self).__init__()

    def forward(self, query, key, value, lens):
        '''
        :param query :(N, context_size) Query is the output of LSTMCell from Decoder
        :param key: (N, T_max, key_size) Key Projection from Encoder per time step
        :param value: (N, T_max, value_size) Value Projection from Encoder per time step
        :param lens: (N, T) Length of key and value, used for binary masking
        :return output: Attended Context
        :return attention: Attention mask that can be plotted
        '''

        energy = torch.bmm(key, query.unsqueeze(2)).squeeze(2) # (N, T_max, key_size) * (N, context_size, 1) = (N, T_max, 1) -> (N, T_max)

        # binary masking for padded positions
        mask = torch.arange(key.size(1)).unsqueeze(0) >= lens.unsqueeze(1) # (1, T) >= (B, 1) -> (N, T_max)
        mask = mask.to(DEVICE)
        energy.masked_fill_(mask, -1e9) # (N, T_max)
        attention = nn.functional.softmax(energy, dim=1) # (N, T_max)
        output = torch.bmm(attention.unsqueeze(1), value).squeeze(1) # (N, T_max)

        return output, attention



I'm thinking this is an issue with ```torch.arange()``` also because in a seperate case, I use a masked loss with the following function invoked from a LightningModule:

```
def generate_mask(lens):
    lens = torch.tensor(lens).to(DEVICE)
    max_len = torch.max(lens)
    mask = (torch.arange(0, max_len).repeat(lens.size(0), 1).to(DEVICE) < \
                lens.unsqueeze(1).expand(lens.size(0), max_len)).int()
    return mask
    
```
Note the use of ```.to(DEVICE)``` here. However, this is more forgivable due to my hacky mask function 1) not being a LightningModule and 2) declared in the global scope of the script (though as I said it is *called* from within a LightningModule.

Here, I also get the device error.


In [None]:
 def training_step(self,batch,batch_idx):

       ...

        logits = self.model(speech,speech_len,batch_idx,target,isTrain=True)  
      ...
      
        mask = generate_mask(target_len).to(DEVICE)

What to do? A minimal example!


In [1]:
!pip install -q pytorch_lightning

In [20]:
import pytorch_lightning as pl
import torch
from torch.utils.data import DataLoader,Dataset
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision import transforms
import os

#issue: when tensor declared outside main pytorch-lighting module or functionally, does not migrate.


class LitModel(pl.LightningModule):

     def __init__(self):
         super().__init__()
         self.l1 = torch.nn.Linear(28 * 28, 10)
         self.masker = TrivialMasking()

     def forward(self, x):
        x = self.l1(x.view(x.size(0), -1))
        # x = self.masker(x)

        return torch.relu(x)

     def training_step(self, batch, batch_idx):
         x, y = batch
         y_hat = self(x)
        #  y_hat = self.masker(y_hat)

         loss = F.cross_entropy(y_hat, y)

         loss = self.masker(loss.unsqueeze(0)).squeeze()
         return loss

     def configure_optimizers(self):
         return torch.optim.Adam(self.parameters(), lr=0.02)

    

class TrivialMasking(pl.LightningModule):
    def __init__(self):
        super(TrivialMasking, self).__init__()

    def forward(self, input_tensor):
        '''
        '''
        mask = torch.arange(len(input_tensor))
        return input_tensor*mask

In [21]:
train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()),num_workers=16)
trainer = pl.Trainer(fast_dev_run=True,gpus=1,progress_bar_refresh_rate=50)
model = LitModel()
trainer.fit(model, train_loader)

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Running in fast_dev_run mode: will run a full train, val and test loop using 1 batch(es)

  | Name   | Type           | Params
------------------------------------------
0 | l1     | Linear         | 7.9 K 
1 | masker | TrivialMasking | 0     
------------------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…




RuntimeError: ignored

Now, fixing using .cuda() :

In [22]:
import pytorch_lightning as pl
import torch
from torch.utils.data import DataLoader,Dataset
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision import transforms
import os

#issue: when tensor declared outside main pytorch-lighting module or functionally, does not migrate.


class LitModel(pl.LightningModule):

     def __init__(self):
         super().__init__()
         self.l1 = torch.nn.Linear(28 * 28, 10)
         self.masker = TrivialMasking()

     def forward(self, x):
        x = self.l1(x.view(x.size(0), -1))
        # x = self.masker(x)

        return torch.relu(x)

     def training_step(self, batch, batch_idx):
         x, y = batch
         y_hat = self(x)
        #  y_hat = self.masker(y_hat)

         loss = F.cross_entropy(y_hat, y)

         loss = self.masker(loss.unsqueeze(0)).squeeze()
         return loss

     def configure_optimizers(self):
         return torch.optim.Adam(self.parameters(), lr=0.02)

    

class TrivialMasking(pl.LightningModule):
    def __init__(self):
        super(TrivialMasking, self).__init__()

    def forward(self, input_tensor):
        '''
        '''
        mask = torch.arange(len(input_tensor)).cuda()
        return input_tensor*mask


train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()),num_workers=16)
trainer = pl.Trainer(fast_dev_run=True,gpus=1,progress_bar_refresh_rate=50)
model = LitModel()
trainer.fit(model, train_loader)

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Running in fast_dev_run mode: will run a full train, val and test loop using 1 batch(es)

  | Name   | Type           | Params
------------------------------------------
0 | l1     | Linear         | 7.9 K 
1 | masker | TrivialMasking | 0     
------------------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…




1

So, the issue is reproduced as was the case with the complex Attention pipeline. How can we fix this? We need to ensure that any CPU tensor returned by torch within a LightningModule with ```on_gpu == True``` already is sent to the relevant device. Otherwise I can see Lightning being crippled as usage increases and projects are migrated.
