# General notes about things I've learned, not lesson specific

## How to install a python package for dev, or latest from git

Clone the repo, the cd into it and:

    pip install -e .

## Jupyter

### Diff two notebooks

In [None]:
!pip install nbdime

[nbdime](https://nbdime.readthedocs.io/en/latest/) is recommended [here on stackoverflow](https://stackoverflow.com/questions/18171968/is-there-any-way-to-generate-a-diff-between-two-versions-of-an-ipython-notebook).

I don't recommend to run `nbdiff-web` from a notebook, it hangs the notebook, might have to kill it from a terminal!

    nbdiff-web --ignore-outputs notes.ipynb 11_notes.ipynb

### Trust a notebook

In [None]:
!jupyter trust notebook.ipynb

## Git

Of course I still don't know much about git after more than 13 years...!

Maybe I should RTFM and actually learn it with Anki or something.

Piping color git output to less:    

    git -c color.status=always status | less -R

Unstage something from git:
    
    git reset -- file dir

## Cuda

In [None]:
torch.cuda.empty_cache()

\#discussions

[6:24 PM]Nike-Zoldyck: Hey Everyone, I'm curious about what your usual fix is when you run into a CUDA out of memory error, other than lowering batch size . I've set mine to 1 on a GNN model that worked fine before with batch size 5 on similar GPU VM, but i had to replace the machine and create everything from scratch. Doing export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128  didn't help either. 

 — 11/02/2022

[1:38 AM]Alexey Zaytsev: In the end, the only way to fix the issue is to find out what's using the memory. To be honest, I don't know if there is a solid way to find that out after the OOM happened - I had cases where no large tensor seems to exist among python variables, but the GPU memory is not freed, despite attempts at gc.collect() and empty_cache().  Would love to hear if someone has a more systematic way to debug PyTorch CUDA memory issues.

[1:50 PM]satan_99: After reducing the batch size, you can use gradient accumulation.

[2:18 PM]Nike-Zoldyck: I was able to figure it out after some trial and error and trying out all methods.  The loss, preds and targets were all on gpu too. When logging them or performing metrics calculation, they take up a lot of space. Moving them to cpu without detaching them from the computation graph helped fix it and I was able to increase the batch size and run too

[2:19 PM]Nike-Zoldyck: Yeah this didn't work even when the batch size was 1

[2:19 PM]Alexey Zaytsev: Why not detach them before moving?

[2:19 PM]Nike-Zoldyck: Then back prop won't work and your loss with be a constant value after epoch 1, and nothing will train

[2:20 PM]Alexey Zaytsev: Buttt, you leave the loss on gpu and make a copy with .cpu().

[2:20 PM]Alexey Zaytsev: loss = loss_f(y, hat); log_save(loss.cpu()); loss.back() 

[2:21 PM]Nike-Zoldyck: You can use loss as loss.item() instead of the whole tensor and while losing you can do loss.cpu().  For metrics, you can pass preds and targets but do a detach().cpu()

[2:21 PM]Nike-Zoldyck: Yes that's what I meant

[2:22 PM]Alexey Zaytsev: Sorry, I meant. 

[2:22 PM]Alexey Zaytsev: log(loss.detach().cpu())

[2:24 PM]Nike-Zoldyck: I've faced error using detach VS just cpu

[2:24 PM]Nike-Zoldyck: Detaching will remove from computation graphs so there's no way backward

[2:24 PM]Alexey Zaytsev: It makes a copy, original loss  is still attached.

[2:30 PM]Alexey Zaytsev: More importantly, the gradient is back-propagated between devices, .cpu() or .cuda() does not break the link

<!-- ![](https://cdn.discordapp.com/attachments/766837559920951316/1037206929694523402/unknown.png) -->
![](notes/cuda1.png)

[2:36 PM]Alexey Zaytsev: I'm not sure how it works with reference counting. If if del t, I can still call .backward() on tc.sum(), so I assume it just keeps a reference to t and t is not cleared from CUDA memory. 

[2:44 PM]Nike-Zoldyck: yeah probably, which ends up taking a lot of space over batches and before the epoch fully completes and everything is reset. so reducing batch size won't matter since it is not cleared or reset until the epoch finishes and there are multiple copies of the same variables

[2:46 PM]Alexey Zaytsev: Exactly. Which is why it needs to be detached. .item() does it internally.

[2:47 PM]Alexey Zaytsev:

<!-- ![](https://cdn.discordapp.com/attachments/766837559920951316/1037211289199595620/unknown.png) -->
![](notes/cuda2.png)

[3:20 PM]Nike-Zoldyck: hmm, i've read that these two methods might release the memory but not to Python

[3:21 PM]Nike-Zoldyck: so you're saying if we detach it, it doesn't release the cache because it didn't keep the memory in the first place? 

[3:23 PM]Alexey Zaytsev: No, I'm saying, if you call .cpu() without .detach(), the resulting tensor holds a reference to the tensor that still lives in GPU memory (because .cpu() does not move, but copies a tensor), and that GPU tensor can't be freed until you free the CPU copy of the tensor. 

[3:24 PM]Alexey Zaytsev: Of course, only applies to tensors that have requires_grad=True.

[3:24 PM]Nike-Zoldyck: Oh , that makes sense !

[3:35 PM]Alexey Zaytsev: Another source of annoying memory leaks I just found - ipython  keeps all cell outputs in _<n> variables for all previous cells executed. So if your tensor is the cell output (part of the final expression), it won't be freed, because a reference to it is held by ipython.

[3:38 PM]Alexey Zaytsev:

<!-- ![](https://cdn.discordapp.com/attachments/766837559920951316/1037224105549766666/unknown.png) -->
![](notes/cuda3.png)

[3:40 PM]Alexey Zaytsev: I feel this is actually a big deal, and the source of the weird CUDA leak I mention in the beginning, where something holds references to variables that you assume are gone.

[3:41 PM]Alexey Zaytsev: The solution is probably to pass --InteractiveShell.cache_size=0 to ipython.

[3:56 PM]Alexey Zaytsev: And you can't even use %config InteractiveShell.cache_size=0, you have to put it into the ipython config.

[3:56 PM]Alexey Zaytsev: Yes, 🤯

[3:57 PM]Nike-Zoldyck: I've never run into this issue on notebooks tho. Always when running scripts on terminal. those fixes probably won't work for the regular case right?

[3:59 PM]Alexey Zaytsev: Yes, this only applies to notebooks and ipython in general.

[4:06 PM]Alexey Zaytsev: @jeremyhoward , did you know about this? I feel it's a pretty significant source of suffering when using notebooks and dealing with weird CUDA OOMs and more people should be aware of it.

[5:00 PM]jeremyhoward: in theory, yes i knew about it. in practice, i haven't considered it nearly as much as i should have

[5:02 PM]jeremyhoward: btw for a bit less typing i think you could use .data instead of .detach()

[5:03 PM]jeremyhoward: also fastai's to_np() does it for you

[5:05 PM]jeremyhoward: apparently this is meant to work %config ZMQInteractiveShell.cache_size = 0

[5:06 PM]jeremyhoward: %reset -f out is meant to remove all stuff in the cache

## Miscellaneous

- a fast.ai model.pkl file is actually a zip file containing archive dir, with archive/data.pkl file and other data files
- I can do two prompts at once using diffusers on my 2060S (8GB) now.