# Lab 02a: PyTorch Lightning

## What You Will Learn

* The core components of a PyTorch Lightning training loop: `LightningModule`s and `Trainer`s
* Useful quality-of-life improvements offered by PyTorch Lightning: `LightningDataModule`s, `Callback`s, and `Metric`s
* How we use these features in the FSDL codebase

## Setup

In [1]:
import os
from pathlib import Path
current_dir = Path.cwd()

print(current_dir)

lab_idx = 2
lab_name = f"lab{str(lab_idx).zfill(2)}"
my_fsdl = "my-fsdl-text-recognizer-2022"

if current_dir.name == lab_name:
    pass 
else:
    os.chdir(f"{current_dir}/{my_fsdl}/{lab_name}")

Path.cwd()

/Users/tomlu/Workspace


PosixPath('/Users/tomlu/Workspace/my-fsdl-text-recognizer-2022/lab02')

## Why Lightning

PyTorch is a powerful library for executing differentiable tensor operations with hardware acceleration and it includes many neural network primitives, but it has no concept of "training". At a high level, an `nn.Module` is stateful function with gradients and a `torch.optim.Optimizer` can update that state using gradients, but there's no pre-built tools in PyTorch to iteratively generate those gradients from data.

So the first thing many folks do in PyTorch is write that code -- a "training loop" to iterate over their `DataLoader`, which in pseudocode might look something like:

```Python
for batch in dataloader:
    inputs, targets = batch

    outputs = model(inputs)
    loss = some_loss_function(inputs, outputs)

    optimizer.zero_gradients()
    loss.backward()

    optimizer.step()
```

This is a solid start, but other needs immediately arise. You'll want to run your model on validation and test data, which need their own `DataLoader`s. Once finished, you'll want to save your model -- and for long-running jobs, you probably want to save checkpoints of the train process so that it can be resumed in case of a crash. For state-of-the-art model performance in many domains, you'll want to distribute your training across multiple nodes/machines and across multiple GPUs within those nodes.

PyTorch Lightning is a popular framework on top of PyTorch.

In [2]:
import pytorch_lightning as pl

version = pl.__version__

docs_url = f"https://pytorch-lightning.readthedocs.io/en/{version}/"
docs_url

'https://pytorch-lightning.readthedocs.io/en/0.8.5/'

At its core, PyTorch Lightning provides
1. the `pl.Trainer` class, which organizes and executes your training, validation, and test loops, and
2. the `pl.LightningModule` class, which links optimizers to models and definies how the model behaves during training, validation, and testing.

Before these are kitted out with all the features a cutting-edge deep learning code needs:
* flags for switching device types and distributed computing strategy
* saving, checkpointing, and resumption
* calculation and logging of metrics

and much more.

Importantly these features can be easily added, removed, extended, or bypassed as desired, meaning your code isn't constrained by the framework. 

In some ways, you can think of Lighning as a tool for "organizing" your PyTorch code, as shown in the video below.

In [3]:
import IPython.display as display


display.IFrame(src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/pl_mod_vid.m4v",
               width=720, height=720)


That's opposed to the other way frameworks are designed, to provide abstractions over the lower-lvel library (here, PyTorch).

Because of this "organize don't abstract" style, writing PyTorch Lightning code involves a lot of over-riding of methods -- you inherit from a class and then implement the specific version of a general method that you need for your code, rather than lightning providing a bunch of already fully-defined classes that you just instantiate, using arguments for configuration.