# Cornell Demo
DB Seminar, Spring 2022.
Rolando Garcia, UC Berkeley.
rogarcia@berkeley.edu

# <- What's in the repo?
0. This is a vanilla Jupyter Notebook, running on VSCode
1. Show README
2. Let's see some code

# -> Lets see train_rnn.py

We can also characterize the train_rnn.py code as follows:
```python
import flor
import torch

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

for epoch in flor.it(range(...)):
    if flor.SkipBlock.step_into('training_loop'):
        for data in trainloader:
            inputs, labels = data
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            print(f"loss: {loss.item()}")
    flor.SkipBlock.end(net, optimizer)
    eval(net, testloader)
```

Brief overview of Record-Replay.
* Record:
    * `flor.SkipBlock.end` serializes and writes partial checkpoint
    * auto-commit changes to repository (special branch)
* Replay:
    * `flor.it` restores its starting state from checkpoint (parallelism)
    * `flor.SkipBlock` may skip, and load side-effects instead from disk (memoization)

# "I don't want to learn a new API"
# -> flor has hands-free mode
Side-by-side comparison. I want to show you what I'm doing.
```bash
python -c "import flor; flor.transformer.Transform('...')
```

# "I don't want this to slow training"
# -> Overhead is negligible
Fast Record (<6% overhead): Buffering, Write-Behind, Background Serialization/IO, Physiological Logging
![Record Plot](doc/img/record.png)
Figure from [Garcia et al. VLDB'21]

# Flor & Git
Model developers iterate quickly to try many ideas. We want to store every version of model training tried. Autocommit
* Show timeline

In [None]:
!git log

In [None]:
!git branch

# Let's explore the Model Training History
Exploratory model development

In [None]:
import os
os.getcwd()

In [None]:
import flor
import numpy as np

Fact table with all the data logged so far:

In [None]:
raw_df = flor.load_kvs()
raw_df

### -> The table is populated with logged data
Let's see the logging statements in train_rnn.py

In [None]:
df = raw_df[['tstamp', 'epoch', 'step', 'name', 'alpha', 'value']]
df

In [None]:
record_df = df[['tstamp', 'epoch', 'step', 'name', 'value']][df['alpha'] == 'a']
replay_df = df[['tstamp', 'epoch', 'step', 'name', 'value']][df['alpha'] == 'b']
record_df['name'].unique(), replay_df['name'].unique() # What did I log in the past? What did the other students log?

In [None]:
df = record_df
avg_train_loss = df[df['name'] == 'avg_train_loss']
avg_train_loss_agg = avg_train_loss.groupby(['tstamp', 'epoch']).agg({'value': 'mean'}).reset_index()
avg_train_loss_agg['tstamp'] = avg_train_loss_agg['tstamp'].map(str)
avg_train_loss_agg # Rollup

fig = px.scatter_3d(avg_train_loss_agg, x='tstamp', y='epoch', z='value', color='tstamp')
fig.show()

In [None]:
df = record_df
avg_val_loss = df[df['name'] == 'average_valid_loss']
avg_valid_loss_agg = avg_val_loss.groupby(['tstamp', 'epoch']).agg({'value': 'mean'}).reset_index()
avg_valid_loss_agg['tstamp'] = avg_valid_loss_agg['tstamp'].map(str)
avg_valid_loss_agg

m_df =  avg_train_loss_agg.merge(avg_valid_loss_agg, on=['tstamp', 'epoch'])
m_df['diff'] = m_df['value_x'] - m_df['value_y']                                # train_loss - val_loss
m_df

fig = px.scatter_3d(m_df, x='tstamp', y='epoch', z='diff', color='tstamp')
fig.show()

# Let's do some hindsight logging
* Skip Retraining when possible
    * Use memoization: observe physical-logical equivalence
* Parallelize Retraining otherwise
    * Enable resuming from a checkpoint
    * Work Partitioning: Control the epoch sub-range from the command-line

### -> Add print statement and replay latest version
And show mechanics of code below.
What does it do?

In [1]:
!python train_rnn.py --replay_flor

Traceback (most recent call last):
  File "train_rnn.py", line 10, in <module>
    from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
ModuleNotFoundError: No module named 'sklearn'


In [None]:
df = raw_df = flor.load_kvs()
record_df = df[['tstamp', 'epoch', 'step', 'name', 'value']][df['alpha'] == 'a']
replay_df = df[['tstamp', 'epoch', 'step', 'name', 'value']][df['alpha'] == 'b']
record_df['name'].unique(), replay_df['name'].unique() # What did I log in the past? What did the other students log?

In [None]:
# Which versions have I replayed?
replay_df[replay_df['name'] == 'learning_rate']['tstamp'].unique()

### -> Propagate logging statements back in time
And show mechanics of code below.
What does it do?

In [None]:
raw_df[['tstamp', 'vid']][
    raw_df['tstamp'] >= np.Datetime64('2022-02-10')
    ].drop_duplicates()

In [None]:
!python -m flor stage train_rnn.py

In [None]:
!git checkout d9973057cb00a470ab29763679fd8d7f84eec1b0

In [None]:
!python -m flor propagate train_rnn.py

In [None]:
!python train_rnn.py --replay_flor