# Installation

In [None]:
! pip install flordb

# Getting Started

We start by selecting (or creating) a `git` repository to save our model training code as we iterate and experiment. Flor automatically commits your changes on every run, so no change is lost. Below we provide a sample repository you can use to follow along:

In [1]:
!git clone git@github.com:ucbepic/ml_tutorial ../ml_tutorial

Cloning into '../ml_tutorial'...
remote: Enumerating objects: 56, done.[K
remote: Counting objects: 100% (56/56), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 56 (delta 20), reused 44 (delta 12), pack-reused 0[K
Receiving objects: 100% (56/56), 15.29 KiB | 1.91 MiB/s, done.
Resolving deltas: 100% (20/20), done.


In [2]:
import os
os.chdir('../ml_tutorial/')

Run the `train.py` script to train a small linear model, 
and test your `flordb` installation.

In [4]:
! python train.py --flor myFirstRun

Epoch [1/5], Step [100/1875], Loss: 0.1726
Epoch [1/5], Step [200/1875], Loss: 0.4662
Epoch [1/5], Step [300/1875], Loss: 0.1456
Epoch [1/5], Step [400/1875], Loss: 0.1074
Epoch [1/5], Step [500/1875], Loss: 0.2317
Epoch [1/5], Step [600/1875], Loss: 0.2387
Epoch [1/5], Step [700/1875], Loss: 0.3520
Epoch [1/5], Step [800/1875], Loss: 0.1224
Epoch [1/5], Step [900/1875], Loss: 0.2337
Epoch [1/5], Step [1000/1875], Loss: 0.0819
Epoch [1/5], Step [1100/1875], Loss: 0.0921
Epoch [1/5], Step [1200/1875], Loss: 0.1345
Epoch [1/5], Step [1300/1875], Loss: 0.1298
Epoch [1/5], Step [1400/1875], Loss: 0.0987
Epoch [1/5], Step [1500/1875], Loss: 0.1604
Epoch [1/5], Step [1600/1875], Loss: 0.0550
Epoch [1/5], Step [1700/1875], Loss: 0.1614
Epoch [1/5], Step [1800/1875], Loss: 0.0570
Epoch [2/5], Step [100/1875], Loss: 0.0634
Epoch [2/5], Step [200/1875], Loss: 0.2885
Epoch [2/5], Step [300/1875], Loss: 0.0539
Epoch [2/5], Step [400/1875], Loss: 0.3525
Epoch [2/5], Step [500/1875], Loss: 0.0433
Ep

Flor will manage checkpoints, logs, command-line arguments, code changes, and other experiment metadata on each run (More details [below](#storage--data-layout)). All of this data is then expesed to the user via SQL or Pandas queries.


# View your experiment history
From the same directory you ran the examples above, open an iPython terminal, then load and pivot the log records.


In [5]:
from flor import full_pivot, log_records
df = full_pivot(log_records())

df.head()

Unnamed: 0,projid,runid,tstamp,vid,epoch,step,loss,lr,epochs,hidden,batch_size
0,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,100,0.1726001054048538,0.001,5,500,32
1,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,200,0.4662041068077087,0.001,5,500,32
2,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,300,0.1455779373645782,0.001,5,500,32
3,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,400,0.1074397191405296,0.001,5,500,32
4,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,500,0.2316924929618835,0.001,5,500,32


# Run some more experiments
The `train.py` script has been prepared in advance to define and manage four different hyper-parameters:

In [6]:
%cat train.py | grep flor.arg

hidden_size = flor.arg("hidden", default=500)
num_epochs = flor.arg("epochs", 5)
batch_size = flor.arg("batch_size", 32)
learning_rate = flor.arg("lr", 1e-3)


You can control any of the hyper-parameters (e.g. `hidden`) using Flor's command-line interface:

In [7]:
! python train.py --flor mySecondRun --hidden 75

Epoch [1/5], Step [100/1875], Loss: 0.7625
Epoch [1/5], Step [200/1875], Loss: 0.4340
Epoch [1/5], Step [300/1875], Loss: 0.3457
Epoch [1/5], Step [400/1875], Loss: 0.4310
Epoch [1/5], Step [500/1875], Loss: 0.3177
Epoch [1/5], Step [600/1875], Loss: 0.3360
Epoch [1/5], Step [700/1875], Loss: 0.1061
Epoch [1/5], Step [800/1875], Loss: 0.2862
Epoch [1/5], Step [900/1875], Loss: 0.4795
Epoch [1/5], Step [1000/1875], Loss: 0.3659
Epoch [1/5], Step [1100/1875], Loss: 0.1897
Epoch [1/5], Step [1200/1875], Loss: 0.1757
Epoch [1/5], Step [1300/1875], Loss: 0.1298
Epoch [1/5], Step [1400/1875], Loss: 0.3751
Epoch [1/5], Step [1500/1875], Loss: 0.2477
Epoch [1/5], Step [1600/1875], Loss: 0.1193
Epoch [1/5], Step [1700/1875], Loss: 0.6109
Epoch [1/5], Step [1800/1875], Loss: 0.1453
Epoch [2/5], Step [100/1875], Loss: 0.3296
Epoch [2/5], Step [200/1875], Loss: 0.1020
Epoch [2/5], Step [300/1875], Loss: 0.4109
Epoch [2/5], Step [400/1875], Loss: 0.0651
Epoch [2/5], Step [500/1875], Loss: 0.0818
Ep

### Advanced (Optional): Batch Processing
Alternatively, we can call `flor.batch()` from an interactive environment
inside our model training repository, to dispatch a group of jobs that can be long-runnning:

In [8]:
import flor

jobs = flor.cross_prod(hidden=[i*100 for i in range(1,6)],lr=(1e-4, 1e-3))
assert jobs is not None

flor.batch(jobs)

--hidden 100 --lr 0.0001 
--hidden 100 --lr 0.001 
--hidden 200 --lr 0.0001 
--hidden 200 --lr 0.001 
--hidden 300 --lr 0.0001 
--hidden 300 --lr 0.001 
--hidden 400 --lr 0.0001 
--hidden 400 --lr 0.001 
--hidden 500 --lr 0.0001 
--hidden 500 --lr 0.001 


Then, using a new console or terminal, we start a `flordb` server to process the batch jobs:
```bash
$ python -m flor serve
```

or, if we want to allocate a GPU to the flor server:
```bash
$ python -m flor serve 0 
```
(where 0 is replaced by the GPU id).

You can check the progress of your jobs with the following query:

In [18]:
!sqlite3 ~/.flor/main.db -header 'select done, path, count(*) from jobs group by done, path;'

done|path|count(*)
1|/home/rogarcia/git/ml_tutorial|10


When finished, the query will report 10 jobs marked as `done` = 1

```
done|path|count(*)
1|/Users/rogarcia/git/ml_tutorial|10
```

You can view the updated pivot view as follows:

In [19]:
df = full_pivot(log_records())
df.head()

Unnamed: 0,projid,runid,tstamp,vid,epoch,step,loss,epochs,batch_size,lr,hidden
0,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,100,0.1726001054048538,5,32,0.001,500
1,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,200,0.4662041068077087,5,32,0.001,500
2,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,300,0.1455779373645782,5,32,0.001,500
3,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,400,0.1074397191405296,5,32,0.001,500
4,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,500,0.2316924929618835,5,32,0.001,500


In [20]:
df['vid'].drop_duplicates().count()

12

# Model Traing Kit (MTK)
The Model Training Kit (MTK) includes utilities for serializing and checkpointing PyTorch state,
and utilities for resuming, auto-parallelizing, and memoizing executions from checkpoint.

In this context, `Flor` is an alias for `MTK`. The model developer passes objects for checkpointing to `Flor.checkpoints(*args)`,
and gives it control over loop iterators by 
calling `Flor.loop(iterator)` as follows:

In [21]:
!cat train.py | grep -B 3 -A 25 Flor.checkpoints 

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Flor.checkpoints(model, optimizer)

# Train the model
total_step = len(train_loader)
for epoch in Flor.loop(range(num_epochs)):
    for i, (images, labels) in Flor.loop(enumerate(train_loader)):
        # Move tensors to the configured device
        images = images.reshape(-1, 28 * 28).to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i + 1) % 100 == 0:
            print(
                "Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}".format(
                    epoch + 1,
                    num_epochs,
                    i + 1,
                    total_step,


As shown, 
we wrap both the nested training loop and main loop with `Flor.loop` so Flor can manage their state. Flor will use loop iteration boundaries to store selected checkpoints adaptively, and on replay time use those same checkpoints to resume training from the appropriate epoch.  


### Logging API

You call `flor.log(name, value)` and `flor.arg(name, default=None)` to log metrics and register tune-able hyper-parameters, respectively. 

In [22]:
%cat train.py | grep flor.arg

hidden_size = flor.arg("hidden", default=500)
num_epochs = flor.arg("epochs", 5)
batch_size = flor.arg("batch_size", 32)
learning_rate = flor.arg("lr", 1e-3)


In [26]:
%cat train.py | grep -C 3 flor.log

                    num_epochs,
                    i + 1,
                    total_step,
                    flor.log("loss", loss.item()),
                )
            )



The `name`(s) you use for the variables you intercept with `flor.log` and `flor.arg` will become a column (measure) in the full pivoted view (see [Viewing your exp history](#view-your-experiment-history)).


# Storage & Data Layout
On each run, Flor will:
1. Save model checkpoints in `~/.flor/`
1. Commit code changes, command-line args, and log records to `git`, inside a dedicated `flor.shadow` branch.

In [30]:
! ls -lagh ~/.flor | grep "ml_tutorial\|main"

-rw-r--r--  1 rogarcia  20K Jul 20 13:38 main.db
-rw-r--r--  1 rogarcia 176K Jul 20 13:40 ml_tutorial.db
drwxrwxr-x  5 rogarcia 4.0K Jul 20 13:40 ml_tutorial_flor.shadow.readme


In [31]:
! echo $(pwd)
! git branch

/home/rogarcia/git/ml_tutorial
* [32mflor.shadow.readme[m


In [32]:
! echo $(pwd)'/.flor'
! ls -lagh ./.flor/

/home/rogarcia/git/ml_tutorial/.flor
total 20K
drwxrwxr-x 2 rogarcia 4.0K Jul 20 13:40 .
drwxrwxr-x 4 rogarcia 4.0K Jul 20 13:31 ..
-rw-rw-r-- 1 rogarcia 2.9K Jul 20 13:40 log_records.csv
-rw-rw-r-- 1 rogarcia  236 Jul 20 13:40 .replay.json
-rw-rw-r-- 1 rogarcia  223 Jul 20 13:40 seconds.json


Flor will access and interpret contents of `.flor` automatically. The data and log records will be exposed to the user via SQL or Pandas queries.

# Hindsight Logging


Suppose you wanted to start logging the `device`
identifier where the model is run, as well as the
final `accuracy` after training.
You would add the corresponding logging statements
to `train.py`, for example:

In [33]:
%cat train.py | grep -C 4 flor.log

from flor import MTK as Flor

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
flor.log("device", str(device))

# Hyper-parameters
input_size = 784
hidden_size = flor.arg("hidden", default=500)
--
                    epoch + 1,
                    num_epochs,
                    i + 1,
                    total_step,
                    flor.log("loss", loss.item()),
                )
            )

# Test the model
--
        correct += (predicted == labels).sum().item()

    print(
        "Accuracy of the network on the 10000 test images: {} %".format(
            flor.log("accuracy", 100 * correct / total)
        )
    )


In [34]:
! echo $(pwd)
! git commit -am "hindsight logging stmts added."

/home/rogarcia/git/ml_tutorial
LICENSE notebook.ipynb README.md train.py flor.shadow.readme
[flor.shadow.readme 04b7174] hindsight logging stmts added.
 1 file changed, 2 insertions(+), 1 deletion(-)


Typically, when you add a logging statement, logging 
begins "from now on", and you have no visibility into the past.
With hindsight logging, the aim is to allow model developers to send
new logging statements back in time, and replay the past 
efficiently from checkpoint.

In order to do that, we open up an interactive environent from within the `ml_tutorial` directory, and call `flor.replay()`, asking flor to apply the logging statements with the names `device` and `accuracy` to all previous versions (leave `where_clause` null in `flor.replay()`):

In [35]:
flor.replay(['device', 'accuracy'])

What is the log level of logging statement `device`? Leave blank to infer `DATA_PREP`:  
What is the log level of logging statement `accuracy`? Leave blank to infer `DATA_PREP`:  


Unnamed: 0,projid,runid,tstamp,vid,prep_secs,eval_secs
0,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,2.418149,0.571114
1,ml_tutorial_flor.shadow.readme,mySecondRun,2023-07-20T13:33:30,2be1ef88e7b2e39d2a5844b0945811244bb40715,0.680741,0.561131
2,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:34:24,d1035feb1274889a9f479f3f687659fc7bf712b8,0.679419,0.555125
3,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:34:50,4d1d90fa76b932edd2d637c0adfce8979487824f,0.681324,0.558446
4,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:35:15,60598fc693c15877a912366b2e29b6f35158a1e1,0.669353,0.560964
5,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:35:40,ef946c517a019861e7f7cf13b956deea3f3e0978,0.652545,0.55979
6,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:36:05,8e34cf572b8ea2f3fcfd8b7e398702a5023b8d0c,0.658422,0.55938
7,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:36:30,35f276660121be3b166da84f8808c41236dff6dd,0.670165,0.563717
8,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:36:56,cdc3c10f280e3473cb47f49425aa1d7db78fff42,0.659957,0.565566
9,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:37:22,5e53085992c1ed7941fe26930859cb4f86cba40c,0.681741,0.564458


Continue replaying 12 versions at DATA_PREP level for 52.52 seconds?[Y/n]?  


Flordb registered 12 replay jobs.


Then, using a new console or terminal, we start a `flordb` server to process the batch jobs:
```bash
$ python -m flor serve
```

or, if we want to allocate a GPU to the flor server:
```bash
$ python -m flor serve 0 
```
(where 0 is replaced by the GPU id).

You can check the progress of your jobs with the following query:

In [39]:
!sqlite3 ~/.flor/main.db -header 'select done, path, appvars, count(*) from replay group by done, path, appvars;'

done|path|appvars|count(*)
1|/home/rogarcia/git/ml_tutorial|device, accuracy|12


When the process is finished, you will be able to view the values for `device` and `accuracy` for historical executions, and they will continue to be logged in subsequent iterations:

In [41]:
from flor import full_pivot, log_records
df = full_pivot(log_records())
df.head()

Unnamed: 0,projid,runid,tstamp,vid,epoch,step,loss,device,epochs,batch_size,lr,accuracy,hidden
0,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,100,0.1726001054048538,cuda,5,32,0.001,97.88,500
1,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,200,0.4662041068077087,cuda,5,32,0.001,97.88,500
2,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,300,0.1455779373645782,cuda,5,32,0.001,97.88,500
3,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,400,0.1074397191405296,cuda,5,32,0.001,97.88,500
4,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,1,500,0.2316924929618835,cuda,5,32,0.001,97.88,500


In [42]:
df[list(flor.DATA_PREP) + ['device', 'accuracy']].drop_duplicates()

Unnamed: 0,projid,runid,tstamp,vid,device,accuracy
0,ml_tutorial_flor.shadow.readme,myFirstRun,2023-07-20T13:31:46,14480b3c1ec4636e0f26ec51b5bc7bc1a5c7d9d1,cuda,97.88
90,ml_tutorial_flor.shadow.readme,mySecondRun,2023-07-20T13:33:30,2be1ef88e7b2e39d2a5844b0945811244bb40715,cuda,97.18
180,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:34:24,d1035feb1274889a9f479f3f687659fc7bf712b8,cuda,93.83
270,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:34:50,4d1d90fa76b932edd2d637c0adfce8979487824f,cuda,97.57
360,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:35:15,60598fc693c15877a912366b2e29b6f35158a1e1,cuda,94.69
450,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:35:40,ef946c517a019861e7f7cf13b956deea3f3e0978,cuda,97.43
540,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:36:05,8e34cf572b8ea2f3fcfd8b7e398702a5023b8d0c,cuda,95.43
630,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:36:30,35f276660121be3b166da84f8808c41236dff6dd,cuda,97.63
720,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:36:56,cdc3c10f280e3473cb47f49425aa1d7db78fff42,cuda,95.92
810,ml_tutorial_flor.shadow.readme,BATCH,2023-07-20T13:37:22,5e53085992c1ed7941fe26930859cb4f86cba40c,cuda,97.81


Note the new columns `device` and `accuracy` that are backfilled.

## Publications

To cite this work, please refer to the [Hindsight Logging](http://www.vldb.org/pvldb/vol14/p682-garcia.pdf) paper (VLDB '21).

FLOR is open source software developed at UC Berkeley. 
[Joe Hellerstein](https://dsf.berkeley.edu/jmh/) (databases), [Joey Gonzalez](http://people.eecs.berkeley.edu/~jegonzal/) (machine learning), and [Koushik Sen](https://people.eecs.berkeley.edu/~ksen) (programming languages) 
are the primary faculty members leading this work.

This work is released as part of [Rolando Garcia](https://rlnsanz.github.io/)'s doctoral dissertation at UC Berkeley,
and has been the subject of study by Eric Liu and Anusha Dandamudi, 
both of whom completed their master's theses on FLOR.
Our list of publications are reproduced below.
Finally, we thank [Vikram Sreekanti](https://www.vikrams.io/), [Dan Crankshaw](https://dancrankshaw.com/), and [Neeraja Yadwadkar](https://cs.stanford.edu/~neeraja/) for guidance, comments, and advice.
[Bobby Yan](https://bobbyy.org/) was instrumental in the development of FLOR and its corresponding experimental evaluation.

* [Hindsight Logging for Model Training](http://www.vldb.org/pvldb/vol14/p682-garcia.pdf). _R Garcia, E Liu, V Sreekanti, B Yan, A Dandamudi, JE Gonzalez, JM Hellerstein, K Sen_. The VLDB Journal, 2021.
* [Fast Low-Overhead Logging Extending Time](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-117.html). _A Dandamudi_. EECS Department, UC Berkeley Technical Report, 2021.
* [Low Overhead Materialization with FLOR](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-79.html). _E Liu_. EECS Department, UC Berkeley Technical Report, 2020. 


## License
FLOR is licensed under the [Apache v2 License](https://www.apache.org/licenses/LICENSE-2.0).
