In [1]:
import time

import numpy as np
import pandas as pd

# Jupyter Notebooks for Research

## 2. Multi Notebook Projects

I find that some of my notebooks get out of hand. Weeks of work leads to 100+ cells.
The notebook feels encumbered (browser, kernel or server load?). A kernel restart
means 10+ mins to get back to where I was. Its unpleasant.

There is probably sub-optimal code in there but the real issue is that I've abused a
the single notebook model and its time to do better.

### Splitting your notebook into multiple notebooks

Maybe you could think of this as the chapters of the analysis. Technically, I suppose
"books" might seem more natural but with a substantial dataset and/ or compute intensive
analyses I reckon you'll be best splitting on what you'd consider chapters. Here's an
example project layout:

1. Data preparation
2. Exploratory analysis
  1. Facet 1
  2. Facet 2
3. Model fitting

The main technical aspect you need to consider is **exchanging data between the notebooks**.
In this example you might have something like the following dependency tree:

```
01_01_data_prep.ipynb
│
└───
│   │   02_01_facet_1.ipynb
│   │   02_02_facet_2.ipynb
│   └───
└───────│   03_01_model.ipynb
```

There are a variety of ways you can acheive this and its going to be pretty straightforward.
I would advise some form of checksumming though.

### Generate data

First we generate our raw dataset.

In [2]:
def long_running_data_generation(n=5):
    time.sleep(n)
    np.random.seed(seed=444)  # vcs would be a pain without this
    return pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))


df = long_running_data_generation()
df.head()

Unnamed: 0,A,B,C,D
0,0.35744,0.377538,1.382338,1.175549
1,-0.939276,-1.14315,-0.54244,-0.548708
2,0.20852,0.21269,1.268021,-0.807303
3,-3.303072,-0.80665,-0.360329,-0.880396
4,0.152631,0.25025,0.078508,-0.903904


### Processing

We have a few processing steps to do.

In [3]:
df = df.assign(A2=2*df.A+df.B)
df.head()

Unnamed: 0,A,B,C,D,A2
0,0.35744,0.377538,1.382338,1.175549,1.092418
1,-0.939276,-1.14315,-0.54244,-0.548708,-3.021702
2,0.20852,0.21269,1.268021,-0.807303,0.629729
3,-3.303072,-0.80665,-0.360329,-0.880396,-7.412793
4,0.152631,0.25025,0.078508,-0.903904,0.555513


### Storage

Let's store that for future use.

In [4]:
# cell imports?! maybe I won't use these anywhere else...
import hashlib
import gzip

# safely hash a dataframe
# TODO: Include reference to where I saw this
row_hashes = pd.util.hash_pandas_object(df, index=True)
df_hash = hashlib.sha256(row_hashes.values).hexdigest()
print(df_hash)
# write the file, don't clobber it if its already there, this could be slow
filename = f'data_prep_df_{df_hash[:7]}.p.gz'
try:
    with gzip.open(filename, 'x') as scores_file:
        scores_file.write('# Creation time: {}\n'.format(str(now)).encode())
        scores_file.write('# Table hash: {}\n'.format(df_hash).encode())
        scores_file.write(df.to_string().encode())
except FileExistsError:
    print('{} already exists.'.format(filename))

0ebc21d7983d3ce8bcc159aea7cc8127b76ef3183e260333780a54b619b129a6
data_prep_df_0ebc21d.p.gz already exists.
