# Files and versioning

Unless you're a string theorist, at some point you're probably going to want to save and load some data. This tutorial covers some of Sciris' tools for doing that more easily.

<div class="alert alert-warning">
    <b>Warning!</b> The tools here are powerful, which also makes them dangerous. Unless it's in a simple text format like JSON or CSV, loading a data file can run arbitrary code on your computer, just like running a Python script can. If you wouldn't run a Python file from a particular source, don't open a data file from that source either.
</div>

<div class="alert alert-info">
    
Click [here](https://mybinder.org/v2/gh/sciris/sciris/HEAD?labpath=docs%2Ftutorials%2Ftut_files.ipynb) to open an interactive version of this notebook.

</div>


## Files

### Saving and loading literally anything

Let's assume you're mostly just saving and loading files you've created yourself or from trusted colleagues, not opening email attachments from the branch of the local mafia. Then everything here is absolutely fine.

Let's revisit our sim from the first tutorial:

In [None]:
import sciris as sc
import numpy as np
import matplotlib.pyplot as plt
sc.options(jupyter=True) # To make plots nicer

class Sim:
    
    def __init__(self, days, trials):
        self.days = days
        self.trials = trials
    
    def run(self):
        self.x = np.arange(self.days)
        self.y = np.cumsum(np.random.randn(self.days, self.trials)**3, axis=0)
    
    def plot(self):
        with plt.style.context('sciris.fancy'):
            plt.plot(self.x, self.y, alpha=0.6)

Now let's run it, save it, reload it, and keep working with the reloaded version:

In [None]:
# Run and save
sim = Sim(days=30, trials=5)
sim.run()
sc.save('my-sim.obj', sim) # Save any Python object to disk

# Load and plot
new_sim = sc.load('my-sim.obj') # Load any Python object
new_sim.plot()

We can create any object, save it, then reload it from disk and it works just like new – even calling methods works! What's happening here? Under the hood, `sc.save()` saves the object as a [gzipped](https://docs.python.org/3/library/gzip.html) (compressed) [pickle](https://docs.python.org/3/library/pickle.html) (byte stream). Pickles are how Python sends objects internally, so can handle almost anything. (For the few corner cases that `pickle` can't handle, `sc.save()` falls back on [dill](https://dill.readthedocs.io/en/latest/), which really can handle everything.) 

There are also other compression options than gzip ([zstandard](https://python-zstandard.readthedocs.io/en/latest/) or no compression), but you probably don't need to worry about these. (If you _really_ care about performance, then `sc.zsave()`, which uses `zstandard` by default, is slightly faster than `sc.save()` – but regardless of how a file was saved you can load it with `sc.load()`.

### Saving and loading JSON

While `sc.save()` and `sc.load()` are great for many things, they _aren't_ great for just sharing data. First, they're not compatible with anything other than Sciris, so if you try to share one of those files with, say, an R user, they won't be able to open them. 

If you just have data and don't need to save custom objects, you should save just the data. If you want to save to CSV or Excel (i.e., data that looks like a spreadsheet), you should convert it to a dataframe (`df = sc.dataframe(data)`), then save it from there (`df.to_excel()` and `df.to_csv()`, respectively). 

But if you want to save data that's a little more complex, you should consider JSON: it's fast, it's easy for humans to read, and absolutely everything loads it. While typically a JSON maps onto a Python `dict`, Sciris will take pretty much any object and save out the JSONifiable parts of it:

In [None]:
# Try saving our sim as a JSON
sc.savejson('my-sim.json', sim)

# Load it as a JSON
sim_json = sc.loadjson('my-sim.json')
print(sim_json)

It's not exactly beautiful, and it's not as powerful as `sc.save()` (for example, `sim_json.plot()` doesn't exist), but it has all the _data_, exactly as it was laid out in the original object:

In [None]:
print(f"{sim_json['x'] = }")
print(f"{sim_json['y'][0] = }")

(Note that when exported to JSON and loaded back again, everything is in default Python types – so the data is now a list of lists rather than a 2D NumPy array.)

### Saving and loading YAML

If you're not super familiar with [YAML](https://yaml.org/), you might think of it as that quirky format for configuration files with lots of colons and indents. It _is_ that, but it's also a powerful extension to JSON – every JSON file is also a valid YAML file, but the reverse is not true (i.e., JSON is a subset of YAML). Of most interest to you, dear scientist, is that you can add comments to YAML files. Consider this (relatively) common situation:

In [None]:
raw_json = '''
{"variables": {
    "timepoints": [0,1,2,3,4,5],
    "really_important_variable": 12.566370614359172
  }
}
'''
data = sc.readjson(raw_json)
print(data)

Now you're tearing your hair out. Where did 12.566370614359172 come from? It looks vaguely familiar, or at least it did when you wrote it 6 months ago. But with YAML, you can have your data and comment it too:

In [None]:
raw_yaml = '''
{"variables": {
    "timepoints": [0,1,2,3,4,5],
    "really_important_variable": 12.566370614359172 # This is just 4π lol
  }
}
'''
data = sc.readyaml(raw_yaml)
print(data)

Mystery solved.

### Other file functions

Sciris includes a number of other file utilities. For example, to get a list of files, you can use `sc.getfilelist()`:

In [None]:
sc.getfilelist('*.ipynb')

Sometimes it's useful to get the folder for the current file, since sometimes you're calling it from a different place, and want the relative paths to remain the same (for example, to load something from a subfolder):

In [None]:
sc.thispath()

(This looks wonky here because this notebook is run on some random cloud server, but it should look more normal if you do it at home!)

Most Sciris file functions can return either strings or [Paths](https://docs.python.org/3/library/pathlib.html). If you've never used `pathlib`, it's a really powerful way of handling paths. It's also quite intuitive. For example, to create a `data` subfolder that's always relative to this notebook regardless of where it's run from, you can do

In [None]:
datafolder = sc.thispath() / 'data'
print(datafolder)

Sciris also makes it easy to ensure that a path exists:

In [None]:
datafile = sc.makefilepath(datafolder / 'my-data.csv', makedirs=True)
print(datafile)

Sciris usually handles all this internally, but this can be useful for using with non-Sciris functions, e.g.

In [None]:
np.savetxt('data/my-data.csv', np.random.rand(2,2)) # Would give an error without sc.makefilepath() above

Lastly, you can clean up with yourself with `sc.rmpath()`, which will automatically figure out whether to use [os.remove()](https://docs.python.org/3/library/os.html#os.remove) (which works for files but not folders) or [shutil.rmtree()](https://docs.python.org/3/library/shutil.html#shutil.rmtree) (which, frustratingly, works for folders but not files):

In [None]:
sc.rmpath('data/my-data.csv')

## Versioning

### Getting version information

You've probably heard people talk about reproducibility. Quite likely you yourself have talked about reproducibility. Central to computational reproducibility is knowing what version everything is. Sciris provides several tools for this. To collect all the metadata available – including the current Python environment, system version, and so on – use `sc.metadata()`:

In [None]:
md = sc.metadata(pipfreeze=False)
print(md)

(We turned off `pipfreeze` above because this stores the entire output of `pip freeze`, i.e. every version of every Python library installed. This is a lot to display in a notebook, but typically you'd leave it enabled.)

If you want specific versions of things, there are two functions for that: `sc.compareversions()`. This does explicit version checks:

In [None]:
if sc.compareversions(np, '>1.0'):
    print('You do not have an ancient version of NumPy')
else:
    print('When you last updated NumPy, dinosaurs roamed the earth')

In contrast, `sc.require()` will raise a warning (or exception) if the requirement isn't met. For example:

In [None]:
sc.require('numpy>99.9.9', die=False) # We don't want to die, we're in the middle of a tutorial!

You can see it raises a warning (there is no NumPy v99.9.9), and attempts to give a helpful suggestion (which in this case is not very helpful).

### Saving and loading version information

#### Metadata-enhanced figures

Sciris includes a copy of `plt.savefig()` named `sc.savefig()`. Aside from saving with publication-quality resolution by default, the other difference is that it automatically saves metadata along with the figure (including optional comments, if we want). For example:

In [None]:
plt.pcolor(sc.smooth(np.random.rand(10,10)), cmap='turbo')
sc.savefig('my-fig.png', comments='This is a pretty plot')

We can load metadata from the saved file using `sc.loadmetadata()`:

In [None]:
md = sc.loadmetadata('my-fig.png')
sc.printjson(md) # Can just use print(), but sc.printjson() is prettier

#### Metadata-enhanced files

Remember `sc.save()` and `sc.load()` from the previous tutorial? The metadata-enhanced versions of these are `sc.savearchive()` and `sc.loadarchive()`. These will save an arbitrary object to a zip file, but also include a file called `sciris_metadata.json` along with it. You can even include other files or even whole folders in with it too – for example, if you want to save a big set of sim results and figure you might as well throw in the whole source code along with it. For example, re-using our sim from before, let's save it along with this notebook:

In [None]:
sim_archive = sc.savearchive('my-sim.zip', sim, files='tut_files.ipynb', comments='Sim plus notebook')

This is just an ordinary zip file, so we can open it with any application. But we can also load the metadata automatically with `sc.loadmetadata()`: 

In [None]:
md = sc.loadmetadata(sim_archive)
print(md['comments'])

And, of course, we can load the whole thing as a brand new, fully-functional object:

In [None]:
sim = sc.loadarchive(sim_archive)
sim.plot()