# [Tutorial](https://github.com/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/03-caching-configuration.ipynb)

# Caching

This notebook illustrates the use of the climetlab cache and highlight some cache configuration settings.

The relevant Climetlab documentation is located at https://climetlab.readthedocs.io/en/latest/guide/caching.html

Relevant CliMetLab settings are:
- cache-directory 
- maximum-cache-disk-usage 
- maximum-cache-size

## How to run this exercise

This exercise is in the form of a [Jupyter notebook](https://jupyter.org/). It can be "run" in a number of free cloud based environments (see two options below). These require no installation. When you click on one of the links below ([`Open in Colab`](https://colab.research.google.com/github/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/03-caching-configuration.ipynb) or [`Launch in Deepnote`](https://deepnote.com/launch?url=https://github.com/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/03-caching-configuration.ipynb)) you will be prompted to create a free account, after which you will see the same page you see here. You can run each block of code by selecting shift+control repeatedly, or by selecting the "play" icon. 

Advanced users may wish to run this exercise on their own computers by first installing [Python](https://www.python.org/downloads/), [Jupyter](https://jupyter.org/install) and [CliMetLab](https://climetlab.readthedocs.io/en/latest/installing.html).

<style>
td, th {
   border: 1px solid white;
   border-collapse: collapse;
}
</style>
<table align="left">
  <tr>
    <th>Run the tutorial via free cloud platforms: </th>
    <th><a href="https://colab.research.google.com/github/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/03-caching-configuration.ipynb">
        <img src = "https://colab.research.google.com/assets/colab-badge.svg" alt = "Colab"></th>
    <th><a href="https://deepnote.com/launch?url=https://github.com/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/03-caching-configuration.ipynb">
        <img src = "https://deepnote.com/buttons/launch-in-deepnote-small.svg" alt = "Kaggle"></th>
  </tr>
</table>

### Change the setting `download-out-of-date-urls` to True, default is False

In [1]:
!climetlab settings

cache-directory                           [34m/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen[0m
check-out-of-date-urls                    [34mTrue[0m
dask-directories                          [34m['/Users/wenwen/.climetlab/dask'][0m
datasets-catalogs-urls                    [34m['https://github.com/ecmwf-lab/climetlab-datasets/raw/main/datasets'][0m
datasets-directories                      [34m['/Users/wenwen/.climetlab/datasets'][0m
download-out-of-date-urls                 [34mFalse[0m
layers-directories                        [34m['/Users/wenwen/.climetlab/layers'][0m
maximum-cache-disk-usage                  [34m90%[0m
maximum-cache-size                        [34mNone[0m
number-of-download-threads                [34m4[0m
plotting-options                          [34m{}[0m
projections-directories                   [34m['/Users/wenwen/.climetlab/projections'][0m
styles-directories                        [34m['/Users/wenwen/.cl

In [2]:
import climetlab as cml
cml.settings.get('download-out-of-date-urls')

False

In [3]:
cml.settings.set('download-out-of-date-urls', True)
cml.settings.get('download-out-of-date-urls')

True

## Let's begin the exercise...

In [4]:
# pip install climetlab --quiet

In [5]:
import climetlab as cml
URL1 = "https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.SP.list.v04r00.csv"
URL2 = "https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.NI.list.v04r00.csv"

### Using ``cml.load_source("url",...)`` stores the data in the climetlab cache.  

In [6]:
data = cml.load_source("url", URL1)
data.to_pandas()

Remote content of URL https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.SP.list.v04r00.csv has changed
Invalidating cache version and re-downloading https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.SP.list.v04r00.csv
Deleting entry {
    "path": "/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/url-e29d836ce6a5ea6e24cbb4398dd11140a280205e0248a9ad468bde15e1727667.SP.list.v04r00.csv",
    "size": 34820706,
    "owner": null,
    "args": null
}
CliMetLab cache: deleting /var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/url-e29d836ce6a5ea6e24cbb4398dd11140a280205e0248a9ad468bde15e1727667.SP.list.v04r00.csv (33.2 MiB, 2 days 17 hours 45 minutes 1 second)
CliMetLab cache: None None


ibtracs.SP.list.v04r00.csv:   0%|          | 0.00/33.2M [00:00<?, ?B/s]

  return pandas.read_csv(self.path, **pandas_read_csv_kwargs)


Unnamed: 0,SID,SEASON,NUMBER,BASIN,SUBBASIN,NAME,ISO_TIME,NATURE,LAT,LON,...,BOM_GUST_PER,REUNION_GUST,REUNION_GUST_PER,USA_SEAHGT,USA_SEARAD_NE,USA_SEARAD_SE,USA_SEARAD_SW,USA_SEARAD_NW,STORM_SPEED,STORM_DIR
0,,Year,,,,,,,degrees_north,degrees_east,...,second,kts,second,ft,nmile,nmile,nmile,nmile,kts,degrees
1,1897005S10135,1897,1,SP,EA,NOT_NAMED,1897-01-04 12:00:00,NR,-10.1000,135.300,...,,,,,,,,,9,246
2,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 15:00:00,NR,-10.2755,134.902,...,,,,,,,,,8,246
3,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 18:00:00,NR,-10.4406,134.523,...,,,,,,,,,8,246
4,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 21:00:00,NR,-10.5853,134.182,...,,,,,,,,,7,247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75934,2023061S14162,2023,12,SP,MM,KEVIN,2023-03-06 06:00:00,NR,-30.8,193.0,...,,,,,,,,,23,113
75935,2023061S14162,2023,12,SP,MM,KEVIN,2023-03-06 09:00:00,NR,-31.1068,193.947,...,,,,,,,,,16,112
75936,2023061S14162,2023,12,SP,MM,KEVIN,2023-03-06 12:00:00,NR,-31.4,194.7,...,,,,,,,,,17,123
75937,2023061S14162,2023,12,SP,MM,KEVIN,2023-03-06 15:00:00,NR,-32.0486,195.673,...,,,,,,,,,23,130


### Next call to the same code does not redownload the data.

In [7]:
data = cml.load_source("url", URL1)
data.to_pandas()

  return pandas.read_csv(self.path, **pandas_read_csv_kwargs)


Unnamed: 0,SID,SEASON,NUMBER,BASIN,SUBBASIN,NAME,ISO_TIME,NATURE,LAT,LON,...,BOM_GUST_PER,REUNION_GUST,REUNION_GUST_PER,USA_SEAHGT,USA_SEARAD_NE,USA_SEARAD_SE,USA_SEARAD_SW,USA_SEARAD_NW,STORM_SPEED,STORM_DIR
0,,Year,,,,,,,degrees_north,degrees_east,...,second,kts,second,ft,nmile,nmile,nmile,nmile,kts,degrees
1,1897005S10135,1897,1,SP,EA,NOT_NAMED,1897-01-04 12:00:00,NR,-10.1000,135.300,...,,,,,,,,,9,246
2,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 15:00:00,NR,-10.2755,134.902,...,,,,,,,,,8,246
3,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 18:00:00,NR,-10.4406,134.523,...,,,,,,,,,8,246
4,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 21:00:00,NR,-10.5853,134.182,...,,,,,,,,,7,247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75934,2023061S14162,2023,12,SP,MM,KEVIN,2023-03-06 06:00:00,NR,-30.8,193.0,...,,,,,,,,,23,113
75935,2023061S14162,2023,12,SP,MM,KEVIN,2023-03-06 09:00:00,NR,-31.1068,193.947,...,,,,,,,,,16,112
75936,2023061S14162,2023,12,SP,MM,KEVIN,2023-03-06 12:00:00,NR,-31.4,194.7,...,,,,,,,,,17,123
75937,2023061S14162,2023,12,SP,MM,KEVIN,2023-03-06 15:00:00,NR,-32.0486,195.673,...,,,,,,,,,23,130


### Observe and manipulate cache

The downloaded data is actually store in a cache directory, managed by CliMetLab, using a small database. Data is also unzipped if needed within the cache directory.

The cache can be observed and manipulated:
- Within python using ``cml.cache``
- With command line interface ``climetlab cache`` and ``climetlab decache``
- Using the web interface GUI (in progress: summer of code project https://github.com/ecmwf-lab/climetlab-script-web)
- NOT by playing directly with the cache files (same logic as a web browser cache).

In [8]:
cml.cache

In [9]:
!climetlab cache

Cache directory:            [34m/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen[0m
Cache size:                 [34m156.1 MiB[0m
Number of entries in cache: [34m29[0m
Most recently accessed:     [34m2 minutes ago[0m
Least recently accessed:    [34mlast Tuesday[0m
Youngest entry:             [34m2 minutes ago[0m
Oldest entry:               [34mlast Tuesday[0m


In [10]:
!climetlab cache --all

[34m/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/grib-index-19d28c477a0bcdc6db97ea241e2f4e8187eb931be09140df1eb278a0d46a4dae.json[0m
  creation_date: [32m2023-04-25 15:34:36.574213[0m
  last_access: [32m2023-04-25 15:34:36.574213[0m
  accesses: [32m1[0m
  type: [32mfile[0m
  size: [32m4[0m
  owner: [32mgrib-index[0m
  args: [32m['test.grib', 1682461945.5818264, 1682461945.5818264, 1052, 0][0m
  expires: [32mNone[0m
  extra: [32mNone[0m
  flags: [32m0[0m
  owner_data: [32mNone[0m
  parent: [32mNone[0m
  replaced: [32mNone[0m

[34m/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib[0m
  creation_date: [32m2023-04-25 15:45:00.599192[0m
  last_access: [32m2023-04-25 15:45:00.599192[0m
  accesses: [32m1[0m
  type: [32mfile[0m
  size: [32m1052[0m
  owner: [32murl[0m
  args: [32m{'url': 'https://github.com/ecmwf/climetlab

  args: [32m['reanalysis-era5-single-levels', {'variable': '2t', 'product_type': 'reanalysis', 'area': [54.5, -6.0, 39.0, 9.5], 'time': 12, 'year': 1980, 'month': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'], 'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31']}][0m
  expires: [32mNone[0m
  extra: [32mNone[0m
  flags: [32m0[0m
  owner_data: [32mNone[0m
  parent: [32mNone[0m
  replaced: [32mNone[0m

[34m/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/grib-index-b6217d73b387755cdff7289aa0df3170f71ae36c261da92735529784026c4305.json[0m
  creation_date: [32m2023-04-25 16:25:10.122007[0m
  last_access: [32m2023-04-25 16:26:10.545558[0m
  accesses: [32m2[0m
  type: [32mfile[0m
  size: [32m4[0m
  owner: [32mgrib-index[0m
  args: [32m['/var/folders/y3/4fs71r3n6sd517ny

In [11]:
!climetlab cache --newer 1d

[32mEntries newer than '2023-04-27 09:34:23'.[0m
Cache directory:            [34m/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen[0m
Cache size:                 [34m0[0m
Number of entries in cache: [34m3[0m
Most recently accessed:     [34m2 minutes ago[0m
Least recently accessed:    [34m36 minutes ago[0m
Youngest entry:             [34m2 minutes ago[0m
Oldest entry:               [34m36 minutes ago[0m


In [12]:
!climetlab cache --help

usage: cache [-h] [--json] [--all] [--path] [--sort KEY] [--reverse]
             [--match STRING] [--newer DATE] [--older DATE] [--accessed]
             [--larger SIZE] [--smaller SIZE]

Cache command to inspect the CliMetLab cache. The selection arguments are the
same as for the ``climetlab decache`` deletion command. Examples: climetlab
cache --all

optional arguments:
  -h, --help      show this help message and exit
  --json          produce a JSON output
  --all
  --path          print the path of cache directory and exit
  --sort KEY      sort output according to increasing values of KEY.
  --reverse       reverse the order of the sort, from larger to smaller
  --match STRING  TODO
  --newer DATE    TODO
  --older DATE    TODO
  --accessed      use the date of last access instead of the creation date
  --larger SIZE   consider only cache entries that are larger than SIZE bytes
  --smaller SIZE  consider only cache entries that are smaller than SIZE bytes



In [13]:
# Delete cached data newer than one day
# !climetlab decache --newer 1d # This is commented out to avoid running this cell by mistake

# Configuring CliMetLab cache settings

In [14]:
!climetlab settings cache-directory 
!climetlab settings maximum-cache-disk-usage 
!climetlab settings maximum-cache-size  

/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen
90
None


# Concurrent cache use

If the cache is full, the older data is automatically deleted (with a log message). 
When multiple scripts are using the same cache this may lead to a file being deleted (because the cache is full), even if it is currently in use by another script.
 




In [15]:
import climetlab as cml
cml.settings.get("maximum-cache-size")
cml.settings.set("maximum-cache-size", "50M")
cml.settings.get("maximum-cache-size")

CliMetLab cache: trying to free 106.1 MiB
Decaching files oldest than 2023-04-25T16:09:34.085710 (age: 2 days 17 hours 26 minutes 28 seconds)
Deleting entry {
    "path": "/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/grib-index-19d28c477a0bcdc6db97ea241e2f4e8187eb931be09140df1eb278a0d46a4dae.json",
    "owner": "grib-index",
    "args": [
        "test.grib",
        1682461945.5818264,
        1682461945.5818264,
        1052,
        0
    ],
    "creation_date": "2023-04-25 15:34:36.574213",
    "flags": 0,
    "owner_data": null,
    "last_access": "2023-04-25 15:34:36.574213",
    "type": "file",
    "parent": null,
    "replaced": null,
    "extra": null,
    "expires": null,
    "accesses": 1,
    "size": 4
}
CliMetLab cache: deleting /var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/grib-index-19d28c477a0bcdc6db97ea241e2f4e8187eb931be09140df1eb278a0d46a4dae.json (4, 2 days 18 hours 1 minute 26 seconds)
CliMetLab cache: grib-index ["test.grib

CliMetLab cache: e-odretriever {"params": ["2t", "msl"]}
Deleting entry {
    "path": "/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/grib-index-995f88c48e1c08940c5782240c180f704c0d9648682bba3b47b1fcbc6f694d11.json",
    "owner": "grib-index",
    "args": [
        "/var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/e-odretriever-ee877de4e68b1399120142ff002b5a6a44f9642ab455bf33ec91acec4d0e59dd.cache",
        1682463058.1478097,
        1682463058.1450922,
        50956766,
        0
    ],
    "creation_date": "2023-04-25 15:50:58.155246",
    "flags": 0,
    "owner_data": null,
    "last_access": "2023-04-25 15:50:58.155246",
    "type": "file",
    "parent": null,
    "replaced": null,
    "extra": null,
    "expires": null,
    "accesses": 1,
    "size": 4
}
CliMetLab cache: deleting /var/folders/y3/4fs71r3n6sd517ny3ydt1w5w0000gp/T/climetlab-wenwen/grib-index-995f88c48e1c08940c5782240c180f704c0d9648682bba3b47b1fcbc6f694d11.json (4, 2 days 17 hours 

52428800

# Take home message

. End-Users do not need to manage the data. Data is downloaded on demand, with minimal duplication.

. The climetlab cache is a **cache**: it is managed by climetlab and automatically cleaned up.

. Multiple users should not share the same cache directory.

Let us reset the default climetlab cache configuration, just in case.

In [16]:
cml.settings.reset()