# Data download notebook 

**Warning! the entire dataset is almost 1TB. Each individual session, including only spike data, is around 1-3GB. Do not try to download multiple sessions without checking available space in your computer**. The entire spike dataset is about 120GB. 

## Useful links 

- [In the main page of the neuropixels data](https://allensdk.readthedocs.io/en/latest/visual_coding_neuropixels.html) you can find more tutorials. In fact, this notebook was made using the [Data Access](https://allensdk.readthedocs.io/en/latest/_static/examples/nb/ecephys_data_access.html).



## Select the folder where the data will be stored 

For now it will be inside a folder called `allendata`, situated next at the same level than this notebook, is selected. **Make sure it exists before proceeding,** creating it by yourself



In [1]:
import os

#IMPORTANT: check path to where the data might be stored.
data_dir = "./allendata"

if not os.path.exists(data_dir):
    os.makedirs(data_dir)

Execute next cell to configure the dataset in the selected folder. The file `manifest.json`is used to keep track of everything. The `cache` object manages the downloads. It's very important to increase the default `timeout`, which is set to `1200` seconds (20 min) because it might be too slow. If this time is exceeded, download is cancelled.

In [2]:
import os
import shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from allensdk.brain_observatory.ecephys.ecephys_project_cache import EcephysProjectCache

#We need to set here also a maximum download time for all the data. 
#Let's select 1 hour.  
seconds_2_cancel = 60 * 60
manifest_path = os.path.join(data_dir, "manifest.json")
cache = EcephysProjectCache.from_warehouse(manifest=manifest_path, timeout=seconds_2_cancel)

Matplotlib is building the font cache; this may take a moment.


## Download index files

We do not directly download the data, but first index files with information about the sessions. These are just four CSV files, and weight some MBs, so they do not take a lot of space in your computer.  

The command `get_session_table()` loads a Pandas dataframe with the information about all the sessions. The ID identifies the session, and then we have `session_type` (which kind of experiment it was) and data about the animal, such as `age_in_days` or its `full_genotype`. We do a sneak peek to the first rows of the table by invoking its `.head()`.

**Warning: the very first time the `get_session_table` is called, it can take up to 20 minutes of execution. There might be no progress indicator. Be patient.**

Once all the files are downloaded and configured the access will be instant.

In [3]:
sessions = cache.get_session_table() #Returns a Pandas dataframe
sessions.head() #Sneak peek of the table

Unnamed: 0_level_0,published_at,specimen_id,session_type,age_in_days,sex,full_genotype,unit_count,channel_count,probe_count,ecephys_structure_acronyms
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
715093703,2019-10-03T00:00:00Z,699733581,brain_observatory_1.1,118.0,M,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,884,2219,6,"[CA1, VISrl, nan, PO, LP, LGd, CA3, DG, VISl, ..."
719161530,2019-10-03T00:00:00Z,703279284,brain_observatory_1.1,122.0,M,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,755,2214,6,"[TH, Eth, APN, POL, LP, DG, CA1, VISpm, nan, N..."
721123822,2019-10-03T00:00:00Z,707296982,brain_observatory_1.1,125.0,M,Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,444,2229,6,"[MB, SCig, PPT, NOT, DG, CA1, VISam, nan, LP, ..."
732592105,2019-10-03T00:00:00Z,717038288,brain_observatory_1.1,100.0,M,wt/wt,824,1847,5,"[grey, VISpm, nan, VISp, VISl, VISal, VISrl]"
737581020,2019-10-03T00:00:00Z,718643567,brain_observatory_1.1,108.0,M,wt/wt,568,2218,6,"[grey, VISmma, nan, VISpm, VISp, VISl, VISrl]"


Then we can load any session you want. Let's use the session with ID `798911424` as an example. The data for this session is downloaded using `cache.get_session_data(id)`. **Observe that memory of the session will not be allocated yet**.

First time it will download the data. Next accesses will be instant. 

In [4]:
session_id = 798911424 
oursession = cache.get_session_data(session_id)



Downloading:   0%|          | 0.00/2.86G [00:00<?, ?B/s]

  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."


### IMPORTANT: monitor use of RAM, memory will be allocated in the next cell.

Observe that the following cell actually calls a function (it's not a parameter of the object, but rather a `@property`) and this will allocate all the neccesary memory. This will happen the first time after opening the Python kernel. 

If you see the metadata coming out in the next cell you're ready to go for the projects. It will look something like this

```
{'specimen_name': 'Vip-IRES-Cre;Ai32-421338',
 'session_type': 'brain_observatory_1.1',
 'full_genotype': 'Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt',
 'sex': 'F',
 'age_in_days': 110.0,
 'rig_equipment_name': 'NP.1', ... }
```

If don't see it, cell keeps running, and RAM usage continues to increase, interrupt the kernel and restart the kernel. We'll give you 
other files!

In [4]:
oursession.metadata

  return func(args[0], **pargs)
  return func(args[0], **pargs)
  return func(args[0], **pargs)
  return func(args[0], **pargs)
  return func(args[0], **pargs)
  return func(args[0], **pargs)
  return func(args[0], **pargs)
  return func(args[0], **pargs)


{'specimen_name': 'Vip-IRES-Cre;Ai32-421338',
 'session_type': 'brain_observatory_1.1',
 'full_genotype': 'Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt',
 'sex': 'F',
 'age_in_days': 110.0,
 'rig_equipment_name': 'NP.1',
 'num_units': 825,
 'num_channels': 2233,
 'num_probes': 6,
 'num_stimulus_presentations': 70931,
 'session_start_time': datetime.datetime(2018, 12, 21, 0, 2, 57, tzinfo=tzoffset(None, -28800)),
 'ecephys_session_id': 798911424,
 'structure_acronyms': ['LP',
  'DG',
  'CA1',
  'VISam',
  nan,
  'APN',
  'TH',
  'Eth',
  'CA3',
  'VISrl',
  'HPF',
  'ProS',
  'SUB',
  'VISp',
  'CA2',
  'VISl',
  'MB',
  'NOT',
  'LGv',
  'VISal'],
 'stimulus_names': ['spontaneous',
  'gabors',
  'flashes',
  'drifting_gratings',
  'natural_movie_three',
  'natural_movie_one',
  'static_gratings',
  'natural_scenes',
  'drifting_gratings_contrast']}