# BARC dataset

In many ways this BARC dataset is much like The Allen's [BigNeuron project](https://alleninstitute.org/what-we-do/brain-science/news-press/articles/bigneuron-project-launched-advance-3d-reconstructions-neurons) of a few years ago. 

Here though there is only one type of data: brightfield imaging of biocytin stained neurons. Each neuron's image stack on the order of 10 gigabytes of data.

## Access info
The challenge dataset is hosted on Wasabi Cloud Storage, which mimics the APIs of AWS S3 so all the regular ways of accessing data on S3 can be used to access the data

- Service endpoint address: s3.wasabisys.com
- Access Key Id: 2G7POM6IZKJ3KLHSC4JB
- Secret Access Key: 0oHD5BXPim7fR1n7zDXpz4YoB7CHAHAvFgzpuJnt
- Storage region: us-west-1
- bucket name: brightfield-auto-reconstruction-competition  

## Overview of bucket's contents
There are two parts to the data
1. training data (100+ neurons, with manual SWCs)
2. test data (10 neurons, no SWCs)

Each neuron is in its own folder off the root of the bucket. So the are over 100 folders with names like `647225829`, `767485082`, and `861519869`.

Each neuron's data is in a separate folder. Each folder consists of
- the input: a few hundred TIFF image files
- the output: one SWC manually traced skeleton file

There is one unusual sub-root folder, `TEST_DATA_SET`, which contains the data for the ten neurons used during the challenge's evaluation phase. These ten neuron image stacks *DO NOT* have SWC files.

The goal is that some software will read the image stack  and auto reconstruct the SWC, without a human having to manually craft a SWC skeleton file (or at least minimize the human input time).

So, the idea is a two phase challenge: first train with answers (SWC files), then submit 10 SWC files the program generates on the ten neurons in `TEST_DATA_SET`. 

sfirst train a auto reconstruction program using the roughly 100 neurons in the training data set, and check your results against the human traced SWC skeletons that each neuron's image stack comes with. Then for the evaluation phase






Each image stack has its own image count, seemingly a few hunderd TIFF images each (e.g., 270, 500, 309, etc.). Each stack's images are all the same size but the sizes differ between stacks (e.g. 33MB images, 58MB images, etc.). Seemingly, on the order of 30 to 50 MB per image. 

One TEST_DATA_SET sample neuron's data is a folder, named `665856925`:
- Full of about 280 TIFF images
- All files named like:`reconstruction_0_0539044525_639962984-0007.tif` 
- The only thing that changes is the last four characters in the filename root, after the hyphen.
- Each file is about 33 MB in size
- One neuron's data is on the order of 10 gigabyte


# Colab can handle one neuron's data at a time


Consider one large neuron, Name/ID of `647225829`. This one has 460 images, each 57.7MB. So, a single neuron's data can be as big as, say, 25 gigabytes. 

Fortuneately, Google's Colab has that much file system. They give out 50GB file systems. And if you ask for a GPU they actually give you 350GB. (U-Net can use a GPU.)



In [1]:
# Get some stats on the file system:
!!df -h .


['Filesystem      Size  Used Avail Use% Mounted on',
 'overlay          49G   25G   22G  54% /']

The default file system on Colab is 50G, but a 360G file system can be requested, simply by configuring the runtime to have a GPU (yup).

So, on the default (25G) file system, half the file system is already used by the OS and other pre-installed software. A big neuron's data would consume the remaining 25G. So **probably a good idea to request a GPU** which will also come with ~360G file system.



## Overview of the dataset


The data is stored on Wasabi Cloud Storage, which mimics the AWS S3 APIs, so AWS's Python client, boto3, can be used to access the data. boto3 comes preinstalled on Colab.

Here's Wasabi's how-to doc, [How do I use the AWS SDK for Python (boto3) with Wasabi?
](https://wasabi-support.zendesk.com/hc/en-us/articles/115002579891-How-do-I-use-the-AWS-SDK-for-Python-boto3-with-Wasabi-)

The goal here is to have a bit of code that complete maps the dataset's file system. All 115 neurons and all their files (names and sizes) programmatically indexed into a convenient data structure with which to build out manifest files for, say, ShuTu or some U-Net reconstructor to process. I.e. this will make it easier for folks to massage the data into whatever tool they decide to run with.

In [12]:
import boto3

s3 = boto3.resource('s3',
     endpoint_url = 'https://s3.us-west-1.wasabisys.com',
     aws_access_key_id = '2G7POM6IZKJ3KLHSC4JB',
     aws_secret_access_key = "0oHD5BXPim7fR1n7zDXpz4YoB7CHAHAvFgzpuJnt")  
bucket = s3.Bucket('brightfield-auto-reconstruction-competition')

result = bucket.meta.client.list_objects(Bucket=bucket.name,
                                         Delimiter='/')
print( "total root subfolders = " + str(sum(1 for _ in result.get('CommonPrefixes') )) + "\n")

def sumObjectsForPrefix(a_prefix):
  "sums gigabytes of file system occupied by all objects is a directory)"
  tots = 0
  tots = sum(1 for _ in bucket.objects.filter(Prefix = a_prefix)) 
  return tots

# The hundred or so training TIFF stacks, with SWCs                    
training_neurons = {}
for o in result.get('CommonPrefixes'):
  a_prefix = o.get('Prefix')
  # 106 lines of random numbers: 
  #print(a_prefix)
  
  # Enumerate all files
  # print("----------------")
  imagestack_bytes = 0
  imagestack = []
  swc_key = None
  for s3_object in bucket.objects.filter(Prefix = a_prefix):
    # print(s3_object.key + "= " + str(s3_object.size))
    if not s3_object.key.endswith(".swc"):
      if s3_object.key != a_prefix:
        # if == it's the directory itself, not a file in it so ignore
        imagestack.append(s3_object.key)
        imagestack_bytes += s3_object.size
    else:
      swc_key = s3_object.key
  
  if a_prefix != "TEST_DATA_SET/":
    training_neurons[a_prefix] = {"prefix": a_prefix, "swc": swc_key, "imagestack": imagestack, "size": imagestack_bytes}
    
    
    
    
print( "# training neurons: " + str(len(training_neurons)))    

# https://stackoverflow.com/a/49361727
def format_bytes(size):
    # 2**10 = 1024
    power = 2**10
    n = 0
    power_labels = {0 : '', 1: 'kilo', 2: 'mega', 3: 'giga', 4: 'tera'}
    while size > power:
        size /= power
        n += 1
    return size, power_labels[n]+'bytes'
  
for a_neuron_name in training_neurons:
  a_neuron = training_neurons[a_neuron_name]
  sizeAndUnit = format_bytes(a_neuron["size"])
  print(a_neuron_name + ": " + str(len(a_neuron["imagestack"])) + "= " + '{:02.2f}'.format(sizeAndUnit[0]) + " " + sizeAndUnit[1] )
    
# The ten training TIFF stacks, without SWCs                    
testing_neurons = {}
    
  


total root subfolders = 106

# training neurons: 105
647225829/: 460= 24.74 gigabytes
647244741/: 261= 8.00 gigabytes
647247980/: 299= 9.15 gigabytes
647278927/: 346= 17.49 gigabytes
647289876/: 228= 7.01 gigabytes
649052017/: 307= 9.42 gigabytes
650917845/: 245= 9.98 gigabytes
651511374/: 414= 12.72 gigabytes
651748297/: 336= 7.02 gigabytes
651790667/: 250= 13.42 gigabytes
651806289/: 291= 6.05 gigabytes
651829339/: 529= 35.29 gigabytes
651834134/: 469= 14.44 gigabytes
652113069/: 359= 14.56 gigabytes
654221379/: 334= 10.27 gigabytes
654591451/: 300= 12.18 gigabytes
663523681/: 539= 27.28 gigabytes
663961066/: 414= 16.82 gigabytes
664466860/: 382= 11.73 gigabytes
668664690/: 464= 14.33 gigabytes
669371214/: 295= 11.97 gigabytes
672278613/: 330= 10.12 gigabytes
673066511/: 283= 14.52 gigabytes
674317065/: 344= 18.39 gigabytes
676633030/: 387= 10.65 gigabytes
677326176/: 595= 24.10 gigabytes
677347027/: 586= 18.04 gigabytes
685884456/: 492= 20.07 gigabytes
687702530/: 358= 18.06 gigabyt

106 folders for 105 training neurons and the last folder is `TEST_DATA_SET` which contains 10 neuron image stacks in subfolders (without SWC answers).

## Trial and error hacking

In [0]:
import boto3


s3 = boto3.resource('s3',
     endpoint_url = 'https://s3.us-west-1.wasabisys.com',
     aws_access_key_id = '2G7POM6IZKJ3KLHSC4JB',
     aws_secret_access_key = '0oHD5BXPim7fR1n7zDXpz4YoB7CHAHAvFgzpuJnt')  
  
bucket = s3.Bucket('brightfield-auto-reconstruction-competition')
#for obj in bucket.objects.filter(Prefix="668664690"):
#  print(obj.key)

# Good:
#print( "objects=" + str(sum(1 for _ in bucket.objects.filter(Prefix="668664690"))))

  
#size = sum(1 for _ in bucket.objects.all())  
  
  
#print ('sub-folders:')  
#for obj in bucket.objects.filter(Delimiter="/"):
#   print(obj.key)


print( "objects=" + str(sum(1 for _ in bucket.objects.filter(Delimiter="/"))))

  
  
result = bucket.meta.client.list_objects(Bucket=bucket.name,
                                         Delimiter='/')
for o in result.get('CommonPrefixes'):
    print(o.get('Prefix'))  

#list(prefix='/', delimiter='/')

# long list of all files
#for a_bucket_object in bucket.objects.all():
#    print(a_bucket_object.key)