## Pulling barcodes from FigShare

Attempting to grab data from scratch to replicate "Applications of deep convolutional neural networks to digitized natural history collections" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680669/)

In [1]:
import tarfile
import pandas as pd
import numpy as np

First, we need to download the image bundles from Figshare in order to get their image barcodes. They are posted separately by [stained](https://smithsonian.figshare.com/articles/dataset/Mercury-stained_botany_images_for_deep_learning/5423083) and [unstained](https://smithsonian.figshare.com/articles/dataset/Unstained_botany_images_for_deep_learning/5423098) datasets.

In [2]:
! wget -nc -O stained.tar.gz https://smithsonian.figshare.com/ndownloader/files/9355285

File ‘stained.tar.gz’ already there; not retrieving.


In [3]:
stained_barcodes = []
with tarfile.open("stained.tar.gz", "r:gz") as tar:
    for filename in tar.getnames():
        if filename.endswith('.jpg'):
            barcode = filename.split('/')[1].split('.')[0]
            stained_barcodes.append(barcode)
stained_barcodes[:5]

['00000140', '00000162', '00000185', '00000209', '00000231']

In [4]:
! wget -nc -O unstained.tar.gz https://smithsonian.figshare.com/ndownloader/files/9355303

--2021-06-04 13:51:48--  https://smithsonian.figshare.com/ndownloader/files/9355303
Resolving smithsonian.figshare.com (smithsonian.figshare.com)... 99.80.170.16, 52.208.22.115
Connecting to smithsonian.figshare.com (smithsonian.figshare.com)|99.80.170.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 413485713 (394M) [binary/octet-stream]
Saving to: ‘unstained.tar.gz’


2021-06-04 13:52:20 (12.5 MB/s) - ‘unstained.tar.gz’ saved [413485713/413485713]



In [5]:
unstained_barcodes = []
with tarfile.open("unstained.tar.gz", "r:gz") as tar:
    for filename in tar.getnames():
        if filename.endswith('.jpg'):
            barcode = filename.split('/')[1].split('.')[0]
            unstained_barcodes.append(barcode)
unstained_barcodes[:5]

['00000001', '00000003', '00000015', '00000020', '00000021']

In [6]:
stained_barcode_df = pd.DataFrame(stained_barcodes, columns=['barcode'])
stained_barcode_df['stain_status'] = 'stained'
stained_barcode_df.head()

Unnamed: 0,barcode,stain_status
0,140,stained
1,162,stained
2,185,stained
3,209,stained
4,231,stained


In [7]:
unstained_barcode_df = pd.DataFrame(unstained_barcodes, columns=['barcode'])
unstained_barcode_df['stain_status'] = 'unstained'
unstained_barcode_df.head()

Unnamed: 0,barcode,stain_status
0,1,unstained
1,3,unstained
2,15,unstained
3,20,unstained
4,21,unstained


In [8]:
stain_status_df = pd.concat([stained_barcode_df, unstained_barcode_df])
stain_status_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15553 entries, 0 to 7776
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   barcode       15553 non-null  object
 1   stain_status  15553 non-null  object
dtypes: object(2)
memory usage: 364.5+ KB


In [9]:
stain_status_df['stain_status'].value_counts()

unstained    7777
stained      7776
Name: stain_status, dtype: int64

In [10]:
stain_status_df.to_csv('barcodes_from_figshare.tsv', index=False, sep='\t')