# RDM - Tools for supporting (large) data transfers - excerpts 

This is an experiment -- a notebook to capture practices that support the acceleration of large data transfer (10s of TBs; millions of files) and perhaps automate some or all of them.

Run this notebook locally on the machine that has to data to transfer. It's written for Mac OS X (Unix/bash). It can probably be easily modified for Linux boxes.

#### Last revised: 27 March 2018

By: Rick

To set automagic on, follow the command with the argument '1'

In [1]:
%automagic 1


Automagic is ON, % prefix IS NOT needed for line magics.


## Sample data

To demonstrate the data profiling tools, we'll need a directory with a ton of files. I have one with Library of Congress data on my computer. Locate an appropriate set of sample data on yours.

In [2]:
cd /Users/rjaffe/Jupyter_notebooks/RDM_datatransfer/test_data/LOC_Data/batch_az_acacia_ver01/

/Users/rjaffe/Jupyter_notebooks/rdm_datatransfer/test_data/LOC_Data/batch_az_acacia_ver01


In [3]:
ls

LOCTree.txt  [34msn89053851[m[m/  [34msn94051341[m[m/  [34msn94051342[m[m/  [34msn95060791[m[m/


...count the files (`-type f`), starting from the present working directory.

In [4]:
!find . -type f | wc -l

   18488


### Looping one-liners

We can use looping (and other iterating techniques) to profile the files in a directory tree, recursively from the present working directory.

Here's a one-liner adapted from a command that John White used recently on Savio. It uses the `ls` command to list the subdirectories in the current directory, then lists the items in each subdirectory and passes the results to `wc` to count them.

In [5]:
!for i in `ls`; do echo $i; find $i | wc -l; done

LOCTree.txt
       1
sn89053851
    8287
sn94051341
    4783
sn94051342
    6256
sn95060791
   12465


Some researchers have found that certain file types -- .mat files, for one, apparently -- take a long time to transfer. This variation counts the number of files in each directory with a particular filename extension. (I use .xml as an example only because my test data set contains a number of .xml files.)

In [7]:
!for i in `ls`; do echo $i; find $i  -type f -name *.xml | wc -l; done

sn89053851
    2352
sn94051341
    1426
sn94051342
    1797
sn95060791
    3663


Find all files with a given filename extension (i.e, of a specific file type).

In [8]:
!find ./sn89053851 -type f -name *.xml  #note the use of '*' as a wildcard, meaning 'any set of characters'

./sn89053851/1899/11/16/ed-1/seq-7/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-9/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-8/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-1/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-6/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-12/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-3/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-4/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-5/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-2/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-11/ocr.xml
./sn89053851/1899/11/16/ed-1/seq-10/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-7/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-9/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-8/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-1/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-6/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-12/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-3/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-4/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-5/ocr.xml
./sn89053851/1899/11/30/ed-1/seq-2/ocr.xml
./sn89053851/1899/11/30/ed-1

./sn89053851/1911/03/10/ed-1/seq-1/ocr.xml
./sn89053851/1911/03/10/ed-1/seq-3/ocr.xml
./sn89053851/1911/03/10/ed-1/seq-4/ocr.xml
./sn89053851/1911/03/10/ed-1/seq-2/ocr.xml
./sn89053851/1911/03/31/ed-1/seq-1/ocr.xml
./sn89053851/1911/03/31/ed-1/seq-3/ocr.xml
./sn89053851/1911/03/31/ed-1/seq-4/ocr.xml
./sn89053851/1911/03/31/ed-1/seq-2/ocr.xml
./sn89053851/1911/03/24/ed-1/seq-1/ocr.xml
./sn89053851/1911/03/24/ed-1/seq-3/ocr.xml
./sn89053851/1911/03/24/ed-1/seq-4/ocr.xml
./sn89053851/1911/03/24/ed-1/seq-2/ocr.xml
./sn89053851/1911/04/28/ed-1/seq-1/ocr.xml
./sn89053851/1911/04/28/ed-1/seq-3/ocr.xml
./sn89053851/1911/04/28/ed-1/seq-4/ocr.xml
./sn89053851/1911/04/28/ed-1/seq-2/ocr.xml
./sn89053851/1911/04/21/ed-1/seq-1/ocr.xml
./sn89053851/1911/04/21/ed-1/seq-3/ocr.xml
./sn89053851/1911/04/21/ed-1/seq-4/ocr.xml
./sn89053851/1911/04/21/ed-1/seq-2/ocr.xml
./sn89053851/1911/04/07/ed-1/seq-1/ocr.xml
./sn89053851/1911/04/07/ed-1/seq-3/ocr.xml
./sn89053851/1911/04/07/ed-1/seq

./sn89053851/1903/01/01/ed-1/seq-3/ocr.xml
./sn89053851/1903/01/01/ed-1/seq-4/ocr.xml
./sn89053851/1903/01/01/ed-1/seq-2/ocr.xml
./sn89053851/1903/01/15/ed-1/seq-1/ocr.xml
./sn89053851/1903/01/15/ed-1/seq-3/ocr.xml
./sn89053851/1903/01/15/ed-1/seq-4/ocr.xml
./sn89053851/1903/01/15/ed-1/seq-2/ocr.xml
./sn89053851/1903/01/22/ed-1/seq-1/ocr.xml
./sn89053851/1903/01/22/ed-1/seq-3/ocr.xml
./sn89053851/1903/01/22/ed-1/seq-4/ocr.xml
./sn89053851/1903/01/22/ed-1/seq-2/ocr.xml
./sn89053851/1903/06/04/ed-1/seq-1/ocr.xml
./sn89053851/1903/06/04/ed-1/seq-3/ocr.xml
./sn89053851/1903/06/04/ed-1/seq-4/ocr.xml
./sn89053851/1903/06/04/ed-1/seq-2/ocr.xml
./sn89053851/1903/06/18/ed-1/seq-1/ocr.xml
./sn89053851/1903/06/18/ed-1/seq-3/ocr.xml
./sn89053851/1903/06/18/ed-1/seq-4/ocr.xml
./sn89053851/1903/06/18/ed-1/seq-2/ocr.xml
./sn89053851/1903/06/11/ed-1/seq-1/ocr.xml
./sn89053851/1903/06/11/ed-1/seq-3/ocr.xml
./sn89053851/1903/06/11/ed-1/seq-4/ocr.xml
./sn89053851/1903/06/11/ed-1/seq

./sn89053851/1908/03/26/ed-1/seq-3/ocr.xml
./sn89053851/1908/03/26/ed-1/seq-4/ocr.xml
./sn89053851/1908/03/26/ed-1/seq-2/ocr.xml
./sn89053851/1908/03/12/ed-1/seq-1/ocr.xml
./sn89053851/1908/03/12/ed-1/seq-3/ocr.xml
./sn89053851/1908/03/12/ed-1/seq-4/ocr.xml
./sn89053851/1908/03/12/ed-1/seq-2/ocr.xml
./sn89053851/1908/04/02/ed-1/seq-1/ocr.xml
./sn89053851/1908/04/02/ed-1/seq-3/ocr.xml
./sn89053851/1908/04/02/ed-1/seq-4/ocr.xml
./sn89053851/1908/04/02/ed-1/seq-2/ocr.xml
./sn89053851/1908/04/16/ed-1/seq-1/ocr.xml
./sn89053851/1908/04/16/ed-1/seq-3/ocr.xml
./sn89053851/1908/04/16/ed-1/seq-4/ocr.xml
./sn89053851/1908/04/16/ed-1/seq-2/ocr.xml
./sn89053851/1908/04/09/ed-1/seq-1/ocr.xml
./sn89053851/1908/04/09/ed-1/seq-3/ocr.xml
./sn89053851/1908/04/09/ed-1/seq-4/ocr.xml
./sn89053851/1908/04/09/ed-1/seq-2/ocr.xml
./sn89053851/1908/04/30/ed-1/seq-1/ocr.xml
./sn89053851/1908/04/30/ed-1/seq-3/ocr.xml
./sn89053851/1908/04/30/ed-1/seq-4/ocr.xml
./sn89053851/1908/04/30/ed-1/seq

./sn89053851/1907/06/13/ed-1/seq-3/ocr.xml
./sn89053851/1907/06/13/ed-1/seq-4/ocr.xml
./sn89053851/1907/06/13/ed-1/seq-2/ocr.xml
./sn89053851/1907/12/05/ed-1/seq-1/ocr.xml
./sn89053851/1907/12/05/ed-1/seq-3/ocr.xml
./sn89053851/1907/12/05/ed-1/seq-4/ocr.xml
./sn89053851/1907/12/05/ed-1/seq-2/ocr.xml
./sn89053851/1907/12/19/ed-1/seq-1/ocr.xml
./sn89053851/1907/12/19/ed-1/seq-3/ocr.xml
./sn89053851/1907/12/19/ed-1/seq-4/ocr.xml
./sn89053851/1907/12/19/ed-1/seq-2/ocr.xml
./sn89053851/1907/12/26/ed-1/seq-1/ocr.xml
./sn89053851/1907/12/26/ed-1/seq-6/ocr.xml
./sn89053851/1907/12/26/ed-1/seq-3/ocr.xml
./sn89053851/1907/12/26/ed-1/seq-4/ocr.xml
./sn89053851/1907/12/26/ed-1/seq-5/ocr.xml
./sn89053851/1907/12/26/ed-1/seq-2/ocr.xml
./sn89053851/1907/12/12/ed-1/seq-1/ocr.xml
./sn89053851/1907/12/12/ed-1/seq-3/ocr.xml
./sn89053851/1907/12/12/ed-1/seq-4/ocr.xml
./sn89053851/1907/12/12/ed-1/seq-2/ocr.xml
./sn89053851/1900/03/01/ed-1/seq-7/ocr.xml
./sn89053851/1900/03/01/ed-1/seq

This one-liner loops through a given set of values, calling the `find` command to identify files whose size is greater than each value, then pipes that information to `wc` to count those files.  Here, M = megabytes and G = gigabytes.

The output shows the number of files of each size.

In [10]:
!for size in +0 +1M +10M +1G +10G ; do echo "$size"; find . -type f -size $size | wc -l; done

+0
   18485
+1M
     604
+10M
       0
+1G
       0
+10G
       0


### Tree

The `tree` command diagrams the structure of your local file tree. The first argument points to the top of the tree you want to draw. The `-d` flag means 'show only directories' (i.e., not files); the `-o` option writes the output to a local file, which we'll name 'LOCTree.txt'. Writing to a file is valuable because the output can get quite long. 

I didn't have the 'tree' package installed on my Mac. I needed to use a package manager to download and install. I use Homebrew (https://brew.sh/), but there are others. To install 'tree' using Homebrew, the command is `brew install tree`. More documentation can be found at https://docs.brew.sh/Manpage.

In [9]:
!tree ./ -d -o LOCTree.txt

Let's read the file into the notebook and take a look at the file tree.

In [11]:
cat "LOCTree.txt"

./
├── sn89053851
│   ├── 1899
│   │   ├── 06
│   │   │   └── 01
│   │   │       └── ed-1
│   │   │           ├── seq-1
│   │   │           ├── seq-10
│   │   │           ├── seq-11
│   │   │           ├── seq-12
│   │   │           ├── seq-2
│   │   │           ├── seq-3
│   │   │           ├── seq-4
│   │   │           ├── seq-5
│   │   │           ├── seq-6
│   │   │           ├── seq-7
│   │   │           ├── seq-8
│   │   │           └── seq-9
│   │   ├── 08
│   │   │   └── 31
│   │   │       └── ed-1
│   │   │           ├── seq-1
│   │   │           ├── seq-10
│   │   │           ├── seq-2
│   │   │           ├── seq-3
│   │   │           ├── seq-4
│   │   │           ├── seq-5
│   │   │           ├── seq-6
│   │   │           ├── seq-7
│   │   │           ├── seq-8
│   │   │           └── seq-9
│   │   ├── 09
│   │   │   ├── 07
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-10
│   │   │   │       ├── seq-2


│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │       └── seq-4
│   │   │   ├── 17
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │       └── seq-4
│   │   │   └── 24
│   │   │       └── ed-1
│   │   │           ├── seq-1
│   │   │           ├── seq-2
│   │   │           ├── seq-3
│   │   │           └── seq-4
│   │   ├── 07
│   │   │   ├── 01
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │       └── seq-4
│   │   │   ├── 08
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │       └── seq-4
│   │   │   ├── 15
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │  

│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │       └── seq-4
│   │   │   ├── 14
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │       └── seq-4
│   │   │   ├── 21
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │       └── seq-4
│   │   │   └── 28
│   │   │       └── ed-1
│   │   │           ├── seq-1
│   │   │           ├── seq-2
│   │   │           ├── seq-3
│   │   │           └── seq-4
│   │   ├── 07
│   │   │   ├── 05
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │       └── seq-4
│   │   │   ├── 12
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │   │   │       └── seq-4
│   │   │   ├── 19
│   │   │   │   └── e

    │   │   ��   └── ed-1
    │   │   │       ├── seq-1
    │   │   │       ├── seq-2
    │   │   │       ├── seq-3
    │   │   │       └── seq-4
    │   │   ├── 12
    │   │   │   └── ed-1
    │   │   │       ├── seq-1
    │   │   │       ├── seq-2
    │   │   │       ├── seq-3
    │   │   │       ├── seq-4
    │   │   │       ├── seq-5
    │   │   │       └── seq-6
    │   │   ├── 19
    │   │   │   └── ed-1
    │   │   │       ├── seq-1
    │   │   │       ├── seq-2
    │   │   │       ├── seq-3
    │   │   │       ├── seq-4
    │   │   │       ├── seq-5
    │   │   │       └── seq-6
    │   │   └── 26
    │   │       └── ed-1
    │   │           ├── seq-1
    │   │           ├── seq-2
    │   │           ├── seq-3
    │   │           ├── seq-4
    │   │           ├── seq-5
    │   │           └── seq-6
    │   ├── 07
    │   │   ├── 03
    │   │   │   └── ed-1
    │   │   │       ├── seq-1
    │   │   │       ├── seq-2
    │   │   │       ├── seq-3

    │   ��           ├── seq-3
    │   │           └── seq-4
    │   ├── 11
    │   │   ├── 02
    │   │   │   └── ed-1
    │   │   │       ├── seq-1
    │   │   │       ├── seq-2
    │   │   │       ├── seq-3
    │   │   │       └── seq-4
    │   │   ├── 09
    │   │   │   └── ed-1
    │   │   │       ├── seq-1
    │   │   │       ├── seq-2
    │   │   │       ├── seq-3
    │   │   │       └── seq-4
    │   │   ├── 16
    │   │   │   └── ed-1
    │   │   │       ├── seq-1
    │   │   │       ├── seq-2
    │   │   │       ├── seq-3
    │   │   │       └── seq-4
    │   │   ├── 23
    │   │   │   └── ed-1
    │   │   │       ├── seq-1
    │   │   │       ├── seq-2
    │   │   │       ├── seq-3
    │   │   │       └── seq-4
    │   │   └── 30
    │   │       └── ed-1
    │   │           ├── seq-1
    │   │           ├── seq-2
    │   │           ├── seq-3
    │   │           └── seq-4
    │   └── 12
    │       ├── 07
    │       │   └── ed-1
    │    

##  Running shell scripts

Can we run a shell script? Yes. (Of course.)

For instructions, see: https://www.quora.com/How-do-I-execute-bash-scripts-via-IPython-Jupyter-notebook.

Let's demonstrate how-to by running a useful script...one that finds all directories with greater than 2000 files and displays the size and name of those directories. 
 
 First, let's locate the script: it's called 'dirs-gt-2000-du.sh' and it's in the same directory as this notebook. See In[2], above.

In [11]:
cd /Users/rjaffe/Jupyter_notebooks/rdm_datatransfer

/Users/rjaffe/Jupyter_notebooks/rdm_datatransfer


In [12]:
ls

Shell_play-excerpt.ipynb  [31mdirs-gt-2000-du.sh[m[m*       readme.md
Shell_play.ipynb          [31mdirs-gt-2000.sh[m[m*          [34mtest_data[m[m/


Running scripts within the notebook requires starting a new shell process. There's a library for that: `subprocess`.

In [13]:
import subprocess

The script takes an argument: the directory that we will profile. We pass both the script name and directory to be profiled as arguments to the subprocess function. 

Be patient: this will take a while!

In [None]:
dirinfo = subprocess.run(['./dirs-gt-2000-du.sh', '/Users/rjaffe/Jupyter_notebooks/rdm_datatransfer/test_data/LOC_Data/batch_az_acacia_ver01', '/dev/null'], stdout=subprocess.PIPE)
print(dirinfo.stdout.decode("utf-8", "strict"))

voilà!

## Challenge

PROBLEM: RDM has consulted on several cases in which researchers were trying to upload millions of files to Box or Drive and encountering v-e-r-y slow transfer rates. Also, we've encountered at least two common research tools (Opera Phenix microscopes and FMRIB FSL analysis software) that can produce individual directories containing greater than 20,000 files -- the limit that Box can display without throwing a "File can't be found' error in response to the MLSD command used by FileZilla and other FTP clients.

CONSTRAINTS:

• Copying zillions of tiny files ("ZOTS") takes much longer than many fewer, larger files. 

• Google Drive and Box both throttle bulk uploads to as few as two files per second. 

• Box has a per-file file size limit of 15GB.  

CHALLENGE 1: Can we use these tools to devise a strategy for identifying the ends of the tree -- where the number of files tend to be great -- and then tarring or zipping up just those directories? The goal is to enable faster transfer rates to Box and Drive, while allowing researchers to download to their analysis environments only those portions of the data that they need for any given analysis.

CHALLENGE 2: Can we run this notebook from the cloud and point it at a local machine with internal or mounted storage to transfer?

# Using rclone

Now let's try running rclone to copy a file from the local machine to Google Drive.

First, navigate to the file...

In [None]:
cd /Users/rjaffe/Documents/RDM/TestData

In [None]:
ls

In [None]:
cd Transfer-test-no-subfolders/

In [None]:
ls -al

Now run rclone to copy one of the files in this directory to a pre-configured drive account.

In [None]:
!rclone copy /Users/rjaffe/Documents/RDM/TestData/Transfer-test-no-subfolders/Picture\ 8.png rclone-rdmconsult-bdrive:test_from_notebook

Check the drive account in a separate browser to confirm that the file actually transferred. (It did!)

# Using Globus and the Globus API

...to be added.