# RDM - Tools for supporting (large) data transfers

This is an experiment -- a notebook to capture practices that support the acceleration of large data transfer (10s of TBs; millions of files) and perhaps automate some or all of them.

Run this notebook locally on the machine that has to data to transfer. It's written for Mac OS X (Unix/bash). It can probably be easily modified for Linux boxes.

#### Last revised: 26 March 2018

By: Rick

## A few shell command basics

Prefix a shell command with an exclamation point ('!') to run that command from the notebook.

A few examples:

In [1]:
!python --version

Python 3.6.2 :: Anaconda custom (64-bit)


In [2]:
!pwd

/Users/rjaffe/Jupyter_notebooks/rdm_datatransfer


In [3]:
!ls

LOCTree.txt        [31mdirs-gt-2000-du.sh[m[m readme.md
Shell_play.ipynb   [31mdirs-gt-2000.sh[m[m


Many common commands, including `pwd` and `ls`, can be run by prefixing the command with a '%' sign.

In [None]:
%cd /Users/rjaffe/Documents/RDM/RDM_Consulting/

In [None]:
%cd /Users/rjaffe/Jupyter_notebooks/RDM_datatransfer

These commands can be run without using the % prefix by turning on '**automagic**'.

From : https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html

>Besides %cd, other available shell-like magic functions are %cat, %cp, %env, %ls, %man, %mkdir, %more, %mv, %pwd, %rm, and %rmdir, any of which can be used without the % sign if automagic is on. This makes it so that you can almost treat the IPython prompt as if it's a normal shell.

The '%automagic' command toggles the setting on and off. Run this cell several times to see. Continue with automagic on.

In [5]:
%automagic


Automagic is ON, % prefix IS NOT needed for line magics.


In [None]:
pwd

However, you still need the '!' to refer to the value returned by the command.

This works:

In [None]:
directory = !pwd

In [None]:
print(directory)

But this throws an error:

In [None]:
directory = pwd   #This will throw an error.

Within the notebook, a variable like this has a different type than a purely Python variable.

In [None]:
type(directory)

# Sample data

To demonstrate the data profiling tools, we'll need a directory with a ton of files. I have one with Library of Congress data on my computer. Locate an appropriate set of sample data on yours.

In [14]:
cd /Users/rjaffe/documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01/

/Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01


In [None]:
ls

## Some useful commands

Use the `find` command to list directories. The dot ('.') refers to your present location within the file tree. Don't forget to prefix the command with an exclamation point ("bang"). `Find` is not one of the commands covered by 'automagic'.

In [None]:
!find . -type d

Now let's determine the number of directories (`-type d`) by piping the output of the `find` command to `wc` and counting the number of lines (`-l`) in the output.

In [None]:
!find . -type d | wc -l

(Notice that you only need to include the 'bang' at the beginning of the cell.)

My sample data contains multiple directories. We can inspect the first level of subdirectories using the `maxdepth` option.

In [None]:
!find . -type d -maxdepth 1

See more layers by increasing the -maxdepth value.

In [None]:
!find . -type d -maxdepth 2

Let's look at one of the subdirectories and count the directories in it. (Adding `./sn89053851` to the find command means start from the subdirectory named 'sn89053851'.)

In [None]:
!find ./sn89053851 -type d | wc -l

Now let's count the files (`-type f`), starting from the present working directory.

In [None]:
!find . -type f | wc -l

One more option: find files by size (`-size`). In this case, file of size greater than 1Mb (`+1M`).

In [None]:
!find . -type f -size +1M | wc -l

Many, many tiny files slow down transfers. So let's use the command above to look for files smaller than 20Kb. (Note the use of a lowercase 'k', as opposed to an uppercase 'M' or 'G'.)

In [None]:
!find . -type f -size -20k | wc -l

For more information on the `find` and `wc` commands, use the `man` command. (While you don't need the '!' for this commnad, it does affect the display. Try running the cell with and without it.)

In [None]:
!man find

In [None]:
!man wc

Next, use a bash shell (perl?) one-liner to profile the files in a directory tree, recursively from the present working directory.

This one-liner loops through a given set of values, calling the `find` command to identify files whose size is greater than each value, then pipes that information to `wc` to count those files.  Here, M = megabytes and G = gigabytes.

The output shows the number of files of each size.

In [None]:
!for size in +0 +1M +10M +1G +10G ; do echo "$size"; find . -type f -size $size | wc -l; done

The `tree` command diagrams the structure of your local file tree. The first argument points to the top of the tree you want to draw. The `-d` flag means 'show only directories' (i.e., not files); the `-o` option writes the output to a local file, which we'll name 'LOCTree.txt'. Writing to a file is valuable because the output can get quite long. 

I didn't have the 'tree' package installed on my Mac. I needed to use a package manager to download and install. I use Homebrew (https://brew.sh/), but there are others. To install 'tree' using Homebrew, the command is `brew install tree`. More documentation can be found at https://docs.brew.sh/Manpage.

In [15]:
!tree ./ -d -o LOCTree.txt

Find the output file just where you put it -- in the present working directory of the local file system.

In [16]:
ls

LOCTree.txt  [34msn89053851[m[m/  [34msn94051341[m[m/  [34msn94051342[m[m/  [34msn95060791[m[m/


Let's read the file into the notebook and take a look at the file tree.

In [17]:
with open("LOCTree.txt", "r") as f:
    print(f.read())

./
├── sn89053851
│   ├── 1899
│   │   ├── 06
│   │   │   └── 01
│   │   │       └── ed-1
│   │   │           ├── seq-1
│   │   │           ├── seq-10
│   │   │           ├── seq-11
│   │   │           ├── seq-12
│   │   │           ├── seq-2
│   │   │           ├── seq-3
│   │   │           ├── seq-4
│   │   │           ├── seq-5
│   │   │           ├── seq-6
│   │   │           ├── seq-7
│   │   │           ├── seq-8
│   │   │           └── seq-9
│   │   ├── 08
│   │   │   └── 31
│   │   │       └── ed-1
│   │   │           ├── seq-1
│   │   │           ├── seq-10
│   │   │           ├── seq-2
│   │   │           ├── seq-3
│   │   │           ├── seq-4
│   │   │           ├── seq-5
│   │   │           ├── seq-6
│   │   │           ├── seq-7
│   │   │           ├── seq-8
│   │   │           └── seq-9
│   │   ├── 09
│   │   │   ├── 07
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-10
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │  

Tree includes an option to limit how many levels of subdirectories to assess. That's useful for getting a sense of the structure of your file tree.

In [None]:
!tree ./ -d -L 1

In [None]:
!tree ./ -d -L 2

In [None]:
!tree ./ -d -L 3

#  Shell scripts

Can we run a shell script? Yes. (Of course.)

For instructions, see: https://www.quora.com/How-do-I-execute-bash-scripts-via-IPython-Jupyter-notebook.

Let's demonstrate how-to by running a useful script...one that finds all directories with greater than 2000 files and displays the size and name of those directories. 
 
 First, let's locate the script: it's called 'dirs-gt-2000-du.sh' and it's in the same directory as this notebook. See In[2], above.

In [6]:
cd /Users/rjaffe/Jupyter_notebooks/rdm_datatransfer

/Users/rjaffe/Jupyter_notebooks/rdm_datatransfer


In [7]:
ls

LOCTree.txt         [31mdirs-gt-2000-du.sh[m[m* readme.md
Shell_play.ipynb    [31mdirs-gt-2000.sh[m[m*


To be sure that the script does what we expect, let's look at it.

In [12]:
with open("dirs-gt-2000-du.sh", "r") as f:
    print(f.read())

#!/bin/bash
topdir=$1
for dir in `find $topdir -type d` ; do
  count=`find $dir -type f | wc -l`
  if [ $count -gt 2000 ]
  then
    size=`du -d=0 -h $dir`
	echo $size
  fi
done


Running scripts within the notebook requires starting a new shell process. There's a library for that: `subprocess`.

In [9]:
import subprocess

The script takes an argument: the directory that we will profile. We pass both the script name and directory to be profiled as arguments to the subprocess function. 

Be patient: this will take a while!

In [None]:
subprocess.run(['./dirs-gt-2000-du.sh', '/Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01'])

Hmmmm...something's wrong. On my local system, the output reads:

    5.7G /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01
    1.9G /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01/sn89053851
    898M /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01/sn94051342
    913M /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01/sn94051341
    2.1G /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01/sn95060791

Here's what's up: `subprocess.run` returns a CompletedProcess instance that shows the arguments we provided and the returncode of `0` produced by the `echo` command. That's what gets displayed in the cell's output. To see the *output* of the echo command, we must ignore the return code (i.e., send it to `dev/null`) and `pipe` the process's output to `stdout`, per: 
https://docs.python.org/3/library/subprocess.html#subprocess.RUN. 

As the documentation notes, "By default, this function returns the data as encoded bytes." We decode it to UTF-8.

In [10]:
dirinfo = subprocess.run(['./dirs-gt-2000-du.sh', '/Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01', 'dev/null'], stdout=subprocess.PIPE)
print(dirinfo.stdout.decode("utf-8", "strict"))

5.7G /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01
1.9G /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01/sn89053851
898M /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01/sn94051342
913M /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01/sn94051341
2.1G /Users/rjaffe/Documents/RDM/TestData/LOC_Data/batch_az_acacia_ver01/sn95060791



voilà!

## Challenge

PROBLEM: RDM has consulted on several cases in which researchers were trying to upload millions of files to Box or Drive and encountering v-e-r-y slow transfer rates. Also, we've encountered at least two common research tools (Opera Phenix microscopes and FMRIB FSL analysis software) that can produce individual directories containing greater than 20,000 files -- the limit that Box can display without throwing a "File can't be found' error in response to the MLSD command used by FileZilla and other FTP clients.

CONSTRAINTS:

• Copying zillions of tiny files ("ZOTS") takes much longer than many fewer, larger files. 

• Google Drive and Box both throttle bulk uploads to as few as two files per second. 

• Box has a per-file file size limit of 15GB.  

CHALLENGE 1: Can we use these tools to devise a strategy for identifying the ends of the tree -- where the number of files tend to be great -- and then tarring or zipping up just those directories? The goal is to enable faster transfer rates to Box and Drive, while allowing researchers to download to their analysis environments only those portions of the data that they need for any given analysis.

CHALLENGE 2: Can we run this notebook from the cloud and point it at a local machine with internal or mounted storage to transfer?

# Using rclone

Now let's try running rclone to copy a file from the local machine to Google Drive.

First, navigate to the file...

In [None]:
cd /Users/rjaffe/Documents/RDM/TestData

In [None]:
ls

In [None]:
cd Transfer-test-no-subfolders/

In [None]:
ls -al

Now run rclone to copy one of the files in this directory to a pre-configured drive account.

In [None]:
!rclone copy /Users/rjaffe/Documents/RDM/TestData/Transfer-test-no-subfolders/Picture\ 8.png rclone-rdmconsult-bdrive:test_from_notebook

Check the drive account in a separate browser to confirm that the file actually transferred. (It did!)

# Using Globus and the Globus API

...to be added.