# RDM - Tools for supporting (large) data transfers

This is an experiment -- a notebook to capture practices that support the acceleration of large data transfer (10s of TBs; millions of files) and perhaps automate some or all of them.

Run this notebook locally on the machine that has to data to transfer. It's written for Mac OS X (Unix/bash). It can probably be easily modified for Linux boxes.

Last revised: 22 March 2018

By: Rick

## A few shell command basics

In [38]:
!python --version

Python 3.6.2 :: Anaconda custom (64-bit)


In [1]:
!pwd

/Users/rjaffe/Jupyter_notebooks/RDM_datatransfer


In [14]:
!ls

[34mLOC_Data[m[m                         [31mdirs-gt-2000.sh[m[m
Shell_play.ipynb                 [31mdirs-specifytopdir-gt-2000-du.sh[m[m
[31mdirs-gt-2000-du.sh[m[m               [31mdirs-specifytopdir-gt-500-du.sh[m[m


In [3]:
%cd /Users/rjaffe/Documents/RDM/RDM_Consulting/

/Users/rjaffe/Documents/RDM/RDM_Consulting


In [4]:
%cd /Users/rjaffe/Jupyter_notebooks/RDM_datatransfer

/Users/rjaffe/Jupyter_notebooks/RDM_datatransfer


From : https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html

"Besides %cd, other available shell-like magic functions are %cat, %cp, %env, %ls, %man, %mkdir, %more, %mv, %pwd, %rm, and %rmdir, **any of which can be used without the % sign if automagic is on**. This makes it so that you can almost treat the IPython prompt as if it's a normal shell" (**emphasis** added).

The '%automagic' command toggles the setting on and off.

In [6]:
%automagic


Automagic is ON, % prefix IS NOT needed for line magics.


In [7]:
%automagic


Automagic is OFF, % prefix IS needed for line magics.


In [8]:
%automagic


Automagic is ON, % prefix IS NOT needed for line magics.


However, it seems you still need the '!' to refer to the contents of the command.

In [9]:
directory = pwd   #This will throw an error.

NameError: name 'pwd' is not defined

This works:

In [10]:
directory = !pwd

Within the notebook, a variable like this has a different type than a purely Python variable.

In [11]:
type(directory)

IPython.utils.text.SList

In [12]:
print(directory)

['/Users/rjaffe/Jupyter_notebooks/RDM_datatransfer']


As advertised, with automagic on, we don't need to prefix certain common shell commands with '!'.

In [15]:
ls

[34mLOC_Data[m[m/                         [31mdirs-gt-2000.sh[m[m*
Shell_play.ipynb                  [31mdirs-specifytopdir-gt-2000-du.sh[m[m*
[31mdirs-gt-2000-du.sh[m[m*               [31mdirs-specifytopdir-gt-500-du.sh[m[m*


In [16]:
cd LOC_Data/

/Users/rjaffe/Jupyter_notebooks/RDM_datatransfer/LOC_Data


In [17]:
ls

[34mbatch_az_acacia_ver01[m[m/         batch_az_acacia_ver01.tar.bz2


In [18]:
cd batch_az_acacia_ver01/

/Users/rjaffe/Jupyter_notebooks/RDM_datatransfer/LOC_Data/batch_az_acacia_ver01


In [19]:
ls

[34msn89053851[m[m/ [34msn94051341[m[m/ [34msn94051342[m[m/ [34msn95060791[m[m/


Next, use a bash shell (perl?) one-liner to profile, recursively from the present working directory, the files in a directory tree.

Run the one-liner. The output shows the number of files in each file size range. (Here, M=MB and G=GB.)

In [20]:
!for size in +0 +1M +10M +1G +10G ; do echo "$size\t"; find . -type f -size $size | wc -l; done

+0	
   18485
+1M	
     604
+10M	
       0
+1G	
       0
+10G	
       0


The 'tree' command diagrams the structure of your local file tree. The first argument points to the top of the tree you want to draw. The '-d' flag means 'show only directories' (i.e., not files); the '-o' option writes the output to a local file. This is valuable because the output can get quite long. 

In [21]:
pwd

'/Users/rjaffe/Jupyter_notebooks/RDM_datatransfer/LOC_Data/batch_az_acacia_ver01'

In [22]:
!tree ./ -d -o LOCTree.txt

Find the output file just where you put it -- in the present working directory of the local file system.

In [23]:
pwd

'/Users/rjaffe/Jupyter_notebooks/RDM_datatransfer/LOC_Data/batch_az_acacia_ver01'

In [24]:
ls

LOCTree.txt  [34msn89053851[m[m/  [34msn94051341[m[m/  [34msn94051342[m[m/  [34msn95060791[m[m/


In [25]:
f = open("LOCTree.txt", "r")
print(f.read())

./
├── sn89053851
│   ├── 1899
│   │   ├── 06
│   │   │   └── 01
│   │   │       └── ed-1
│   │   │           ├── seq-1
│   │   │           ├── seq-10
│   │   │           ├── seq-11
│   │   │           ├── seq-12
│   │   │           ├── seq-2
│   │   │           ├── seq-3
│   │   │           ├── seq-4
│   │   │           ├── seq-5
│   │   │           ├── seq-6
│   │   │           ├── seq-7
│   │   │           ├── seq-8
│   │   │           └── seq-9
│   │   ├── 08
│   │   │   └── 31
│   │   │       └── ed-1
│   │   │           ├── seq-1
│   │   │           ├── seq-10
│   │   │           ├── seq-2
│   │   │           ├── seq-3
│   │   │           ├── seq-4
│   │   │           ├── seq-5
│   │   │           ├── seq-6
│   │   │           ├── seq-7
│   │   │           ├── seq-8
│   │   │           └── seq-9
│   │   ├── 09
│   │   │   ├── 07
│   │   │   │   └── ed-1
│   │   │   │       ├── seq-1
│   │   │   │       ├── seq-10
│   │   │   │       ├── seq-2
│   │   │   │       ├── seq-3
│   │  

#  Shell scripts

Can we run a shell script?

For instructions, see: https://www.quora.com/How-do-I-execute-bash-scripts-via-IPython-Jupyter-notebook.

First, let's locate the script...it's two directories up.

In [26]:
cd ../../

/Users/rjaffe/Jupyter_notebooks/RDM_datatransfer


In [27]:
ls

[34mLOC_Data[m[m/                         [31mdirs-gt-2000.sh[m[m*
Shell_play.ipynb                  [31mdirs-specifytopdir-gt-2000-du.sh[m[m*
[31mdirs-gt-2000-du.sh[m[m*               [31mdirs-specifytopdir-gt-500-du.sh[m[m*


The file 'dirs-specifytopdir-gt-2000-du.sh' finds -- starting from ./LOC_Data/batch_az_acacia_ver01 -- all directories with greater than 2000 files, and displays the size and name of those folders. Be patient: this will take a while!

In [28]:
import subprocess

In [29]:
subprocess.call(['./dirs-specifytopdir-gt-2000-du.sh'])

0

Hmmmm...something's wrong. On my local system, the output reads:

    5.7G ./LOC_Data/batch_az_acacia_ver01
    1.9G ./LOC_Data/batch_az_acacia_ver01/sn89053851
    898M ./LOC_Data/batch_az_acacia_ver01/sn94051342
    913M ./LOC_Data/batch_az_acacia_ver01/sn94051341
    2.1G ./LOC_Data/batch_az_acacia_ver01/sn95060791

The echo command returns '0'. To get the output, you must use subprocess.check_output, per: 
https://docs.python.org/3/library/subprocess.html#subprocess.check_output. 
As the documentation notes, "By default, this function returns the data as encoded bytes." We decode it to UTF-8.

In [30]:
dirinfo = subprocess.check_output('./dirs-specifytopdir-gt-2000-du.sh')

In [31]:
type(dirinfo)

bytes

In [32]:
print(dirinfo.decode("utf-8", "strict"))

5.7G ./LOC_Data/batch_az_acacia_ver01
1.9G ./LOC_Data/batch_az_acacia_ver01/sn89053851
898M ./LOC_Data/batch_az_acacia_ver01/sn94051342
913M ./LOC_Data/batch_az_acacia_ver01/sn94051341
2.1G ./LOC_Data/batch_az_acacia_ver01/sn95060791



voilà!

## Challenge

PROBLEM: RDM has consulted on several cases in which researchers were trying to upload millions of files to Box or Drive and encountering v-e-r-y slow transfer rates. Also, we've encountered at least two common research tools (Opera Phenix microscopes and FMRIB FSL analysis software) that can produce individual directories containing greater than 20,000 files -- the limit that Box can display without throwing a "File can't be found' error in response to the MLSD command used by FileZilla and other FTP clients.

CONSTRAINTS:

• Copying zillions of tiny files ("ZOTS") takes much longer than many fewer, larger files. 

• Google Drive and Box both throttle bulk uploads. Google, in particular, limits uploads to two files per second. 

• Box has a per-file file size limit of 15GB.  

CHALLENGE 1: Can we use these tools to devise a strategy for identifying the ends of the tree -- where the number of files tend to be great -- and then tarring or zipping up just those directories? The goal is to enable faster transfer rates to Box and Drive, while allowing researchers to download to their analysis environments only those portions of the data that they need for any given analysis.

CHALLENGE 2: Can we run this notebook from the cloud and point it at a local machine with internal or mounted storage to transfer?

# Using rclone

Now let's try running rclone to copy a file from the local machine to Google Drive.

First, navigate to the file...

In [33]:
cd /Users/rjaffe/Documents/RDM/TestData

/Users/rjaffe/Documents/RDM/TestData


In [34]:
ls

25GBfile.txt
[34m30000-sensor-report-like-files[m[m/
[34m98gigfolder[m[m/
AA Kellam 2011 Numeric Data Sources for Ref Libs.pdf.7z
[34mDebuggingFailure[m[m/
[34mIncremental-Upload-Testing[m[m/
[34mTransfer-test-no-subfolders[m[m/
[34mTransfer_test_data[m[m/
lftp-log2.txt
lftp-log3.txt
lftp_log.txt
[34mnewer-than-test[m[m/


In [35]:
cd Transfer-test-no-subfolders/

/Users/rjaffe/Documents/RDM/TestData/Transfer-test-no-subfolders


In [36]:
ls -al

total 5821672
drwxr-xr-x  85 rjaffe  staff        2720 Mar  6  2017 [34m.[m[m/
drwxr-xr-x  15 rjaffe  staff         480 Mar  8 15:37 [34m..[m[m/
-rwx------   1 rjaffe  staff   966222326 Jan 21  2016 [31mBeahrs_ELP_video_2012_1080p.mov[m[m*
-rwx------   1 rjaffe  staff   484826466 Jan 21  2016 [31mBeahrs_ELP_video_2012_720p.mov[m[m*
-rwx------   1 rjaffe  staff     4898775 Jan 21  2016 [31mCatal_Color.mov[m[m*
-rwx------   1 rjaffe  staff    11687174 Jan 21  2016 [31mCatal_Dido_Remix.mov[m[m*
-rwx------   1 rjaffe  staff      622541 Jan 20  2016 [31mPicture 1.png[m[m*
-rwx------   1 rjaffe  staff     1048386 Jan 20  2016 [31mPicture 10.png[m[m*
-rwx------   1 rjaffe  staff      813169 Jan 20  2016 [31mPicture 11.png[m[m*
-rwx------   1 rjaffe  staff     1223535 Jan 20  2016 [31mPicture 12.png[m[m*
-rwx------   1 rjaffe  staff     1455440 Jan 20  2016 [31mPicture 13.png[m[m*
-rwx------   1 rjaffe  staff     1267979 Jan 20  2016 [31mPicture 14

Now run rclone to copy one of the files in this directory to a pre-configured drive account.

In [37]:
!rclone copy /Users/rjaffe/Documents/RDM/TestData/Transfer-test-no-subfolders/Picture\ 8.png rclone-rdmconsult-bdrive:test_from_notebook

Check the drive account in a separate browser to confirm that the file actually transferred. (It did!)