# CloudStor access via WebDAV

[CloudStor](https://www.aarnet.edu.au/network-and-services/cloud-services-applications/cloudstor/) is data storage service provided by AARNet. Individual researchers in AARNet connected institutions get 100gb of storage space for free, and research projects can apply for additional space.

We're using CloudStor to store and share high-resolution scans of Sydney Stock Exchange records from the Noel Butlin Archives at ANU. By my reckoning, there's 72,843 TIFF files, each weighing in at about 100mb. I'm going to be exploring ways of getting useful structured data out of the images, but as a first step I just wanted to be able to access data *about* the files.

CloudStor is an instance of OwnCloud, and OwnCloud provides WebDAV access, so I thought I'd have a go at using WebDAV to access file data on CloudStor. 

It works, but there are a few tricks...


## Software

I'm using a [Python WebDAV client](https://github.com/CloudPolis/webdav-client-python). I installed it using `pip` but ran into some problems with the dependencies. PyCurl complained that it didn't know what SSL library it was meant to use. Thanks to [StackOverflow](https://stackoverflow.com/a/48092283), I got it going with:

```
brew install curl --with-openssl
pip install --no-cache-dir --global-option=build_ext --global-option="-L/usr/local/opt/openssl/lib" --global-option="-I/usr/local/opt/openssl/include" --user pycurl
```

In [43]:
# Import what we need
import webdav.client as wc
import random
import pandas as pd
import time
from credentials import * # Storing my CloudStor credentials in another file

## Configuration

This was the thing that caused me most confusion.

First of all, you have to create a password in CloudStor to use with WebDAV. This is **not** the password that you use to access the CloudStor web interface (via the AAF). 

* Log onto the CloudStor web interface (using your institutional credentials)
* Click on **Settings** in the top menu
* Enter your new password in the 'Password' box and click **Change password**

This is the password you'll use with the WebDAV client. The WebDAV username is the email address you've used to register with CloudStor.

On the bottom left of the CloudStor web interface is another **Settings** link. If you click it it displays the url to use with WebDAV: `https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/`

Originally, I just plugged this link in below as the `webdav_hostname` and at first things seemed to work. I could list the contents of a directory, but I couldn't get resource information or download a file. Eventually, [amongst the  issues](https://github.com/CloudPolis/webdav-client-python/issues/18) on the client's GitHub site, I found the answer. You have to separate the host from the path, and supply the path as `webdav_root`.


In [37]:
# Set the connection options. CLOUDSTOR_USER and CLOUDSTOR_PW are stored in a separate credentials file.
options = {
    'webdav_hostname': 'https://cloudstor.aarnet.edu.au',
    'webdav_login':    CLOUDSTOR_USER,
    'webdav_password': CLOUDSTOR_PW,
    'webdav_root': '/plus/remote.php/webdav/''
}

## Getting file lists

In [38]:
# Ok let's initiate the client.
client = wc.Client(options)

In [39]:
# Use .list() to get a list of resources in the directory
# In this case it's a list of subdirectories
dirs = client.list('Shared/ANU-Library/Sydney Stock Exchange 1901-1950/')
# For some reason the parent directory is included in the list, let's filter it out
dirs = [d for d in dirs if d[:2] == 'AU']

In [40]:
# Loop through all the subdirectories and use .list() again to get all the filenames
details = []
summary = []
for d in dirs:
    files = [f for f in client.list('Shared/ANU-Library/Sydney Stock Exchange 1901-1950/{}'.format(d)) if f[:1] == 'N']
    print('{}: {} files'.format(d, len(files)))
    # Save the details for each subdirectory
    summary.append({'directory': d, 'number': len(files)})
    for f in files:
        path = 'Shared/ANU-Library/Sydney Stock Exchange 1901-1950/{}{}'.format(d, f)
        # This slows things down a lot, so disable for now
        # info = client.info(path)
        info = {}
        info['name'] = f
        info['directory'] = d
        info['path'] = path
        # print(info)
        details.append(info)
    time.sleep(0.5)

AU NBAC N193-001/: 303 files
AU NBAC N193-002/: 312 files
AU NBAC N193-003/: 345 files
AU NBAC N193-004/: 312 files
AU NBAC N193-005/: 305 files
AU NBAC N193-006/: 334 files
AU NBAC N193-007/: 349 files
AU NBAC N193-008/: 318 files
AU NBAC N193-009/: 327 files
AU NBAC N193-010/: 327 files
AU NBAC N193-011/: 350 files
AU NBAC N193-012/: 310 files
AU NBAC N193-013/: 330 files
AU NBAC N193-014/: 329 files
AU NBAC N193-015/: 349 files
AU NBAC N193-016/: 313 files
AU NBAC N193-017/: 331 files
AU NBAC N193-018/: 322 files
AU NBAC N193-019/: 348 files
AU NBAC N193-020/: 312 files
AU NBAC N193-021/: 330 files
AU NBAC N193-022/: 314 files
AU NBAC N193-023/: 344 files
AU NBAC N193-024/: 310 files
AU NBAC N193-025/: 323 files
AU NBAC N193-026/: 332 files
AU NBAC N193-027/: 349 files
AU NBAC N193-028/: 314 files
AU NBAC N193-029/: 328 files
AU NBAC N193-030/: 327 files
AU NBAC N193-031/: 339 files
AU NBAC N193-032/: 316 files
AU NBAC N193-033/: 329 files
AU NBAC N193-034/: 322 files
AU NBAC N193-0

In [44]:
# How many files are there?
len(details)

72843

In [72]:
# Get some information on individual files
client.info('Shared/ANU-Library/Sydney Stock Exchange 1901-1950/{}/{}'.format('AU NBAC N193-001', 'N193-001_0001.tif'))

{'created': None,
 'name': None,
 'size': '106240746',
 'modified': 'Wed, 13 Jun 2018 01:56:48 GMT'}

## Saving the results

I saved the results as CSV files — one for [files](files.csv) and one for [directories](directories.csv).

In [17]:
# Save previously downloaded data as CSV files so that I don't have to do it again
# I use Pandas for these conversions because it's easy
df_files = pd.DataFrame(details)
df_files.to_csv('files.csv', index=False)
df_dirs = pd.DataFrame(summary)
df_dirs.to_csv('directories.csv', index=False)

In [34]:
# Load previously harvested data
files = pd.read_csv('files.csv').to_dict('records')
directories = pd.read_csv('directories.csv').to_dict('records')

## Getting a random sample of images

To do some testing on the images, I wanted to download a random sample.

In [35]:
# First we'll make a random selection from the list of file names.
random_files = random.sample(files, 10)
random_files

[{'directory': 'AU NBAC N193-140/',
  'name': 'N193-140_0317.tif',
  'path': 'Shared/ANU-Library/Sydney Stock Exchange 1901-1950/AU NBAC N193-140/N193-140_0317.tif'},
 {'directory': 'AU NBAC N193-050/',
  'name': 'N193-050_0097.tif',
  'path': 'Shared/ANU-Library/Sydney Stock Exchange 1901-1950/AU NBAC N193-050/N193-050_0097.tif'},
 {'directory': 'AU NBAC N193-123/',
  'name': 'N193-123_0009.tif',
  'path': 'Shared/ANU-Library/Sydney Stock Exchange 1901-1950/AU NBAC N193-123/N193-123_0009.tif'},
 {'directory': 'AU NBAC N193-141/',
  'name': 'N193-141_0294.tif',
  'path': 'Shared/ANU-Library/Sydney Stock Exchange 1901-1950/AU NBAC N193-141/N193-141_0294.tif'},
 {'directory': 'AU NBAC N193-025/',
  'name': 'N193-025_0076.tif',
  'path': 'Shared/ANU-Library/Sydney Stock Exchange 1901-1950/AU NBAC N193-025/N193-025_0076.tif'},
 {'directory': 'AU NBAC N193-193/',
  'name': 'N193-193_0132.tif',
  'path': 'Shared/ANU-Library/Sydney Stock Exchange 1901-1950/AU NBAC N193-193/N193-193_0132.tif'}

In [29]:
# Then we'll just loop through the randomly selected files and download them
for image in random_files:
    print('Downloading {}'.format(image['name']))
    client.download_sync(remote_path=image['path'], local_path='images/{}'.format(image['name']))

Downloading N193-019_0217.tif
Downloading N193-127_0222.tif
Downloading N193-145_0325.tif
Downloading N193-163_0009.tif
Downloading N193-120_0118.tif
Downloading N193-009_0289.tif
Downloading N193-012_0142.tif
Downloading N193-161_0085.tif
Downloading N193-101_0132.tif
Downloading N193-190_0228.tif


## But wait there's more...

Wondering how to access a public share? Have [a look here](Cloudstor-access-to-a-public-share-via-WebDAV.ipynb)...