# Working with data

On your computer you can open an Explorer which will show you all the files and the directories. In iRODS we have something similar: we have data objects which you can see for the moment as a file and we have collections which are similar to directories.

In the course of this and the next tutorials it will become clear that this is only an analogy and that there is more to data objects and collections.

## Create an iRODS session

Again, to work from your Laptop with data in iRODS you will need a `session` as describes in [01-Setup-and-connect](01-Setup-and-connect.ipynb).

In [None]:
from ibridges.interactive import interactive_auth
session = interactive_auth()

## Create a collection and upload a file

We will now create a new collection on iRODS in your current iRODS working location:

In [None]:
from ibridges.path import IrodsPath

irods_path = IrodsPath(session, '~')
print("Current working location:", irods_path)
coll_path = irods_path.joinpath('demo')
print("New collection name:", coll_path)
coll_path.create_collection()
print("New collection is created:", coll_path.collection_exists())

Now we will upload a file to that new iRODS collection. In this tutorial we assume that you have a file called `demofile.txt` in your home:
- Linux: `/home/<user>/demofile.txt`
- Mac: `/Users/<user>/demofile.txt`
- Windows: `C:\Users\<user>\demofile.txt`

Let's check that first.

In [None]:
from pathlib import Path
local_file = Path.home().joinpath('demofile.txt')
if local_file.is_file():
    print('You are good to follow the next steps!')
else:
    print(f'Please create a file {local_file} before you continue')

Now we can upload that file:

In [None]:
from ibridges import upload
ops = upload(local_file, coll_path, overwrite=True)

You can also see that we get some information what was changed on the iRODS server. **All uploads, downloads and synchronisations will return such a dictionary.** You can also just retrieve the changes, without actually carrying them out by setting the flag `dry-run=True`.

We now created a new data object in iRODS:

<img src="img/DataObject1.png" width="400">

How can we now be sure that the file is uploaded? Let's inspect the collection:

In [None]:
print(coll_path.name)
print(coll_path.collection_exists())

We can get the list of `subcollections` and `data_objects` behind an `IrodsPath` like below.

In [None]:
print(f'Current subcollections in {str(coll_path)}:', coll_path.collection.subcollections)
print(f'Current data objects in {str(coll_path)}:', coll_path.collection.data_objects)

**Note,** that if the `IrodsPath` points to a data object you need to use `path.dataobject` tob get the object the path points to. We will see that later on in the tutorial.

## Download the file and the collection

Of course we can also download the data object we have just created. We will download it to your `Downloads` directory on you computer. Let us first retrieve the changes.

In [None]:
local_path = Path.home().joinpath('Downloads')
assert local_path.is_dir()

In [None]:
from ibridges import download
ops = download(coll_path.joinpath('demofile.txt'), local_path, dry_run=True, overwrite=True)
ops.print_summary()

Now let's really download the file:

In [None]:
ops = download(coll_path.joinpath('demofile.txt'), local_path)

What will happen if we download again?

In [None]:
download(coll_path.joinpath('demofile.txt'), local_path)

You will receive an Exception `FileExistsError`. This means that the file is already there.  

**Note, both the upload and the download function do not overwrite existing data by default.** 

You can force to overwrite the existing data by setting the flag `overwrite`:

In [None]:
ops = download(coll_path.joinpath('demofile.txt'), local_path, overwrite=True)

You also see that no data was transferred. *ibridges* comapres the source and the destination and if the data is exactly the same (checked by so-called [checksums](https://en.wikipedia.org/wiki/Checksum)) the respective data will be ommitted. In the next section we will show you, how to see beforehand, which data will be updated in a transfer.

### Check if you need to up or download

The two functions `upload` and `download` have a parameter `dry_run`. With this parameter you can first check what would be changed, i.e. which data would be updated and which collections/folders would be created in a real up- or download:

In [None]:
ops = download(coll_path.joinpath('demofile.txt'), local_path, dry_run=True, overwrite=True)
ops.print_summary()

The dictionary will fill the lists behind the keys depending on whether you want to upload or download data. In the example above we want to download data from iRODS to our local file system. In this case only the keys `'create_dir'` and `'download'` will be populated.

The `ops` are always returned, not only in a `dry-run`. they can also serve as a summary, what was changed.

## System metadata

Both, collections and data objects in iRODS are labeled automatically with some extra information. This information is called *system metadata*. We already saw the `name` and the `path` for the collection. But there is more!

In [None]:
print(f'Collection {coll_path.name} was created on {coll_path.collection.create_time}')
print(f'The collection was last modified on {coll_path.collection.modify_time}')
print(f'The collection was uploaded by and is owned by {coll_path.collection.owner_name}')

We can also get some system information about the data object we created by uploading the demo file. To inspect a data object (and also a collection) we have to retrieve it from iRODS. We will use the `IrodsPath` to do so.

In [None]:
obj_path = IrodsPath(session, coll_path.joinpath('demofile.txt'))

Now we can inspect the object:

In [None]:
print(f'Data object {obj_path.name} was created on {obj_path.dataobject.create_time}')
print(f'Data object {obj_path.name} full path in iRODS is {obj_path}')
print(f'The data object was created on {obj_path.dataobject.create_time}')
print(f'The data object was last modified on {obj_path.dataobject.modify_time}')
print(f'The data object was uploaded by and is owned by {obj_path.dataobject.owner_name}')

**Data objects also carry a `size` and a `checksum`**, with which you can check whether the data reached iRODS completely. Checksums are particularly useful. With the size you can only check whether the length of your local file matches the on in iRODS. However, you cannot see if the contents is really the same. E.g. the length of 'Hello' and 'Hallo' are the same, but they do differ. With a checksum you can detect this. A checksum is a digest of the contents of a file.

**Note, both functions, `upload` and `download`, will always calculate and compare the checksums between your computer and iRODS**, to make sure that the data is transferred correctly. 

In [None]:
print('Data object size', obj_path.size)
print('Data object checksum', obj_path.checksum)

The iRODS full path `obj.path` is the address with which you can get the full data object. Currently, this is the uploaded file and its system metadata

<img src="img/DataObject2.png" width="400">

## Data object replicas

Usually when we upload a file to another computer we create one new item which can be found under the path that we used to upload the data. 

Some iRODS systems distribute the uploaded data to different storage systems. That means, that the file is stored on several storage systems, but you have one path under which you can find the data and address the data. This is called in iRODS a data object `replica`. When you download data the system decides for you, which is the most advantageous replica to give to you in terms of speed and integrity.

<img src="img/DataObject3.png" width="400">

Let us inspect how many replicas of our file we have in iRODS. We need to get the real iRODS object behind the path:

In [None]:
from ibridges.util import obj_replicas
obj_replicas(obj_path.dataobject)

In Yoda you might have to wait for some time until the data is replicated.

The structure of the answer is a list, where each element of this list is structured like this:

In [None]:
item = obj_replicas(obj_path.dataobject)[0]
print('Relplica index:', item[0])
print('Storage system:', item[1])
print('Replica cecksum', item[2])
print('Replica size', item[3])
print('Replica status', item[4])

The replica status tells you of this particular copy of the data is verified and ok. It should say `good`. In all other cases please send the whole replica information including the path to your local iRODS helpdesk.