# Submit to biosamples

This notebook will serve to show how to use this library, in the simplest way:

- Before you start
    - Generate a valid input file with metadata about 1 sample (A very, very simple TSV)
- What components do we need
- What do we need to input to each component
- How to correct our samples before submission
- How to submit
- See the results in Excel (or TSV, your decision)

This is the simplest example; in other notebooks, we will explore how to submit multiple samples, with relationships defined amongst them, how to validate our samples against ENA checklists, and how to transform input data.

## Before you start

- Please make sure you have `python 3.10` or higher
- Please make sure you have a webin acount set up in [webin-dev](https://wwwdev.ebi.ac.uk/ena/submit/webin/login)
- Please make sure you have the latest biobroker library installed: `pip install biobroker`

In [1]:
%pip install --upgrade biobroker==0.0.4

Collecting biobroker==0.0.4
  Downloading biobroker-0.0.4-py3-none-any.whl.metadata (42 kB)
Downloading biobroker-0.0.4-py3-none-any.whl (55 kB)
Installing collected packages: biobroker
Successfully installed biobroker-0.0.4
Note: you may need to restart the kernel to use updated packages.


### Generate input file

I don't want to have an example file of this kind in the examples, so, let's generate it ourselves! Let's do a simple example with 2 attributes: "name" and "collected_at"

In [1]:
sample_tsv = [
    ["name", "collected_at"],
    ["sumple", "noon"]         
]

writable_sample = "\n".join(["\t".join(row) for row in sample_tsv])
with open("simple_sample_sumple.tsv", "w") as f:
    f.write(writable_sample)

"name" is a especial property in biosamples. We'll talk about it later; for now, just remember that this property has to always be set up.

## What components do we need

Given that you've read the documentation in the main page (I know you've done it, I wrote it with lots of love) you will know by now that, in order to submit, we need:
- An authenticator: To authenticate ourselves to the archive
- An api: To store/execute the instructions to submit to the archive
- At least 1 metadata entity: To store the metadata about our samples
- An input processor: To process the input file with the metadata

Additionally, I will also import an output processor; Not needed, but we will be able to save our brokering results to a very nice, very demure and readable excel file.

Additionally number 2: Metadata entities have a very nice thing, they have a `static method` (That just means you can call the method without creating an instance) that gives you guidelines on how to fill out the metadata. Lovely, eh?

In [2]:
from biobroker.authenticator import WebinAuthenticator # Biosamples uses the WebinAuthenticator
from biobroker.api import BsdApi # BioSamples Database (BSD) API
from biobroker.metadata_entity import Biosample # The metadata entity
from biobroker.input_processor import TsvInputProcessor # An input processor
from biobroker.output_processor import XlsxOutputProcessor # An output processor

print(Biosample.guidelines())

A Biosamples entity MUST have the following properties set:
	- name: a descriptive title for the sample
	- taxId or organism: either the integer code for a taxon ID (taxId), according to https://www.ncbi.nlm.nih.gov/taxonomy, or a string that validates against those records (organism)
A Biosamples entity SHOULD have the following properties set:
	- release: date of release for the metadata of the entity. DEFAULTS TO MOMENT OF CREATION.
For more information, please see https://www.ebi.ac.uk/biosamples/docs/references/api/submit#_submission_minimal_fields.

To indicate relationships in the samples, please use a field named after the relationshipitself: namely, 'derived_from', 'same_as', 'has_member' or 'child_of'.
Please seehttps://www.ebi.ac.uk/biosamples/docs/guides/relationships


2024-09-26 09:00:36,723 - BsdApi - INFO - Set up BSD API successfully: using base uri 'https://wwwdev.ebi.ac.uk/biosamples/samples'
2024-09-26 09:01:59,291 - BsdApi - ERROR - Found following errors in sample validation:
	- /characteristics.organism: should have required property 'organism'
	- /characteristics.Organism: should have required property 'Organism'
	- /characteristics.species: should have required property 'species'
	- /characteristics.Species: should have required property 'Species'
	- /characteristics: should match some schema in anyOf)


Now we have imported everything! See how easy it is?

You may have noticed that, aside from the `name`, there are 2 other mandatory fields:
- taxId/organism: Biosamples requires for the samples to identify which is the taxonomic classification for the organism the sample comes from. This is not important now - It will throw an error later and we will correct it.
- release: As with any archive, the [meta]data can be stored as private for an amount of time. This sets the release date. We can set it up, but for this example we won't; it will be released as soon as we submit it. The BioSample metadata entity has a built-in behaviour to add the release date to the timestamp of creation of the object if not provided.

## How to set up each component

Alrighty! Let's start setting up the components:

### 1. Set up the input processor

For the input processor, we just need to give the path to the input file :)

In [3]:
path = "simple_sample_sumple.tsv" # This is the file we created previously

input_processor = TsvInputProcessor(input_data=path)

Let's check it out!

In [4]:
input_processor.input_data

[{'name': 'sumple', 'collected_at': 'noon'}]

There's another functionality of the input processors: the `transform` function. I will discuss it in further notebooks!

For now, we have the input processor set up. Cool!

### 2. Set up the samples

Now that we have the data in an object... where does that go?

Well, it's as simple as: the input processors have a method called `process`. You give, as an input to this function, the class of `metadata_entity`s that you want to create, and it returns a list of those entities created from the `.input_data`. If you want to see more documentation on that, refer to [ReadTheDocs](https://biobroker.readthedocs.io/en/latest/biobroker.input_processor.html#biobroker.input_processor.input_processor.GenericInputProcessor.process)

In [5]:
my_sample = input_processor.process(Biosample) # We're giving it a Biosample class to process
print(my_sample)

[<biobroker.metadata_entity.metadata_entity.Biosample object at 0x10f9fead0>]


It's... a list of objects?

Yup! The `process` function always returns a list of objects (`Biosample` entities in this case). This makes writing against the output much easier, as you don't need to handle methods to work against a list or a single entity. Don't be lazy - Write against the list!

(Also, let's see how the metadata inside has been transformed)

In [6]:
print(my_sample[0].entity)

{'name': 'sumple', 'characteristics': {'collected_at': [{'text': 'noon'}]}, 'release': '2024-09-26T07:52:45.407206Z'}


Now the sample is set up! See how it has re-structured the metadata?

This is the format that biosamples expects their metadata to be. You don't need to understand everything - Just know, there are certain keywords (e.g. name) that get treated differently, and everything else is stored under `characteristics`. You can review the list of properties in the RTD docs: [ROOT PROPERTIES](https://biobroker.readthedocs.io/en/latest/biobroker.metadata_entity.html#biobroker.metadata_entity.metadata_entity.ROOT_PROPERTIES)

### 3. Setting up the authenticator + API

Now, we need to set up the authenticator and the API. For this example, we're going to use BioSamples dev - the testing environment.

For that, we will set up an environment variable, `API_ENVIRONMENT`, and we will provide the authenticator with our webin-dev username and password.

In [7]:
import os
os.environ['API_ENVIRONMENT'] = "dev" # There are multiple ways to set up environment variables

username = "" # Your username goes here
password = "" # Your password goes here
authenticator = WebinAuthenticator(username=username, password=password)

api = BsdApi(authenticator=authenticator)

For your password and username in a workflow environment, I would recommend either to set them up as environment variables and load them in your script, or use a config file that you're sure it's not going to be pushed to the repository. Be mindful!

Now that we have everything set up, let's try to submit!

### 4. Submitting your sample

This step is very easy - Since we've done everything, we just need to hit submit on the API object and pass the samples we generated!

In [10]:
results = api.submit(my_sample)

BiosamplesValidationError: Found following errors in sample validation:
	- /characteristics.organism: should have required property 'organism'
	- /characteristics.Organism: should have required property 'Organism'
	- /characteristics.species: should have required property 'species'
	- /characteristics.Species: should have required property 'Species'
	- /characteristics: should match some schema in anyOf)

As we can see, Biosamples is complaining; it's telling us our sample is missing a certain set of characteristics. This specific case, the error comes from a `anyOf` schema error - That meaning, in plain english, you should only worry to set one of the characteristics.

Since we mentioned it before, let's correct our sample and add the `organism` field. `organism` takes a string, but this should match the label in the NCBI taxonomy service. Let's choose human!

In [11]:
my_sample[0]['organism'] = "Homo sapiens"

One very cool thing is that `metadata_entity` objects are set up as dictionaries. The same way you would add a key:value pair to a dictionary, you can do the same with a metadata entity - And the entity will handle where and how to write it.

(Tip: For multiple correction, I'd advise to correct the source of the metadata (the tsv) rather than working directly on the objects.)

Let's try again!

In [12]:
submitted_samples = api.submit(entities=my_sample)

In [13]:
print(submitted_samples[0].entity)

{'name': 'sumple', 'characteristics': {'SRA accession': [{'text': 'ERS30993787'}], 'collected_at': [{'text': 'noon'}], 'organism': [{'text': 'Homo sapiens'}]}, 'accession': 'SAMEA131394580', 'sraAccession': 'ERS30993787', 'webinSubmissionAccountId': 'Webin-64342', 'taxId': 9606, 'status': 'PUBLIC', 'release': '2024-09-26T07:52:45.407Z', 'update': '2024-09-26T08:03:34.403Z', 'submitted': '2024-09-26T08:03:34.403Z', 'submittedVia': 'JSON_API', 'create': '2024-09-26T08:03:34.403Z', '_links': {'self': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/samples'}, 'curationDomain': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/samples{?curationdomain}', 'templated': True}, 'curationLinks': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA131394580/curationlinks'}, 'curationLink': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA131394580/curationlinks/{hash}', 'templated': True}, 'structuredData': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/structureddata/SAMEA131394580'}}}


And it's submitted! if you want to see it, it's already available in biosamples dev:

In [14]:
print(f"https://wwwdev.ebi.ac.uk/biosamples/samples/{submitted_samples[0]['accession']}")

https://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA131394580


(You may need to wait a bit - Biosamples dev operates a bit slower, as it's normal for testing grounds. It may take a while to make the sample public)

Let's create an output file so you can be happy with your local version of the metadata!

In [15]:
from biobroker.output_processor import XlsxOutputProcessor
output_processor = XlsxOutputProcessor(output_path="simple_sample_submitted.xlsx", sheet_name="Awesome submission")
output_processor.save(submitted_samples)

And you should see something like this! Isn't this demure?


<img src="simple_sample_submitted.png">

### 5. Enjoy!