In [1]:
import math
import lithops
import pandas as pd
from dataplug import CloudObject

from dataplug.formats.genomics.fasta import FASTA, partition_chunks_strategy


### 🔐 AWS Credentials Configuration

There are two main ways to configure your AWS credentials:

---

#### ✅ Option 1: (Recommended) Use AWS CLI profiles

You can securely configure your credentials using the AWS CLI:

Run the following command in your terminal:

```bash
aws configure
```

✅ This is the **safest and most recommended approach**, especially for production environments or shared projects.

---

#### ⚠️ Option 2: (Less secure) Set credentials in code

For quick experiments or in isolated environments, you can manually set the environment variables in this notebook:

```python
import os

os.environ["AWS_ACCESS_KEY_ID"] = "AWS_ACCESS_KEY_ID"
os.environ["AWS_SECRET_ACCESS_KEY"] = "AWS_SECRET_ACCESS_KEY"
```

> ⚠️ **IMPORTANT**: Never expose your credentials in source code or public repositories.  
> This method should only be used for temporary testing and never in production.


### ⚙️ Preprocessing a FASTA File from S3 with Lithops

The following code loads a FASTA file stored in an S3 bucket using `CloudObject` and preprocesses it in **4 parallel jobs** by dividing the file into chunks.

```python
co = CloudObject.from_s3(
    FASTA, 
    "s3://dnastack-covid-19-sra-data/PacBio/fasta/SRR16804994/SRR16804994.fasta"
)

# Perform preprocessing in 4 parallel jobs (chunk size = total size / 4)
parallel_config = {"verbose": 10}
chunk_size = math.ceil(co.size / 4)
co.preprocess(parallel_config=parallel_config, chunk_size=chunk_size)
```

- `CloudObject.from_s3(...)`: Loads the remote FASTA file as a Lithops CloudObject.
- `co.size`: Retrieves the size of the file in bytes.
- `chunk_size`: Determines the number of bytes each parallel job will process.
- `preprocess(...)`: Splits and processes the file using **4 parallel workers**, improving performance for large datasets.

> ℹ️ You can adjust the number of parallel jobs by modifying the divisor used in the `chunk_size` calculation.


In [3]:
co = CloudObject.from_s3(FASTA, "s3://dnastack-covid-19-sra-data/PacBio/fasta/SRR16804994/SRR16804994.fasta")

# Perform preprocessing in 4 parallel jobs (chunk size = total size / 4)
parallel_config = {"verbose": 10}
chunk_size = math.ceil(co.size / 4)
co.preprocess(parallel_config=parallel_config, chunk_size=chunk_size)

### 📊 Inspecting and Partitioning the FASTA File

After loading the FASTA file as a `CloudObject`, we can inspect its metadata and split it into multiple slices for parallel processing.

```python
print(f"FASTA file has {co.attributes.num_sequences} sequences")

data_slices = co.partition(partition_chunks_strategy, num_chunks=8)
```

- `co.attributes.num_sequences`: Returns the number of sequences found in the FASTA file.
- `partition(...)`: Divides the file into chunks using the `partition_chunks_strategy` and the specified number of partitions (`num_chunks=8`).

In [4]:
print(f"FASTA file has {co.attributes.num_sequences} sequences")
data_slices = co.partition(partition_chunks_strategy, num_chunks=8)

FASTA file has 1 sequences


In [5]:
data_slices[0].get()

b'>SRR16804994\nNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTTTTGCAGCCGATTATCAGCACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGTACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTG

In [6]:
len(data_slices)

8