
# Malaria Gene Annotation — Reproducible Workflow

This notebook demonstrates the end-to-end process and discusses dependencies, version pinning, and FAIR principles.


In [None]:

# Optionally install exact versions (useful on Colab)
# Uncomment if using Colab:
# !pip install -r ../requirements.txt



## 1. Inputs and expectations

We expect two tab-delimited files:

- `malaria.fna` — multi-FASTA with **tab-delimited** headers, e.g.
```
>1_g	length=1143	scaffold=scaffold00001	strand=-
ATGGATATAA...
```
- `malaria.blastx.tab` — BLASTX table with **gene ID in column 1** and **protein description in column 10**.

If a protein description is `null`, that entry is excluded.
```
#queryName	queryLength	firstQueryPos	lastQueryPos	hitName	hitLength	firstHitPos	lastHitPos	frame	hitDescription	numberOfHits	coverage	identity	evalue	score	hspLength	relIdentity
2701_g	1881	1	1806	Q7RHP3	516	1	516	+1	GAF domain protein	10	1.18	256	2e-110	405	608	0.43
```

If a protein description is null, that entry is excluded.



In [None]:

from pathlib import Path
base = Path("..").resolve()
fna = base / "data" / "malaria.fna"
blast = base / "data" / "malaria.blastx.tab"
print(fna.read_text().splitlines()[:4])
print(blast.read_text().splitlines()[:2])



## 2. Run the reference script

We reuse the provided `src/malaria.py` to generate `output.txt`.


In [None]:

import subprocess, sys, os, textwrap, json
out_path = (base / "output.txt").as_posix()
cmd = [sys.executable, (base / "src" / "malaria.py").as_posix(), fna.as_posix(), blast.as_posix(), out_path]
print("Running:", " ".join(cmd))
subprocess.check_call(cmd)
print("\nFirst lines of output:")
print("\n".join(Path(out_path).read_text().splitlines()[:4]))



## 3. Simple sanity checks

- All headers must contain a `protein=` field.
- No entry with `null` description should appear.


In [None]:

from pathlib import Path
ok = True
num = 0
for line in Path(out_path).read_text().splitlines():
    if line.startswith(">"):
        num += 1
        if "\tprotein=" not in line:
            ok = False
print(f"Headers with protein field: {num} — OK={ok}")



## 4. Visualization

We produce a trivial bar chart of sequence lengths in the output to verify content. (No colors explicitly set.)


In [None]:

import re
import matplotlib.pyplot as plt

lengths = []
with open(out_path) as f:
    for line in f:
        if line.startswith(">"):
            m = re.search(r"length=(\d+)", line)
            if m:
                lengths.append(int(m.group(1)))

plt.figure()
plt.bar([str(i+1) for i in range(len(lengths))], lengths)
plt.title("Sequence lengths in output.txt")
plt.xlabel("Record index")
plt.ylabel("Length (bp)")
plt.show()



## 5. Notes on robustness

The reference script assumes **single-line sequences**. For multi-line FASTA, we would need to buffer sequences until the next header. See below for a robust approach (pseudocode):

```text
for each line:
  if header -> flush previous sequence if had hit; start new record
  else -> append to current sequence buffer
```


## 6. Version pinning

Let \( p \) be a package and \( v \) a pinned version. We enforce:
\[
\texttt{install}(p == v) \Rightarrow \text{deterministic behavior across machines.}
\]
Tradeoffs and risks are detailed in the README.



## 7. FAIR discussion

- **Findable:** clear metadata (README, keywords), standard names, CITATION.
- **Accessible:** open license and plain-text formats.
- **Interoperable:** FASTA/TSV standards; Python 3 environment.
- **Reusable:** pinned environment, runnable example, and tests/sanity checks.
