# 🧠 Part 1: The Plan for unzipping the Train and Test folders and view the MPEG-G files
We’ll create a local Jupyter notebook or script that:

0. 🔧 Step 0: Prerequisites
1. 🐳 Pulls and runs the Genie Docker container
2. 📤 Extract the Train and Test File from TrainFiles.zip and TestFiles.zip
3. 🐳 Pull the Genie Docker Image
4. 📂 Mounts your ```.mgb``` file for one file in TrainFiles.
5. 🧬 Load FASTQ in Python
6. 💡 Explain the layers and outputs in plain English for a data scientist unfamiliar with bioinformatics
7. 🏃 Run steps 4 & 5 for all files in TrainFiles and TestFiles


## 📓 Let’s Start with a Local Notebook
Here's how your local notebook should look:

### 🔧 Step 0: Prerequisites

1. Install an Integrated Development environment VSCode or Anaconda
2. Install Docker (if you haven’t already):
👉 https://www.docker.com/products/docker-desktop/
3. Install Python
4. Install Conda
5. Create a virtual invironment

Note: doing this you may be asked to install command line tools depending on your machine, xcode, git, etc, but it is worth it as you will find them useful. Some of the downloads may take more than half an hour, stay focused and follow instructions.

 Instructions for Mac are available [here](https://docs.google.com/document/d/1Rug61nxs6FLqlYh8Qzqgzuv6nq3lyTq45HVB6cqERfE/edit?usp=sharing)  
If you already have IDE, Python, CondaInstall Docker:
👉 https://www.docker.com/products/docker-desktop/

Important:

 However, we recommend that you install Biopython using conda, as it will also install the required dependencies.

` conda install Bio`

 A better approach is to create a new conda environment for your project, and then install Biopython in that environment:

 `conda create -n myenv mpegg_env `
 Activate the environment

 `conda activate mpegg_env`
 Now you can install Biopython in the new environment  and then import Bio in your code

` conda install -c conda-forge biopython`



In [1]:
# Test Docker works
!docker run hello-world


Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/



In [3]:
!pip install Bio

Collecting Bio
  Downloading bio-1.8.0-py3-none-any.whl.metadata (5.7 kB)
Collecting gprofiler-official (from Bio)
  Downloading gprofiler_official-1.0.0-py3-none-any.whl.metadata (11 kB)
Collecting mygene (from Bio)
  Downloading mygene-3.2.2-py2.py3-none-any.whl.metadata (10 kB)
Collecting pandas (from Bio)
  Downloading pandas-2.3.1-cp310-cp310-win_amd64.whl.metadata (19 kB)
Collecting pooch (from Bio)
  Downloading pooch-1.8.2-py3-none-any.whl.metadata (10 kB)
Collecting requests (from Bio)
  Downloading requests-2.32.4-py3-none-any.whl.metadata (4.9 kB)
Collecting tqdm (from Bio)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting biothings-client>=0.2.6 (from mygene->Bio)
  Downloading biothings_client-0.4.1-py3-none-any.whl.metadata (10 kB)
Collecting httpx>=0.22.0 (from biothings-client>=0.2.6->mygene->Bio)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting anyio (from httpx>=0.22.0->biothings-client>=0.2.6->mygene->Bio)
  Downloading 

In [None]:
# Import packages
import os
import zipfile
import subprocess
import random
import pandas as pd
import numpy as np
#%pip install matplotlib
import matplotlib.pyplot as plt
import shutil


from Bio import SeqIO
from collections import Counter

#%pip install scikit-learn

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

# Errors ignore
import warnings
warnings.filterwarnings('ignore')

## Step 1. 📤 Extract the Train and Test File from a .zip

In [1]:
print(os.listdir('.'))

NameError: name 'os' is not defined

In [9]:
# Define your zip files and corresponding output directories
zip_targets = {
    'TrainFiles.zip': './',
    'TestFiles.zip': './'
}

for zip_path, extract_to in zip_targets.items():
    # Create the output directory if it doesn't exist
    os.makedirs(extract_to, exist_ok=True)

    # Extract zip content
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_to)
        print(f"✅ Extracted {zip_path} to ./{extract_to}/")

✅ Extracted TrainFiles.zip to ././/
✅ Extracted TestFiles.zip to ././/


## 🐳 Step 2: Pull the Genie Docker Image

In [4]:
# Ensure you have the latest version of Genie
!docker pull muefab/genie:latest

latest: Pulling from muefab/genie
Digest: sha256:c3112a3879cc18061bbab5ed8f76dec255ab1be46e2133cd59320dd5ba98ef89
Status: Image is up to date for muefab/genie:latest
docker.io/muefab/genie:latest


In [None]:
!docker run --rm -v <dir>:/work muefab/genie:latest
<cmd> <dir>

SyntaxError: invalid syntax (1929300137.py, line 2)

## 🏃 Steps 4 - 8 for one MPEG-G file

### 📂 Step 4: Mounts your ```.mgb``` file for one file in TrainFiles.

In [5]:
notebook_dir = os.getcwd()

# Pick one `.mgb` file from TrainFiles
mgb_filename = "ID_ZZWUCJ.mgb"
mgb_filename_no_mgb = mgb_filename[:-4]
train_dir = os.path.join(os.getcwd(), "TestFiles")
mgb_file_path = os.path.join(train_dir, mgb_filename)

# Output location for decoded FASTQ
output_fastq = f"{mgb_filename_no_mgb}.fastq"

# Docker mount paths
host_dir = train_dir                # Local directory with the `.mgb` file
container_dir = "/data"             # Directory inside the container

# Show paths
print(f"📁 Host path to `.mgb`: {mgb_file_path}")
print(f"📁 Host directory mounted: {host_dir}")
print(f"📦 Container directory will be: {container_dir}")
print(f"📄 Output FASTQ: {output_fastq}")

📁 Host path to `.mgb`: c:\Users\Reinhard\Documents\computer_vision_projects\microbiome_classification_challenge\TestFiles\ID_ZZWUCJ.mgb
📁 Host directory mounted: c:\Users\Reinhard\Documents\computer_vision_projects\microbiome_classification_challenge\TestFiles
📦 Container directory will be: /data
📄 Output FASTQ: ID_ZZWUCJ.fastq


### 🔍 Step 5: Decodes ```.mgb``` to .```fastq``` of that one file.

In [6]:
def inspect_mgb_structure(host_dir=".", container_dir="/work", mgb_filename=mgb_filename):
    command = [
        "docker", "run", "--rm",
        "-v", f"{host_dir}:{container_dir}",
        "muefab/genie:latest", "run",  # ✅ Add "run" subcommand here
        "-f",
        "-i", f"{container_dir}/TestFiles/{mgb_filename}",
        "-o", f"{container_dir}/TestFiles/{mgb_filename_no_mgb}.fastq"
    ]
    print("Running:", " ".join(command))
    result = subprocess.run(command, capture_output=True, text=True)
    print("\n--- STDOUT ---\n")
    print(result.stdout)
    if result.stderr:
        print("\n--- STDERR ---\n")
        print(result.stderr)

inspect_mgb_structure()

Running: docker run --rm -v .:/work muefab/genie:latest run -f -i /work/TestFiles/ID_ZZWUCJ.mgb -o /work/TestFiles/ID_ZZWUCJ.fastq

--- STDOUT ---

[INFO,      0.000s, App]:    ______           _
[INFO,      0.001s, App]:   / ____/__  ____  (_)__
[INFO,      0.001s, App]:  / / __/ _ \/ __ \/ / _ \
[INFO,      0.001s, App]: / /_/ /  __/ / / / /  __/
[INFO,      0.001s, App]: \____/\___/_/ /_/_/\___/
[INFO,      0.001s, App]: Command: /usr/local/bin/genie run -f -i /work/TestFiles/ID_ZZWUCJ.mgb -o /work/TestFiles/ID_ZZWUCJ.fastq 
[INFO,      0.026s, App/Run]: Input file 1: /work/TestFiles/ID_ZZWUCJ.mgb with size 5.43MiB
[INFO,      0.063s, App/Run]: Working directory: /work/TestFiles with 11.7GiB available
[INFO,      0.095s, App/Run]: Output file: /work/TestFiles/ID_ZZWUCJ.fastq with 11.7GiB available
[INFO,      0.095s, App/Run]: Threads: 4 with 4 supported
[INFO,      0.106s, Spring]: Temporary directory: /work/TestFiles/tmp.x7KUBpRZFm/
[INFO,      0.116s, Spring]: Temporary directory

### 🧬 Step 6: Load FASTQ in Python

In [7]:
# Safer path for Windows (forward slashes or raw string)
fastq_path = os.path.join(os.getcwd(), train_dir, f"{mgb_filename_no_mgb}.fastq")

# Check if the file exists before parsing
if not os.path.exists(fastq_path):
    print(f"❌ FASTQ file not found at: {fastq_path}")
else:
    total_reads = 0
    read_lengths = []
    quality_scores = []

    for record in SeqIO.parse(fastq_path, "fastq"):
        total_reads += 1
        read_lengths.append(len(record.seq))
        quality_scores.extend(record.letter_annotations["phred_quality"])

    print(f"🔍 Total reads: {total_reads}")
    print(f"📏 Avg read length: {sum(read_lengths)/len(read_lengths):.1f} bp")
    print(f"🎯 Avg quality score: {sum(quality_scores)/len(quality_scores):.1f}")


🔍 Total reads: 163440
📏 Avg read length: 124.5 bp
🎯 Avg quality score: 33.3


In [8]:
print("🧪 First 3 reads:\n")
for i, record in enumerate(SeqIO.parse(fastq_path, "fastq")):
    print(f"🔹 ID: {record.id}")
    print(f"🔹 SEQ: {record.seq[:50]}...")  # just preview first 50 bp
    print(f"🔹 QUALITY: {record.letter_annotations['phred_quality'][:10]}...\n")
    if i >= 2:
        break


🧪 First 3 reads:

🔹 ID: NB501656:452:H3WJWAFXY:2:11306:21744:4603
🔹 SEQ: ATACGTAAGGACCGAGCGTTGTCCGGAATCATTGGGCGTAAAGGGTACGT...
🔹 QUALITY: [36, 36, 36, 36, 36, 36, 36, 36, 36, 36]...

🔹 ID: NB501656:452:H3WJWAFXY:2:11306:21744:4603
🔹 SEQ: CCTGTTTGCTACCCACGCTTTCGTACCTCAGCGTCAGATAATGGCCAGAA...
🔹 QUALITY: [36, 36, 36, 36, 36, 36, 36, 36, 36, 36]...

🔹 ID: NB501656:452:H3WJWAFXY:1:11302:10166:2744
🔹 SEQ: ATACGTAAGGACCGAGCGTTGTCCGGAATCATTGGGCGTAAAGGGTACGT...
🔹 QUALITY: [36, 36, 36, 36, 36, 36, 36, 36, 36, 36]...



### 🧠 What MPEG-G Did (Plain English)

Your `.mgb` file used the MPEG-G standard to store sequencing data efficiently. Here's what happened under the hood:

- **Access Units (AUs)**: Think of these as independent blocks, like packets or video frames. Each AU can be decoded without needing the entire file.
  
- **Descriptor Streams**:
  - `SEQUENCE`: These are the DNA letters (A, T, C, G...).
  - `QUALITY`: Confidence for each base (used to assess sequencing accuracy).
  - `READ_IDENTIFIER`: Name or ID of each read.

- **Compression Techniques**:
  - Redundancies in the reads and IDs were removed.
  - Quality scores may have been quantized or entropy-coded.
  - Optional reference-based compression could align reads to a known genome and store only differences.

- **Output Format (`.fastq`)**:
  - This format is standard in genomics: it includes the ID, DNA sequence, and quality scores for each read.

MPEG-G is to genomics what `.mp4` is to video — a way to store large data efficiently without losing critical information.

### 🏃 Step 7: Run steps 4 & 5

In [10]:
# Set base directories
notebook_dir = os.getcwd()
container_dir = "/data"  # This is the container's path

def decode_all_mgb_in_folder(folder_name):
    host_dir = os.path.join(notebook_dir, folder_name)
    for mgb_filename in os.listdir(host_dir):
        if not mgb_filename.endswith(".mgb"):
            continue

        mgb_filename_no_ext = os.path.splitext(mgb_filename)[0]
        print(f"\n🔄 Decoding: {mgb_filename}")

        command = [
            "docker", "run", "--rm",
            "-v", f"{host_dir}:{container_dir}",
            "muefab/genie:latest", "run",
            "-f",
            "-i", f"{container_dir}/{mgb_filename}",
            "-o", f"{container_dir}/{mgb_filename_no_ext}.fastq"
        ]

        print("Running:", " ".join(command))
        result = subprocess.run(command, capture_output=True, text=True)

        """
        Caution on printing out each line as this does take up memory.

        print("\n--- STDOUT ---\n")
        print(result.stdout)
        if result.stderr:
            print("\n--- STDERR ---\n")
            print(result.stderr)#

        """

## 📢 N.B. Split your Train and Test files into smaller sub files

Decoding the mgb files is time and compute intensive, we recommend splitting the train and test files into smaller bite size chunks.

In [None]:
decode_all_mgb_in_folder("TrainFiles")
decode_all_mgb_in_folder("TestFiles")


🔄 Decoding: ID_AAFNOT.mgb
Running: docker run --rm -v c:\Users\Reinhard\Documents\computer_vision_projects\microbiome_classification_challenge\TrainFiles:/data muefab/genie:latest run -f -i /data/ID_AAFNOT.mgb -o /data/ID_AAFNOT.fastq

🔄 Decoding: ID_AAXPTO.mgb
Running: docker run --rm -v c:\Users\Reinhard\Documents\computer_vision_projects\microbiome_classification_challenge\TrainFiles:/data muefab/genie:latest run -f -i /data/ID_AAXPTO.mgb -o /data/ID_AAXPTO.fastq

🔄 Decoding: ID_AAYKAN.mgb
Running: docker run --rm -v c:\Users\Reinhard\Documents\computer_vision_projects\microbiome_classification_challenge\TrainFiles:/data muefab/genie:latest run -f -i /data/ID_AAYKAN.mgb -o /data/ID_AAYKAN.fastq

🔄 Decoding: ID_ABEZNS.mgb
Running: docker run --rm -v c:\Users\Reinhard\Documents\computer_vision_projects\microbiome_classification_challenge\TrainFiles:/data muefab/genie:latest run -f -i /data/ID_ABEZNS.mgb -o /data/ID_ABEZNS.fastq
