# Drosophila melanogaster Embryogenesis Dataset (GEO: GSE121160) 
   * **Description**: This dataset includes time-course gene expression data during *Drosophila melanogaster* embryogenesis, covering 14 time points (0, 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, and 20 hours). It features RNA sequencing (RNA-Seq) data for thousands of genes, including transcription factors and their targets. The study also pairs this with proteome data, but the transcriptome portion alone provides gene expression dynamics over time.  
   * **Why It Fits**: It tracks gene expression across multiple developmental stages, and among the 7,640 genes analyzed, 791 are classified as regulators (e.g., transcription factors involved in replication, transcription, etc.). This allows you to study both transcription factor activity and downstream gene expression changes over time.  
   * **Access**: Available at NCBI GEO under accession number GSE121160 ([https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121160](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121160)).



In [None]:
import GEOparse
from pprint import pprint

gse_id = "GSE121160"
gse = GEOparse.get_GEO(geo=gse_id, destdir="../"+ gse_id + "_data")

# Show metadata
pprint(gse.metadata)

23-Mar-2025 20:12:50 DEBUG utils - Directory ../GSE121160_data already exists. Skipping.
23-Mar-2025 20:12:50 INFO GEOparse - File already exist: using local version.
23-Mar-2025 20:12:50 INFO GEOparse - Parsing ../GSE121160_data\GSE121160_family.soft.gz: 
23-Mar-2025 20:12:50 DEBUG GEOparse - DATABASE: GeoMiame
23-Mar-2025 20:12:50 DEBUG GEOparse - SERIES: GSE121160
23-Mar-2025 20:12:50 DEBUG GEOparse - PLATFORM: GPL17275
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427100
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427101
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427102
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427103
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427104
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427105
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427106
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427107
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427108
23-Mar-2025 20:12:50 DEBUG GEOparse - SAMPLE: GSM3427109
23-

{'contact_address': ['Ackermannweg 4'],
 'contact_city': ['Mainz'],
 'contact_country': ['Germany'],
 'contact_institute': ['Institute of Molecular Biology, Mainz'],
 'contact_name': ['Sergi,,Sayols'],
 'contact_zip/postal_code': ['55128'],
 'contributor': ['Falk,,Butter', 'Kolja,,Becker', 'Stefan,,Legewie'],
 'geo_accession': ['GSE121160'],
 'last_update_date': ['Mar 25 2019'],
 'overall_design': ['Whole embryos of Drosophila melanogaster measured at 14 '
                    'time points during the first 20h of development (0h, 1h, '
                    '2h, 3h, 4h, 5h, 6h, 8h, 10h, 12h, 14h, 16h, 18h, 20h). '
                    'Each sample was measured in biological quadruplicates. '
                    'RNAseq samples correspond to proteome measurements '
                    'deposited in ProteomeXchange as PXD005713.'],
 'platform_id': ['GPL17275'],
 'platform_taxid': ['7227'],
 'pubmed_id': ['30478415'],
 'relation': ['SubSeries of: GSE121167',
              'BioProject: https:/

In [10]:
# Show sample names
pprint(gse.gsms.keys())

dict_keys(['GSM3427100', 'GSM3427101', 'GSM3427102', 'GSM3427103', 'GSM3427104', 'GSM3427105', 'GSM3427106', 'GSM3427107', 'GSM3427108', 'GSM3427109', 'GSM3427110', 'GSM3427111', 'GSM3427112', 'GSM3427113', 'GSM3427114', 'GSM3427115', 'GSM3427116', 'GSM3427117', 'GSM3427118', 'GSM3427119', 'GSM3427120', 'GSM3427121', 'GSM3427122', 'GSM3427123', 'GSM3427124', 'GSM3427125', 'GSM3427126', 'GSM3427127', 'GSM3427128', 'GSM3427129', 'GSM3427130', 'GSM3427131', 'GSM3427132', 'GSM3427133', 'GSM3427134', 'GSM3427135', 'GSM3427136', 'GSM3427137', 'GSM3427138', 'GSM3427139', 'GSM3427140', 'GSM3427141', 'GSM3427142', 'GSM3427143', 'GSM3427144', 'GSM3427145', 'GSM3427146', 'GSM3427147', 'GSM3427148', 'GSM3427149', 'GSM3427150', 'GSM3427151', 'GSM3427152', 'GSM3427153', 'GSM3427154', 'GSM3427155'])


In [2]:
import tarfile
import os
import gzip
import shutil
import pandas as pd

# Step 1: Download data
gse = GEOparse.get_GEO(geo="GSE121160", destdir="./GSE121160_data", include_data=True, silent=False)

# Step 2: Extract .tar file
tar_file = "../GSE121160_data/GSE121160_RAW.tar"
extract_dir = "../GSE121160_data/extracted"

# Check if .tar file exists
if os.path.exists(tar_file):
    try:
        with tarfile.open(tar_file, "r") as tar:
            tar.extractall(path=extract_dir)
        print(f"Extracted .tar to {extract_dir}")
    except Exception as e:
        print(f"Error extracting {tar_file}: {e}")
        exit()
else:
    print(f"No .tar file found at {tar_file}. Listing files in ./GSE121160_data:")
    print(os.listdir("./GSE121160_data"))
    extract_dir = "./GSE121160_data"  # Fallback to base directory if no .tar

# Step 3: Ensure extract_dir exists
if not os.path.exists(extract_dir):
    print(f"Directory {extract_dir} does not exist. Creating it.")
    os.makedirs(extract_dir)
else:
    print(f"Using directory: {extract_dir}")
    print(f"Files in {extract_dir}: {os.listdir(extract_dir)}")

# Step 4: Decompress .tsv.gz files
tsv_gz_found = False
for file in os.listdir(extract_dir):
    if file.endswith(".tsv.gz"):
        tsv_gz_found = True
        gz_file = os.path.join(extract_dir, file)
        tsv_file = os.path.join(extract_dir, file.replace(".gz", ""))
        try:
            with gzip.open(gz_file, "rb") as f_in:
                with open(tsv_file, "wb") as f_out:
                    shutil.copyfileobj(f_in, f_out)
            print(f"Decompressed {file} to {tsv_file}")
        except Exception as e:
            print(f"Error decompressing {gz_file}: {e}")

if not tsv_gz_found:
    print(f"No .tsv.gz files found in {extract_dir}")

# Step 5: Read TSV files
tsv_files = [f for f in os.listdir(extract_dir) if f.endswith(".tsv")]
for tsv_file in tsv_files:
    file_path = os.path.join(extract_dir, tsv_file)
    print(f"\nProcessing {tsv_file}")
    
    # Load TSV into DataFrame
    try:
        df = pd.read_csv(file_path, sep="\t")
        print(f"Shape: {df.shape}")
        print("Columns:", df.columns.tolist())
        print("First few rows:")
        print(df.head())
    except Exception as e:
        print(f"Error reading {tsv_file}: {e}")

# Step 6: Link to samples
print("\nMatching TSV files to samples:")
for gsm_name, gsm in gse.gsms.items():
    for sup_file in gsm.metadata.get("supplementary_file", []):
        sup_file_name = sup_file.split("/")[-1].replace(".gz", "")
        if sup_file_name in tsv_files:
            print(f"Sample {gsm_name} ({gsm.metadata['title']}) -> {sup_file_name}")

23-Mar-2025 18:02:41 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE121nnn/GSE121160/soft/GSE121160_family.soft.gz to ./GSE121160_data\GSE121160_family.soft.gz
100%|███████████████████████████████████████████████████████████████████| 5.05k/5.05k [00:00<00:00, 35.9kB/s]
23-Mar-2025 18:02:42 DEBUG downloader - Size validation passed
23-Mar-2025 18:02:42 DEBUG downloader - Moving C:\Users\rrtuc\AppData\Local\Temp\tmp7n9pgzam to C:\Users\rrtuc\Desktop\backed-up\python-projects\gene_causal_mapper\jupyter_notebooks\GSE121160_data\GSE121160_family.soft.gz
23-Mar-2025 18:02:42 DEBUG downloader - Successfully downloaded ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE121nnn/GSE121160/soft/GSE121160_family.soft.gz
23-Mar-2025 18:02:42 INFO GEOparse - Parsing ./GSE121160_data\GSE121160_family.soft.gz: 
23-Mar-2025 18:02:42 DEBUG GEOparse - DATABASE: GeoMiame
23-Mar-2025 18:02:42 DEBUG GEOparse - SERIES: GSE121160
23-Mar-2025 18:02:42 DEBUG GEOparse - PLATFORM: GPL17275
23-Mar-2025 

Extracted .tar to ../GSE121160_data/extracted
Using directory: ../GSE121160_data/extracted
Files in ../GSE121160_data/extracted: ['GSM3427100_00h_1.tsv.gz', 'GSM3427101_00h_2.tsv.gz', 'GSM3427102_00h_3.tsv.gz', 'GSM3427103_00h_4.tsv.gz', 'GSM3427104_01h_1.tsv.gz', 'GSM3427105_01h_2.tsv.gz', 'GSM3427106_01h_3.tsv.gz', 'GSM3427107_01h_4.tsv.gz', 'GSM3427108_02h_1.tsv.gz', 'GSM3427109_02h_2.tsv.gz', 'GSM3427110_02h_3.tsv.gz', 'GSM3427111_02h_4.tsv.gz', 'GSM3427112_03h_1.tsv.gz', 'GSM3427113_03h_2.tsv.gz', 'GSM3427114_03h_3.tsv.gz', 'GSM3427115_03h_4.tsv.gz', 'GSM3427116_04h_1.tsv.gz', 'GSM3427117_04h_2.tsv.gz', 'GSM3427118_04h_3.tsv.gz', 'GSM3427119_04h_4.tsv.gz', 'GSM3427120_05h_1.tsv.gz', 'GSM3427121_05h_2.tsv.gz', 'GSM3427122_05h_3.tsv.gz', 'GSM3427123_05h_4.tsv.gz', 'GSM3427124_06h_1.tsv.gz', 'GSM3427125_06h_2.tsv.gz', 'GSM3427126_06h_3.tsv.gz', 'GSM3427127_06h_4.tsv.gz', 'GSM3427128_08h_1.tsv.gz', 'GSM3427129_08h_2.tsv.gz', 'GSM3427130_08h_3.tsv.gz', 'GSM3427131_08h_4.tsv.gz', 'GSM34