Skip to content

rachadele/extract-geo-metadata

Repository files navigation

GEO Metadata Extraction and HCA Spreadsheet Generation

Spreadsheet Generator

Description

The Python script, generate-spreadsheet.py, generates an HCA template sheet using the TSV generated by extract-geo-metadata.py. It will automatically run said script; you do not need to run both. However, the TSV product of extract-geo-metadata.oy contains important information that varies between datasets, and cannot be programmatically imported into the HCA template sheet. It's useful to look at both the tinitial TSV and the HCA excel spreadsheet when wrangling.

Prerequisites

Before using this script, ensure that you have the following prerequisites installed:

  • Python 3.x
  • Required Python packages (installable via pip):
    • BeautifulSoup (beautifulsoup4)
    • pandas (pandas)
    • openpyxl (openpyxl)

Usage

  1. Clone this repository to your local machine:
git clone https://github.com/rachadele/extract-geo-metadata.git
cd extract-geo-metadata
  1. Install the required Python libraries:
pip install -r requirements.txt
  1. Run the script using the following command, replacing GSE123456 with the specific GSE accession of the project you want to process:
    python generate-spreadsheet.py GSE123456
    

This will generate an HCA template sheet with minimal metadata filled in for the given GEO accession (Biosamples, organism, analysis files, etc). The TSV (e.g. GSE123456.tsv) also generated by this script will contain more specific project metadata. To generate

GEO Metadata Extractor

Overview

The GEO Metadata Extractor is a Python script designed to simplify the process of downloading and extracting metadata from MINiML files for Gene Expression Omnibus (GEO) datasets. This script is particularly useful for researchers and data scientists working with GEO datasets who want to extract metadata for further analysis.

Features

Downloads MINiML files for a given GEO accession (GSE) from the GEO FTP server. Parses the MINiML file to extract metadata for all samples. Allows filtering of samples to include only "Homo sapiens" (human) data using the --human flag. Outputs the extracted metadata in a tab-separated values (TSV) file for easy analysis.

Usage

  1. Run the script to extract metadata for a specific GSE accession. Replace GSE123456 with the desired GSE accession:
python extract-geo-metadata.py GSE123456
  1. To filter for a specific organism, use the --organism flag:
python extract-geo-metadata.py GSE123456 --organism "Homo sapiens"
  1. To filter by platform, use the --platform flag:
python extract-geo-metadata.py GSE123456 --platform GPL12345

The platform and organism tag may be passed together to return metadata for only samples matching given organism AND platform. The extracted metadata will be saved as a TSV file named GSE123456.tsv in the same directory.

Acknowledgments

This script uses the BeautifulSoup library for XML parsing. It relies on the GEO FTP server for data retrieval.