GEO Metadata Extraction and HCA Spreadsheet Generation

Spreadsheet Generator

Description

The Python script, generate-spreadsheet.py, generates an HCA template sheet using the TSV generated by extract-geo-metadata.py. It will automatically run said script; you do not need to run both. However, the TSV product of extract-geo-metadata.oy contains important information that varies between datasets, and cannot be programmatically imported into the HCA template sheet. It's useful to look at both the tinitial TSV and the HCA excel spreadsheet when wrangling.

Prerequisites

Before using this script, ensure that you have the following prerequisites installed:

Python 3.x
Required Python packages (installable via pip):
- BeautifulSoup (beautifulsoup4)
- pandas (pandas)
- openpyxl (openpyxl)

Usage

Clone this repository to your local machine:

git clone https://github.com/rachadele/extract-geo-metadata.git
cd extract-geo-metadata

Install the required Python libraries:

pip install -r requirements.txt

Run the script using the following command, replacing GSE123456 with the specific GSE accession of the project you want to process:
```
python generate-spreadsheet.py GSE123456
```

This will generate an HCA template sheet with minimal metadata filled in for the given GEO accession (Biosamples, organism, analysis files, etc). The TSV (e.g. GSE123456.tsv) also generated by this script will contain more specific project metadata. To generate

GEO Metadata Extractor

Overview

The GEO Metadata Extractor is a Python script designed to simplify the process of downloading and extracting metadata from MINiML files for Gene Expression Omnibus (GEO) datasets. This script is particularly useful for researchers and data scientists working with GEO datasets who want to extract metadata for further analysis.

Features

Downloads MINiML files for a given GEO accession (GSE) from the GEO FTP server. Parses the MINiML file to extract metadata for all samples. Allows filtering of samples to include only "Homo sapiens" (human) data using the --human flag. Outputs the extracted metadata in a tab-separated values (TSV) file for easy analysis.

Usage

Run the script to extract metadata for a specific GSE accession. Replace GSE123456 with the desired GSE accession:

python extract-geo-metadata.py GSE123456

To filter for a specific organism, use the --organism flag:

python extract-geo-metadata.py GSE123456 --organism "Homo sapiens"

To filter by platform, use the --platform flag:

python extract-geo-metadata.py GSE123456 --platform GPL12345

The platform and organism tag may be passed together to return metadata for only samples matching given organism AND platform. The extracted metadata will be saved as a TSV file named GSE123456.tsv in the same directory.

Acknowledgments

This script uses the BeautifulSoup library for XML parsing. It relies on the GEO FTP server for data retrieval.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
README.md		README.md
extract-geo-metadata.py		extract-geo-metadata.py
generate-spreadsheet.py		generate-spreadsheet.py
hca_template.xlsx		hca_template.xlsx
library_protocol_template.xlsx		library_protocol_template.xlsx
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEO Metadata Extraction and HCA Spreadsheet Generation

Spreadsheet Generator

Description

Prerequisites

Usage

GEO Metadata Extractor

Overview

Features

Usage

Acknowledgments

About

Releases

Packages

Languages

rachadele/extract-geo-metadata

Folders and files

Latest commit

History

Repository files navigation

GEO Metadata Extraction and HCA Spreadsheet Generation

Spreadsheet Generator

Description

Prerequisites

Usage

GEO Metadata Extractor

Overview

Features

Usage

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages