The Python script, generate-spreadsheet.py, generates an HCA template sheet using the TSV generated by extract-geo-metadata.py. It will automatically run said script; you do not need to run both. However, the TSV product of extract-geo-metadata.oy contains important information that varies between datasets, and cannot be programmatically imported into the HCA template sheet. It's useful to look at both the tinitial TSV and the HCA excel spreadsheet when wrangling.
Before using this script, ensure that you have the following prerequisites installed:
- Python 3.x
- Required Python packages (installable via
pip
):- BeautifulSoup (
beautifulsoup4
) - pandas (
pandas
) - openpyxl (
openpyxl
)
- BeautifulSoup (
- Clone this repository to your local machine:
git clone https://github.com/rachadele/extract-geo-metadata.git
cd extract-geo-metadata
- Install the required Python libraries:
pip install -r requirements.txt
- Run the script using the following command, replacing
GSE123456
with the specific GSE accession of the project you want to process:python generate-spreadsheet.py GSE123456
This will generate an HCA template sheet with minimal metadata filled in for the given GEO accession (Biosamples, organism, analysis files, etc). The TSV (e.g. GSE123456.tsv) also generated by this script will contain more specific project metadata. To generate
The GEO Metadata Extractor is a Python script designed to simplify the process of downloading and extracting metadata from MINiML files for Gene Expression Omnibus (GEO) datasets. This script is particularly useful for researchers and data scientists working with GEO datasets who want to extract metadata for further analysis.
Downloads MINiML files for a given GEO accession (GSE) from the GEO FTP server. Parses the MINiML file to extract metadata for all samples. Allows filtering of samples to include only "Homo sapiens" (human) data using the --human flag. Outputs the extracted metadata in a tab-separated values (TSV) file for easy analysis.
- Run the script to extract metadata for a specific GSE accession. Replace GSE123456 with the desired GSE accession:
python extract-geo-metadata.py GSE123456
- To filter for a specific organism, use the --organism flag:
python extract-geo-metadata.py GSE123456 --organism "Homo sapiens"
- To filter by platform, use the --platform flag:
python extract-geo-metadata.py GSE123456 --platform GPL12345
The platform and organism tag may be passed together to return metadata for only samples matching given organism AND platform. The extracted metadata will be saved as a TSV file named GSE123456.tsv in the same directory.
This script uses the BeautifulSoup library for XML parsing. It relies on the GEO FTP server for data retrieval.