Certainly, here's a README markdown file explaining the code I provided:
This code utilizes Biopython and spaCy to extract PubMed ID and create NLP summaries for PubMed articles. The pipeline is divided into the following steps:
- Use Biopython's Entrez API to search for and retrieve PubMed IDs for a given search query.
- Download the PubMed article XML/txt files corresponding to each retrieved PubMed ID from an Amazon S3 bucket.
- Parse the article XML files using Biopython's
Medline
module to extract the article text. (not required) - Process the article text using spaCy's NLP pipeline to generate a summary of the article.
- Save the summary in a CSV file named after the original article file.
To get started, you'll need to have the following prerequisites installed:
- Python 3
- Biopython
- spaCy
- An AWS account with access to an S3 bucket containing PubMed article XML files.
You can install the necessary Python packages using pip by running the following command:
pip install biopython spacy
You'll also need to download the spaCy English language model by running the following command:
python -m spacy download en_core_web_sm
Once you have the prerequisites installed, you can run the code by running the main.py
script in the project directory. Before running the script, you'll need to modify the following variables in the main.py
file to match your AWS S3 bucket and PubMed search query:
S3_BUCKET = "your-s3-bucket-name"
SEARCH_QUERY = "your-pubmed-search-query"
After setting these variables, you can run the script by navigating to the project directory in your terminal and running:
python main.py --> rename later!
This will search PubMed for articles matching your search query, download the corresponding article XML files from your S3 bucket, process the article text using spaCy, and generate a summary CSV file for each article in the text
directory.