import os
import re
from concurrent.futures import ThreadPoolExecutor
from urllib.request import urlopen

import pandas as pd
from bs4 import BeautifulSoup
from Bio import Entrez
from metapub import PubMedFetcher
from pytablewriter import MarkdownTableWriter
from tqdm import tqdm
from googletrans import Translator  # 导入 googletrans
from retry import retry

# FigureYa104GEOmining

title: "FigureYa104GEOmining"
author: "Yu Sun, Xuan Da, Taojun Ye"
reviewer: "Ying Ge"
date: "2025-5-20"
output: html_document

## 需求描述

在GEO数据库中检索到的高通量数据，想批量获得它们出自哪篇文章，标注影响因子，输出文本文件和网页，网页文件里有链接到每篇文章的pubmed页面。

**用法参考这篇帖子：**<https://mp.weixin.qq.com/s/G-CQhNEJBmMRuDe2kxND_w>

##Requirement description: 

Retrieve high-throughput data from GEO database, want to obtain in bulk which article they come from, annotate impact factors, output text files and web pages, with links to the pubmed pages of each article in the web file. 

**Refer to this post for usage:**< https://mp.weixin.qq.com/s/G-CQhNEJBmMRuDe2kxND_w >

## 应用场景

场景一：老板让测序，怎样设计实验呢？参考测同样数据的文章是怎样设计实验的。

场景二：测序数据回来了，怎样分析？能画哪些图？结果怎样描述？参考类似的数据的文章吧！

场景三：想结合已发表的数据做整合分析，哪套数据更靠谱？先看影响因子高的文章里的数据吧！

##Application scenario 

Scenario 1: How to design experiments when the boss asks sequencing? How to design experiments based on articles with the same data for reference testing. 

Scenario 2: The sequencing data is back, how to analyze it? What pictures can be drawn? How to describe the results? Refer to articles with similar data! 

Scenario 3: Do you want to integrate and analyze published data? Which set of data is more reliable? Let's first look at the data in articles with high impact factors!

## 环境设置

下载并安装Anaconda发行版，https://www.anaconda.com/distribution/#download-section

里面已经包含了运行本文档所需的Python3、ipython、Jupyter notebook。

需要额外安装BioPython、metapub和pytablewriter，由于eutils在比较新的版本中更新了API导致不能向前兼容，所以也需要重新安装一个比较旧的版本，在终端运行以下命令来安装：

##Environment settings
Download and install the Anaconda distribution, https://www.anaconda.com/distribution/#download -

The section already includes the Python 3, iPython, and Jupyter notebooks required to run this document. 

Additional installations of BioPython, metapub, and pytablewriter are required. As eutils has updated its API in a newer version, it cannot be forward compatible. Therefore, an older version needs to be reinstalled by running the following command on the terminal:

```bash
conda install -c anaconda biopython
pip install metapub
pip install eutils==0.5.0
pip install pytablewriter
pip install markdown
pip install tqdm # Anaconda自带tqdm库
```

## 输入

把要检索的关键词写进下面代码区的”term = “的后面，例如`(gds pubmed[Filter]) AND "Drosophila melanogaster"[orgn:__txid7227] AND ATAC-seq`。

建议先在<https://www.ncbi.nlm.nih.gov/geo/>网站上检索，尝试好合适的关键词后，再来提取文献。

开头加上`(gds pubmed[Filter])` ，就会过滤掉那些还没有发表文章的数据。

##Input
Write the keywords to be searched after "term=" in the code area below, for example, ` (gds pubmed [Filter]) AND "Drosophila melanogas" [ergn: __txid7227] AND ATAC seq `. 
Suggest starting with< https://www.ncbi.nlm.nih.gov/geo/ >Search on the website, try appropriate keywords, and then extract literature. 
Adding '(gds pubmed [Filter]' at the beginning will filter out data that has not yet been published.

## 运行代码

打开Anaconda——Jupyter Notebook，打开本文档，在Jupyter Notebook中点击Run按钮，运行下面的代码。

**加速：**建议自己注册一个NCBI的账号，然后点击右上角自己的邮箱，申请API key，以加快检索速度：E-utils users are allowed 3 requests/second without an API key. Create an API key to increase your e-utils limit to 10 requests/second。

把你的API key添加到代码区的“Entrez.api_key = ”后面

API key的获取方法详见<https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/>

##Run code
Open Anaconda - Jupyter Notebook, open this document, click the Run button in Jupyter Notebook, and run the following code. 

**Acceleration: * * It is recommended to register an NCBI account and click on your email in the upper right corner to apply for an API key to speed up the retrieval process E-utils users are allowed 3 requests/second without an API key. Create an API key to increase your e-utils limit to 10 requests/second。 

Add your API key to the "Entrez. api_key=" section in the code area. 

For details on how to obtain the API key, please refer to< https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/ >

## 下面是代码区

##Below is the code area

In [None]:
import os
import re
from concurrent.futures import ThreadPoolExecutor
from urllib.request import urlopen

import pandas as pd
from bs4 import BeautifulSoup
from Bio import Entrez
from metapub import PubMedFetcher
from pytablewriter import MarkdownTableWriter
from tqdm import tqdm
from googletrans import Translator  
from retry import retry

# 读取期刊影响因子的CSV文件
# Read the CSV file containing journal impact factors
df = pd.read_csv('impact_factor.csv')

# 获取期刊的影响因子
# Retrieve the impact factor of a journal
def get_impact_factor(issn_eissn):
    # 使用ISSN或E-ISSN查找期刊的影响因子
    # Use ISSN or E-ISSN to look up the journal's impact factor
    impact_factor = df.loc[(df['ISSN'] == issn_eissn) | (df['E-ISSN'] == issn_eissn), 'Impact Factor'].values
    if len(impact_factor) > 0:
        return impact_factor[0]
    else:
        return '-NA-'

# 设置Entrez模块的邮箱和API key（用于NCBI数据库访问）
# Set up email and API key for Entrez module (for NCBI database access)
Entrez.email = "yetaojun0709@gmail.com"
Entrez.api_key = "e479d557c9c4a6533a56188d731704f66107"

# 设置PubMed搜索条件：GDS数据库中关于人类心肌梗死的研究
# Set PubMed search criteria: Studies on myocardial infarction in humans from GDS database
term = '(gds pubmed[Filter]) AND "myocardial infarction" AND "Homo sapiens"[orgn:__txid9606]'
# 搜索并获取符合条件的文章ID列表
# Search and retrieve the list of article IDs matching the criteria
handle = Entrez.esearch(db="gds", term=term, retmax=100000)
record = Entrez.read(handle)

# 创建PubMed文章获取器实例
# Create an instance of PubMedFetcher to retrieve articles
fetch = PubMedFetcher()

# 设置Markdown表格的标题行（中英文标题、年份、影响因子等）
# Set headers for the Markdown table (Title, Chinese Title, Year, Impact Factor, etc.)
writer = MarkdownTableWriter()
# 添加中文标题一栏
# Add a column for Chinese title
writer.headers = ["Title", "Title (Chinese)", "Year", "Impact Factor", "Journal", "PMID", "GEO accession", "Authors"]
writer.value_matrix = []

# 定义一个带有重试功能的翻译函数，用于翻译文章标题
# Define a translation function with retry capability for article titles
@retry(tries=3, delay=2)  # 最多尝试3次，每次尝试之间间隔2秒
# Retry up to 3 times with a 2-second delay between attempts
def translate_title(title):
    return translator.translate(title, dest='zh-CN').text

# 初始化Google翻译器
# Initialize Google Translator
translator = Translator()

# 打印搜索结果信息
# Print search result statistics
print("搜索 '{}' 返回了 {} 条结果。正在解析...".format(term, len(record['IdList'])))
print("Search for '{}' returned {} results. Parsing...".format(term, len(record['IdList'])))

# 处理单个GDS ID并提取相关文章信息
# Process a single GDS ID and extract relevant article information
def process_gds_id(gds_id):
    result = []
    gds_url = f'https://www.ncbi.nlm.nih.gov/gds/?term={gds_id}'
    try:
        # 打开NCBI GDS记录页面
        # Open the NCBI GDS record page
        soup = BeautifulSoup(urlopen(gds_url), 'lxml')
    except Exception as e:
        # print("访问NCBI API出错：", e)
        # print("Error accessing NCBI API:", e)
        return result

    # 获取GEO accession号码和对应的URL
    # Extract GEO accession number and its corresponding URL
    accn = soup.body.form('dl', class_="rprtid")[0].contents[1].string
    accn_url = f'https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc={accn}'
    try:
        # 打开GEO accession页面
        # Open the GEO accession page
        soup = BeautifulSoup(urlopen(accn_url), 'lxml')
    except:
        # print("访问NCBI API出错，请等待或更换API key。")
        # print("Error accessing NCBI API. Please wait or change API key.")
        return result
    
    # 获取文章的PubMed ID
    # Extract PubMed ID of the article
    pubmed_id_element = soup.body.find('span', class_="pubmed_id")
    if pubmed_id_element:
        pmids = pubmed_id_element['id'].split(',')
        # 处理每个PubMed ID
        # Process each PubMed ID
        for pmid in pmids:
            pmid_url = f'https://www.ncbi.nlm.nih.gov/pubmed/{pmid}'
            try:
                # 通过PubMed ID获取文章信息
                # Retrieve article information using PubMed ID
                article = fetch.article_by_pmid(pmid)
            except:
                # print("访问NCBI API出错，请等待或更换API key。")
                # print("Error accessing NCBI API. Please wait or change API key.")
                return result

            # 获取文章的ISSN或E-ISSN号码
            # Get the ISSN or E-ISSN of the journal
            issn_eissn = article.issn or article.e_issn or '-NA-'
            # 获取文章所属期刊的影响因子
            # Retrieve the journal's impact factor
            if_value = get_impact_factor(issn_eissn)
            # 格式化作者列表（前n-1个作者用逗号连接，最后一个用and连接）
            # Format author list (connect first n-1 authors with commas, last with "and")
            author_list = ", ".join(article.authors[:-1]) + " and " + article.authors[-1]

            # 翻译文章标题
            # Translate article title
            try:
                translated_title = translate_title(article.title)
            except Exception as e:
                print(f"翻译标题时出现错误：{e}")
                print(f"Error translating title: {e}")
                translated_title = "翻译失败"
                translated_title = "Translation Failed"
            result.append([article.title, translated_title, article.year, if_value, article.journal, '[' + pmid + '](' + pmid_url + ')', '[' + accn + '](' + accn_url + ')', author_list])
    return result
        
# 设置最大线程数（通常为CPU核心数的2倍）
# Set maximum number of threads (typically twice the number of CPU cores)
max_workers = 2 * os.cpu_count()
# 使用ThreadPoolExecutor并行处理所有GDS ID，提高处理效率
# Use ThreadPoolExecutor to process all GDS IDs in parallel for efficiency
with ThreadPoolExecutor(max_workers=max_workers) as executor:
    results = list(tqdm(executor.map(process_gds_id, record['IdList']), total=len(record['IdList'])))
# 将所有处理结果添加到Markdown表格中
# Add all processed results to the Markdown table
for result in results:
    writer.value_matrix.extend(result)
# 将结果写入文件并打印内容
# Write results to file and print content
    print(content)

搜索 '(gds pubmed[Filter]) AND "myocardial infarction" AND "Homo sapiens"[orgn:__txid9606]' 返回了 435 条结果。正在解析...
Search for '(gds pubmed[Filter]) AND "myocardial infarction" AND "Homo sapiens"[orgn:__txid9606]' returned 435 results. Parsing...


 14%|██████                                    | 63/435 [00:12<01:11,  5.19it/s]


## 输出

在当前文件夹里生成分两个文件：

1. `GEO_citations.txt`：文本文件，包含文章信息的汇总

2. `GEO_citations.html`：网页中嵌入的三线表，蓝色字带链接，点击GSE ID可直达数据的GEO页面，点击PMID可直达文章的Pubmed页面。默认为按照相关性排序“Sort by Default order”，可以复制到Excel中自行排序、筛选等操作；链接也会保留到Excel文件中，点击链接可直接跳转至paper网页。

## output

Generate two files in the current folder:

1. GEO_citations. txt: Text file containing a summary of article information

2. GEO_citations. html: A three line table embedded in the webpage, with blue text and links. Clicking on GSE ID will take you directly to the GEO page of the data, and clicking on PMID will take you directly to the Pubmed page of the article. The default sorting is "Sort by Default order" based on relevance, which can be copied to Excel for sorting, filtering, and other operations; The link will also be retained in the Excel file, and clicking on the link will directly redirect to the paper webpage.

## 特殊情况的说明

1. 如果遇到TimeoutError，换个网络好的地方再试。

2. 偶尔会遇到影响因子那里都是NA的情况，可能是<https://www.scijournal.org>网站访问不畅，稍后重试即可。

3. 个别期刊在这个网站上检索不到影响因子，例如PNAS、NAR，所以会显示为NA。

4. 如果文章题目中出现特殊符号，例如“<”，题目会在此断掉，这是Entrez包的一个bug。

##Explanation of special circumstances
1. If encountering TimeoutError, try again in a better network location.
2. Occasionally encountering situations where the influencing factors are all NA, it may be< https://www.scijournal.org >The website is not accessible, please try again later.
3. Some journals cannot retrieve impact factors on this website, such as PNAS and NAR, so they will be displayed as NA.
4. If there is a special symbol in the title of the article, such as "<", the title will be broken here, which is a bug in the Entrez package.

In [None]:
import IPython
print(IPython.sys_info())