The aim of this notebook is to download the PDF files for the set of papers which we have location/abstract/organization and url for

In [1]:
import pickle

import numpy as np
import pandas as pd

import requests
import mimetypes
import os

import uuid

In [2]:
DATA_FOLDER = '../../../data/'

In [3]:
DOWNLOAD_FOLDER = '../../../downloads/'

Load the dataset with papers of interest

In [4]:
with open(f'{DATA_FOLDER}organization_paper_location_abstract_df.pickle', 'rb') as handle:
    organization_paper_location_abstract_df = pickle.load(handle)

In [5]:
# Generate a unique key for the papers to make the plotting easier
paper_ids = np.arange(0, len(organization_paper_location_abstract_df))

paper_ids = [int(x) for x in paper_ids]

# Store the paper_ids in the dataframe to make inspection easier
organization_paper_location_abstract_df['paper_id'] = paper_ids

In [6]:
organization_paper_location_abstract_df

Unnamed: 0,title,organization,location,abstract,paper_id
0,Predicting lncRNA-protein interactions by mach...,Biomedical Informatics,"(Health & Biomedical Informatics Centre, 202-2...","Here, we aim to provide a review of machine-le...",0
1,Recent advances in predicting protein-lncRNA i...,Tianjin University,"(天津医科大学, 22号, 气象台路, 新兴街道, 天津市, 和平区, 天津市, 30005...",classified into the deep learning-based method...,1
2,Prediction of plant lncRNA by ensemble machine...,Roche,"(Roche, La Tour-du-Pin, Isère, Auvergne-Rhône-...",Multiple machine learning approaches to lncRNA...,2
3,Long non-coding RNA and RNA-binding protein in...,Qatar Foundation,"(المؤسسة القطرية - كابينة 3, شارع 2730, المدين...",interplay between lncRNAs and lncRNAs and RBP...,3
4,A four-methylated LncRNA signature predicts su...,IUB,"(Iub, Dollo, ሶማሌ ክልል / Somali, ኢትዮጵያ, (8.23333...",In order to identify the optimal prognostic si...,4
5,LncMachine: a machine learning algorithm for l...,Stanford University,"(Stanford University, 408, Panama Mall, Stanfo...",We evaluated the performance of machine learni...,5
6,CRlncRC: a machine learning-based method for c...,Columbia University,"(Columbia University, Broadway, Manhattan Comm...",learning models on measurements of model sensi...,6
7,Machine learning-based identification of tumor...,The Second Affiliated Hospital,"(深圳市第二人民医院, 泥岗西路, 黄木岗社区, 华富街道, 福田区, 深圳市, 广东省, ...",lncRNAs lncRNA (TIIClncRNA) in low-grade gliom...,7
8,Evaluation of machine learning models that pre...,Computer Science & Electrical Engineering,"(Kenneth H Keller Hall, 200, Southeast Union S...",Our literature survey identified machine learn...,8
9,Machine learning-based construction of a ferro...,Beijing Institute of Technology,"(北京理工大学, 5, 中关村南大街, 北下关街道, 海淀区, 北京市, 100872, 中...",We have identified lncrna related to iron deat...,9


These records represent papers that we have an organization for and an abstract record

Get the url for downloading the files for these papers

In [7]:
with open(f'{DATA_FOLDER}papers.pickle', 'rb') as handle:
    papers = pickle.load(handle)

In [8]:
papers_df = pd.DataFrame.from_dict(papers)

In [9]:
papers_df

Unnamed: 0,title,abstract,year,url,author_id
0,Predicting lncRNA-protein interactions by mach...,"Here, we aim to provide a review of machine-le...",2020,Unknown,[zkBXb_kAAAAJ]
1,Recent advances in predicting protein-lncRNA i...,classified into the deep learning-based method...,2022,Unknown,"[, , , EHvA-IUAAAAJ]"
2,Recent advances in machine learning methods fo...,machine learning prediction models of LDAs. Fi...,2022,https://www.frontiersin.org/articles/10.3389/f...,"[5RoxYhkAAAAJ, , , ]"
3,Prediction of plant lncRNA by ensemble machine...,Multiple machine learning approaches to lncRNA...,2018,https://link.springer.com/article/10.1186/s128...,"[ap3FfWEAAAAJ, , ]"
4,Machine learning-based integration develops an...,related lncRNAs remains largely unexplored. In...,2022,https://www.nature.com/articles/s41467-022-284...,"[, , , , , ]"
...,...,...,...,...,...
95,Machine-Learning-Based identification of key f...,330500) was used to assess the differential ex...,2024,Unknown,"[clJGV9UAAAAJ, , , ]"
96,Prediction of ncRNA from RNA-Seq data using ma...,ncRNAs or lncRNAs. By classifying coding and l...,2023,Unknown,"[AEaAOCQAAAAJ, QVJvfz8AAAAJ]"
97,A classification model for lncRNA and mRNA bas...,"For these four machine learning algorithms, we...",2019,https://link.springer.com/article/10.1186/s128...,"[, , , , , ]"
98,Integrating multiple machine learning algorith...,"from TCGA-STAD, we identified 26 prognostic ln...",2023,https://www.frontiersin.org/articles/10.3389/f...,"[, , , , , , ]"


In [10]:
# inner join
papers_to_download_df = pd.merge(organization_paper_location_abstract_df, papers_df[['title', 'url']], on='title', how='inner')

In [11]:
papers_to_download_df

Unnamed: 0,title,organization,location,abstract,paper_id,url
0,Predicting lncRNA-protein interactions by mach...,Biomedical Informatics,"(Health & Biomedical Informatics Centre, 202-2...","Here, we aim to provide a review of machine-le...",0,Unknown
1,Recent advances in predicting protein-lncRNA i...,Tianjin University,"(天津医科大学, 22号, 气象台路, 新兴街道, 天津市, 和平区, 天津市, 30005...",classified into the deep learning-based method...,1,Unknown
2,Prediction of plant lncRNA by ensemble machine...,Roche,"(Roche, La Tour-du-Pin, Isère, Auvergne-Rhône-...",Multiple machine learning approaches to lncRNA...,2,https://link.springer.com/article/10.1186/s128...
3,Long non-coding RNA and RNA-binding protein in...,Qatar Foundation,"(المؤسسة القطرية - كابينة 3, شارع 2730, المدين...",interplay between lncRNAs and lncRNAs and RBP...,3,https://www.sciencedirect.com/science/article/...
4,A four-methylated LncRNA signature predicts su...,IUB,"(Iub, Dollo, ሶማሌ ክልል / Somali, ኢትዮጵያ, (8.23333...",In order to identify the optimal prognostic si...,4,https://www.sciencedirect.com/science/article/...
5,LncMachine: a machine learning algorithm for l...,Stanford University,"(Stanford University, 408, Panama Mall, Stanfo...",We evaluated the performance of machine learni...,5,https://escholarship.org/content/qt32n7m7td/qt...
6,CRlncRC: a machine learning-based method for c...,Columbia University,"(Columbia University, Broadway, Manhattan Comm...",learning models on measurements of model sensi...,6,https://link.springer.com/article/10.1186/s129...
7,Machine learning-based identification of tumor...,The Second Affiliated Hospital,"(深圳市第二人民医院, 泥岗西路, 黄木岗社区, 华富街道, 福田区, 深圳市, 广东省, ...",lncRNAs lncRNA (TIIClncRNA) in low-grade gliom...,7,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9...
8,Evaluation of machine learning models that pre...,Computer Science & Electrical Engineering,"(Kenneth H Keller Hall, 200, Southeast Union S...",Our literature survey identified machine learn...,8,https://academic.oup.com/nargab/article/6/3/lq...
9,Machine learning-based construction of a ferro...,Beijing Institute of Technology,"(北京理工大学, 5, 中关村南大街, 北下关街道, 海淀区, 北京市, 100872, 中...",We have identified lncrna related to iron deat...,9,https://www.frontiersin.org/articles/10.3389/f...


Some of these papers have missing url's so lets drop those records

In [12]:
papers_to_download_df = papers_to_download_df.query('url != "Unknown"')

In [13]:
papers_to_download_df

Unnamed: 0,title,organization,location,abstract,paper_id,url
2,Prediction of plant lncRNA by ensemble machine...,Roche,"(Roche, La Tour-du-Pin, Isère, Auvergne-Rhône-...",Multiple machine learning approaches to lncRNA...,2,https://link.springer.com/article/10.1186/s128...
3,Long non-coding RNA and RNA-binding protein in...,Qatar Foundation,"(المؤسسة القطرية - كابينة 3, شارع 2730, المدين...",interplay between lncRNAs and lncRNAs and RBP...,3,https://www.sciencedirect.com/science/article/...
4,A four-methylated LncRNA signature predicts su...,IUB,"(Iub, Dollo, ሶማሌ ክልል / Somali, ኢትዮጵያ, (8.23333...",In order to identify the optimal prognostic si...,4,https://www.sciencedirect.com/science/article/...
5,LncMachine: a machine learning algorithm for l...,Stanford University,"(Stanford University, 408, Panama Mall, Stanfo...",We evaluated the performance of machine learni...,5,https://escholarship.org/content/qt32n7m7td/qt...
6,CRlncRC: a machine learning-based method for c...,Columbia University,"(Columbia University, Broadway, Manhattan Comm...",learning models on measurements of model sensi...,6,https://link.springer.com/article/10.1186/s129...
7,Machine learning-based identification of tumor...,The Second Affiliated Hospital,"(深圳市第二人民医院, 泥岗西路, 黄木岗社区, 华富街道, 福田区, 深圳市, 广东省, ...",lncRNAs lncRNA (TIIClncRNA) in low-grade gliom...,7,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9...
8,Evaluation of machine learning models that pre...,Computer Science & Electrical Engineering,"(Kenneth H Keller Hall, 200, Southeast Union S...",Our literature survey identified machine learn...,8,https://academic.oup.com/nargab/article/6/3/lq...
9,Machine learning-based construction of a ferro...,Beijing Institute of Technology,"(北京理工大学, 5, 中关村南大街, 北下关街道, 海淀区, 北京市, 100872, 中...",We have identified lncrna related to iron deat...,9,https://www.frontiersin.org/articles/10.3389/f...
10,Identification and validation of cuproptosis-r...,Beijing Institute of Technology,"(北京理工大学, 5, 中关村南大街, 北下关街道, 海淀区, 北京市, 100872, 中...",LncRNAs with prognostic value and construct a ...,10,https://www.mdpi.com/2218-273X/12/12/1890
11,Prediction of LncRNA subcellular localization ...,Genetics,"(Genetics, Errigal Road, Crumlin, Crumlin F Wa...","In this work, we develop a DeepLncRNA to ident...",11,https://www.nature.com/articles/s41598-018-347...


In [14]:
papers_to_download_df.shape

(40, 6)

In [15]:
def download_files(paper_ids, urls, output_dir):
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    downloaded_files = {
        'paper_id': [],
        'file_name': []
    }

    for paper_id, link in zip(paper_ids, urls):
        try:                        
            # Download the file if a link is available
            if link:
                response = requests.get(link, stream=True)
                content_type = response.headers.get('Content-Type', '')
                
                # Check if the content type is PDF
                if 'application/pdf' in content_type or mimetypes.guess_extension(content_type) == '.pdf':
                    #file_name = f"{title[:50].replace(' ', '_').replace('/', '-')}.pdf"
                    unique_filename = str(uuid.uuid4()) + '.pdf'
                    
                    #file_path = os.path.join(output_dir, file_name)
                    file_path = os.path.join(output_dir, unique_filename)
                    
                    with open(file_path, "wb") as file:
                        for chunk in response.iter_content(chunk_size=8192):
                            file.write(chunk)
                    
                    print(f"Downloaded: {file_path}")
                    #downloaded_files.append(file_path)
                    downloaded_files['paper_id'] += [paper_id]
                    downloaded_files['file_name'] += [unique_filename]
                else:
                    print(f"Skipping non-PDF content: {link}")
            else:
                print("No downloadable link found.")
        except StopIteration:
            print("No more articles available in the search results.")
            break
        except Exception as e:
            print(f"Error occurred while processing article {paper_id}: {e}")    

    return downloaded_files

In [16]:
paper_ids = papers_to_download_df['paper_id'].values.tolist()
urls = papers_to_download_df['url'].values.tolist()

In [17]:
downloaded_files = download_files(paper_ids=paper_ids, urls=urls, output_dir=DOWNLOAD_FOLDER)

Skipping non-PDF content: https://link.springer.com/article/10.1186/s12864-018-4665-2
Skipping non-PDF content: https://www.sciencedirect.com/science/article/pii/S1044579X22001249
Skipping non-PDF content: https://www.sciencedirect.com/science/article/pii/S0888754320319698
Downloaded: ../../downloads/51807917-91f3-4b8f-8ad4-e1c5c923432e.pdf
Skipping non-PDF content: https://link.springer.com/article/10.1186/s12920-018-0436-9
Skipping non-PDF content: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9373811/
Skipping non-PDF content: https://academic.oup.com/nargab/article/6/3/lqae125/7759975
Skipping non-PDF content: https://www.frontiersin.org/articles/10.3389/fonc.2023.1171878/full
Skipping non-PDF content: https://www.mdpi.com/2218-273X/12/12/1890
Skipping non-PDF content: https://www.nature.com/articles/s41598-018-34708-w
Skipping non-PDF content: https://www.nature.com/articles/s41598-024-68750-8
Downloaded: ../../downloads/92c4c16b-1cc5-49b0-8ff3-09182bfb02fc.pdf
Skipping non-PDF con

In [18]:
downloaded_files_df = pd.DataFrame(downloaded_files)

In [19]:
downloaded_files_df

Unnamed: 0,paper_id,file_name
0,5,51807917-91f3-4b8f-8ad4-e1c5c923432e.pdf
1,14,92c4c16b-1cc5-49b0-8ff3-09182bfb02fc.pdf
2,23,d320ce0f-7cd8-4afa-8ec7-baad93b09505.pdf


In [20]:
downloaded_files_df = pd.merge(papers_to_download_df, downloaded_files_df)[['title', 'file_name']]

In [21]:
downloaded_files_df

Unnamed: 0,title,file_name
0,LncMachine: a machine learning algorithm for l...,51807917-91f3-4b8f-8ad4-e1c5c923432e.pdf
1,DMFLDA: a deep learning framework for predicti...,92c4c16b-1cc5-49b0-8ff3-09182bfb02fc.pdf
2,Evaluation of deep learning in non-coding RNA ...,d320ce0f-7cd8-4afa-8ec7-baad93b09505.pdf


In [22]:
downloaded_files_df.to_parquet(f'{DATA_FOLDER}downloaded_files_df.parquet')