# Merge 2020 and 2021 Results

The file PMC8012676.xml is interesting in that it's a good example of being able to detect a composite figure from the text. Also, there are many good examples for looking at the relationships between the text, the citations and the figures.

In [1]:
import json
import os
import re
import sys
import tempfile
from pathlib import Path, PurePath
from pprint import pprint

import numpy as np
import pandas as pd
import requests
import requests_cache

In [2]:
%load_ext sql

In [3]:
requests_cache.install_cache("pfocr_cache")

## Import PFOCR 2020 Results

In [4]:
from functools import partial

import rpy2.robjects as ro
from rpy2.ipython import html
from rpy2.robjects import default_converter, pandas2ri
from rpy2.robjects.conversion import localconverter
from rpy2.robjects.lib.dplyr import DataFrame
from rpy2.robjects.packages import importr

html.html_rdataframe = partial(html.html_rdataframe, table_class="docutils")

In [5]:
pandas2ri.activate()
base = importr("base")
readRDS = ro.r["readRDS"]
saveRDS = ro.r["saveRDS"]



In [6]:
def rds2pandas(rds_path):
    r_df = readRDS(str(rds_path))
    with localconverter(ro.default_converter + pandas2ri.converter):
        pandas_df = ro.conversion.rpy2py(r_df)
    return pandas_df

In [7]:
def pandas2rds(pandas_df, rds_path):
    with localconverter(default_converter + pandas2ri.converter) as cv:
        r_df = DataFrame(pandas_df)

    saveRDS(r_df, str(rds_path))

### Import Figures 2020

#### from `pfocr20200224` Database

In [8]:
%sql postgresql:///pfocr20200224

In [9]:
%%sql figures_2020_db_data << SELECT id, filepath, figure_number, resolution, hash
FROM figures;

 * postgresql:///pfocr20200224
114176 rows affected.
Returning data to local variable figures_2020_db_data


Let's turn this into a dataframe and get rid of some rows that are identical except for `id` and `filepath`.

Expecting 114117 rows × 4 columns.

In [10]:
raw_figures_2020_db_df = figures_2020_db_data.DataFrame()

raw_figures_2020_db_df["pfocr_id"] = raw_figures_2020_db_df["filepath"].apply(
    lambda f: PurePath(f).name
)

figures_2020_db_df = raw_figures_2020_db_df.drop(
    columns=["id", "filepath"]
).drop_duplicates()

del raw_figures_2020_db_df

figures_2020_db_df

Unnamed: 0,figure_number,resolution,hash,pfocr_id
0,nihms631297f1,150,f1251c61cdae166d9efe021a38a26df030e53cc0c8b446...,PMC4183209__nihms631297f1.jpg
1,zam0120900500001,72,99bfb01f132886e392fbe9987723edf477d1480eb116a1...,PMC2698365__zam0120900500001.jpg
2,nutrients-09-01243-g007,600,a34de0ab2744b046c9a2d927676b234d308327c615bcc4...,PMC5707715__nutrients-09-01243-g007.jpg
3,EMS83014-f003,96,7f4fafb790d9b92b98e3573f7d1edef1b9990e67d3af6d...,PMC6538535__EMS83014-f003.jpg
4,12014_2017_9165_Fig1_HTML,72,cae8e2ac3e341cec7e41c21cbdb2317999a1bfbb0bc7bc...,PMC5557313__12014_2017_9165_Fig1_HTML.jpg
...,...,...,...,...
114171,kpsb-11-01-1119962-g016,200,d40168e350337ab4e9f18a8eacbcb1b233eeaa703e11b9...,PMC4871689__kpsb-11-01-1119962-g016.jpg
114172,1471-2407-14-217-3,600,1ec4ad03dd5295e97fe7f1718cf609cb21d263341c12ec...,PMC3994450__1471-2407-14-217-3.jpg
114173,gr3,113,581cebd1c383b70ba61359690b58ca69b1e3f978dd447f...,PMC4157144__gr3.jpg
114174,nihms491534f1,150,86a3b076b363fdf036689937015d97eeae8d1e664fa1da...,PMC3748258__nihms491534f1.jpg


#### from `pfocr_figures.rds`

Expecting 64643 rows × 14 columns

In [11]:
figures_2020_rds_url = (
    "https://www.dropbox.com/s/qhc33zho78rnaoj/pfocr_figures.rds?dl=1"
)

with tempfile.NamedTemporaryFile(suffix=".rds") as f:
    figures_2020_rds_path = f.name
    with requests.get(figures_2020_rds_url, stream=True) as r:
        for chunk in r.iter_content(chunk_size=128):
            f.write(chunk)
        f.seek(0)
    figures_2020_rds_df = rds2pandas(figures_2020_rds_path).rename(
        columns={
            "figid": "pfocr_id",
            "pmcid": "pmc_id",
            "filename": "figure_filename",
            "number": "figure_number",
            "pmc_ranked_result_index": "pmc_search_index",
            "figtitle": "figure_title",
            "papertitle": "paper_title",
            "caption": "figure_caption",
            "figlink": "relative_figure_page_url",
            "reftext": "reference",
            "year": "publication_year",
        }
    )


figures_2020_rds_df["paper_url"] = (
    "https://www.ncbi.nlm.nih.gov/pmc/articles/" + figures_2020_rds_df["pmc_id"]
)

figures_2020_rds_df["figure_page_url"] = (
    "https://www.ncbi.nlm.nih.gov"
    + figures_2020_rds_df["relative_figure_page_url"]
)

figures_2020_rds_df["figure_thumbnail_url"] = (
    "https://www.ncbi.nlm.nih.gov/pmc/articles/"
    + figures_2020_rds_df["pmc_id"]
    + "/bin/"
    + figures_2020_rds_df["figure_filename"]
)

figures_2020_rds_df["pfocr_year"] = 2020

figures_2020_rds_df.drop(
    columns=[
        "relative_figure_page_url",
        "figure_filename",
        "source_f",
        "type.man",
        "automl_index",
    ],
    inplace=True,
)

figures_2020_rds_df

Unnamed: 0,pfocr_id,figure_number,reference,publication_year,pathway_score,pmc_search_index,figure_title,paper_title,figure_caption,pmc_id,paper_url,figure_page_url,figure_thumbnail_url,pfocr_year
1,PMC5653847__41598_2017_14124_Fig8_HTML.jpg,Figure 8,"Céline Barthelemy, et al. Sci Rep. 2017;7:13816.",2017,0.968270,133303,Model of FTY720-induced transporter endocytosi...,FTY720-induced endocytosis of yeast and human ...,Model of FTY720-induced transporter endocytosi...,PMC5653847,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,2020
2,PMC4187043__zh20191474070013.jpg,Fig. 13,"Yuan Wei, et al. Am J Physiol Renal Physiol. 2...",2014,0.965793,79929,Proposed signaling pathway by which the stimul...,Angiotensin II type 2 receptor regulates ROMK-...,Proposed signaling pathway by which the stimul...,PMC4187043,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,2020
3,PMC5746550__rsob-7-170228-g1.jpg,Figure 1,"Georgia R. Frost, et al. Open Biol. 2017 Dec;7...",2017,0.962470,98034,AŒ≤ production,The role of astrocytes in amyloid production a...,AŒ≤ production. In the amyloidogenic pathway (...,PMC5746550,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,2020
4,PMC4211692__pone.0110875.g008.jpg,Figure 8,"Enida Gjoni, et al. PLoS One. 2014;9(10):e110875.",2014,0.966721,142401,,Glucolipotoxicity Impairs Ceramide Flow from t...,Glucolipotoxicity impairs CERT- and vesicular-...,PMC4211692,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,2020
5,PMC2588433__nihms78212f8.jpg,Figure 8,"Amanda L. Lewis, et al. J Biol Chem. ;282(38):...",,0.966758,67398,,NeuA sialic acid O-acetylesterase activity mod...,Bacterial Sia biosynthesis can be divided into...,PMC2588433,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64639,PMC4216988__zh20221474360006.jpg,Fig. 6,"Marcelo D. Carattino, et al. Am J Physiol Rena...",2014,0.143076,108774,Hypothetical mechanism of activation of ENaC b...,Prostasin interacts with the epithelial Na+ ch...,Hypothetical mechanism of activation of ENaC b...,PMC4216988,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,2020
64640,PMC2873070__nihms128887f5.jpg,Scheme 1,"Hua Cheng, et al. Neurobiol Aging. ;31(7):1188...",,0.127176,143547,A schematic diagram of a proposed working mode...,Apolipoprotein E mediates sulfatide depletion ...,A schematic diagram of a proposed working mode...,PMC2873070,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,2020
64641,PMC3651446__pnas.1220523110fig06.jpg,Fig. 6,"Jiun-Ming Wu, et al. Proc Natl Acad Sci U S A....",2013,0.055546,159643,Models for nucleation of centrosomal and kinet...,Aurora kinase inhibitors reveal mechanisms of ...,Models for nucleation of centrosomal and kinet...,PMC3651446,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,2020
64642,PMC6770832__cancers-11-01236-g005.jpg,Figure 5,"Carmel Mothersill, et al. Cancers (Basel). 201...",2019,0.140041,618,A simplified TGFŒ≤ pathway leading to p21 expr...,Relevance of Non-Targeted Effects for Radiothe...,A simplified TGFŒ≤ pathway leading to p21 expr...,PMC6770832,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,2020


Expecting 64643

In [12]:
len(set(figures_2020_rds_df["pfocr_id"].to_list()))

64643

In [13]:
print(len(figures_2020_rds_df.columns))
figures_2020_rds_df.columns

14


Index(['pfocr_id', 'figure_number', 'reference', 'publication_year',
       'pathway_score', 'pmc_search_index', 'figure_title', 'paper_title',
       'figure_caption', 'pmc_id', 'paper_url', 'figure_page_url',
       'figure_thumbnail_url', 'pfocr_year'],
      dtype='object')

In [14]:
len(
    set(figures_2020_db_df["pfocr_id"].to_list())
    & set(figures_2020_rds_df["pfocr_id"].to_list())
)

63917

In [15]:
len(
    set(figures_2020_db_df["pfocr_id"].to_list())
    - set(figures_2020_rds_df["pfocr_id"].to_list())
)

50200

In [16]:
len(
    set(figures_2020_rds_df["pfocr_id"].to_list())
    - set(figures_2020_db_df["pfocr_id"].to_list())
)

726

### Import Genes 2020

Expecting 1112551 rows × 8 columns

In [17]:
genes_2020_rds_url = (
    "https://www.dropbox.com/s/alf7auvxve36oer/pfocr_genes.rds?dl=1"
)

with tempfile.NamedTemporaryFile(suffix=".rds") as f:
    pfocr_genes_2020_rds_path = f.name
    with requests.get(genes_2020_rds_url, stream=True) as r:
        for chunk in r.iter_content(chunk_size=128):
            f.write(chunk)
        f.seek(0)

    genes_2020_df = rds2pandas(pfocr_genes_2020_rds_path).rename(
        columns={
            "figid": "pfocr_id",
            "pmcid": "pmc_id",
            "entrez": "ncbigene_id",
            "word": "matched_ocr_text",
            "symbol": "lexicon_term",
            "source": "lexicon_term_source",
        }
    )

genes_2020_df["pfocr_year"] = 2020

genes_2020_df

Unnamed: 0,pfocr_id,pmc_id,matched_ocr_text,lexicon_term,lexicon_term_source,hgnc_symbol,ncbigene_id,pfocr_year
1,PMC100003__mb2410470011.jpg,PMC100003,"Ga12,Gaq",G-ALPHA-q,hgnc_alias_symbol,GNAQ,2776,2020
2,PMC100003__mb2410470011.jpg,PMC100003,Etk,ETK,hgnc_alias_symbol,BMX,660,2020
3,PMC100003__mb2410470011.jpg,PMC100003,FAK,FAK,hgnc_alias_symbol,PTK2,5747,2020
4,PMC100003__mb2410470011.jpg,PMC100003,AR*,AR,hgnc_symbol,AR,367,2020
5,PMC100003__mb2410470011.jpg,PMC100003,(Src,SRC,hgnc_symbol,SRC,6714,2020
...,...,...,...,...,...,...,...,...
1112547,PMC99976__mb2310138007.jpg,PMC99976,MEK-2,MEK2,hgnc_alias_symbol,MAP2K2,5605,2020
1112548,PMC99976__mb2310138007.jpg,PMC99976,RAS,RAS,bioentities_symbol,HRAS,3265,2020
1112549,PMC99976__mb2310138007.jpg,PMC99976,RAS,RAS,bioentities_symbol,KRAS,3845,2020
1112550,PMC99976__mb2310138007.jpg,PMC99976,RAS,RAS,bioentities_symbol,NRAS,4893,2020


In [18]:
len(set(genes_2020_df["pfocr_id"].to_list()))

58962

In [19]:
len(
    set(genes_2020_df["pfocr_id"].to_list())
    - set(figures_2020_rds_df["pfocr_id"].to_list())
)

0

In [20]:
len(
    set(genes_2020_df["pfocr_id"].to_list())
    - set(figures_2020_db_df["pfocr_id"].to_list())
)

0

In [21]:
print("PMC6936734__nihms-1063332-f0001.jpg" in set(genes_2020_df["pfocr_id"]))

False


## Import Data 2021

### Import Figures 2021

In [22]:
target_date = "20210513"
images_dir = Path(f"../data/images/{target_date}")

OLD: Previously (when not doing any checking for duplicate pfocr_id), we got 15955 rows × 14 columns.

CORRECT: But when only excluding hits from any pfocr_id the 2020 database (meaning we actually ran it), we expect 15914 rows × 13 columns.

In [23]:
figures_2021_df = rds2pandas(images_dir.joinpath("pfocr_figures_2021.rds"))

figures_2021_df

Unnamed: 0,pfocr_id,figure_page_url,figure_thumbnail_url,figure_number,figure_title,figure_caption,pmc_id,paper_url,paper_title,reference,pmc_search_index,pathway_score,pfocr_year
11,PMC7226520__cells-09-01043-g007.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 7,Comparative downstream pathway analysis of the...,Comparative downstream pathway analysis of the...,PMC7226520,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,"A STAT3 of Addiction: Adipose Tissue, Adipocyt...","Rose Kadye, et al. Cells. 2020 Apr;9(4):1043.",12,0.811027,2021
16,PMC7346062__aging-12-103262-g005.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 5,KEGG pathways related to resveratrol-targeted ...,KEGG pathways related to resveratrol-targeted ...,PMC7346062,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Resveratrol promotes osteogenesis and alleviat...,"Tao Yu, et al. Aging (Albany NY). 2020 Jun 15;...",17,0.943144,2021
21,PMC7063815__13578_2020_396_Fig2_HTML.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Fig. 2,Interaction between tumor metabolism and the m...,Interaction between tumor metabolism and the m...,PMC7063815,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,mTOR signaling pathway and mTOR inhibitors in ...,"Zhilin Zou, et al. Cell Biosci. 2020;10:31.",22,0.943767,2021
22,PMC6497965__zbc0191904900006.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Figure 6,A proposed signaling cascade for phosphorylate...,A proposed signaling cascade for phosphorylate...,PMC6497965,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Phosphorylation of proliferating cell nuclear ...,"Bo Peng, et al. J Biol Chem. 2019 Apr 26;294(1...",23,0.603429,2021
23,PMC7538683__bsr-40-bsr20202711-g5.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 5,,(A) Cell cycle signaling pathway is significan...,PMC7538683,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Bioinformatics analysis and experimental valid...,"Jiajia Chen, et al. Biosci Rep. 2020 Oct 30;40...",24,0.863582,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...
124414,PMC7803631__gr3a.jpg,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,Fig,Cell free biosynthesis for erythromycin A. The...,PMC7803631,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Complex natural product production methods and...,"Dongwon Park, et al. Synth Syst Biotechnol. 20...",124515,0.648432,2021
124421,PMC7359798__gr2_lrg.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 2,RNA Sensing and ResponseNon-comprehensive over...,RNA Sensing and ResponseNon-comprehensive over...,PMC7359798,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Immune Sensing Mechanisms that Discriminate Se...,"Eva Bartok, et al. Immunity. 2020 Jul 14;53(1)...",124522,0.922250,2021
124439,PMC7530268__fonc-10-586530-g0001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 1,The ferroptotic cascade,The ferroptotic cascade. Accumulation of free ...,PMC7530268,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,From Iron Chelation to Overload as a Therapeut...,"Eric Grignano, et al. Front Oncol. 2020;10:586...",124540,0.952995,2021
124445,PMC7466447__fcell-08-00766-g001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,FIGURE 1,Immunoregulatory functions of TEC in the TME,Immunoregulatory functions of TEC in the TME. ...,PMC7466447,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Tumor Endothelial Cells (TECs) as Potential Im...,"Laurenz Nagl, et al. Front Cell Dev Biol. 2020...",124546,0.933006,2021


In [24]:
len(
    set(figures_2021_df["pfocr_id"].to_list())
    & set(figures_2020_rds_df["pfocr_id"].to_list())
)

608

In [25]:
len(
    set(figures_2021_df["pfocr_id"].to_list())
    & set(figures_2020_db_df["pfocr_id"].to_list())
)

0

In [26]:
print("PMC6936734__nihms-1063332-f0001.jpg" in set(figures_2021_df["pfocr_id"]))

True


Previously (when not doing any checking for duplicate pfocr_id), we got 649 rows × 14 columns.

But when only excluding hits from any pfocr_id the 2020 database (meaning we actually ran it), we expect 608 rows × 13 columns.

Note we still get some duplicates between `figures_2020_rds_df["pfocr_id"]` and `figures_2021_df["pfocr_id"]`, because `figures_2020_rds_df` is based on the RDS file, not the database.

In [27]:
figures_2021_df[
    figures_2021_df["pfocr_id"].isin(
        set(figures_2020_rds_df["pfocr_id"].to_list())
    )
]

Unnamed: 0,pfocr_id,figure_page_url,figure_thumbnail_url,figure_number,figure_title,figure_caption,pmc_id,paper_url,paper_title,reference,pmc_search_index,pathway_score,pfocr_year
135,PMC6936734__nihms-1063332-f0001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Figure 1,Schematic representation of sphingosine-1-phos...,Schematic representation of sphingosine-1-phos...,PMC6936734,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Sphingolipid metabolism and drug resistance in...,"Kelly M. Kreitzburg, et al. Cancer Drug Resist...",136,0.956722,2021
224,PMC6932950__fcell-07-00335-g001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,FIGURE 1,An overview of current and emerging agents for...,An overview of current and emerging agents for...,PMC6932950,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Therapeutic and Mechanistic Perspectives of Pr...,"Mark P. Waterhouse, et al. Front Cell Dev Biol...",225,0.915671,2021
415,PMC6944318__nihms-1063769-f0002.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Figure 2:,,"To support proliferation, cells take up more g...",PMC6944318,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Bioenergetics and translational metabolism: Im...,"Bradford G. Hill, et al. Biol Chem. ;401(1):3-29.",416,0.898072,2021
540,PMC6902900__fimmu-10-02839-g0002.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Figure 2,Overview of cellular metabolism in T cells,Overview of cellular metabolism in T cells. Th...,PMC6902900,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Metabolic Pathways Involved in Regulatory T Ce...,"Rosalie W. M. Kempkes, et al. Front Immunol. 2...",541,0.967499,2021
592,PMC6909948__jcav10p6848g004.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Figure 4,cAMP can regulate multidrug-resistance in lung...,cAMP can regulate multidrug-resistance in lung...,PMC6909948,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,A perspective profile of ADCY1 in cAMP signali...,"Ting Zou, et al. J Cancer. 2019;10(27):6848-6857.",593,0.960402,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...
123360,PMC6932173__fimmu-10-02900-g0006.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Figure 6,Proposed inositol-requiring protein 1 alpha (I...,Proposed inositol-requiring protein 1 alpha (I...,PMC6932173,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Molecular Insight Into the IRE1α-Mediated Type...,"Maja Studencka-Turski, et al. Front Immunol. 2...",123461,0.914990,2021
123555,PMC6934036__fimmu-10-02854-g0001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Figure 1,A schematic illustration representing differen...,A schematic illustration representing differen...,PMC6934036,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,WNT Signaling in Tumors: The Way to Evade Drug...,"Elena Martin-Orozco, et al. Front Immunol. 201...",123656,0.894541,2021
124075,PMC6892982__fcell-07-00308-g001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,FIGURE 1,Autophagy regulators and functions associated ...,Autophagy regulators and functions associated ...,PMC6892982,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Agephagy – Adapting Autophagy for Health Durin...,"Eleanor R. Stead, et al. Front Cell Dev Biol. ...",124176,0.881064,2021
124142,PMC6952997__cells-08-01612-g008.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Figure 8,Influence of ethylene on amount of compounds a...,Influence of ethylene on amount of compounds a...,PMC6952997,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Effect of Ethylene on Cell Wall and Lipid Meta...,"Yongchao Zhu, et al. Cells. 2019 Dec;8(12):1612.",124243,0.922297,2021


In [28]:
"PMC6936734__nihms-1063332-f0001.jpg" in set(figures_2020_rds_df["pfocr_id"])

True

In [29]:
columns_symmetric_difference = set(figures_2021_df.columns) ^ set(
    figures_2020_rds_df.columns
)
if len(columns_symmetric_difference) != 0:
    print(columns_symmetric_difference)
    print(len(figures_2020_rds_df.columns))
    print(figures_2020_rds_df.columns)

    print(len(figures_2021_df.columns))
    print(figures_2021_df.columns)

{'publication_year'}
14
Index(['pfocr_id', 'figure_number', 'reference', 'publication_year',
       'pathway_score', 'pmc_search_index', 'figure_title', 'paper_title',
       'figure_caption', 'pmc_id', 'paper_url', 'figure_page_url',
       'figure_thumbnail_url', 'pfocr_year'],
      dtype='object')
13
Index(['pfocr_id', 'figure_page_url', 'figure_thumbnail_url', 'figure_number',
       'figure_title', 'figure_caption', 'pmc_id', 'paper_url', 'paper_title',
       'reference', 'pmc_search_index', 'pathway_score', 'pfocr_year'],
      dtype='object')


Expecting unchanged: 15306 rows × 14 columns

In [30]:
figures_2021_df[
    figures_2021_df["pfocr_id"].isin(
        set(figures_2020_rds_df["pfocr_id"].to_list())
    )
    == False
]

Unnamed: 0,pfocr_id,figure_page_url,figure_thumbnail_url,figure_number,figure_title,figure_caption,pmc_id,paper_url,paper_title,reference,pmc_search_index,pathway_score,pfocr_year
11,PMC7226520__cells-09-01043-g007.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 7,Comparative downstream pathway analysis of the...,Comparative downstream pathway analysis of the...,PMC7226520,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,"A STAT3 of Addiction: Adipose Tissue, Adipocyt...","Rose Kadye, et al. Cells. 2020 Apr;9(4):1043.",12,0.811027,2021
16,PMC7346062__aging-12-103262-g005.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 5,KEGG pathways related to resveratrol-targeted ...,KEGG pathways related to resveratrol-targeted ...,PMC7346062,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Resveratrol promotes osteogenesis and alleviat...,"Tao Yu, et al. Aging (Albany NY). 2020 Jun 15;...",17,0.943144,2021
21,PMC7063815__13578_2020_396_Fig2_HTML.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Fig. 2,Interaction between tumor metabolism and the m...,Interaction between tumor metabolism and the m...,PMC7063815,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,mTOR signaling pathway and mTOR inhibitors in ...,"Zhilin Zou, et al. Cell Biosci. 2020;10:31.",22,0.943767,2021
22,PMC6497965__zbc0191904900006.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Figure 6,A proposed signaling cascade for phosphorylate...,A proposed signaling cascade for phosphorylate...,PMC6497965,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,Phosphorylation of proliferating cell nuclear ...,"Bo Peng, et al. J Biol Chem. 2019 Apr 26;294(1...",23,0.603429,2021
23,PMC7538683__bsr-40-bsr20202711-g5.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 5,,(A) Cell cycle signaling pathway is significan...,PMC7538683,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Bioinformatics analysis and experimental valid...,"Jiajia Chen, et al. Biosci Rep. 2020 Oct 30;40...",24,0.863582,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...
124414,PMC7803631__gr3a.jpg,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,Fig,Cell free biosynthesis for erythromycin A. The...,PMC7803631,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Complex natural product production methods and...,"Dongwon Park, et al. Synth Syst Biotechnol. 20...",124515,0.648432,2021
124421,PMC7359798__gr2_lrg.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 2,RNA Sensing and ResponseNon-comprehensive over...,RNA Sensing and ResponseNon-comprehensive over...,PMC7359798,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Immune Sensing Mechanisms that Discriminate Se...,"Eva Bartok, et al. Immunity. 2020 Jul 14;53(1)...",124522,0.922250,2021
124439,PMC7530268__fonc-10-586530-g0001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Figure 1,The ferroptotic cascade,The ferroptotic cascade. Accumulation of free ...,PMC7530268,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,From Iron Chelation to Overload as a Therapeut...,"Eric Grignano, et al. Front Oncol. 2020;10:586...",124540,0.952995,2021
124445,PMC7466447__fcell-08-00766-g001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,FIGURE 1,Immunoregulatory functions of TEC in the TME,Immunoregulatory functions of TEC in the TME. ...,PMC7466447,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Tumor Endothelial Cells (TECs) as Potential Im...,"Laurenz Nagl, et al. Front Cell Dev Biol. 2020...",124546,0.933006,2021


Expecting 56095

In [31]:
print(len(set(figures_2020_rds_df["pmc_id"].to_list())))

56095


Previously (when not doing any checking for duplicate pfocr_id), we got 13441

But when only excluding hits from any pfocr_id the 2020 database (meaning we actually ran it), we expect 13414.

In [32]:
print(len(set(figures_2021_df["pmc_id"].to_list())))

13414


### Import Genes 2021

OLD: Previously (when not doing any checking for duplicate pfocr_id), we got 214155 rows × 5 columns

OLD: When excluding hits from any pfocr_id in 2020 figures RDS, we expect 205010 rows × 5 columns.

CORRECT: But when only excluding hits from any pfocr_id the 2020 database (meaning we actually ran it), we expect 213858 rows × 5 columns.

In [33]:
genes_2021_df = rds2pandas(images_dir.joinpath("pfocr_genes_2021.rds"))
genes_2021_df

Unnamed: 0,pfocr_id,matched_ocr_text,lexicon_term,ncbigene_id,pfocr_year
0,PMC8036963__cancers-13-01583-g003.jpg,H2AX,H2AX,3014,2021
1,PMC8036963__cancers-13-01583-g003.jpg,H2AX,H2AX,3014,2021
2,PMC8036963__cancers-13-01583-g003.jpg,Ku70/80,KU70,2547,2021
3,PMC8036963__cancers-13-01583-g003.jpg,Ku70/80,KU80,7520,2021
4,PMC8036963__cancers-13-01583-g003.jpg,ATM,ATM,472,2021
...,...,...,...,...,...
213853,PMC7927090__ijms-22-02194-g002.jpg,AKT,AKT,207,2021
213854,PMC7927090__ijms-22-02194-g002.jpg,Smad,SMAD,4092,2021
213855,PMC7927090__ijms-22-02194-g002.jpg,MTOR,MTOR,2475,2021
213856,PMC7927090__ijms-22-02194-g002.jpg,VEGFR2,VEGFR2,3791,2021


Expecting 205010 rows × 5 columns

In [34]:
genes_2021_df[
    genes_2021_df["pfocr_id"].isin(
        set(figures_2020_rds_df["pfocr_id"].to_list())
    )
    == False
]

Unnamed: 0,pfocr_id,matched_ocr_text,lexicon_term,ncbigene_id,pfocr_year
0,PMC8036963__cancers-13-01583-g003.jpg,H2AX,H2AX,3014,2021
1,PMC8036963__cancers-13-01583-g003.jpg,H2AX,H2AX,3014,2021
2,PMC8036963__cancers-13-01583-g003.jpg,Ku70/80,KU70,2547,2021
3,PMC8036963__cancers-13-01583-g003.jpg,Ku70/80,KU80,7520,2021
4,PMC8036963__cancers-13-01583-g003.jpg,ATM,ATM,472,2021
...,...,...,...,...,...
213853,PMC7927090__ijms-22-02194-g002.jpg,AKT,AKT,207,2021
213854,PMC7927090__ijms-22-02194-g002.jpg,Smad,SMAD,4092,2021
213855,PMC7927090__ijms-22-02194-g002.jpg,MTOR,MTOR,2475,2021
213856,PMC7927090__ijms-22-02194-g002.jpg,VEGFR2,VEGFR2,3791,2021


OLD: Previously (when not doing any checking for duplicate pfocr_id), we got 9145 rows × 5 columns

CORRECT: But when only excluding hits from any pfocr_id the 2020 database (meaning we actually ran it), we expect 8848 rows × 5 columns.

In [35]:
genes_2021_df[
    genes_2021_df["pfocr_id"].isin(
        set(figures_2020_rds_df["pfocr_id"].to_list())
    )
]

Unnamed: 0,pfocr_id,matched_ocr_text,lexicon_term,ncbigene_id,pfocr_year
56,PMC6941108__ijms-20-06207-g003.jpg,KIF20A,KIF20A,10112,2021
57,PMC6941108__ijms-20-06207-g003.jpg,AIM2,AIM2,9447,2021
58,PMC6941108__ijms-20-06207-g003.jpg,HMMR,HMMR,3161,2021
59,PMC6941108__ijms-20-06207-g003.jpg,CCNE2,CCNE2,9134,2021
60,PMC6941108__ijms-20-06207-g003.jpg,NDC80,NDC80,10403,2021
...,...,...,...,...,...
212533,PMC6954688__nihms-1059517-f0008.jpg,PPARA,PPARA,5465,2021
212534,PMC6954688__nihms-1059517-f0008.jpg,AMP,AMP,4236,2021
212535,PMC6954688__nihms-1059517-f0008.jpg,ADP,ADP,23038,2021
212536,PMC6954688__nihms-1059517-f0008.jpg,ADP,ADP,23038,2021


Expecting no `pfocr_id` overlaps between `genes_2020_df` and `genes_2021_df`.

In [36]:
len(
    genes_2021_df[
        genes_2021_df["pfocr_id"].isin(set(genes_2020_df["pfocr_id"]))
    ]
)

0

Expecting no `pfocr_id` overlaps between `genes_2020_df` and `figures_2021_df`.

In [37]:
len(
    figures_2021_df[
        figures_2021_df["pfocr_id"].isin(set(genes_2020_df["pfocr_id"]))
    ]
)

0

Expecting no `pfocr_id` overlaps between `genes_2021_df` and `figures_2020_db_df`.

In [38]:
len(
    genes_2021_df[
        genes_2021_df["pfocr_id"].isin(set(figures_2020_db_df["pfocr_id"]))
    ]
)

0

Expecting 8,848 `pfocr_id` overlaps between `genes_2021_df` and `figures_2020_rds_df`.

In [39]:
genes_2021_df[
    genes_2021_df["pfocr_id"].isin(set(figures_2020_rds_df["pfocr_id"]))
]

Unnamed: 0,pfocr_id,matched_ocr_text,lexicon_term,ncbigene_id,pfocr_year
56,PMC6941108__ijms-20-06207-g003.jpg,KIF20A,KIF20A,10112,2021
57,PMC6941108__ijms-20-06207-g003.jpg,AIM2,AIM2,9447,2021
58,PMC6941108__ijms-20-06207-g003.jpg,HMMR,HMMR,3161,2021
59,PMC6941108__ijms-20-06207-g003.jpg,CCNE2,CCNE2,9134,2021
60,PMC6941108__ijms-20-06207-g003.jpg,NDC80,NDC80,10403,2021
...,...,...,...,...,...
212533,PMC6954688__nihms-1059517-f0008.jpg,PPARA,PPARA,5465,2021
212534,PMC6954688__nihms-1059517-f0008.jpg,AMP,AMP,4236,2021
212535,PMC6954688__nihms-1059517-f0008.jpg,ADP,ADP,23038,2021
212536,PMC6954688__nihms-1059517-f0008.jpg,ADP,ADP,23038,2021


Those 8,848 gene mentions come from 571 papers.

In [40]:
len(
    genes_2021_df[
        genes_2021_df["pfocr_id"].isin(set(figures_2020_rds_df["pfocr_id"]))
    ]["pfocr_id"].drop_duplicates()
)

571

## Merge 2020 and 2021

In [41]:
data_dir = Path(
    "~/Dropbox (Gladstone)/Documents/pathway-ocr/20210515/"
).expanduser()

### figures

#### Load Generic PMC Metadata

Expecting 7129730 rows × 12 columns

In [42]:
pmc_ids_url = "https://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz"
pmcs_df = pd.read_csv(
    pmc_ids_url,
    dtype={
        "Year": "object",
        # TODO: some of these are NaN, so the following doesn't work
        # "Year": np.int32,
        # "Volume": np.int32,
        # "Issue": np.int32,
        # "PMID": np.int32,
    },
)
pmcs_df

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date
0,Breast Cancer Res,1465-5411,1465-542X,2000,3,1.0,55,10.1186/bcr271,PMC13900,11250746.0,,live
1,Breast Cancer Res,1465-5411,1465-542X,2000,3,1.0,61,10.1186/bcr272,PMC13901,11250747.0,,live
2,Breast Cancer Res,1465-5411,1465-542X,2000,3,1.0,66,10.1186/bcr273,PMC13902,11250748.0,,live
3,Breast Cancer Res,1465-5411,1465-542X,1999,2,1.0,59,10.1186/bcr29,PMC13911,11056684.0,,live
4,Breast Cancer Res,1465-5411,1465-542X,1999,2,1.0,64,10.1186/bcr30,PMC13912,11400682.0,,live
...,...,...,...,...,...,...,...,...,...,...,...,...
7139974,J Am Geriatr Soc,0002-8614,1532-5415,2019,67,11,2376,10.1111/jgs.16201,PMC8173537,31675106.0,NIHMS1698321,live
7139975,Org Lett,1523-7060,1523-7052,2021,23,10,4008,10.1021/acs.orglett.1c01218,PMC8173538,33979173.0,NIHMS1704325,live
7139976,Science,0036-8075,1095-9203,2021,372,6539,292,10.1126/science.aba7582,PMC8173539,33859035.0,NIHMS1699275,live
7139977,Clin Immunol,1521-6616,1521-7035,2020,221,,108602,10.1016/j.clim.2020.108602,PMC8173542,33007439.0,NIHMS1701840,2021-12-01


#### Perform merge

- OLD: de-dupe figures by pfocr_id in the 2020 database: 80557 rows × 14 columns.
- CORRECT: de-dupe figures by pfocr_id in the 2020 figures RDS: 79949 rows × 14 columns.

In [43]:
pfocr_figures_df = pd.merge(
    (
        (
            # excluding 608 figures we downloaded but didn't OCR in 2020
            figures_2020_rds_df[
                figures_2020_rds_df["pfocr_id"].isin(
                    figures_2021_df["pfocr_id"]
                )
                == False
            ]
        )
        .append(figures_2021_df)
        .reset_index(drop=True)
        .drop(columns=["pmc_search_index"])
    ),
    (
        pmcs_df[["PMCID", "Year", "Release Date"]]
        .rename(
            columns={
                "PMCID": "pmc_id",
                "Year": "publication_year_updated",
                # "Release Date": "release_date",
            }
        )
        .set_index("pmc_id")
    ),
    how="left",
    on=["pmc_id"],
)

pfocr_figures_df["publication_year_combined"] = pfocr_figures_df[
    "publication_year"
].combine_first(pfocr_figures_df["publication_year_updated"])

pfocr_figures_df = pfocr_figures_df.drop(
    columns=["publication_year", "publication_year_updated"]
).rename(columns={"publication_year_combined": "publication_year"})

pfocr_figures_df

Unnamed: 0,pfocr_id,figure_number,reference,pathway_score,figure_title,paper_title,figure_caption,pmc_id,paper_url,figure_page_url,figure_thumbnail_url,pfocr_year,Release Date,publication_year
0,PMC5653847__41598_2017_14124_Fig8_HTML.jpg,Figure 8,"Céline Barthelemy, et al. Sci Rep. 2017;7:13816.",0.968270,Model of FTY720-induced transporter endocytosi...,FTY720-induced endocytosis of yeast and human ...,Model of FTY720-induced transporter endocytosi...,PMC5653847,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,2020,live,2017
1,PMC4187043__zh20191474070013.jpg,Fig. 13,"Yuan Wei, et al. Am J Physiol Renal Physiol. 2...",0.965793,Proposed signaling pathway by which the stimul...,Angiotensin II type 2 receptor regulates ROMK-...,Proposed signaling pathway by which the stimul...,PMC4187043,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,2020,live,2014
2,PMC5746550__rsob-7-170228-g1.jpg,Figure 1,"Georgia R. Frost, et al. Open Biol. 2017 Dec;7...",0.962470,AŒ≤ production,The role of astrocytes in amyloid production a...,AŒ≤ production. In the amyloidogenic pathway (...,PMC5746550,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,2020,live,2017
3,PMC4211692__pone.0110875.g008.jpg,Figure 8,"Enida Gjoni, et al. PLoS One. 2014;9(10):e110875.",0.966721,,Glucolipotoxicity Impairs Ceramide Flow from t...,Glucolipotoxicity impairs CERT- and vesicular-...,PMC4211692,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,2020,live,2014
4,PMC2588433__nihms78212f8.jpg,Figure 8,"Amanda L. Lewis, et al. J Biol Chem. ;282(38):...",0.966758,,NeuA sialic acid O-acetylesterase activity mod...,Bacterial Sia biosynthesis can be divided into...,PMC2588433,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,2020,live,2007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79944,PMC7803631__gr3a.jpg,,"Dongwon Park, et al. Synth Syst Biotechnol. 20...",0.648432,Fig,Complex natural product production methods and...,Cell free biosynthesis for erythromycin A. The...,PMC7803631,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,2021,live,2021
79945,PMC7359798__gr2_lrg.jpg,Figure 2,"Eva Bartok, et al. Immunity. 2020 Jul 14;53(1)...",0.922250,RNA Sensing and ResponseNon-comprehensive over...,Immune Sensing Mechanisms that Discriminate Se...,RNA Sensing and ResponseNon-comprehensive over...,PMC7359798,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,2021,live,2020
79946,PMC7530268__fonc-10-586530-g0001.jpg,Figure 1,"Eric Grignano, et al. Front Oncol. 2020;10:586...",0.952995,The ferroptotic cascade,From Iron Chelation to Overload as a Therapeut...,The ferroptotic cascade. Accumulation of free ...,PMC7530268,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,2021,live,2020
79947,PMC7466447__fcell-08-00766-g001.jpg,FIGURE 1,"Laurenz Nagl, et al. Front Cell Dev Biol. 2020...",0.933006,Immunoregulatory functions of TEC in the TME,Tumor Endothelial Cells (TECs) as Potential Im...,Immunoregulatory functions of TEC in the TME. ...,PMC7466447,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,2021,live,2020


Save file:

In [45]:
len(set(figures_2021_df["pfocr_id"]))

15914

In [46]:
len(set(figures_2020_rds_df["pfocr_id"]) & set(figures_2021_df["pfocr_id"]))

608

In [47]:
len(
    (set(figures_2020_rds_df["pfocr_id"]) & set(figures_2021_df["pfocr_id"]))
    - set(genes_2020_df["pfocr_id"])
)

608

In [48]:
len(
    (set(figures_2020_rds_df["pfocr_id"]) & set(figures_2021_df["pfocr_id"]))
    - set(figures_2020_db_df["pfocr_id"])
)

608

In [49]:
len(set(figures_2020_rds_df["pfocr_id"]) | set(figures_2021_df["pfocr_id"]))

79949

In [50]:
print(len(set(figures_2020_rds_df["pfocr_id"])))
len(figures_2020_rds_df["pfocr_id"])

64643


64643

In [51]:
print(len(set(figures_2021_df["pfocr_id"])))
len(figures_2021_df["pfocr_id"])

15914


15914

In [52]:
len(set(genes_2020_df["pfocr_id"]) & set(genes_2021_df["pfocr_id"]))

0

In [53]:
print(len(set(pfocr_figures_df["pfocr_id"])))
print(len(pfocr_figures_df["pfocr_id"]))

79949
79949


In [54]:
print(len(set(figures_2020_rds_df["pfocr_id"])))
print(len(figures_2020_rds_df["pfocr_id"]))

64643
64643


In [55]:
print(len(set(figures_2021_df["pfocr_id"])))
print(len(figures_2021_df["pfocr_id"]))

15914
15914


In [56]:
print(
    len(
        set(
            figures_2020_rds_df["pfocr_id"].to_list()
            + figures_2021_df["pfocr_id"].to_list()
        )
    )
)
print(
    len(
        figures_2020_rds_df["pfocr_id"].to_list()
        + figures_2021_df["pfocr_id"].to_list()
    )
)

79949
80557


In [57]:
len(
    set(figures_2021_df["pmc_id"].to_list())
    | set(figures_2020_rds_df["pmc_id"].to_list())
)

68976

In [58]:
len(set(figures_2020_rds_df["pmc_id"].to_list()))

56095

In [59]:
len(set(figures_2021_df["pmc_id"].to_list()))

13414

#### Get OA PMC data

Get the OA PMC data in XML format (only run this during the off-hours).

[Docs](https://www.ncbi.nlm.nih.gov/pmc/tools/oai/)

Maybe this should go into a different file?

### genes

- OLD: de-duping figures only by hash in 2020 database: 1,326,706 rows (wrong b/c some figure hashes changed)
- OLD: de-duping figures by pfocr_id in 2020 figures RDS: 1,317,561 rows × 8 columns (wrong because OCR request failed for 608 figures in 2020)
- CORRECT: de-duping figures by hash or pfocr_id in 2020 database: 1,326,409 rows × 8 columns

In [60]:
pfocr_genes_df = genes_2020_df.append(genes_2021_df).reset_index(drop=True)
pfocr_genes_df

Unnamed: 0,pfocr_id,pmc_id,matched_ocr_text,lexicon_term,lexicon_term_source,hgnc_symbol,ncbigene_id,pfocr_year
0,PMC100003__mb2410470011.jpg,PMC100003,"Ga12,Gaq",G-ALPHA-q,hgnc_alias_symbol,GNAQ,2776,2020
1,PMC100003__mb2410470011.jpg,PMC100003,Etk,ETK,hgnc_alias_symbol,BMX,660,2020
2,PMC100003__mb2410470011.jpg,PMC100003,FAK,FAK,hgnc_alias_symbol,PTK2,5747,2020
3,PMC100003__mb2410470011.jpg,PMC100003,AR*,AR,hgnc_symbol,AR,367,2020
4,PMC100003__mb2410470011.jpg,PMC100003,(Src,SRC,hgnc_symbol,SRC,6714,2020
...,...,...,...,...,...,...,...,...
1326404,PMC7927090__ijms-22-02194-g002.jpg,,AKT,AKT,,,207,2021
1326405,PMC7927090__ijms-22-02194-g002.jpg,,Smad,SMAD,,,4092,2021
1326406,PMC7927090__ijms-22-02194-g002.jpg,,MTOR,MTOR,,,2475,2021
1326407,PMC7927090__ijms-22-02194-g002.jpg,,VEGFR2,VEGFR2,,,3791,2021


Save file:

In [61]:
1317561 - 1112551

205010

In [62]:
(1317561 - 1112551) / 1112551

0.18427020424232238

In [63]:
214000 / 1112551

0.19235073268551284

In [64]:
280000 / 1112551

0.25167385585020374

In [65]:
0.24 * 1112551

267012.24

In [66]:
15 / 65

0.23076923076923078

In [67]:
16 / 65

0.24615384615384617

In [68]:
(79949 - 64643) / 64643

0.236777377287564

Expected number of unique NCBI Gene IDs (gene count) in `pfocr_genes_df`:

- OLD: de-dupe figures by pfocr_id in 2020 figures RDS: 14233.
- CORRECT: de-dupe figures by pfocr_id in 2020 database: 14251.

In [70]:
len(pfocr_genes_df["ncbigene_id"].drop_duplicates())

14251

Expected number of unique hgnc_symbols (alternate gene count) in `pfocr_genes_df`:

- CORRECT: de-dupe figures by pfocr_id in 2020 database: 13465.

In [71]:
len(pfocr_genes_df["hgnc_symbol"].drop_duplicates())

13465

Expected number of unique pfocr_ids (figure count) in `pfocr_genes_df` (it has at least one gene):

- OLD: de-dupe figures by pfocr_id in 2020 figures RDS: 73305.
- CORRECT: de-dupe figures by pfocr_id in 2020 database: 73876.

In [72]:
len(pfocr_genes_df["pfocr_id"].drop_duplicates())

73876

Expected number of unique pfocr_ids (figure count) in `pfocr_genes_df`: 79949

In [73]:
len(pfocr_figures_df["pfocr_id"].drop_duplicates())

79949

## Join figure & gene dfs

We've already merged 2020 and 2021 data for figures and for genes. Now we're joining the figures DF and the genes DF.

Disabling this for now, because I'm not sure it's useful or up-to-date.