# Analysis Template Walkthrough

# Setup

## Select extract
In order for the template cells to query data from the correct repository, enter the repository name as `repository` and repository object type as `object_type`.

In [1]:
repository = 'figshare'
object_type = 'articles'

In [2]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

In [3]:
#see more rows and columns of output
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100) 

## Helper Functions

In [4]:
import os, sys
dir2 = os.path.abspath('../')
dir1 = os.path.dirname(dir2)
if not dir1 in sys.path: sys.path.append(dir1)

from utils import analysis
from utils.crosswalk import RepositoryExtract, property_crosswalk
from utils import accessors

# Summary Statistic Walkthroughs

Read in the repository .json file

In [5]:
df = pd.read_json(f'{repository}_{object_type}.json')

In [6]:
df

Unnamed: 0,id,title_search,doi_search,handle_search,url_search,published_date_search,thumb_search,defined_type_search,defined_type_name_search,group_id_search,url_private_api_search,url_public_api_search,url_private_html_search,url_public_html_search,timeline_search,resource_title_search,resource_doi_search,search_page,publish_query,authors,categories,citation,confidential_reason,created_date,custom_fields,defined_type_metadata,defined_type_name_metadata,description,doi_metadata,embargo_date,embargo_options,embargo_reason,embargo_title,embargo_type,figshare_url,files,funding,funding_list,group_id_metadata,handle_metadata,has_linked_file,is_confidential,is_embargoed,is_metadata_record,is_public,license,metadata_reason,modified_date,published_date_metadata,references,resource_doi_metadata,resource_title_metadata,size,status,tags,thumb_metadata,timeline_metadata,title_metadata,url_metadata,url_private_api_metadata,url_private_html_metadata,url_public_api_metadata,url_public_html_metadata,version
0,9342671,Schools in control? Questions about the develo...,,2134/1575,https://api.figshare.com/v2/articles/9342671,2006-05-08T15:07:09Z,https://s3-eu-west-1.amazonaws.com/ppreviews-l...,14,conference contribution,20438.0,https://api.figshare.com/v2/account/articles/9...,https://api.figshare.com/v2/articles/9342671,https://figshare.com/account/articles/9342671,https://repository.lboro.ac.uk/articles/confer...,"{'posted': '2006-05-08T15:07:09', 'publisherPu...",,,1,1950-01-01,"[{'id': 2654377, 'full_name': 'Martin Owen', '...","[{'id': 1121, 'title': 'Design Practice and Ma...","Owen, Martin (1993): Schools in control? Quest...",,2006-05-08T15:07:09Z,"[{'name': 'School', 'value': ['Design']}, {'na...",14,conference contribution,The paper considers the role of he teaching of...,,,[],,,,https://repository.lboro.ac.uk/articles/confer...,"[{'id': 16951490, 'name': 'owen93.pdf', 'size'...",,[],20438.0,2134/1575,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2019-08-19T10:08:03Z,2006-05-08T15:07:09Z,[],,,49256,public,[untagged],https://s3-eu-west-1.amazonaws.com/ppreviews-l...,"{'posted': '2006-05-08T15:07:09', 'publisherPu...",Schools in control? Questions about the develo...,https://api.figshare.com/v2/articles/9342671,https://api.figshare.com/v2/account/articles/9...,https://figshare.com/account/articles/9342671,https://api.figshare.com/v2/articles/9342671,https://repository.lboro.ac.uk/articles/confer...,1
1,9490289,Extracting more meaning from CAA results using...,,2134/1892,https://api.figshare.com/v2/articles/9490289,2006-05-24T15:27:57Z,https://s3-eu-west-1.amazonaws.com/ppreviews-l...,14,conference contribution,20438.0,https://api.figshare.com/v2/account/articles/9...,https://api.figshare.com/v2/articles/9490289,https://figshare.com/account/articles/9490289,https://repository.lboro.ac.uk/articles/confer...,"{'posted': '2006-05-24T15:27:57', 'publisherPu...",,,1,1950-01-01,"[{'id': 7194734, 'full_name': 'S. Valenti', 'i...","[{'id': 2, 'title': 'Uncategorized', 'parent_i...","Valenti, S.; Cucchiarelli, A. (2002): Extracti...",,2006-05-24T15:27:57Z,"[{'name': 'School', 'value': ['University Acad...",14,conference contribution,This work describes a novel approach to the pr...,,,[],,,,https://repository.lboro.ac.uk/articles/confer...,"[{'id': 17116004, 'name': 'valenti_s1.pdf', 's...",,[],20438.0,2134/1892,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2019-08-19T12:31:31Z,2006-05-24T15:27:57Z,[],,,185571,public,[untagged],https://s3-eu-west-1.amazonaws.com/ppreviews-l...,"{'posted': '2006-05-24T15:27:57', 'publisherPu...",Extracting more meaning from CAA results using...,https://api.figshare.com/v2/articles/9490289,https://api.figshare.com/v2/account/articles/9...,https://figshare.com/account/articles/9490289,https://api.figshare.com/v2/articles/9490289,https://repository.lboro.ac.uk/articles/confer...,1
2,152586,Large-Scale Mapping and Validation of Escheric...,10.1371/journal.pbio.0050008,,https://api.figshare.com/v2/articles/152586,2007-01-09T00:43:06Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,116.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/152586,https://figshare.com/account/articles/152586,https://plos.figshare.com/articles/dataset/Lar...,"{'posted': '2007-01-09T00:43:06', 'firstOnline...",Large-Scale Mapping and Validation of Escheric...,10.1371/journal.pbio.0050008,1,1950-01-01,"[{'id': 96324, 'full_name': 'Jeremiah J Faith'...","[{'id': 48, 'title': 'Biological Sciences', 'p...","J Faith, Jeremiah; Hayete, Boris; T Thaden, Jo...",,2007-01-09T00:43:06Z,[],3,dataset,<div><p>Machine learning approaches offer the ...,10.1371/journal.pbio.0050008,,[],,,,https://plos.figshare.com/articles/dataset/Lar...,"[{'id': 469682, 'name': 'Figure_S1.pdf', 'size...",,[],116.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-18T11:28:43Z,2007-01-09T00:43:06Z,[],10.1371/journal.pbio.0050008,Large-Scale Mapping and Validation of Escheric...,3358192,public,"[large-scale, validation, transcriptional, com...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-01-09T00:43:06', 'firstOnline...",Large-Scale Mapping and Validation of Escheric...,https://api.figshare.com/v2/articles/152586,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/152586,https://api.figshare.com/v2/articles/152586,https://plos.figshare.com/articles/dataset/Lar...,1
3,622417,Simplified Support Vector Machine,10.1371/journal.pcbi.0030020.g001,,https://api.figshare.com/v2/articles/622417,2007-02-23T00:40:17Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,1,figure,101.0,https://api.figshare.com/v2/account/articles/6...,https://api.figshare.com/v2/articles/622417,https://figshare.com/account/articles/622417,https://plos.figshare.com/articles/figure/_Sim...,"{'posted': '2007-02-23T00:40:17', 'firstOnline...",Improving the <em>Caenorhabditis elegans</em> ...,10.1371/journal.pcbi.0030020,1,1950-01-01,"[{'id': 54319, 'full_name': 'Gunnar Rätsch', '...","[{'id': 53, 'title': 'Mathematics', 'parent_id...","Rätsch, Gunnar; Sonnenburg, Sören; Srinivasan,...",,2007-02-23T00:40:17Z,[],1,figure,<p>Learn a function <i>f</i> such that the dif...,10.1371/journal.pcbi.0030020.g001,,[],,,,https://plos.figshare.com/articles/figure/_Sim...,"[{'id': 952110, 'name': 'Figure_1.tif', 'size'...",,[],101.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2015-12-02T17:29:15Z,2007-02-23T00:40:17Z,[],10.1371/journal.pcbi.0030020,Improving the <em>Caenorhabditis elegans</em> ...,0,public,[vector],https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-02-23T00:40:17', 'firstOnline...",Simplified Support Vector Machine,https://api.figshare.com/v2/articles/622417,https://api.figshare.com/v2/account/articles/6...,https://figshare.com/account/articles/622417,https://api.figshare.com/v2/articles/622417,https://plos.figshare.com/articles/figure/_Sim...,1
4,152441,"Data Preparation Protocols, Additional Results...",10.1371/journal.pcbi.0030020.sd001,,https://api.figshare.com/v2/articles/152441,2007-02-23T00:40:41Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,101.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/152441,https://figshare.com/account/articles/152441,https://plos.figshare.com/articles/dataset/Imp...,"{'posted': '2007-02-23T00:40:41', 'firstOnline...",Improving the <em>Caenorhabditis elegans</em> ...,10.1371/journal.pcbi.0030020,1,1950-01-01,"[{'id': 54319, 'full_name': 'Gunnar Rätsch', '...","[{'id': 53, 'title': 'Mathematics', 'parent_id...","Rätsch, Gunnar; Sonnenburg, Sören; Srinivasan,...",,2007-02-23T00:40:41Z,[],3,dataset,<p>(161 KB PDF)</p>,10.1371/journal.pcbi.0030020.sd001,,[],,,,https://plos.figshare.com/articles/dataset/Imp...,"[{'id': 468834, 'name': 'Protocol_S1.pdf', 'si...",,[],101.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2020-04-27T08:11:31Z,2007-02-23T00:40:41Z,[],10.1371/journal.pcbi.0030020,Improving the <em>Caenorhabditis elegans</em> ...,0,public,"[improving, genome, annotation]",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-02-23T00:40:41', 'firstOnline...","Data Preparation Protocols, Additional Results...",https://api.figshare.com/v2/articles/152441,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/152441,https://api.figshare.com/v2/articles/152441,https://plos.figshare.com/articles/dataset/Imp...,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24208,14107373,Using Machine Learning to Increase NPC Fidelity,10.1184/r1/14107373.v2,,https://api.figshare.com/v2/articles/14107373,2021-12-02T17:15:02Z,https://s3-eu-west-1.amazonaws.com/ppreviews-c...,18,report,9987.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/14107373,https://figshare.com/account/articles/14107373,https://kilthub.cmu.edu/articles/report/Using_...,"{'posted': '2021-12-02T17:15:02', 'firstOnline...",,,7,2021-03-21,"[{'id': 5441456, 'full_name': 'Dustin Updyke',...","[{'id': 23, 'title': 'Software Engineering', '...","Updyke, Dustin; Podnar, Thomas; Dobson, Geoffr...",,2021-12-02T17:15:02Z,"[{'name': 'Publisher Statement', 'value': 'Thi...",18,report,Experiences that seem real to players in train...,10.1184/R1/14107373.v2,,[],,,file,https://kilthub.cmu.edu/articles/report/Using_...,"[{'id': 31645775, 'name': '2021_005_001_743903...",,[],9987.0,,0,0,0,0,1,"{'value': 43, 'name': 'In Copyright', 'url': '...",,2021-12-02T17:15:04Z,2021-12-02T17:15:02Z,[],,,791782,public,"[Machine Learning, modeling, cybersecurity tra...",https://s3-eu-west-1.amazonaws.com/ppreviews-c...,"{'posted': '2021-12-02T17:15:02', 'firstOnline...",Using Machine Learning to Increase NPC Fidelity,https://api.figshare.com/v2/articles/14107373,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/14107373,https://api.figshare.com/v2/articles/14107373,https://kilthub.cmu.edu/articles/report/Using_...,2
24209,17114091,Generative Chemical Transformer: Neural Machin...,10.1021/acs.jcim.1c01289.s001,,https://api.figshare.com/v2/articles/17114091,2021-12-02T18:05:37Z,https://ndownloader.figshare.com/files/3164602...,6,journal contribution,2409.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17114091,https://figshare.com/account/articles/17114091,https://acs.figshare.com/articles/journal_cont...,"{'posted': '2021-12-02T18:05:37', 'publisherPu...",Generative Chemical Transformer: Neural Machin...,10.1021/acs.jcim.1c01289,7,2021-03-21,"[{'id': 10732168, 'full_name': 'Hyunseung Kim'...","[{'id': 4, 'title': 'Biochemistry', 'parent_id...","Kim, Hyunseung; Na, Jonggeol; Lee, Won Bo (202...",,2021-12-02T18:05:37Z,[],6,journal contribution,Discovering new materials better\nsuited to sp...,10.1021/acs.jcim.1c01289.s001,,[],,,,https://acs.figshare.com/articles/journal_cont...,"[{'id': 31646022, 'name': 'ci1c01289_si_001.pd...",,[],2409.0,,0,0,0,0,1,"{'value': 10, 'name': 'CC BY-NC 4.0', 'url': '...",,2021-12-02T18:05:37Z,2021-12-02T18:05:37Z,[],10.1021/acs.jcim.1c01289,Generative Chemical Transformer: Neural Machin...,2501061,public,"[single condition set, multiple target propert...",https://ndownloader.figshare.com/files/3164602...,"{'posted': '2021-12-02T18:05:37', 'publisherPu...",Generative Chemical Transformer: Neural Machin...,https://api.figshare.com/v2/articles/17114091,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17114091,https://api.figshare.com/v2/articles/17114091,https://acs.figshare.com/articles/journal_cont...,1
24210,17114094,Anti-hypertensive Peptide Predictor: A Machine...,10.1021/acs.jafc.1c04555.s001,,https://api.figshare.com/v2/articles/17114094,2021-12-02T18:06:11Z,,3,dataset,2505.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17114094,https://figshare.com/account/articles/17114094,https://acs.figshare.com/articles/dataset/Anti...,"{'posted': '2021-12-02T18:06:11', 'publisherPu...",Anti-hypertensive Peptide Predictor: A Machine...,10.1021/acs.jafc.1c04555,7,2021-03-21,"[{'id': 8441022, 'full_name': 'Gazal Kalyan', ...","[{'id': 4, 'title': 'Biochemistry', 'parent_id...","Kalyan, Gazal; Junghare, Vivek; Khan, Mohammad...",,2021-12-02T18:06:11Z,[],3,dataset,Angiotensin\nconverting enzyme-I (ACE-I) is a ...,10.1021/acs.jafc.1c04555.s001,,[],,,,https://acs.figshare.com/articles/dataset/Anti...,"[{'id': 31646025, 'name': 'jf1c04555_si_001.xl...",,[],2505.0,,0,0,0,0,1,"{'value': 10, 'name': 'CC BY-NC 4.0', 'url': '...",,2021-12-02T18:06:11Z,2021-12-02T18:06:11Z,[],10.1021/acs.jafc.1c04555,Anti-hypertensive Peptide Predictor: A Machine...,153877,public,"[key therapeutic target, hypertensive peptide ...",,"{'posted': '2021-12-02T18:06:11', 'publisherPu...",Anti-hypertensive Peptide Predictor: A Machine...,https://api.figshare.com/v2/articles/17114094,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17114094,https://api.figshare.com/v2/articles/17114094,https://acs.figshare.com/articles/dataset/Anti...,1
24211,17114097,Anti-hypertensive Peptide Predictor: A Machine...,10.1021/acs.jafc.1c04555.s002,,https://api.figshare.com/v2/articles/17114097,2021-12-02T18:06:12Z,https://ndownloader.figshare.com/files/3164602...,6,journal contribution,2505.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17114097,https://figshare.com/account/articles/17114097,https://acs.figshare.com/articles/journal_cont...,"{'posted': '2021-12-02T18:06:12', 'publisherPu...",Anti-hypertensive Peptide Predictor: A Machine...,10.1021/acs.jafc.1c04555,7,2021-03-21,"[{'id': 8441022, 'full_name': 'Gazal Kalyan', ...","[{'id': 4, 'title': 'Biochemistry', 'parent_id...","Kalyan, Gazal; Junghare, Vivek; Khan, Mohammad...",,2021-12-02T18:06:12Z,[],6,journal contribution,Angiotensin\nconverting enzyme-I (ACE-I) is a ...,10.1021/acs.jafc.1c04555.s002,,[],,,,https://acs.figshare.com/articles/journal_cont...,"[{'id': 31646028, 'name': 'jf1c04555_si_002.pd...",,[],2505.0,,0,0,0,0,1,"{'value': 10, 'name': 'CC BY-NC 4.0', 'url': '...",,2021-12-02T18:06:12Z,2021-12-02T18:06:12Z,[],10.1021/acs.jafc.1c04555,Anti-hypertensive Peptide Predictor: A Machine...,403235,public,"[key therapeutic target, hypertensive peptide ...",https://ndownloader.figshare.com/files/3164602...,"{'posted': '2021-12-02T18:06:12', 'publisherPu...",Anti-hypertensive Peptide Predictor: A Machine...,https://api.figshare.com/v2/articles/17114097,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17114097,https://api.figshare.com/v2/articles/17114097,https://acs.figshare.com/articles/journal_cont...,1


Figshare "machine learning" objects include a variety of resource types that are not relevant to the question at hand, such as journal articles, book chapters, and other items.

Full Figshare resource types returned from inital metadata extract, with relevant subset identified in bold:
* **dataset**
* journal contribution
* figure
* thesis
* presentation
* media
* preprint
* **software**
* poster
* conference contribution
* online resource
* chapter
* book
* report
* educational resource
* **model**
* workflow

We want to re-run all summary statistics on this subset of Figshare metadata.

In [7]:
#subset df based on resource type

In [8]:
ids = df.FigshareArticlesCrosswalk.unique_identifier
ids

0         9342671
1         9490289
2          152586
3          622417
4          152441
           ...   
24208    14107373
24209    17114091
24210    17114094
24211    17114097
24212    17114100
Name: id, Length: 24213, dtype: int64

In [9]:
resource_types = df.FigshareArticlesCrosswalk.resource_type
resource_types

0        conference contribution
1        conference contribution
2                        dataset
3                         figure
4                        dataset
                  ...           
24208                     report
24209       journal contribution
24210                    dataset
24211       journal contribution
24212       journal contribution
Name: defined_type_name_metadata, Length: 24213, dtype: object

In [10]:
#add ID column to resource type column
resource_ids = pd.concat([ids, resource_types], axis = 1)
resource_ids

Unnamed: 0,id,defined_type_name_metadata
0,9342671,conference contribution
1,9490289,conference contribution
2,152586,dataset
3,622417,figure
4,152441,dataset
...,...,...
24208,14107373,report
24209,17114091,journal contribution
24210,17114094,dataset
24211,17114097,journal contribution


In [11]:
#keep only ids that are 'dataset', 'software', or 'model'
keep_subset = resource_ids[resource_ids['defined_type_name_metadata'].isin(['dataset', 'software', 'model'])]
keep_ids = keep_subset['id']
keep_ids

2          152586
4          152441
9          151745
10         151418
11         151373
           ...   
24204    17113517
24205    17113523
24206    17113529
24207    17113550
24210    17114094
Name: id, Length: 10359, dtype: int64

In [12]:
#subset full dataset to keep only these ids
subset_df = df[df['id'].isin(keep_ids)]

In [13]:
subset_df

Unnamed: 0,id,title_search,doi_search,handle_search,url_search,published_date_search,thumb_search,defined_type_search,defined_type_name_search,group_id_search,url_private_api_search,url_public_api_search,url_private_html_search,url_public_html_search,timeline_search,resource_title_search,resource_doi_search,search_page,publish_query,authors,categories,citation,confidential_reason,created_date,custom_fields,defined_type_metadata,defined_type_name_metadata,description,doi_metadata,embargo_date,embargo_options,embargo_reason,embargo_title,embargo_type,figshare_url,files,funding,funding_list,group_id_metadata,handle_metadata,has_linked_file,is_confidential,is_embargoed,is_metadata_record,is_public,license,metadata_reason,modified_date,published_date_metadata,references,resource_doi_metadata,resource_title_metadata,size,status,tags,thumb_metadata,timeline_metadata,title_metadata,url_metadata,url_private_api_metadata,url_private_html_metadata,url_public_api_metadata,url_public_html_metadata,version
2,152586,Large-Scale Mapping and Validation of Escheric...,10.1371/journal.pbio.0050008,,https://api.figshare.com/v2/articles/152586,2007-01-09T00:43:06Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,116.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/152586,https://figshare.com/account/articles/152586,https://plos.figshare.com/articles/dataset/Lar...,"{'posted': '2007-01-09T00:43:06', 'firstOnline...",Large-Scale Mapping and Validation of Escheric...,10.1371/journal.pbio.0050008,1,1950-01-01,"[{'id': 96324, 'full_name': 'Jeremiah J Faith'...","[{'id': 48, 'title': 'Biological Sciences', 'p...","J Faith, Jeremiah; Hayete, Boris; T Thaden, Jo...",,2007-01-09T00:43:06Z,[],3,dataset,<div><p>Machine learning approaches offer the ...,10.1371/journal.pbio.0050008,,[],,,,https://plos.figshare.com/articles/dataset/Lar...,"[{'id': 469682, 'name': 'Figure_S1.pdf', 'size...",,[],116.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-18T11:28:43Z,2007-01-09T00:43:06Z,[],10.1371/journal.pbio.0050008,Large-Scale Mapping and Validation of Escheric...,3358192,public,"[large-scale, validation, transcriptional, com...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-01-09T00:43:06', 'firstOnline...",Large-Scale Mapping and Validation of Escheric...,https://api.figshare.com/v2/articles/152586,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/152586,https://api.figshare.com/v2/articles/152586,https://plos.figshare.com/articles/dataset/Lar...,1
4,152441,"Data Preparation Protocols, Additional Results...",10.1371/journal.pcbi.0030020.sd001,,https://api.figshare.com/v2/articles/152441,2007-02-23T00:40:41Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,101.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/152441,https://figshare.com/account/articles/152441,https://plos.figshare.com/articles/dataset/Imp...,"{'posted': '2007-02-23T00:40:41', 'firstOnline...",Improving the <em>Caenorhabditis elegans</em> ...,10.1371/journal.pcbi.0030020,1,1950-01-01,"[{'id': 54319, 'full_name': 'Gunnar Rätsch', '...","[{'id': 53, 'title': 'Mathematics', 'parent_id...","Rätsch, Gunnar; Sonnenburg, Sören; Srinivasan,...",,2007-02-23T00:40:41Z,[],3,dataset,<p>(161 KB PDF)</p>,10.1371/journal.pcbi.0030020.sd001,,[],,,,https://plos.figshare.com/articles/dataset/Imp...,"[{'id': 468834, 'name': 'Protocol_S1.pdf', 'si...",,[],101.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2020-04-27T08:11:31Z,2007-02-23T00:40:41Z,[],10.1371/journal.pcbi.0030020,Improving the <em>Caenorhabditis elegans</em> ...,0,public,"[improving, genome, annotation]",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-02-23T00:40:41', 'firstOnline...","Data Preparation Protocols, Additional Results...",https://api.figshare.com/v2/articles/152441,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/152441,https://api.figshare.com/v2/articles/152441,https://plos.figshare.com/articles/dataset/Imp...,1
9,151745,Module-Based Outcome Prediction Using Breast C...,10.1371/journal.pone.0001047,,https://api.figshare.com/v2/articles/151745,2007-10-17T00:29:05Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,107.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/151745,https://figshare.com/account/articles/151745,https://plos.figshare.com/articles/dataset/Mod...,"{'posted': '2007-10-17T00:29:05', 'firstOnline...",Module-Based Outcome Prediction Using Breast C...,10.1371/journal.pone.0001047,1,1950-01-01,"[{'id': 80239, 'full_name': 'Martin H. van Vli...","[{'id': 13, 'title': 'Genetics', 'parent_id': ...","H. van Vliet, Martin; N. Klijn, Christiaan; F....",,2007-10-17T00:29:05Z,[],3,dataset,<div><h3>Background</h3><p>The availability of...,10.1371/journal.pone.0001047,,[],,,,https://plos.figshare.com/articles/dataset/Mod...,"[{'id': 465350, 'name': 'Text_S1.doc', 'size':...",,[],107.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-18T12:04:10Z,2007-10-17T00:29:05Z,[],10.1371/journal.pone.0001047,Module-Based Outcome Prediction Using Breast C...,12532568,public,"[module-based, cancer, compendia]",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-10-17T00:29:05', 'firstOnline...",Module-Based Outcome Prediction Using Breast C...,https://api.figshare.com/v2/articles/151745,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/151745,https://api.figshare.com/v2/articles/151745,https://plos.figshare.com/articles/dataset/Mod...,1
10,151418,Activation of Inflammation/NF-κB Signaling in ...,10.1371/journal.pgen.0030207,,https://api.figshare.com/v2/articles/151418,2007-11-23T00:23:38Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,98.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/151418,https://figshare.com/account/articles/151418,https://plos.figshare.com/articles/dataset/Act...,"{'posted': '2007-11-23T00:23:38', 'firstOnline...",Activation of Inflammation/NF-κB Signaling in ...,10.1371/journal.pgen.0030207,1,1950-01-01,"[{'id': 272843, 'full_name': 'Rebecca C Fry', ...","[{'id': 13, 'title': 'Genetics', 'parent_id': ...","C Fry, Rebecca; Navasumrit, Panida; Valiathan,...",,2007-11-23T00:23:38Z,[],3,dataset,<div><p>The long-term health outcome of prenat...,10.1371/journal.pgen.0030207,,[],,,,https://plos.figshare.com/articles/dataset/Act...,"[{'id': 463877, 'name': 'Figure_S1.pdf', 'size...",,[],98.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-18T12:04:14Z,2007-11-23T00:23:38Z,[],10.1371/journal.pgen.0030207,Activation of Inflammation/NF-κB Signaling in ...,12846051,public,"[activation, signaling, infants, arsenic-expos...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-11-23T00:23:38', 'firstOnline...",Activation of Inflammation/NF-κB Signaling in ...,https://api.figshare.com/v2/articles/151418,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/151418,https://api.figshare.com/v2/articles/151418,https://plos.figshare.com/articles/dataset/Act...,1
11,151373,Intragenomic Matching Reveals a Huge Potential...,10.1371/journal.pcbi.0030238,,https://api.figshare.com/v2/articles/151373,2007-11-30T00:22:53Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,101.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/151373,https://figshare.com/account/articles/151373,https://plos.figshare.com/articles/dataset/Int...,"{'posted': '2007-11-30T00:22:53', 'firstOnline...",Intragenomic Matching Reveals a Huge Potential...,10.1371/journal.pcbi.0030238,1,1950-01-01,"[{'id': 82401, 'full_name': 'Morten Lindow', '...","[{'id': 12, 'title': 'Cell Biology', 'parent_i...","Lindow, Morten; Jacobsen, Anders; Nygaard, San...",,2007-11-30T00:22:53Z,[],3,dataset,<div><p>microRNAs (miRNAs) are important post-...,10.1371/journal.pcbi.0030238,,[],,,,https://plos.figshare.com/articles/dataset/Int...,"[{'id': 6649146, 'name': 'Dataset S1.TDS', 'si...",,[],101.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-10-28T06:55:11Z,2007-11-30T00:22:53Z,[],10.1371/journal.pcbi.0030238,Intragenomic Matching Reveals a Huge Potential...,3756384,public,"[support vector machine classification, target...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-11-30T00:22:53', 'firstOnline...",Intragenomic Matching Reveals a Huge Potential...,https://api.figshare.com/v2/articles/151373,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/151373,https://api.figshare.com/v2/articles/151373,https://plos.figshare.com/articles/dataset/Int...,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24204,17113517,Supplementary Table 2: Unsupervised learning o...,10.25402/fon.17113517.v2,,https://api.figshare.com/v2/articles/17113517,2021-12-02T15:58:43Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113517,https://figshare.com/account/articles/17113517,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:58:43', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:58:43Z,[],3,dataset,<div>Supplementary Table 2. Unsupervised lear...,10.25402/FON.17113517.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645340, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:58:46Z,2021-12-02T15:58:43Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,77444,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:58:43', 'firstOnline...",Supplementary Table 2: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113517,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113517,https://api.figshare.com/v2/articles/17113517,https://future-science-group.figshare.com/arti...,2
24205,17113523,Supplementary Table 3: Unsupervised learning o...,10.25402/fon.17113523.v2,,https://api.figshare.com/v2/articles/17113523,2021-12-02T15:59:13Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113523,https://figshare.com/account/articles/17113523,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:59:13', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:59:13Z,[],3,dataset,<div>Supplementary Table 3. Unsupervised lear...,10.25402/FON.17113523.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645343, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:59:15Z,2021-12-02T15:59:13Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,185790,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:59:13', 'firstOnline...",Supplementary Table 3: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113523,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113523,https://api.figshare.com/v2/articles/17113523,https://future-science-group.figshare.com/arti...,2
24206,17113529,Supplementary Table 4: Unsupervised learning o...,10.25402/fon.17113529.v2,,https://api.figshare.com/v2/articles/17113529,2021-12-02T15:59:36Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113529,https://figshare.com/account/articles/17113529,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:59:36', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:59:36Z,[],3,dataset,<div>Supplementary Table 4. Unsupervised lear...,10.25402/FON.17113529.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645349, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:59:38Z,2021-12-02T15:59:36Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,137646,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:59:36', 'firstOnline...",Supplementary Table 4: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113529,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113529,https://api.figshare.com/v2/articles/17113529,https://future-science-group.figshare.com/arti...,2
24207,17113550,Supplementary Table 5: Unsupervised learning o...,10.25402/fon.17113550.v2,,https://api.figshare.com/v2/articles/17113550,2021-12-02T15:59:52Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113550,https://figshare.com/account/articles/17113550,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:59:52', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:59:52Z,[],3,dataset,<div>Supplementary Table 5. Unsupervised lear...,10.25402/FON.17113550.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645355, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:59:54Z,2021-12-02T15:59:52Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,109962,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:59:52', 'firstOnline...",Supplementary Table 5: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113550,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113550,https://api.figshare.com/v2/articles/17113550,https://future-science-group.figshare.com/arti...,2


In [14]:
#rename subset_df back to df, so remainder of code identical to full extract
del df
df = subset_df

In [15]:
df

Unnamed: 0,id,title_search,doi_search,handle_search,url_search,published_date_search,thumb_search,defined_type_search,defined_type_name_search,group_id_search,url_private_api_search,url_public_api_search,url_private_html_search,url_public_html_search,timeline_search,resource_title_search,resource_doi_search,search_page,publish_query,authors,categories,citation,confidential_reason,created_date,custom_fields,defined_type_metadata,defined_type_name_metadata,description,doi_metadata,embargo_date,embargo_options,embargo_reason,embargo_title,embargo_type,figshare_url,files,funding,funding_list,group_id_metadata,handle_metadata,has_linked_file,is_confidential,is_embargoed,is_metadata_record,is_public,license,metadata_reason,modified_date,published_date_metadata,references,resource_doi_metadata,resource_title_metadata,size,status,tags,thumb_metadata,timeline_metadata,title_metadata,url_metadata,url_private_api_metadata,url_private_html_metadata,url_public_api_metadata,url_public_html_metadata,version
2,152586,Large-Scale Mapping and Validation of Escheric...,10.1371/journal.pbio.0050008,,https://api.figshare.com/v2/articles/152586,2007-01-09T00:43:06Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,116.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/152586,https://figshare.com/account/articles/152586,https://plos.figshare.com/articles/dataset/Lar...,"{'posted': '2007-01-09T00:43:06', 'firstOnline...",Large-Scale Mapping and Validation of Escheric...,10.1371/journal.pbio.0050008,1,1950-01-01,"[{'id': 96324, 'full_name': 'Jeremiah J Faith'...","[{'id': 48, 'title': 'Biological Sciences', 'p...","J Faith, Jeremiah; Hayete, Boris; T Thaden, Jo...",,2007-01-09T00:43:06Z,[],3,dataset,<div><p>Machine learning approaches offer the ...,10.1371/journal.pbio.0050008,,[],,,,https://plos.figshare.com/articles/dataset/Lar...,"[{'id': 469682, 'name': 'Figure_S1.pdf', 'size...",,[],116.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-18T11:28:43Z,2007-01-09T00:43:06Z,[],10.1371/journal.pbio.0050008,Large-Scale Mapping and Validation of Escheric...,3358192,public,"[large-scale, validation, transcriptional, com...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-01-09T00:43:06', 'firstOnline...",Large-Scale Mapping and Validation of Escheric...,https://api.figshare.com/v2/articles/152586,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/152586,https://api.figshare.com/v2/articles/152586,https://plos.figshare.com/articles/dataset/Lar...,1
4,152441,"Data Preparation Protocols, Additional Results...",10.1371/journal.pcbi.0030020.sd001,,https://api.figshare.com/v2/articles/152441,2007-02-23T00:40:41Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,101.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/152441,https://figshare.com/account/articles/152441,https://plos.figshare.com/articles/dataset/Imp...,"{'posted': '2007-02-23T00:40:41', 'firstOnline...",Improving the <em>Caenorhabditis elegans</em> ...,10.1371/journal.pcbi.0030020,1,1950-01-01,"[{'id': 54319, 'full_name': 'Gunnar Rätsch', '...","[{'id': 53, 'title': 'Mathematics', 'parent_id...","Rätsch, Gunnar; Sonnenburg, Sören; Srinivasan,...",,2007-02-23T00:40:41Z,[],3,dataset,<p>(161 KB PDF)</p>,10.1371/journal.pcbi.0030020.sd001,,[],,,,https://plos.figshare.com/articles/dataset/Imp...,"[{'id': 468834, 'name': 'Protocol_S1.pdf', 'si...",,[],101.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2020-04-27T08:11:31Z,2007-02-23T00:40:41Z,[],10.1371/journal.pcbi.0030020,Improving the <em>Caenorhabditis elegans</em> ...,0,public,"[improving, genome, annotation]",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-02-23T00:40:41', 'firstOnline...","Data Preparation Protocols, Additional Results...",https://api.figshare.com/v2/articles/152441,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/152441,https://api.figshare.com/v2/articles/152441,https://plos.figshare.com/articles/dataset/Imp...,1
9,151745,Module-Based Outcome Prediction Using Breast C...,10.1371/journal.pone.0001047,,https://api.figshare.com/v2/articles/151745,2007-10-17T00:29:05Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,107.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/151745,https://figshare.com/account/articles/151745,https://plos.figshare.com/articles/dataset/Mod...,"{'posted': '2007-10-17T00:29:05', 'firstOnline...",Module-Based Outcome Prediction Using Breast C...,10.1371/journal.pone.0001047,1,1950-01-01,"[{'id': 80239, 'full_name': 'Martin H. van Vli...","[{'id': 13, 'title': 'Genetics', 'parent_id': ...","H. van Vliet, Martin; N. Klijn, Christiaan; F....",,2007-10-17T00:29:05Z,[],3,dataset,<div><h3>Background</h3><p>The availability of...,10.1371/journal.pone.0001047,,[],,,,https://plos.figshare.com/articles/dataset/Mod...,"[{'id': 465350, 'name': 'Text_S1.doc', 'size':...",,[],107.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-18T12:04:10Z,2007-10-17T00:29:05Z,[],10.1371/journal.pone.0001047,Module-Based Outcome Prediction Using Breast C...,12532568,public,"[module-based, cancer, compendia]",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-10-17T00:29:05', 'firstOnline...",Module-Based Outcome Prediction Using Breast C...,https://api.figshare.com/v2/articles/151745,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/151745,https://api.figshare.com/v2/articles/151745,https://plos.figshare.com/articles/dataset/Mod...,1
10,151418,Activation of Inflammation/NF-κB Signaling in ...,10.1371/journal.pgen.0030207,,https://api.figshare.com/v2/articles/151418,2007-11-23T00:23:38Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,98.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/151418,https://figshare.com/account/articles/151418,https://plos.figshare.com/articles/dataset/Act...,"{'posted': '2007-11-23T00:23:38', 'firstOnline...",Activation of Inflammation/NF-κB Signaling in ...,10.1371/journal.pgen.0030207,1,1950-01-01,"[{'id': 272843, 'full_name': 'Rebecca C Fry', ...","[{'id': 13, 'title': 'Genetics', 'parent_id': ...","C Fry, Rebecca; Navasumrit, Panida; Valiathan,...",,2007-11-23T00:23:38Z,[],3,dataset,<div><p>The long-term health outcome of prenat...,10.1371/journal.pgen.0030207,,[],,,,https://plos.figshare.com/articles/dataset/Act...,"[{'id': 463877, 'name': 'Figure_S1.pdf', 'size...",,[],98.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-18T12:04:14Z,2007-11-23T00:23:38Z,[],10.1371/journal.pgen.0030207,Activation of Inflammation/NF-κB Signaling in ...,12846051,public,"[activation, signaling, infants, arsenic-expos...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-11-23T00:23:38', 'firstOnline...",Activation of Inflammation/NF-κB Signaling in ...,https://api.figshare.com/v2/articles/151418,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/151418,https://api.figshare.com/v2/articles/151418,https://plos.figshare.com/articles/dataset/Act...,1
11,151373,Intragenomic Matching Reveals a Huge Potential...,10.1371/journal.pcbi.0030238,,https://api.figshare.com/v2/articles/151373,2007-11-30T00:22:53Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,101.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/151373,https://figshare.com/account/articles/151373,https://plos.figshare.com/articles/dataset/Int...,"{'posted': '2007-11-30T00:22:53', 'firstOnline...",Intragenomic Matching Reveals a Huge Potential...,10.1371/journal.pcbi.0030238,1,1950-01-01,"[{'id': 82401, 'full_name': 'Morten Lindow', '...","[{'id': 12, 'title': 'Cell Biology', 'parent_i...","Lindow, Morten; Jacobsen, Anders; Nygaard, San...",,2007-11-30T00:22:53Z,[],3,dataset,<div><p>microRNAs (miRNAs) are important post-...,10.1371/journal.pcbi.0030238,,[],,,,https://plos.figshare.com/articles/dataset/Int...,"[{'id': 6649146, 'name': 'Dataset S1.TDS', 'si...",,[],101.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-10-28T06:55:11Z,2007-11-30T00:22:53Z,[],10.1371/journal.pcbi.0030238,Intragenomic Matching Reveals a Huge Potential...,3756384,public,"[support vector machine classification, target...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2007-11-30T00:22:53', 'firstOnline...",Intragenomic Matching Reveals a Huge Potential...,https://api.figshare.com/v2/articles/151373,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/151373,https://api.figshare.com/v2/articles/151373,https://plos.figshare.com/articles/dataset/Int...,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24204,17113517,Supplementary Table 2: Unsupervised learning o...,10.25402/fon.17113517.v2,,https://api.figshare.com/v2/articles/17113517,2021-12-02T15:58:43Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113517,https://figshare.com/account/articles/17113517,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:58:43', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:58:43Z,[],3,dataset,<div>Supplementary Table 2. Unsupervised lear...,10.25402/FON.17113517.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645340, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:58:46Z,2021-12-02T15:58:43Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,77444,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:58:43', 'firstOnline...",Supplementary Table 2: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113517,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113517,https://api.figshare.com/v2/articles/17113517,https://future-science-group.figshare.com/arti...,2
24205,17113523,Supplementary Table 3: Unsupervised learning o...,10.25402/fon.17113523.v2,,https://api.figshare.com/v2/articles/17113523,2021-12-02T15:59:13Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113523,https://figshare.com/account/articles/17113523,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:59:13', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:59:13Z,[],3,dataset,<div>Supplementary Table 3. Unsupervised lear...,10.25402/FON.17113523.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645343, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:59:15Z,2021-12-02T15:59:13Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,185790,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:59:13', 'firstOnline...",Supplementary Table 3: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113523,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113523,https://api.figshare.com/v2/articles/17113523,https://future-science-group.figshare.com/arti...,2
24206,17113529,Supplementary Table 4: Unsupervised learning o...,10.25402/fon.17113529.v2,,https://api.figshare.com/v2/articles/17113529,2021-12-02T15:59:36Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113529,https://figshare.com/account/articles/17113529,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:59:36', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:59:36Z,[],3,dataset,<div>Supplementary Table 4. Unsupervised lear...,10.25402/FON.17113529.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645349, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:59:38Z,2021-12-02T15:59:36Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,137646,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:59:36', 'firstOnline...",Supplementary Table 4: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113529,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113529,https://api.figshare.com/v2/articles/17113529,https://future-science-group.figshare.com/arti...,2
24207,17113550,Supplementary Table 5: Unsupervised learning o...,10.25402/fon.17113550.v2,,https://api.figshare.com/v2/articles/17113550,2021-12-02T15:59:52Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113550,https://figshare.com/account/articles/17113550,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:59:52', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:59:52Z,[],3,dataset,<div>Supplementary Table 5. Unsupervised lear...,10.25402/FON.17113550.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645355, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:59:54Z,2021-12-02T15:59:52Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,109962,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:59:52', 'firstOnline...",Supplementary Table 5: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113550,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113550,https://api.figshare.com/v2/articles/17113550,https://future-science-group.figshare.com/arti...,2


## 1. How many total objects (not just records) are in our main dataset extracts for each repository?
**Property:** unique_identifier

In [16]:
ids = df.FigshareArticlesCrosswalk.unique_identifier
ids

2          152586
4          152441
9          151745
10         151418
11         151373
           ...   
24204    17113517
24205    17113523
24206    17113529
24207    17113550
24210    17114094
Name: id, Length: 10359, dtype: int64

In [17]:
ids.nunique()

10347

In [18]:
print(f'There are {len(ids)} items in the Figshare extract, with {ids.nunique()} unique IDs.')

There are 10359 items in the Figshare extract, with 10347 unique IDs.


In [19]:
#most have 1 row, 9 have 4 row instances
ids.value_counts().value_counts()

1    10343
4        4
Name: id, dtype: int64

In [20]:
#look into these duplicate IDs
ids.value_counts()

8832509     4
14099231    4
14254089    4
12618089    4
13517543    1
           ..
8200526     1
8206505     1
8143394     1
8207012     1
17114094    1
Name: id, Length: 10347, dtype: int64

In [21]:
fig_dupes = df.loc[df['id'] == 8832509]
fig_dupes

Unnamed: 0,id,title_search,doi_search,handle_search,url_search,published_date_search,thumb_search,defined_type_search,defined_type_name_search,group_id_search,url_private_api_search,url_public_api_search,url_private_html_search,url_public_html_search,timeline_search,resource_title_search,resource_doi_search,search_page,publish_query,authors,categories,citation,confidential_reason,created_date,custom_fields,defined_type_metadata,defined_type_name_metadata,description,doi_metadata,embargo_date,embargo_options,embargo_reason,embargo_title,embargo_type,figshare_url,files,funding,funding_list,group_id_metadata,handle_metadata,has_linked_file,is_confidential,is_embargoed,is_metadata_record,is_public,license,metadata_reason,modified_date,published_date_metadata,references,resource_doi_metadata,resource_title_metadata,size,status,tags,thumb_metadata,timeline_metadata,title_metadata,url_metadata,url_private_api_metadata,url_private_html_metadata,url_public_api_metadata,url_public_html_metadata,version
9003,8832509,"Data and R code for ""The paleoclimatic footpri...",10.6084/m9.figshare.8832509.v2,,https://api.figshare.com/v2/articles/8832509,2019-08-16T04:34:07Z,,3,dataset,,https://api.figshare.com/v2/account/articles/8...,https://api.figshare.com/v2/articles/8832509,https://figshare.com/account/articles/8832509,https://figshare.com/articles/dataset/Data_and...,"{'posted': '2019-08-16T04:34:07', 'firstOnline...",,,9,1950-01-01,"[{'id': 3861649, 'full_name': 'Jinzhi Ding', '...","[{'id': 266, 'title': 'Carbon Sequestration Sc...","Ding, Jinzhi; Wang, Tao; Piao, Shilong; Smith,...",,2019-08-16T04:34:07Z,[],3,dataset,"A new estimate of Tibetan soil carbon pool, a...",10.6084/m9.figshare.8832509.v2,2019-11-09,[],,,file,https://figshare.com/articles/dataset/Data_and...,"[{'id': 16174970, 'name': 'Tibetan soil carbon...",the Strategic Priority Research Program (A) of...,"[{'id': 14948300, 'title': 'the Strategic Prio...",,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2019-08-16T04:34:07Z,2019-08-16T04:34:07Z,[],,,1336299,public,"[soil carbon stock, alpine permafrost, Tibetan...",,"{'posted': '2019-08-16T04:34:07', 'firstOnline...","Data and R code for ""The paleoclimatic footpri...",https://api.figshare.com/v2/articles/8832509,https://api.figshare.com/v2/account/articles/8...,https://figshare.com/account/articles/8832509,https://api.figshare.com/v2/articles/8832509,https://figshare.com/articles/dataset/Data_and...,2
9004,8832509,"Data and R code for ""The paleoclimatic footpri...",10.6084/m9.figshare.8832509.v2,,https://api.figshare.com/v2/articles/8832509,2019-08-16T04:34:07Z,,3,dataset,,https://api.figshare.com/v2/account/articles/8...,https://api.figshare.com/v2/articles/8832509,https://figshare.com/account/articles/8832509,https://figshare.com/articles/dataset/Data_and...,"{'posted': '2019-08-16T04:34:07', 'firstOnline...",,,9,1950-01-01,"[{'id': 3861649, 'full_name': 'Jinzhi Ding', '...","[{'id': 266, 'title': 'Carbon Sequestration Sc...","Ding, Jinzhi; Wang, Tao; Piao, Shilong; Smith,...",,2019-08-16T04:34:07Z,[],3,dataset,"A new estimate of Tibetan soil carbon pool, a...",10.6084/m9.figshare.8832509.v2,2019-11-09,[],,,file,https://figshare.com/articles/dataset/Data_and...,"[{'id': 16174970, 'name': 'Tibetan soil carbon...",the Strategic Priority Research Program (A) of...,"[{'id': 14948300, 'title': 'the Strategic Prio...",,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2019-08-16T04:34:07Z,2019-08-16T04:34:07Z,[],,,1336299,public,"[soil carbon stock, alpine permafrost, Tibetan...",,"{'posted': '2019-08-16T04:34:07', 'firstOnline...","Data and R code for ""The paleoclimatic footpri...",https://api.figshare.com/v2/articles/8832509,https://api.figshare.com/v2/account/articles/8...,https://figshare.com/account/articles/8832509,https://api.figshare.com/v2/articles/8832509,https://figshare.com/articles/dataset/Data_and...,2
9005,8832509,"Data and R code for ""The paleoclimatic footpri...",10.6084/m9.figshare.8832509.v2,,https://api.figshare.com/v2/articles/8832509,2019-08-16T04:34:07Z,,3,dataset,,https://api.figshare.com/v2/account/articles/8...,https://api.figshare.com/v2/articles/8832509,https://figshare.com/account/articles/8832509,https://figshare.com/articles/dataset/Data_and...,"{'posted': '2019-08-16T04:34:07', 'firstOnline...",,,1,2019-08-16,"[{'id': 3861649, 'full_name': 'Jinzhi Ding', '...","[{'id': 266, 'title': 'Carbon Sequestration Sc...","Ding, Jinzhi; Wang, Tao; Piao, Shilong; Smith,...",,2019-08-16T04:34:07Z,[],3,dataset,"A new estimate of Tibetan soil carbon pool, a...",10.6084/m9.figshare.8832509.v2,2019-11-09,[],,,file,https://figshare.com/articles/dataset/Data_and...,"[{'id': 16174970, 'name': 'Tibetan soil carbon...",the Strategic Priority Research Program (A) of...,"[{'id': 14948300, 'title': 'the Strategic Prio...",,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2019-08-16T04:34:07Z,2019-08-16T04:34:07Z,[],,,1336299,public,"[soil carbon stock, alpine permafrost, Tibetan...",,"{'posted': '2019-08-16T04:34:07', 'firstOnline...","Data and R code for ""The paleoclimatic footpri...",https://api.figshare.com/v2/articles/8832509,https://api.figshare.com/v2/account/articles/8...,https://figshare.com/account/articles/8832509,https://api.figshare.com/v2/articles/8832509,https://figshare.com/articles/dataset/Data_and...,2
9006,8832509,"Data and R code for ""The paleoclimatic footpri...",10.6084/m9.figshare.8832509.v2,,https://api.figshare.com/v2/articles/8832509,2019-08-16T04:34:07Z,,3,dataset,,https://api.figshare.com/v2/account/articles/8...,https://api.figshare.com/v2/articles/8832509,https://figshare.com/account/articles/8832509,https://figshare.com/articles/dataset/Data_and...,"{'posted': '2019-08-16T04:34:07', 'firstOnline...",,,1,2019-08-16,"[{'id': 3861649, 'full_name': 'Jinzhi Ding', '...","[{'id': 266, 'title': 'Carbon Sequestration Sc...","Ding, Jinzhi; Wang, Tao; Piao, Shilong; Smith,...",,2019-08-16T04:34:07Z,[],3,dataset,"A new estimate of Tibetan soil carbon pool, a...",10.6084/m9.figshare.8832509.v2,2019-11-09,[],,,file,https://figshare.com/articles/dataset/Data_and...,"[{'id': 16174970, 'name': 'Tibetan soil carbon...",the Strategic Priority Research Program (A) of...,"[{'id': 14948300, 'title': 'the Strategic Prio...",,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2019-08-16T04:34:07Z,2019-08-16T04:34:07Z,[],,,1336299,public,"[soil carbon stock, alpine permafrost, Tibetan...",,"{'posted': '2019-08-16T04:34:07', 'firstOnline...","Data and R code for ""The paleoclimatic footpri...",https://api.figshare.com/v2/articles/8832509,https://api.figshare.com/v2/account/articles/8...,https://figshare.com/account/articles/8832509,https://api.figshare.com/v2/articles/8832509,https://figshare.com/articles/dataset/Data_and...,2


In [22]:
#compare rows to see if identical
fig1 = fig_dupes.iloc[0]
fig2 = fig_dupes.iloc[1]
fig3 = fig_dupes.iloc[2]
fig4 = fig_dupes.iloc[3]

print(fig1.equals(fig2))
print(fig2.equals(fig3))
print(fig3.equals(fig4))

True
False
True


In [23]:
#So first two instances are identical, and latter two instances are identical, but they differ from each other
#can at least narrow down to 2 instances rather than 4
#would need to dig deeper into how they differ, since at a glance, they look the same

In [24]:
fig_diffs = fig1 == fig4
fig_diffs

id                             True
title_search                   True
doi_search                     True
handle_search                  True
url_search                     True
published_date_search          True
thumb_search                   True
defined_type_search            True
defined_type_name_search       True
group_id_search               False
url_private_api_search         True
url_public_api_search          True
url_private_html_search        True
url_public_html_search         True
timeline_search                True
resource_title_search          True
resource_doi_search            True
search_page                   False
publish_query                 False
authors                        True
categories                     True
citation                       True
confidential_reason            True
created_date                   True
custom_fields                  True
defined_type_metadata          True
defined_type_name_metadata     True
description                 

In [25]:
index_diffs = fig_diffs[fig_diffs == False]
index_diffs

group_id_search            False
search_page                False
publish_query              False
group_id_metadata          False
resource_doi_metadata      False
resource_title_metadata    False
dtype: bool

In [26]:
index_diffs_str = index_diffs.index.tolist()
index_diffs_str

['group_id_search',
 'search_page',
 'publish_query',
 'group_id_metadata',
 'resource_doi_metadata',
 'resource_title_metadata']

In [27]:
fig1[index_diffs_str].dropna()

search_page               9
publish_query    1950-01-01
Name: 9003, dtype: object

In [28]:
fig4[index_diffs_str].dropna()

search_page               1
publish_query    2019-08-16
Name: 9006, dtype: object

In [29]:
#we don't use search_page for anything, so can ignore
#besides date field, other variations are None or NaN, so can ignore
#as with other repo duplicates, we likely want to keep most recent published, when same id

In [30]:
#subset to view only the duplicate ids
dupes_all = ids.value_counts().to_frame()
dupes_all = dupes_all[dupes_all['id'] == 4]
dupes_all

Unnamed: 0,id
8832509,4
14099231,4
14254089,4
12618089,4


In [31]:
dupes_all_ids = dupes_all.index.to_list()
dupes_all_ids

[8832509, 14099231, 14254089, 12618089]

In [32]:
dupes_df = df[df.id.isin(dupes_all_ids)].sort_values(['id', 'search_page', 'publish_query', 'embargo_date'])
dupes_df[['id', 'search_page', 'publish_query', 'embargo_date']]

Unnamed: 0,id,search_page,publish_query,embargo_date
9005,8832509,1,2019-08-16,2019-11-09
9006,8832509,1,2019-08-16,2019-11-09
9003,8832509,9,1950-01-01,2019-11-09
9004,8832509,9,1950-01-01,2019-11-09
13009,12618089,4,2019-08-16,
13010,12618089,4,2019-08-16,
13011,12618089,5,2019-08-16,
13012,12618089,5,2019-08-16,
3136,14099231,1,2019-08-16,
3137,14099231,1,2019-08-16,


In [33]:
#there's no way the 1950s dates are accurate
#and can ignore all the None embargo dates
#so difference is in search_page and publish_query

In [34]:
#group by id, sort by descending date, and select last date
df_use = df.sort_values(['id', 'publish_query'], ascending = False).groupby('id').nth(0).reset_index()
df_use

Unnamed: 0,id,title_search,doi_search,handle_search,url_search,published_date_search,thumb_search,defined_type_search,defined_type_name_search,group_id_search,url_private_api_search,url_public_api_search,url_private_html_search,url_public_html_search,timeline_search,resource_title_search,resource_doi_search,search_page,publish_query,authors,categories,citation,confidential_reason,created_date,custom_fields,defined_type_metadata,defined_type_name_metadata,description,doi_metadata,embargo_date,embargo_options,embargo_reason,embargo_title,embargo_type,figshare_url,files,funding,funding_list,group_id_metadata,handle_metadata,has_linked_file,is_confidential,is_embargoed,is_metadata_record,is_public,license,metadata_reason,modified_date,published_date_metadata,references,resource_doi_metadata,resource_title_metadata,size,status,tags,thumb_metadata,timeline_metadata,title_metadata,url_metadata,url_private_api_metadata,url_private_html_metadata,url_public_api_metadata,url_public_html_metadata,version
0,114867,Genome-Wide Screens for <em>In Vivo</em> Tinma...,10.1371/journal.pgen.1003195,,https://api.figshare.com/v2/articles/114867,2013-01-10T01:21:07Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,98.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/114867,https://figshare.com/account/articles/114867,https://plos.figshare.com/articles/dataset/Gen...,"{'posted': '2013-01-10T01:21:07', 'firstOnline...",Genome-Wide Screens for <em>In Vivo</em> Tinma...,10.1371/journal.pgen.1003195,1,1950-01-01,"[{'id': 65601, 'full_name': 'Hong Jin', 'is_ac...","[{'id': 13, 'title': 'Genetics', 'parent_id': ...","Jin, Hong; Stojnic, Robert; Adryan, Boris; Ozd...",,2013-01-10T01:21:07Z,[],3,dataset,<div><p>The NK homeodomain factor Tinman is a ...,10.1371/journal.pgen.1003195,,[],,,,https://plos.figshare.com/articles/dataset/Gen...,"[{'id': 278296, 'name': 'Figure_S1.tif', 'size...",,[],98.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-19T11:42:14Z,2013-01-10T01:21:07Z,[],10.1371/journal.pgen.1003195,Genome-Wide Screens for <em>In Vivo</em> Tinma...,51942464,public,"[genome-wide, screens, tinman, binding, sites,...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2013-01-10T01:21:07', 'firstOnline...",Genome-Wide Screens for <em>In Vivo</em> Tinma...,https://api.figshare.com/v2/articles/114867,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/114867,https://api.figshare.com/v2/articles/114867,https://plos.figshare.com/articles/dataset/Gen...,1
1,116703,PROSPER: An Integrated Feature-Based Tool for ...,10.1371/journal.pone.0050300,,https://api.figshare.com/v2/articles/116703,2012-11-29T01:51:43Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,107.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/116703,https://figshare.com/account/articles/116703,https://plos.figshare.com/articles/dataset/PRO...,"{'posted': '2012-11-29T01:51:43', 'firstOnline...",PROSPER: An Integrated Feature-Based Tool for ...,10.1371/journal.pone.0050300,1,1950-01-01,"[{'id': 59322, 'full_name': 'Jiangning Song', ...","[{'id': 52, 'title': 'Information and Computin...","Song, Jiangning; Tan, Hao; J. Perry, Andrew; A...",,2012-11-29T01:51:43Z,[],3,dataset,<div><p>The ability to catalytically cleave pr...,10.1371/journal.pone.0050300,,[],,,,https://plos.figshare.com/articles/dataset/PRO...,"[{'id': 287843, 'name': 'Figure_S1.tif', 'size...",,[],107.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2012-11-29T01:51:51Z,2012-11-29T01:51:43Z,[],10.1371/journal.pone.0050300,PROSPER: An Integrated Feature-Based Tool for ...,4188905,public,"[feature-based, predicting, protease, substrat...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2012-11-29T01:51:43', 'firstOnline...",PROSPER: An Integrated Feature-Based Tool for ...,https://api.figshare.com/v2/articles/116703,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/116703,https://api.figshare.com/v2/articles/116703,https://plos.figshare.com/articles/dataset/PRO...,1
2,116813,Exploiting Genomic Knowledge in Optimising Mol...,10.1371/journal.pone.0048862,,https://api.figshare.com/v2/articles/116813,2012-11-21T01:53:33Z,,3,dataset,107.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/116813,https://figshare.com/account/articles/116813,https://plos.figshare.com/articles/dataset/Exp...,"{'posted': '2012-11-21T01:53:33', 'firstOnline...",Exploiting Genomic Knowledge in Optimising Mol...,10.1371/journal.pone.0048862,1,1950-01-01,"[{'id': 116761, 'full_name': 'Steve O'Hagan', ...","[{'id': 21, 'title': 'Biotechnology', 'parent_...","O'Hagan, Steve; Knowles, Joshua; Kell, Douglas...",,2012-11-21T01:53:33Z,[],3,dataset,<div><p>Comparatively few studies have address...,10.1371/journal.pone.0048862,,[],,,,https://plos.figshare.com/articles/dataset/Exp...,"[{'id': 288459, 'name': 'File_S1.xlsx', 'size'...",,[],107.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-19T09:42:08Z,2012-11-21T01:53:33Z,[],10.1371/journal.pone.0048862,Exploiting Genomic Knowledge in Optimising Mol...,1092903,public,"[exploiting, genomic, optimising, molecular, b...",,"{'posted': '2012-11-21T01:53:33', 'firstOnline...",Exploiting Genomic Knowledge in Optimising Mol...,https://api.figshare.com/v2/articles/116813,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/116813,https://api.figshare.com/v2/articles/116813,https://plos.figshare.com/articles/dataset/Exp...,1
3,117032,Text S1 - <em>e</em>Thread: A Highly Optimized...,10.1371/journal.pone.0050200.s001,,https://api.figshare.com/v2/articles/117032,2012-11-21T01:57:12Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,107.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/117032,https://figshare.com/account/articles/117032,https://plos.figshare.com/articles/dataset/_e_...,"{'posted': '2012-11-21T01:57:12', 'firstOnline...",<em>e</em>Thread: A Highly Optimized Machine L...,10.1371/journal.pone.0050200,1,1950-01-01,"[{'id': 119287, 'full_name': 'Michal Brylinski...","[{'id': 52, 'title': 'Information and Computin...","Brylinski, Michal; Lingam, Daswanth (2015): Te...",,2012-11-21T01:57:12Z,[],3,dataset,<p>\n <b>Calculation of the burial ...,10.1371/journal.pone.0050200.s001,,[],,,,https://plos.figshare.com/articles/dataset/_e_...,"[{'id': 289627, 'name': 'Text_S1.pdf', 'size':...",,[],107.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2020-04-25T08:25:26Z,2012-11-21T01:57:12Z,[],10.1371/journal.pone.0050200,<em>e</em>Thread: A Highly Optimized Machine L...,0,public,"[optimized, learning-based, meta-threading, mo...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2012-11-21T01:57:12', 'firstOnline...",Text S1 - <em>e</em>Thread: A Highly Optimized...,https://api.figshare.com/v2/articles/117032,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/117032,https://api.figshare.com/v2/articles/117032,https://plos.figshare.com/articles/dataset/_e_...,1
4,117391,A Machine Learning Approach for Identifying Am...,10.1371/journal.pone.0049538,,https://api.figshare.com/v2/articles/117391,2012-11-14T02:03:11Z,https://s3-eu-west-1.amazonaws.com/ppreviews-p...,3,dataset,107.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/117391,https://figshare.com/account/articles/117391,https://plos.figshare.com/articles/dataset/A_M...,"{'posted': '2012-11-14T02:03:11', 'firstOnline...",A Machine Learning Approach for Identifying Am...,10.1371/journal.pone.0049538,1,1950-01-01,"[{'id': 120312, 'full_name': 'Alexander G. Hol...","[{'id': 48, 'title': 'Biological Sciences', 'p...","G. Holman, Alexander; Gabuzda, Dana (2016): A ...",,2012-11-14T02:03:11Z,[],3,dataset,<div><p>The identification of nucleotide seque...,10.1371/journal.pone.0049538,,[],,,,https://plos.figshare.com/articles/dataset/A_M...,"[{'id': 291521, 'name': 'Figure_S1.pdf', 'size...",,[],107.0,,0,0,0,0,1,"{'value': 1, 'name': 'CC BY 4.0', 'url': 'http...",,2016-01-19T09:42:00Z,2012-11-14T02:03:11Z,[],10.1371/journal.pone.0049538,A Machine Learning Approach for Identifying Am...,2850456,public,"[identifying, amino, signatures, hiv, predicti...",https://s3-eu-west-1.amazonaws.com/ppreviews-p...,"{'posted': '2012-11-14T02:03:11', 'firstOnline...",A Machine Learning Approach for Identifying Am...,https://api.figshare.com/v2/articles/117391,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/117391,https://api.figshare.com/v2/articles/117391,https://plos.figshare.com/articles/dataset/A_M...,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10342,17113517,Supplementary Table 2: Unsupervised learning o...,10.25402/fon.17113517.v2,,https://api.figshare.com/v2/articles/17113517,2021-12-02T15:58:43Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113517,https://figshare.com/account/articles/17113517,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:58:43', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:58:43Z,[],3,dataset,<div>Supplementary Table 2. Unsupervised lear...,10.25402/FON.17113517.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645340, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:58:46Z,2021-12-02T15:58:43Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,77444,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:58:43', 'firstOnline...",Supplementary Table 2: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113517,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113517,https://api.figshare.com/v2/articles/17113517,https://future-science-group.figshare.com/arti...,2
10343,17113523,Supplementary Table 3: Unsupervised learning o...,10.25402/fon.17113523.v2,,https://api.figshare.com/v2/articles/17113523,2021-12-02T15:59:13Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113523,https://figshare.com/account/articles/17113523,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:59:13', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:59:13Z,[],3,dataset,<div>Supplementary Table 3. Unsupervised lear...,10.25402/FON.17113523.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645343, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:59:15Z,2021-12-02T15:59:13Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,185790,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:59:13', 'firstOnline...",Supplementary Table 3: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113523,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113523,https://api.figshare.com/v2/articles/17113523,https://future-science-group.figshare.com/arti...,2
10344,17113529,Supplementary Table 4: Unsupervised learning o...,10.25402/fon.17113529.v2,,https://api.figshare.com/v2/articles/17113529,2021-12-02T15:59:36Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113529,https://figshare.com/account/articles/17113529,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:59:36', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:59:36Z,[],3,dataset,<div>Supplementary Table 4. Unsupervised lear...,10.25402/FON.17113529.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645349, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:59:38Z,2021-12-02T15:59:36Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,137646,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:59:36', 'firstOnline...",Supplementary Table 4: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113529,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113529,https://api.figshare.com/v2/articles/17113529,https://future-science-group.figshare.com/arti...,2
10345,17113550,Supplementary Table 5: Unsupervised learning o...,10.25402/fon.17113550.v2,,https://api.figshare.com/v2/articles/17113550,2021-12-02T15:59:52Z,https://s3-eu-west-1.amazonaws.com/ppreviews-f...,3,dataset,21381.0,https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/17113550,https://figshare.com/account/articles/17113550,https://future-science-group.figshare.com/arti...,"{'posted': '2021-12-02T15:59:52', 'firstOnline...",Unsupervised learning of cross-modal mappings ...,10.2217/fon-2021-1059,7,2021-03-21,"[{'id': 511385, 'full_name': 'Jianmin Xu', 'is...","[{'id': 387, 'title': 'Solid Tumours', 'parent...","Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...",,2021-12-02T15:59:52Z,[],3,dataset,<div>Supplementary Table 5. Unsupervised lear...,10.25402/FON.17113550.v2,,[],,,file,https://future-science-group.figshare.com/arti...,"[{'id': 31645355, 'name': 'Supplementary Table...",,[],21381.0,,0,0,0,0,1,"{'value': 12, 'name': 'CC BY-NC-ND 4.0', 'url'...",,2021-12-02T15:59:54Z,2021-12-02T15:59:52Z,[],10.2217/fon-2021-1059,Unsupervised learning of cross-modal mappings ...,109962,public,"[machine learning, gastric cancer, disease mod...",https://s3-eu-west-1.amazonaws.com/ppreviews-f...,"{'posted': '2021-12-02T15:59:52', 'firstOnline...",Supplementary Table 5: Unsupervised learning o...,https://api.figshare.com/v2/articles/17113550,https://api.figshare.com/v2/account/articles/1...,https://figshare.com/account/articles/17113550,https://api.figshare.com/v2/articles/17113550,https://future-science-group.figshare.com/arti...,2


In [35]:
df_use[df_use.id.isin(dupes_all_ids)].sort_values(['id', 'publish_query'])[['id','publish_query']]

Unnamed: 0,id,publish_query
3726,8832509,2019-08-16
5668,12618089,2019-08-16
7509,14099231,2019-08-16
7793,14254089,2021-03-21


In [36]:
#for matching other notebooks, rename 'df_use' back to 'df'
df = df_use

In [37]:
ids = df.FigshareArticlesCrosswalk.unique_identifier

In [38]:
print(f'There are {len(ids)} items in the Figshare extract after removing duplicates, with {ids.nunique()} unique IDs.')

There are 10347 items in the Figshare extract after removing duplicates, with 10347 unique IDs.


## 2. See the "Licenses offered" tab in /Working documents/Licenses sheet for list of licenses by repo.

## Given the type(s) of license(s) offered by the repo, how many of each type is assigned?
**Property:** License

In [39]:
licenses = df.FigshareArticlesCrosswalk.license
licenses

0              CC BY 4.0
1              CC BY 4.0
2              CC BY 4.0
3              CC BY 4.0
4              CC BY 4.0
              ...       
10342    CC BY-NC-ND 4.0
10343    CC BY-NC-ND 4.0
10344    CC BY-NC-ND 4.0
10345    CC BY-NC-ND 4.0
10346       CC BY-NC 4.0
Name: license, Length: 10347, dtype: object

In [40]:
license_counts = licenses.value_counts().to_frame()
license_counts['percent'] = license_counts['license']/len(licenses) * 100
license_counts

Unnamed: 0,license,percent
CC BY 4.0,8004,77.355755
CC BY + CC0,918,8.872137
CC BY-NC 4.0,880,8.504881
CC0,251,2.425824
MIT,140,1.353049
GPL 3.0+,45,0.434909
Apache 2.0,27,0.260945
CC BY-NC-ND 4.0,17,0.164299
In Copyright,15,0.14497
CC BY,8,0.077317


## 3. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Description
**Related function:** `mean_characters`

In [41]:
description = df.FigshareArticlesCrosswalk.description
description

0        <div><p>The NK homeodomain factor Tinman is a ...
1        <div><p>The ability to catalytically cleave pr...
2        <div><p>Comparatively few studies have address...
3        <p>\n            <b>Calculation of the burial ...
4        <div><p>The identification of nucleotide seque...
                               ...                        
10342    <div>Supplementary Table 2.  Unsupervised lear...
10343    <div>Supplementary Table 3.  Unsupervised lear...
10344    <div>Supplementary Table 4.  Unsupervised lear...
10345    <div>Supplementary Table 5.  Unsupervised lear...
10346    Angiotensin\nconverting enzyme-I (ACE-I) is a ...
Name: description, Length: 10347, dtype: object

In [42]:
description.describe()

count             10347
unique             8146
top       <p>(XLSX)</p>
freq                 74
Name: description, dtype: object

In [43]:
print(f'Number of null descriptions: {sum(description.isna())}')

Number of null descriptions: 0


In [44]:
print(f'{analysis.mean_characters(description)} mean characters')

862.0527866437516 mean characters


## 4. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Methods
**Related function:** `mean_characters`

In [45]:
methods = df.FigshareArticlesCrosswalk.methods
methods

In [46]:
#confirm missing for this repo
print(df.FigshareArticlesCrosswalk.methods)

None


## 5. What are the min and max publication dates for each repo?

## How many objects were published each year for each repo?
**Property:** Publication date

In [47]:
publication_dates = df.FigshareArticlesCrosswalk.publication_date
publication_dates

0        2016-01-19T11:42:14
1        2012-11-29T01:51:51
2        2016-01-19T09:42:08
3        2015-12-02T12:40:08
4        2016-01-19T09:42:00
                ...         
10342    2021-12-02T15:12:45
10343    2021-12-02T15:12:45
10344    2021-12-02T15:12:45
10345    2021-12-02T15:12:45
10346    2021-12-02T18:06:11
Name: timeline_metadata, Length: 10347, dtype: object

In [48]:
#min and max publication year
publication_dates.min(), publication_dates.max()

('2011-11-03T00:28:42', '2021-12-02T18:06:11')

In [49]:
#objects per year
publication_dates.astype('datetime64').apply(lambda date: date.year).value_counts().sort_index()

2011       1
2012       1
2013      10
2014      16
2015     467
2016     866
2017     442
2018    1089
2019    1582
2020    2470
2021    3403
Name: timeline_metadata, dtype: int64

In [50]:
#export for plotting
pub_dates_export = publication_dates.astype('datetime64').apply(lambda date: date.year).value_counts().sort_index().to_frame()

In [51]:
#update column names
pub_dates_export_ready = pub_dates_export.reset_index(level=0)
pub_dates_export_ready.columns = ['year', 'count']

In [52]:
#add column with name of repo
pub_dates_export_ready['repo'] = 'figshare_subset'

In [53]:
pub_dates_export_ready

Unnamed: 0,year,count,repo
0,2011,1,figshare_subset
1,2012,1,figshare_subset
2,2013,10,figshare_subset
3,2014,16,figshare_subset
4,2015,467,figshare_subset
5,2016,866,figshare_subset
6,2017,442,figshare_subset
7,2018,1089,figshare_subset
8,2019,1582,figshare_subset
9,2020,2470,figshare_subset


In [54]:
#export to Figures folder
pub_dates_export_ready.to_csv('..\\..\\Figures\\Figure1\\repository_dates\\figshare_subset_pub_years.csv')

## 6. What are the unweighted mean, median, and max file sizes among all ingested files?
**Property:** File size
**Related function:** `get_summary_statistics`

We first get the file size attribute using the crosswalk.

In [55]:
file_sizes = df.FigshareArticlesCrosswalk.file_size
file_sizes

0        [6866146, 9545372, 6460828, 1569434, 9797854, ...
1        [837800, 289792, 42496, 56832, 67072, 71168, 7...
2          [239186, 227786, 170104, 225301, 152663, 77863]
3                                                 [158279]
4                                 [2296307, 497404, 56745]
                               ...                        
10342                                              [77444]
10343                                             [185790]
10344                                             [137646]
10345                                             [109962]
10346                                             [153877]
Name: files, Length: 10347, dtype: object

In [56]:
#replace None values with empty list
file_sizes = file_sizes.apply(lambda d: d if isinstance(d, list) else [])
file_sizes

0        [6866146, 9545372, 6460828, 1569434, 9797854, ...
1        [837800, 289792, 42496, 56832, 67072, 71168, 7...
2          [239186, 227786, 170104, 225301, 152663, 77863]
3                                                 [158279]
4                                 [2296307, 497404, 56745]
                               ...                        
10342                                              [77444]
10343                                             [185790]
10344                                             [137646]
10345                                             [109962]
10346                                             [153877]
Name: files, Length: 10347, dtype: object

In [57]:
#collapse into single column - each file size in own row, since we are interested in summary stats across all files
file_sizes_long = file_sizes.explode()
file_sizes_long

0        6866146
0        9545372
0        6460828
0        1569434
0        9797854
          ...   
10342      77444
10343     185790
10344     137646
10345     109962
10346     153877
Name: files, Length: 31926, dtype: object

In [58]:
#drop NaN values, so median calculates correctly
file_sizes_long = file_sizes_long.dropna()
file_sizes_long

0        6866146
0        9545372
0        6460828
0        1569434
0        9797854
          ...   
10342      77444
10343     185790
10344     137646
10345     109962
10346     153877
Name: files, Length: 31894, dtype: object

In [59]:
#get summary statistics
analysis.get_summary_statistics(file_sizes_long)

{'mean': 147661899.27497336, 'median': 157358.0, 'max': 570679833692}

## 7. What are the mean, median, and max number of files per object?
**Property:** URL
**Related function:** `get_summary_statistics`

`missing` is set to an empty list so that the `None` values for objects without files have "zero files"

In [60]:
files = df.FigshareArticlesCrosswalk.url
files

0        [https://ndownloader.figshare.com/files/278296...
1        [https://ndownloader.figshare.com/files/287843...
2        [https://ndownloader.figshare.com/files/288459...
3          [https://ndownloader.figshare.com/files/289627]
4        [https://ndownloader.figshare.com/files/291521...
                               ...                        
10342    [https://ndownloader.figshare.com/files/31645340]
10343    [https://ndownloader.figshare.com/files/31645343]
10344    [https://ndownloader.figshare.com/files/31645349]
10345    [https://ndownloader.figshare.com/files/31645355]
10346    [https://ndownloader.figshare.com/files/31646025]
Name: files, Length: 10347, dtype: object

In [61]:
#replace None with empty list
files = files.apply(lambda d: d if isinstance(d, list) else [])
files

0        [https://ndownloader.figshare.com/files/278296...
1        [https://ndownloader.figshare.com/files/287843...
2        [https://ndownloader.figshare.com/files/288459...
3          [https://ndownloader.figshare.com/files/289627]
4        [https://ndownloader.figshare.com/files/291521...
                               ...                        
10342    [https://ndownloader.figshare.com/files/31645340]
10343    [https://ndownloader.figshare.com/files/31645343]
10344    [https://ndownloader.figshare.com/files/31645349]
10345    [https://ndownloader.figshare.com/files/31645355]
10346    [https://ndownloader.figshare.com/files/31646025]
Name: files, Length: 10347, dtype: object

In [62]:
#get files per object
files_counts = files.apply(len)
files_counts

0        12
1        14
2         6
3         1
4         3
         ..
10342     1
10343     1
10344     1
10345     1
10346     1
Name: files, Length: 10347, dtype: int64

In [63]:
#get summary statistics
analysis.get_summary_statistics(files_counts)

{'mean': 3.082439354402242, 'median': 1.0, 'max': 1100}

## 8. What are the mean, median, and max total dataset size (summed across all files) per object?
**Property:** Dataset size
**Related function:** `get_summary_statistics`

In [64]:
dataset_sizes = df.FigshareArticlesCrosswalk.dataset_size
dataset_sizes

0        [6866146, 9545372, 6460828, 1569434, 9797854, ...
1        [837800, 289792, 42496, 56832, 67072, 71168, 7...
2          [239186, 227786, 170104, 225301, 152663, 77863]
3                                                 [158279]
4                                 [2296307, 497404, 56745]
                               ...                        
10342                                              [77444]
10343                                             [185790]
10344                                             [137646]
10345                                             [109962]
10346                                             [153877]
Name: files, Length: 10347, dtype: object

In [65]:
#replace None values with empty list, file size of 0
dataset_sizes = dataset_sizes.apply(lambda d: d if isinstance(d, list) else [])
dataset_sizes

0        [6866146, 9545372, 6460828, 1569434, 9797854, ...
1        [837800, 289792, 42496, 56832, 67072, 71168, 7...
2          [239186, 227786, 170104, 225301, 152663, 77863]
3                                                 [158279]
4                                 [2296307, 497404, 56745]
                               ...                        
10342                                              [77444]
10343                                             [185790]
10344                                             [137646]
10345                                             [109962]
10346                                             [153877]
Name: files, Length: 10347, dtype: object

In [66]:
#sum up size of files within object (sum up within each list in series)
dataset_sizes_total = dataset_sizes.apply(sum)
dataset_sizes_total

0        51942464
1         4188905
2         1092903
3          158279
4         2850456
           ...   
10342       77444
10343      185790
10344      137646
10345      109962
10346      153877
Name: files, Length: 10347, dtype: int64

In [67]:
#get summary statistics
analysis.get_summary_statistics(dataset_sizes_total, suppress_output=False);

mean: 455158849.47095776
median: 72567.0
max: 675675420197


## 9. How many of each scientific domain are assigned?
**Property:** Domain
**Related function:** `domains.value_counts()`

In [68]:
domains = df.FigshareArticlesCrosswalk.domain
domains

0                        [Genetics, Developmental Biology]
1        [Information and Computing Sciences, Biologica...
2        [Biotechnology, Biological Sciences, Informati...
3        [Information and Computing Sciences, Biologica...
4              [Biological Sciences, Cancer, Neuroscience]
                               ...                        
10342                                      [Solid Tumours]
10343                                      [Solid Tumours]
10344                                      [Solid Tumours]
10345                                      [Solid Tumours]
10346    [Biochemistry, Molecular Biology, Pharmacology...
Name: categories, Length: 10347, dtype: object

In [69]:
#expand so one domain per row
domains_all = domains.explode()
domains_all

0                                            Genetics
0                               Developmental Biology
1                  Information and Computing Sciences
1                                 Biological Sciences
1                                        Biochemistry
                             ...                     
10346                                    Pharmacology
10346      Chemical Sciences not elsewhere classified
10346    Biological Sciences not elsewhere classified
10346    Information Systems not elsewhere classified
10346                                        Virology
Name: categories, Length: 66847, dtype: object

In [70]:
domains_counts = domains_all.value_counts().to_frame()
domains_counts['percent'] = domains_counts/len(domains) * 100
domains_counts.head(10)

Unnamed: 0,categories,percent
Biological Sciences not elsewhere classified,4292,41.480622
Information Systems not elsewhere classified,3009,29.080893
Genetics,2210,21.358848
Biotechnology,1846,17.84092
Cancer,1816,17.550981
Medicine,1708,16.5072
Chemical Sciences not elsewhere classified,1582,15.289456
Science Policy,1524,14.728907
Biochemistry,1488,14.38098
Mathematical Sciences not elsewhere classified,1449,14.004059


## 10. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Technical details
**Related function:** `mean_characters`

In [71]:
# "usage notes" is not in crosswalk, none for this repo

## 11-13. What are the mean and median total number of keyword terms per object, after merging results for Keyword, Geographic keyword, and Scientific keyword?
**Property:** Keyword

In [72]:
print(df.FigshareArticlesCrosswalk.keyword)

0        [genome-wide, screens, tinman, binding, sites,...
1        [feature-based, predicting, protease, substrat...
2        [exploiting, genomic, optimising, molecular, b...
3        [optimized, learning-based, meta-threading, mo...
4        [identifying, amino, signatures, hiv, predicti...
                               ...                        
10342    [machine learning, gastric cancer, disease mod...
10343    [machine learning, gastric cancer, disease mod...
10344    [machine learning, gastric cancer, disease mod...
10345    [machine learning, gastric cancer, disease mod...
10346    [key therapeutic target, hypertensive peptide ...
Name: tags, Length: 10347, dtype: object


In [73]:
print(df.FigshareArticlesCrosswalk.geographic_keyword)

0        [genome-wide, screens, tinman, binding, sites,...
1        [feature-based, predicting, protease, substrat...
2        [exploiting, genomic, optimising, molecular, b...
3        [optimized, learning-based, meta-threading, mo...
4        [identifying, amino, signatures, hiv, predicti...
                               ...                        
10342    [machine learning, gastric cancer, disease mod...
10343    [machine learning, gastric cancer, disease mod...
10344    [machine learning, gastric cancer, disease mod...
10345    [machine learning, gastric cancer, disease mod...
10346    [key therapeutic target, hypertensive peptide ...
Name: tags, Length: 10347, dtype: object


In [74]:
print(df.FigshareArticlesCrosswalk.scientific_keyword)

0        [genome-wide, screens, tinman, binding, sites,...
1        [feature-based, predicting, protease, substrat...
2        [exploiting, genomic, optimising, molecular, b...
3        [optimized, learning-based, meta-threading, mo...
4        [identifying, amino, signatures, hiv, predicti...
                               ...                        
10342    [machine learning, gastric cancer, disease mod...
10343    [machine learning, gastric cancer, disease mod...
10344    [machine learning, gastric cancer, disease mod...
10345    [machine learning, gastric cancer, disease mod...
10346    [key therapeutic target, hypertensive peptide ...
Name: tags, Length: 10347, dtype: object


In [75]:
#confirm all identical
keywords1 = df.FigshareArticlesCrosswalk.keyword
keywords2 = df.FigshareArticlesCrosswalk.geographic_keyword
keywords3 = df.FigshareArticlesCrosswalk.scientific_keyword

In [76]:
print(keywords1.equals(keywords2))
print(keywords2.equals(keywords3))

True
True


In [77]:
#use keyword column since all keyword columns are identical
keywords = df.FigshareArticlesCrosswalk.keyword
keywords

0        [genome-wide, screens, tinman, binding, sites,...
1        [feature-based, predicting, protease, substrat...
2        [exploiting, genomic, optimising, molecular, b...
3        [optimized, learning-based, meta-threading, mo...
4        [identifying, amino, signatures, hiv, predicti...
                               ...                        
10342    [machine learning, gastric cancer, disease mod...
10343    [machine learning, gastric cancer, disease mod...
10344    [machine learning, gastric cancer, disease mod...
10345    [machine learning, gastric cancer, disease mod...
10346    [key therapeutic target, hypertensive peptide ...
Name: tags, Length: 10347, dtype: object

In [78]:
keywords_counts = keywords1.apply(len)
keywords_counts

0         8
1         6
2         8
3         6
4         6
         ..
10342     9
10343     9
10344     9
10345     9
10346    32
Name: tags, Length: 10347, dtype: int64

In [79]:
#get summary statistics
analysis.get_summary_statistics(keywords_counts, suppress_output = False);

mean: 9.812602686769111
median: 7.0
max: 61


## 14. Who are the most common funding agencies for each repo? What are the object counts per agency?
**Property:** Funding Agency

In [80]:
funders = df.FigshareArticlesCrosswalk.funding_agency
funders

0        []
1        []
2        []
3        []
4        []
         ..
10342    []
10343    []
10344    []
10345    []
10346    []
Name: funding_list, Length: 10347, dtype: object

In [81]:
#may be more than one funder per object, so expand
funders_long = funders.explode()
funders_long

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
10342    NaN
10343    NaN
10344    NaN
10345    NaN
10346    NaN
Name: funding_list, Length: 10563, dtype: object

In [82]:
funders_counts = funders_long.value_counts().to_frame()
funders_counts['percent'] = funders_counts['funding_list']/len(funders) * 100
funders_counts

Unnamed: 0,funding_list,percent
National Science Foundation,28,0.270610
CSU Animal Cancer Shipley Chair in Comparative Oncology,25,0.241616
National Natural Science Foundation of China,19,0.183628
"Foundation for the National Institutes of Health Defense Sciences Office, DARPA",18,0.173963
National Institutes of Health,15,0.144970
...,...,...
Google daydream,1,0.009665
ERA-NET NEURON NeuroNiche project 01EW1708,1,0.009665
ERC advanced grant (Zf-BrainReg),1,0.009665
Alexander-von-Humboldt Stiftung,1,0.009665


## 15. What are the mean, median, and max number of Views per object?
**Property:** Views
**Related function:** `get_summary_statistics`

In [83]:
views = df.FigshareArticlesCrosswalk.views
views

In [84]:
#confirm none - variable not available for this repo
print(df.FigshareArticlesCrosswalk.views)

None


## 16. What are the mean, median, and max (total) number of downloads per object?
**Property:** Downloads
**Related function:** `get_summary_statistics`

In [85]:
downloads = df.FigshareArticlesCrosswalk.downloads
downloads

In [86]:
#confirm none - variable not available for this repo
print(df.FigshareArticlesCrosswalk.downloads)

None


## 17. What are the mean, median, and max Citation counts per object?
**Property:** Citation count
**Related function:** `get_summary_statistics`

In [87]:
citation_count = df.FigshareArticlesCrosswalk.citation_count
citation_count

In [88]:
#confirm none - variable not available for this repo
print(df.FigshareArticlesCrosswalk.citation_count)

None


## 18. How many objects contain each given resource type?
**Property:** Resource type

In [89]:
resource_types = df.FigshareArticlesCrosswalk.resource_type
resource_types

0        dataset
1        dataset
2        dataset
3        dataset
4        dataset
          ...   
10342    dataset
10343    dataset
10344    dataset
10345    dataset
10346    dataset
Name: defined_type_name_metadata, Length: 10347, dtype: object

In [90]:
resource_counts = resource_types.value_counts().to_frame()
resource_counts['percent'] = round(resource_counts['defined_type_name_metadata']/len(resource_types) * 100)
resource_counts

Unnamed: 0,defined_type_name_metadata,percent
dataset,10132,98.0
software,213,2.0
model,2,0.0


## 19. How many objects contain each type of file extension given?
**Property:** File Extension
**Related function:** `get_file_extensions`

In [91]:
files = df.FigshareArticlesCrosswalk.file_extension
files

0        [Figure_S1.tif, Figure_S2.tif, Figure_S3.tif, ...
1        [Figure_S1.tif, Table_S8.doc, Table_S7.doc, Ta...
2        [File_S1.xlsx, File_S2.xlsx, File_S3.xlsx, Fil...
3                                            [Text_S1.pdf]
4            [Figure_S1.pdf, Figure_S2.pdf, Table_S1.xlsx]
                               ...                        
10342                          [Supplementary Table 2.pdf]
10343                          [Supplementary Table 3.pdf]
10344                          [Supplementary Table 4.pdf]
10345                          [Supplementary Table 5.pdf]
10346                              [jf1c04555_si_001.xlsx]
Name: files, Length: 10347, dtype: object

In [92]:
#add ID column to make easier to count by object
files_ids = pd.concat([ids, files], axis = 1)
files_ids

Unnamed: 0,id,files
0,114867,"[Figure_S1.tif, Figure_S2.tif, Figure_S3.tif, ..."
1,116703,"[Figure_S1.tif, Table_S8.doc, Table_S7.doc, Ta..."
2,116813,"[File_S1.xlsx, File_S2.xlsx, File_S3.xlsx, Fil..."
3,117032,[Text_S1.pdf]
4,117391,"[Figure_S1.pdf, Figure_S2.pdf, Table_S1.xlsx]"
...,...,...
10342,17113517,[Supplementary Table 2.pdf]
10343,17113523,[Supplementary Table 3.pdf]
10344,17113529,[Supplementary Table 4.pdf]
10345,17113550,[Supplementary Table 5.pdf]


In [93]:
#replace the None values with empty lists so the count of string values evaluates to 0
files_ids = files_ids.apply(
    lambda row: row.apply(
        lambda cell: cell if cell else []
    ),
    axis=1
)
files_ids

Unnamed: 0,id,files
0,114867,"[Figure_S1.tif, Figure_S2.tif, Figure_S3.tif, ..."
1,116703,"[Figure_S1.tif, Table_S8.doc, Table_S7.doc, Ta..."
2,116813,"[File_S1.xlsx, File_S2.xlsx, File_S3.xlsx, Fil..."
3,117032,[Text_S1.pdf]
4,117391,"[Figure_S1.pdf, Figure_S2.pdf, Table_S1.xlsx]"
...,...,...
10342,17113517,[Supplementary Table 2.pdf]
10343,17113523,[Supplementary Table 3.pdf]
10344,17113529,[Supplementary Table 4.pdf]
10345,17113550,[Supplementary Table 5.pdf]


The following code separates the full file extensions (all dot-separated values after the first dot) for a list of files and creates a set, allowing us to only look at the number of objects that contain a given extension. It keeps only unique file extensions: for instance, if an object has 5 .tif files, there is only 1 .tif items in the object list of file extensions.

In [94]:
files_extension_set = files_ids['files'].apply(
    lambda file_list: list({''.join(Path(file).suffixes) for file in file_list})
)
files_extension_set

0        [.tif, .pdf, .xls]
1              [.doc, .tif]
2             [.xlsx, .zip]
3                    [.pdf]
4             [.xlsx, .pdf]
                ...        
10342                [.pdf]
10343                [.pdf]
10344                [.pdf]
10345                [.pdf]
10346               [.xlsx]
Name: files, Length: 10347, dtype: object

In [95]:
#expand so each file within object is own row
files_ext_ids = files_extension_set.explode().to_frame()
files_ext_ids

Unnamed: 0,files
0,.tif
0,.pdf
0,.xls
1,.doc
1,.tif
...,...
10342,.pdf
10343,.pdf
10344,.pdf
10345,.pdf


In [96]:
#group by extension type to count how many objects have each file type
ext_grouped = files_ext_ids.groupby('files').value_counts().to_frame().sort_values(0, ascending = False)
ext_grouped['percent'] = round(ext_grouped[0]/len(files_ids) * 100)

In [97]:
#this is an ESTIMATE, pre-cleaning
ext_grouped[ext_grouped['percent'] >= 5]

Unnamed: 0_level_0,0,percent
files,Unnamed: 1_level_1,Unnamed: 2_level_1
.xls,2780,27.0
.xlsx,1504,15.0
.zip,1077,10.0
.docx,837,8.0
.pdf,703,7.0
.csv,533,5.0


In [98]:
ext_grouped.head(20)

Unnamed: 0_level_0,0,percent
files,Unnamed: 1_level_1,Unnamed: 2_level_1
.xls,2780,27.0
.xlsx,1504,15.0
.zip,1077,10.0
.docx,837,8.0
.pdf,703,7.0
.csv,533,5.0
.XLSX,436,4.0
.DOCX,347,3.0
.PDF,343,3.0
.txt,320,3.0


In [99]:
#confirm accuracy - there should be 51 objects with .json files
len(files_ext_ids[files_ext_ids['files'] == '.json'] )

51

In [100]:
#export for further clean up, refining estimates, and plotting

In [101]:
ext_grouped_ready = files_ext_ids.reset_index(level=0)
ext_grouped_ready

Unnamed: 0,index,files
0,0,.tif
1,0,.pdf
2,0,.xls
3,1,.doc
4,1,.tif
...,...,...
14303,10342,.pdf
14304,10343,.pdf
14305,10344,.pdf
14306,10345,.pdf


In [102]:
#reset index and update column names
ext_grouped_ready.columns = ['index', 'files']

#add column with name of repo
ext_grouped_ready['repo'] = 'figshare_subset'

ext_grouped_ready

Unnamed: 0,index,files,repo
0,0,.tif,figshare_subset
1,0,.pdf,figshare_subset
2,0,.xls,figshare_subset
3,1,.doc,figshare_subset
4,1,.tif,figshare_subset
...,...,...,...
14303,10342,.pdf,figshare_subset
14304,10343,.pdf,figshare_subset
14305,10344,.pdf,figshare_subset
14306,10345,.pdf,figshare_subset


In [103]:
ext_grouped_ready['index'].unique() #confirmed accurate

array([    0,     1,     2, ..., 10344, 10345, 10346], dtype=int64)

In [104]:
#export to Figures folder
ext_grouped_ready.to_csv('..\\..\\Figures\\Figure2\\file_ext_data\\figshare_subset_extensions.csv')

## 19.5 How many files of each type of file extension are present?
**Property:** File extension

In [105]:
#pick up from files
files

0        [Figure_S1.tif, Figure_S2.tif, Figure_S3.tif, ...
1        [Figure_S1.tif, Table_S8.doc, Table_S7.doc, Ta...
2        [File_S1.xlsx, File_S2.xlsx, File_S3.xlsx, Fil...
3                                            [Text_S1.pdf]
4            [Figure_S1.pdf, Figure_S2.pdf, Table_S1.xlsx]
                               ...                        
10342                          [Supplementary Table 2.pdf]
10343                          [Supplementary Table 3.pdf]
10344                          [Supplementary Table 4.pdf]
10345                          [Supplementary Table 5.pdf]
10346                              [jf1c04555_si_001.xlsx]
Name: files, Length: 10347, dtype: object

In [106]:
#remove None
files_ext_all = pd.Series(filter(None, files))
files_ext_all

0        [Figure_S1.tif, Figure_S2.tif, Figure_S3.tif, ...
1        [Figure_S1.tif, Table_S8.doc, Table_S7.doc, Ta...
2        [File_S1.xlsx, File_S2.xlsx, File_S3.xlsx, Fil...
3                                            [Text_S1.pdf]
4            [Figure_S1.pdf, Figure_S2.pdf, Table_S1.xlsx]
                               ...                        
10310                          [Supplementary Table 2.pdf]
10311                          [Supplementary Table 3.pdf]
10312                          [Supplementary Table 4.pdf]
10313                          [Supplementary Table 5.pdf]
10314                              [jf1c04555_si_001.xlsx]
Length: 10315, dtype: object

In [107]:
#expand so each item in own row
files_ext_all = files_ext_all.explode()
files_ext_all

0                    Figure_S1.tif
0                    Figure_S2.tif
0                    Figure_S3.tif
0                    Figure_S4.tif
0                    Figure_S5.tif
                   ...            
10310    Supplementary Table 2.pdf
10311    Supplementary Table 3.pdf
10312    Supplementary Table 4.pdf
10313    Supplementary Table 5.pdf
10314        jf1c04555_si_001.xlsx
Length: 31894, dtype: object

In [108]:
files_ext_all_df = files_ext_all.apply(lambda fn: Path(fn).suffixes).value_counts().to_frame()
files_ext_all_df.head(10)

Unnamed: 0,0
[.xls],3046
[.tif],2482
[.csv],2363
[.mp4],2107
[.xlsx],1783
[.zip],1604
[.jpg],1566
[.docx],1378
[.pdf],1234
[.txt],1209


## 20. How many objects contain each type of File format given?
**Property:** File format

In [109]:
file_formats = df.FigshareArticlesCrosswalk.file_format
file_formats

In [110]:
#confirm missing for this repo
print(df.FigshareArticlesCrosswalk.file_format)

None


## 21. How many objects contain each type of Media type given?
**Property:** Media type

In [111]:
media_types = df.FigshareArticlesCrosswalk.media_type
media_types

In [112]:
#confirm missing for this repo
print(df.FigshareArticlesCrosswalk.media_type)

None


## 22. a) How many objects report one related resource type, and b) how many objects report each of those types? c) How many objects report multiple related resource types (regardless of which types)?
**Property:** Related resource type

In [113]:
related_resource_types = df.FigshareArticlesCrosswalk.related_resource_type
related_resource_types

In [114]:
#confirm missing for this repo
print(df.FigshareArticlesCrosswalk.related_resource_type)

None



## 23-25. If there is an entry for an object in one of the three properties (Original data URL, Primary manuscript PID/URL, and Related resource identifier) count as Related resources = True and then count the number of objects that return True.
**Property:** Related Resource Identifier

In [115]:
related_resource1 = df.FigshareArticlesCrosswalk.original_data_url
related_resource2 = df.FigshareArticlesCrosswalk.primary_manuscript
related_resource3 = df.FigshareArticlesCrosswalk.related_resource_identifier

In [116]:
print(related_resource1)

None


In [117]:
print(related_resource2)

0        10.1371/journal.pgen.1003195
1        10.1371/journal.pone.0050300
2        10.1371/journal.pone.0048862
3        10.1371/journal.pone.0050200
4        10.1371/journal.pone.0049538
                     ...             
10342           10.2217/fon-2021-1059
10343           10.2217/fon-2021-1059
10344           10.2217/fon-2021-1059
10345           10.2217/fon-2021-1059
10346        10.1021/acs.jafc.1c04555
Name: resource_doi_metadata, Length: 10347, dtype: object


In [118]:
print(related_resource3)

0        10.1371/journal.pgen.1003195
1        10.1371/journal.pone.0050300
2        10.1371/journal.pone.0048862
3        10.1371/journal.pone.0050200
4        10.1371/journal.pone.0049538
                     ...             
10342           10.2217/fon-2021-1059
10343           10.2217/fon-2021-1059
10344           10.2217/fon-2021-1059
10345           10.2217/fon-2021-1059
10346        10.1021/acs.jafc.1c04555
Name: resource_doi_metadata, Length: 10347, dtype: object


In [119]:
#confirm related_resource2 and related_resource3 are identical
related_resource2.equals(related_resource3)

True

In [120]:
#concatenate resources along with id
related_resource_all = pd.concat([related_resource1, related_resource2], axis = 1)
related_resource_all = pd.concat([related_resource_all, ids], axis = 1)
related_resource_all

Unnamed: 0,resource_doi_metadata,id
0,10.1371/journal.pgen.1003195,114867
1,10.1371/journal.pone.0050300,116703
2,10.1371/journal.pone.0048862,116813
3,10.1371/journal.pone.0050200,117032
4,10.1371/journal.pone.0049538,117391
...,...,...
10342,10.2217/fon-2021-1059,17113517
10343,10.2217/fon-2021-1059,17113523
10344,10.2217/fon-2021-1059,17113529
10345,10.2217/fon-2021-1059,17113550


In [121]:
#check how many have not None values
rr_count = related_resource_all['resource_doi_metadata'].count()
rr_count

8960

In [122]:
#confirm, make new df
rr_objects = related_resource_all.dropna()
rr_objects

Unnamed: 0,resource_doi_metadata,id
0,10.1371/journal.pgen.1003195,114867
1,10.1371/journal.pone.0050300,116703
2,10.1371/journal.pone.0048862,116813
3,10.1371/journal.pone.0050200,117032
4,10.1371/journal.pone.0049538,117391
...,...,...
10342,10.2217/fon-2021-1059,17113517
10343,10.2217/fon-2021-1059,17113523
10344,10.2217/fon-2021-1059,17113529
10345,10.2217/fon-2021-1059,17113550


In [123]:
#NOTE: also need to clean up empty cells in here (blank, not None)
rr_objects_ready = rr_objects.replace('', None)
rr_objects_ready

Unnamed: 0,resource_doi_metadata,id
0,10.1371/journal.pgen.1003195,114867
1,10.1371/journal.pone.0050300,116703
2,10.1371/journal.pone.0048862,116813
3,10.1371/journal.pone.0050200,117032
4,10.1371/journal.pone.0049538,117391
...,...,...
10342,10.2217/fon-2021-1059,17113517
10343,10.2217/fon-2021-1059,17113523
10344,10.2217/fon-2021-1059,17113529
10345,10.2217/fon-2021-1059,17113550


In [124]:
rr_objects_ready = rr_objects_ready.dropna()
rr_objects_ready

Unnamed: 0,resource_doi_metadata,id
0,10.1371/journal.pgen.1003195,114867
1,10.1371/journal.pone.0050300,116703
2,10.1371/journal.pone.0048862,116813
3,10.1371/journal.pone.0050200,117032
4,10.1371/journal.pone.0049538,117391
...,...,...
10342,10.2217/fon-2021-1059,17113517
10343,10.2217/fon-2021-1059,17113523
10344,10.2217/fon-2021-1059,17113529
10345,10.2217/fon-2021-1059,17113550


In [125]:
print(f'There are {len(rr_objects_ready)} objects with a related resource link')

There are 8906 objects with a related resource link


## 23-25. Also, what is the mean number of related resource links per object (again looking at the three properties (Original data URL, Primary manuscript PID/URL, and Related resource identifier)?
**Property:** Related Resource Identifier

In this case, all objects have at least one related resource, so no need to subset to only objects with related resources, in order to calculate mean/median number of related resources.

In [126]:
#function to count links
def count_links(entry):
    try:
        return len(entry)
    except TypeError:
        return 0

Function `count_links` is expecting a list (in event of multiple links)

In [127]:
#replace the column string values with lists
related_resource_list = rr_objects_ready.apply(
    lambda row: row.apply(
        lambda cell: [cell] if cell else []
    ),
    axis=1
)
related_resource_list

Unnamed: 0,resource_doi_metadata,id
0,[10.1371/journal.pgen.1003195],[114867]
1,[10.1371/journal.pone.0050300],[116703]
2,[10.1371/journal.pone.0048862],[116813]
3,[10.1371/journal.pone.0050200],[117032]
4,[10.1371/journal.pone.0049538],[117391]
...,...,...
10342,[10.2217/fon-2021-1059],[17113517]
10343,[10.2217/fon-2021-1059],[17113523]
10344,[10.2217/fon-2021-1059],[17113529]
10345,[10.2217/fon-2021-1059],[17113550]


In [128]:
#remove id column before counting
related_resource_list = related_resource_list.drop('id', axis=1)
related_resource_list

Unnamed: 0,resource_doi_metadata
0,[10.1371/journal.pgen.1003195]
1,[10.1371/journal.pone.0050300]
2,[10.1371/journal.pone.0048862]
3,[10.1371/journal.pone.0050200]
4,[10.1371/journal.pone.0049538]
...,...
10342,[10.2217/fon-2021-1059]
10343,[10.2217/fon-2021-1059]
10344,[10.2217/fon-2021-1059]
10345,[10.2217/fon-2021-1059]


In [129]:
links_per_object = related_resource_list.apply(
    lambda row: sum([count_links(entry) for entry in row]),
    axis=1
)

links_per_object

0        1
1        1
2        1
3        1
4        1
        ..
10342    1
10343    1
10344    1
10345    1
10346    1
Length: 8906, dtype: int64

In [130]:
print(f'mean {round(links_per_object.mean(), 3)} links per object')

mean 1.0 links per object


In [131]:
print(f'median {round(links_per_object.median(), 3)} links per object')

median 1.0 links per object


In [132]:
8906/10347 * 100

86.073257949164

## 26. How many objects report each relation type? How many objects report multiple relation types, regardless of what those types are?
**Property:** Related resource relation type

In [133]:
relation_type = df.FigshareArticlesCrosswalk.related_resource_relation_type
relation_type

In [134]:
#confirm missing for this repo
print(df.FigshareArticlesCrosswalk.related_resource_relation_type)

None


## 27. For repositories that store the full citation in a designated field, how many objects have a populated citation? How many objects have a citation and a URL or other actionable link?
**Property:** Citation

In [135]:
citations = df.FigshareArticlesCrosswalk.citation
citations

0        Jin, Hong; Stojnic, Robert; Adryan, Boris; Ozd...
1        Song, Jiangning; Tan, Hao; J. Perry, Andrew; A...
2        O'Hagan, Steve; Knowles, Joshua; Kell, Douglas...
3        Brylinski, Michal; Lingam, Daswanth (2015): Te...
4        G. Holman, Alexander; Gabuzda, Dana (2016): A ...
                               ...                        
10342    Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...
10343    Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...
10344    Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...
10345    Xu, Jianmin; Xu, Binghua; Li, Yipeng; Su, Zhij...
10346    Kalyan, Gazal; Junghare, Vivek; Khan, Mohammad...
Name: citation, Length: 10347, dtype: object

In [136]:
#check number missing citations
citations.isna().sum()

0

In [137]:
print(f'There are {len(citations)} objects with a citation')

There are 10347 objects with a citation
