# Analysis Template Walkthrough

# Setup

## Select extract
In order for the template cells to query data from the correct repository, enter the repository name as `repository` and repository object type as `object_type`.

In [1]:
repository = 'zenodo'
object_type = 'records'

In [2]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

In [3]:
#see more rows and columns of output
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100) 

## Helper Functions

In [4]:
import os, sys
dir2 = os.path.abspath('../')
dir1 = os.path.dirname(dir2)
if not dir1 in sys.path: sys.path.append(dir1)

from utils import analysis
from utils.crosswalk import RepositoryExtract, property_crosswalk
from utils import accessors

# Summary Statistic Walkthroughs

Read in the repository .json file

In [5]:
df = pd.read_json(f'{repository}_{object_type}.json')

In [6]:
df

Unnamed: 0,conceptdoi,conceptrecid,created,doi,files,id,links,metadata,owners,revision,stats,updated,page
0,10.5281/zenodo.5528620,5528620,2021-09-26T12:38:34.480681+00:00,10.5281/zenodo.5528621,[{'bucket': '23d1783f-8ed3-4b80-ae21-4cec73073...,5528621,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[255977],2,"{'downloads': 17.0, 'unique_downloads': 17.0, ...",2021-09-26T13:48:23.212663+00:00,1
1,10.5281/zenodo.5747172,5747172,2021-12-01T13:47:08.272108+00:00,10.5281/zenodo.5747173,,5747173,{'badge': 'https://zenodo.org/badge/doi/10.528...,{'access_conditions': '<p>The dataset is set t...,[262744],6,"{'downloads': 1.0, 'unique_downloads': 1.0, 'u...",2021-12-03T01:48:35.649838+00:00,1
2,,4768051,2021-05-17T17:53:16.165204+00:00,10.1007/s10994-021-05968-x,[{'bucket': 'a43e8b77-a43a-488c-8e02-489f02047...,4768052,{'badge': 'https://zenodo.org/badge/doi/10.100...,"{'access_right': 'open', 'access_right_categor...",[71235],2,"{'downloads': 18.0, 'unique_downloads': 18.0, ...",2021-05-18T01:48:13.633614+00:00,1
3,10.5281/zenodo.4738769,4738769,2021-05-05T10:21:43.604973+00:00,10.5281/zenodo.4738770,[{'bucket': 'fdefeabc-7897-4130-9628-438795c87...,4738770,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[37667],3,"{'downloads': 11.0, 'unique_downloads': 11.0, ...",2021-05-05T13:48:11.586654+00:00,1
4,10.5281/zenodo.5602144,5602144,2021-10-28T21:22:24.722154+00:00,10.5281/zenodo.5609988,[{'bucket': '9ecdb5a8-e08a-45a1-a900-70eacf48b...,5609988,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[260803],2,"{'downloads': 3.0, 'unique_downloads': 2.0, 'u...",2021-10-29T01:48:57.739525+00:00,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7953,,642480,2014-03-10T17:39:52+00:00,10.1186/1471-2105-14-170,[{'bucket': 'a625bb5f-df27-4cbe-b49c-345a41f1b...,8330,{'badge': 'https://zenodo.org/badge/doi/10.118...,"{'access_right': 'open', 'access_right_categor...",[2290],11,"{'downloads': 70.0, 'unique_downloads': 69.0, ...",2020-01-20T16:14:01.294191+00:00,1
7954,,641703,2014-03-10T17:38:32+00:00,10.5281/zenodo.7490,[{'bucket': '5d9de852-6754-41d2-b0ac-cacfeb1d4...,7490,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[946],12,"{'downloads': 14289.0, 'unique_downloads': 141...",2020-01-20T17:33:37.234664+00:00,1
7955,,641617,2014-03-10T17:37:37+00:00,10.2478/v10229-011-0015-3,[{'bucket': 'bccfe33c-0ef8-45a7-ac8e-ec48c4d0b...,7016,{'badge': 'https://zenodo.org/badge/doi/10.247...,"{'access_right': 'open', 'access_right_categor...",[],9,"{'downloads': 112.0, 'unique_downloads': 109.0...",2020-01-20T14:23:17.635939+00:00,1
7956,,641676,2014-03-10T17:37:37+00:00,,[{'bucket': 'bb3a59b2-fe88-4964-bc17-78a73b620...,7015,{'bucket': 'https://zenodo.org/api/files/bb3a5...,"{'access_right': 'open', 'access_right_categor...",[],8,"{'downloads': 17.0, 'unique_downloads': 16.0, ...",2020-01-20T15:09:19.938762+00:00,1


Zenodo "machine learning" objects include a variety of resource types that are not relevant to the question at hand, such as journal articles, book chapters, and other items.

Full Zenodo resource types returned from inital metadata extract, with relevant subset identified in bold:
* Journal article
* **Dataset**
* **Software**
* Conference paper
* Presentation
* Poster
* Project deliverable
* Other
* Report
* Thesis
* Figure
* Preprint
* Working paper
* Video/Audio
* Book section
* Book
* Lesson
* Technical note
* Software documentation
* Photo
* Proposal
* Patent
* Data management plan
* Physical object
* Diagram
* Taxonomic treatment
* Plot

We want to re-run all summary statistics on this subset of Zenodo metadata.

In [7]:
#subset df based on resource type

In [8]:
ids = df.ZenodoRecordsCrosswalk.unique_identifier
ids

0       5528621
1       5747173
2       4768052
3       4738770
4       5609988
         ...   
7953       8330
7954       7490
7955       7016
7956       7015
7957      12726
Name: id, Length: 7958, dtype: int64

In [9]:
resource_types = df.ZenodoRecordsCrosswalk.resource_type
resource_types

0           Presentation
1                Dataset
2        Journal article
3               Software
4       Conference paper
              ...       
7953     Journal article
7954              Report
7955     Journal article
7956              Report
7957     Journal article
Name: metadata, Length: 7958, dtype: object

In [10]:
#add ID column to resource type column
resource_ids = pd.concat([ids, resource_types], axis = 1)
resource_ids

Unnamed: 0,id,metadata
0,5528621,Presentation
1,5747173,Dataset
2,4768052,Journal article
3,4738770,Software
4,5609988,Conference paper
...,...,...
7953,8330,Journal article
7954,7490,Report
7955,7016,Journal article
7956,7015,Report


In [11]:
#keep only ids that are 'Dataset', 'Software'
keep_subset = resource_ids[resource_ids['metadata'].isin(['Dataset', 'Software'])]
keep_ids = keep_subset['id']
keep_ids

1       5747173
3       4738770
7       4456470
9       5654850
10      4547779
         ...   
7935      14183
7938      22204
7939      31904
7942      10075
7949      12501
Name: id, Length: 2217, dtype: int64

In [12]:
#subset full dataset to keep only these ids
subset_df = df[df['id'].isin(keep_ids)]

In [13]:
subset_df

Unnamed: 0,conceptdoi,conceptrecid,created,doi,files,id,links,metadata,owners,revision,stats,updated,page
1,10.5281/zenodo.5747172,5747172,2021-12-01T13:47:08.272108+00:00,10.5281/zenodo.5747173,,5747173,{'badge': 'https://zenodo.org/badge/doi/10.528...,{'access_conditions': '<p>The dataset is set t...,[262744],6,"{'downloads': 1.0, 'unique_downloads': 1.0, 'u...",2021-12-03T01:48:35.649838+00:00,1
3,10.5281/zenodo.4738769,4738769,2021-05-05T10:21:43.604973+00:00,10.5281/zenodo.4738770,[{'bucket': 'fdefeabc-7897-4130-9628-438795c87...,4738770,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[37667],3,"{'downloads': 11.0, 'unique_downloads': 11.0, ...",2021-05-05T13:48:11.586654+00:00,1
7,10.5281/zenodo.4456151,4456151,2021-01-22T08:59:53.920317+00:00,10.5281/zenodo.4456470,[{'bucket': '723fc682-b0bd-4b5a-ba84-5c13f1586...,4456470,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[189980],2,"{'downloads': 5.0, 'unique_downloads': 5.0, 'u...",2021-01-22T12:27:21.545991+00:00,1
9,10.5281/zenodo.5639567,5639567,2021-11-08T16:12:09.613970+00:00,10.5281/zenodo.5654850,[{'bucket': '9ebdf204-c339-45f0-890d-2ec282f81...,5654850,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[259815],3,"{'downloads': 0.0, 'unique_downloads': 0.0, 'u...",2021-11-09T13:48:40.037016+00:00,1
10,10.5281/zenodo.4547778,4547778,2021-02-18T05:59:05.081026+00:00,10.5281/zenodo.4547779,[{'bucket': '0b35fbd4-0d16-463a-b3eb-571fbc57c...,4547779,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[197830],3,"{'downloads': 1.0, 'unique_downloads': 1.0, 'u...",2021-02-18T12:27:25.273226+00:00,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7935,10.5281/zenodo.592829,592829,2015-06-25T15:09:27+00:00,10.5281/zenodo.14183,[{'bucket': '051fce62-5352-4abd-b53b-b4de2d1ac...,14183,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[7022],10,"{'downloads': 3.0, 'unique_downloads': 3.0, 'u...",2020-01-25T19:23:07.983801+00:00,1
7938,,597810,2015-07-31T06:02:49+00:00,10.5281/zenodo.22204,[{'bucket': 'bff7131e-f8e3-42c3-b3b2-cc6d88aad...,22204,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[16206],15,"{'downloads': 7.0, 'unique_downloads': 7.0, 'u...",2020-01-25T07:22:23.620589+00:00,1
7939,,591554,2015-10-07T13:20:53+00:00,10.5281/zenodo.31904,[{'bucket': '09dd6cd5-c5ca-4776-bda9-3c40d1326...,31904,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[3832],10,"{'downloads': 1.0, 'unique_downloads': 1.0, 'u...",2020-01-24T19:22:25.401259+00:00,1
7942,,592133,2015-06-25T15:03:29+00:00,10.5281/zenodo.10075,[{'bucket': 'c0dccdf0-92fa-4df5-94ca-78c581e72...,10075,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[4179],11,"{'downloads': 8.0, 'unique_downloads': 8.0, 'u...",2020-01-25T19:21:23.237152+00:00,1


In [14]:
#rename subset_df back to df, so remainder of code identical to full extract
del df
df = subset_df

In [15]:
df

Unnamed: 0,conceptdoi,conceptrecid,created,doi,files,id,links,metadata,owners,revision,stats,updated,page
1,10.5281/zenodo.5747172,5747172,2021-12-01T13:47:08.272108+00:00,10.5281/zenodo.5747173,,5747173,{'badge': 'https://zenodo.org/badge/doi/10.528...,{'access_conditions': '<p>The dataset is set t...,[262744],6,"{'downloads': 1.0, 'unique_downloads': 1.0, 'u...",2021-12-03T01:48:35.649838+00:00,1
3,10.5281/zenodo.4738769,4738769,2021-05-05T10:21:43.604973+00:00,10.5281/zenodo.4738770,[{'bucket': 'fdefeabc-7897-4130-9628-438795c87...,4738770,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[37667],3,"{'downloads': 11.0, 'unique_downloads': 11.0, ...",2021-05-05T13:48:11.586654+00:00,1
7,10.5281/zenodo.4456151,4456151,2021-01-22T08:59:53.920317+00:00,10.5281/zenodo.4456470,[{'bucket': '723fc682-b0bd-4b5a-ba84-5c13f1586...,4456470,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[189980],2,"{'downloads': 5.0, 'unique_downloads': 5.0, 'u...",2021-01-22T12:27:21.545991+00:00,1
9,10.5281/zenodo.5639567,5639567,2021-11-08T16:12:09.613970+00:00,10.5281/zenodo.5654850,[{'bucket': '9ebdf204-c339-45f0-890d-2ec282f81...,5654850,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[259815],3,"{'downloads': 0.0, 'unique_downloads': 0.0, 'u...",2021-11-09T13:48:40.037016+00:00,1
10,10.5281/zenodo.4547778,4547778,2021-02-18T05:59:05.081026+00:00,10.5281/zenodo.4547779,[{'bucket': '0b35fbd4-0d16-463a-b3eb-571fbc57c...,4547779,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[197830],3,"{'downloads': 1.0, 'unique_downloads': 1.0, 'u...",2021-02-18T12:27:25.273226+00:00,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7935,10.5281/zenodo.592829,592829,2015-06-25T15:09:27+00:00,10.5281/zenodo.14183,[{'bucket': '051fce62-5352-4abd-b53b-b4de2d1ac...,14183,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[7022],10,"{'downloads': 3.0, 'unique_downloads': 3.0, 'u...",2020-01-25T19:23:07.983801+00:00,1
7938,,597810,2015-07-31T06:02:49+00:00,10.5281/zenodo.22204,[{'bucket': 'bff7131e-f8e3-42c3-b3b2-cc6d88aad...,22204,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[16206],15,"{'downloads': 7.0, 'unique_downloads': 7.0, 'u...",2020-01-25T07:22:23.620589+00:00,1
7939,,591554,2015-10-07T13:20:53+00:00,10.5281/zenodo.31904,[{'bucket': '09dd6cd5-c5ca-4776-bda9-3c40d1326...,31904,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[3832],10,"{'downloads': 1.0, 'unique_downloads': 1.0, 'u...",2020-01-24T19:22:25.401259+00:00,1
7942,,592133,2015-06-25T15:03:29+00:00,10.5281/zenodo.10075,[{'bucket': 'c0dccdf0-92fa-4df5-94ca-78c581e72...,10075,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[4179],11,"{'downloads': 8.0, 'unique_downloads': 8.0, 'u...",2020-01-25T19:21:23.237152+00:00,1


## 1. How many total objects (not just records) are in our main dataset extracts for each repository?
**Property:** unique_identifier

In [16]:
ids = df.ZenodoRecordsCrosswalk.unique_identifier
ids

1       5747173
3       4738770
7       4456470
9       5654850
10      4547779
         ...   
7935      14183
7938      22204
7939      31904
7942      10075
7949      12501
Name: id, Length: 2217, dtype: int64

In [17]:
ids.nunique()

2217

In [18]:
#number of items = number of unique ID --> each row (item/record) is a unique object
print(f'There are {len(ids)} items in the Zenodo extract, with {ids.nunique()} unique IDs.')

There are 2217 items in the Zenodo extract, with 2217 unique IDs.


## 2. See the "Licenses offered" tab in /Working documents/Licenses sheet for list of licenses by repo.

## Given the type(s) of license(s) offered by the repo, how many of each type is assigned?
**Property:** License

In [19]:
licenses = df.ZenodoRecordsCrosswalk.license
licenses

1             None
3       other-open
7        CC-BY-4.0
9       other-open
10      other-open
           ...    
7935    other-open
7938    other-open
7939       CC0-1.0
7942       GPL-3.0
7949     CC-BY-4.0
Name: metadata, Length: 2217, dtype: object

In [20]:
license_counts = licenses.value_counts().to_frame()
license_counts['percent'] = license_counts['metadata']/len(licenses) * 100
license_counts

Unnamed: 0,metadata,percent
CC-BY-4.0,1067,48.128101
other-open,373,16.824538
CC0-1.0,240,10.82544
MIT,106,4.781236
ODbL-1.0,61,2.751466
Apache-2.0,49,2.210194
GPL-3.0,32,1.443392
CC-BY-NC-SA-4.0,31,1.398286
CC-BY-SA-4.0,30,1.35318
CC-BY-NC-4.0,17,0.766802


## 3. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Description
**Related function:** `mean_characters`

In [21]:
descriptions = df.ZenodoRecordsCrosswalk.description
descriptions

1       <pre>&nbsp;</pre>\n\n<p>A dataset containing I...
3                                No description provided.
7       <p>Supplement to https://github.com/twopin/CAM...
9       <p>Code and data supporting Delphi's PNAS pape...
10      <p>Release for making bibtex citation.\nIt is ...
                              ...                        
7935    <p>This version offers some improvements over ...
7938    <p><strong>Training and testing of the agents ...
7939    <p><strong>Abstract</strong></p>\n\n<p>Taxonom...
7942    <p>Malheur is a tool for the automatic analysi...
7949    <p>This dataset includes classification of Eng...
Name: metadata, Length: 2217, dtype: object

In [22]:
descriptions.describe()

count                         2217
unique                        2188
top       No description provided.
freq                             5
Name: metadata, dtype: object

In [23]:
print(f'Number of null descriptions: {sum(descriptions.isna())}')

Number of null descriptions: 0


In [24]:
#this is a rough approximation, given additional characters like paragraph and newline indicators in description
print(f'{analysis.mean_characters(descriptions)} mean characters')

1371.393967093236 mean characters


## 4. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Methods
**Related function:** `mean_characters`

In [25]:
methods = df.ZenodoRecordsCrosswalk.methods
methods

In [26]:
#Confirm missing for this repo
print(df.ZenodoRecordsCrosswalk.methods)

None


## 5. What are the min and max publication dates for each repo?

## How many objects were published each year for each repo?
**Property:** Publication date

In [27]:
publication_dates = df.ZenodoRecordsCrosswalk.publication_date
publication_dates

1       2021-12-01
3       2021-05-05
7       2021-01-22
9       2021-11-04
10      2021-02-18
           ...    
7935    2015-01-22
7938    2015-07-30
7939    2015-10-07
7942    2013-12-25
7949    2014-10-30
Name: metadata, Length: 2217, dtype: object

In [28]:
#min and max publication year
publication_dates.min(), publication_dates.max()

('2006-01-01', '2022-06-04')

In [29]:
#objects per year
publication_dates.astype('datetime64').apply(lambda date: date.year).value_counts().sort_index()

2006      1
2012      3
2013      4
2014     11
2015     29
2016     52
2017     97
2018    179
2019    326
2020    642
2021    872
2022      1
Name: metadata, dtype: int64

In [30]:
#ready for plotting export
pub_dates_export = publication_dates.astype('datetime64').apply(lambda date: date.year).value_counts().sort_index().to_frame()

In [31]:
#update column names
pub_dates_export_ready = pub_dates_export.reset_index(level=0)
pub_dates_export_ready.columns = ['year', 'count']

In [32]:
#add column with name of repo
pub_dates_export_ready['repo'] = 'zenodo_subset'

In [33]:
pub_dates_export_ready

Unnamed: 0,year,count,repo
0,2006,1,zenodo_subset
1,2012,3,zenodo_subset
2,2013,4,zenodo_subset
3,2014,11,zenodo_subset
4,2015,29,zenodo_subset
5,2016,52,zenodo_subset
6,2017,97,zenodo_subset
7,2018,179,zenodo_subset
8,2019,326,zenodo_subset
9,2020,642,zenodo_subset


In [34]:
#export to Figures folder
pub_dates_export_ready.to_csv('..\\..\\Figures\\Figure1\\repository_dates\\zenodo_subset_pub_years.csv')

## 6. What are the unweighted mean, median, and max file sizes among all ingested files?
**Property:** File size
**Related function:** `get_summary_statistics`

In [35]:
file_sizes = df.ZenodoRecordsCrosswalk.file_size
file_sizes

1             None
3         [133317]
7       [58343170]
9       [45947386]
10       [1577773]
           ...    
7935     [3906128]
7938    [25615252]
7939    [31428769]
7942      [140158]
7949     [2084084]
Name: files, Length: 2217, dtype: object

In [36]:
#replace None values with empty list
file_sizes = file_sizes.apply(lambda d: d if isinstance(d, list) else [])
file_sizes

1               []
3         [133317]
7       [58343170]
9       [45947386]
10       [1577773]
           ...    
7935     [3906128]
7938    [25615252]
7939    [31428769]
7942      [140158]
7949     [2084084]
Name: files, Length: 2217, dtype: object

In [37]:
#collapse into single column - each file size in own row, since we are interested in summary stats across all files
file_sizes_long = file_sizes.explode()
file_sizes_long

1            NaN
3         133317
7       58343170
9       45947386
10       1577773
          ...   
7935     3906128
7938    25615252
7939    31428769
7942      140158
7949     2084084
Name: files, Length: 19060, dtype: object

In [38]:
#drop NaN values, so median calculates correctly
file_sizes_long = file_sizes_long.dropna()
file_sizes_long

3         133317
7       58343170
9       45947386
10       1577773
12          3731
          ...   
7935     3906128
7938    25615252
7939    31428769
7942      140158
7949     2084084
Name: files, Length: 18989, dtype: object

In [39]:
#get summary statistics
analysis.get_summary_statistics(file_sizes_long)

{'mean': 609352656.069198, 'median': 18996648.0, 'max': 279115940800}

Alternative approach with array gives same values

In [40]:
#confirm accuracy with alternative method
file_size_array = np.array([size for size_list in file_sizes for size in size_list])
file_size_array

array([  133317, 58343170, 45947386, ..., 31428769,   140158,  2084084],
      dtype=int64)

In [41]:
analysis.get_summary_statistics(file_size_array, suppress_output=False);

mean: 609352656.069198
median: 18996648.0
max: 279115940800


## 7. What are the mean, median, and max number of files per object?
**Property:** URL
**Related function:** `get_summary_statistics`

`missing` is set to an empty list so that the `None` values for objects without files have "zero files"

In [42]:
files = df.ZenodoRecordsCrosswalk.url
files

1                                                    None
3       [https://zenodo.org/api/files/fdefeabc-7897-41...
7       [https://zenodo.org/api/files/723fc682-b0bd-4b...
9       [https://zenodo.org/api/files/9ebdf204-c339-45...
10      [https://zenodo.org/api/files/0b35fbd4-0d16-46...
                              ...                        
7935    [https://zenodo.org/api/files/051fce62-5352-4a...
7938    [https://zenodo.org/api/files/bff7131e-f8e3-42...
7939    [https://zenodo.org/api/files/09dd6cd5-c5ca-47...
7942    [https://zenodo.org/api/files/c0dccdf0-92fa-4d...
7949    [https://zenodo.org/api/files/269831f2-f474-44...
Name: files, Length: 2217, dtype: object

In [43]:
#replace None with empty list
files = files.apply(lambda d: d if isinstance(d, list) else [])
files

1                                                      []
3       [https://zenodo.org/api/files/fdefeabc-7897-41...
7       [https://zenodo.org/api/files/723fc682-b0bd-4b...
9       [https://zenodo.org/api/files/9ebdf204-c339-45...
10      [https://zenodo.org/api/files/0b35fbd4-0d16-46...
                              ...                        
7935    [https://zenodo.org/api/files/051fce62-5352-4a...
7938    [https://zenodo.org/api/files/bff7131e-f8e3-42...
7939    [https://zenodo.org/api/files/09dd6cd5-c5ca-47...
7942    [https://zenodo.org/api/files/c0dccdf0-92fa-4d...
7949    [https://zenodo.org/api/files/269831f2-f474-44...
Name: files, Length: 2217, dtype: object

In [44]:
#get files per object
files_counts = files.apply(len)
files_counts

1       0
3       1
7       1
9       1
10      1
       ..
7935    1
7938    1
7939    1
7942    1
7949    1
Name: files, Length: 2217, dtype: int64

In [45]:
#get summary statistics
analysis.get_summary_statistics(files_counts)

{'mean': 8.565178168696436, 'median': 1.0, 'max': 2600}

## 8. What are the mean, median, and max total dataset size (summed across all files) per object?
**Property:** Dataset size
**Related function:** `get_summary_statistics`

In [46]:
dataset_sizes = df.ZenodoRecordsCrosswalk.dataset_size
dataset_sizes

1             None
3         [133317]
7       [58343170]
9       [45947386]
10       [1577773]
           ...    
7935     [3906128]
7938    [25615252]
7939    [31428769]
7942      [140158]
7949     [2084084]
Name: files, Length: 2217, dtype: object

In [47]:
#replace None values with empty list, file size of 0
dataset_sizes = dataset_sizes.apply(lambda d: d if isinstance(d, list) else [])
dataset_sizes

1               []
3         [133317]
7       [58343170]
9       [45947386]
10       [1577773]
           ...    
7935     [3906128]
7938    [25615252]
7939    [31428769]
7942      [140158]
7949     [2084084]
Name: files, Length: 2217, dtype: object

In [48]:
#sum up size of files within object (sum up within each list in series)
dataset_sizes_total = dataset_sizes.apply(sum)
dataset_sizes_total

1              0
3         133317
7       58343170
9       45947386
10       1577773
          ...   
7935     3906128
7938    25615252
7939    31428769
7942      140158
7949     2084084
Name: files, Length: 2217, dtype: int64

In [49]:
#get summary statistics
analysis.get_summary_statistics(dataset_sizes_total, suppress_output=False);

mean: 5219214066.801083
median: 32713830.0
max: 300398018118


## 9. How many of each scientific domain are assigned?
**Property:** Domain
**Related function:** `domains.value_counts()`

In [50]:
domains = df.ZenodoRecordsCrosswalk.domain
domains

In [51]:
#confirm missing for this repo
print(df.ZenodoRecordsCrosswalk.domain)

None


## 10. What is the mean number of characters (excluding whitespaces, if possible) for usage notes per object?
**Property:** Technical details
**Related function:** `mean_characters`

In [52]:
# "usage notes" is not in crosswalk - need to return to

## 11-13. What are the mean and median total number of keyword terms per object, after merging results for Keyword, Geographic keyword, and Scientific keyword?
**Property:** Keyword

In [53]:
print(df.ZenodoRecordsCrosswalk.keyword)

1                                                    None
3                                                    None
7                                                    None
9                                                    None
10                                                   None
                              ...                        
7935                                                 None
7938                                                 None
7939    [artificial neural networks, Cypripedioideae, ...
7942                                                 None
7949    [educational data mining, word levels, machine...
Name: metadata, Length: 2217, dtype: object


In [54]:
print(df.ZenodoRecordsCrosswalk.geographic_keyword)

1                                                    None
3                                                    None
7                                                    None
9                                                    None
10                                                   None
                              ...                        
7935                                                 None
7938                                                 None
7939    [artificial neural networks, Cypripedioideae, ...
7942                                                 None
7949    [educational data mining, word levels, machine...
Name: metadata, Length: 2217, dtype: object


In [55]:
# Crosswalk lists:  ('genus', 'kingdom', 'order', 'phylum', 'scientificNameAuthorship', 'specificEpithet') 
# for "scientific_keyword"
# but not seeing those anywhere in metadata column so ignoring for now

In [56]:
#create and concatenate keywords
keywords1 = df.ZenodoRecordsCrosswalk.keyword
keywords2 = df.ZenodoRecordsCrosswalk.geographic_keyword

In [57]:
keywords_all = pd.concat([keywords1, keywords2], axis = 1)
keywords_all

Unnamed: 0,metadata,metadata.1
1,,
3,,
7,,
9,,
10,,
...,...,...
7935,,
7938,,
7939,"[artificial neural networks, Cypripedioideae, ...","[artificial neural networks, Cypripedioideae, ..."
7942,,


In [58]:
#replace the None values with empty lists so the count of string values evaluates to 0
keywords_all = keywords_all.apply(
    lambda row: row.apply(
        lambda cell: cell if cell else []
    ),
    axis=1
)
keywords_all

Unnamed: 0,metadata,metadata.1
1,[],[]
3,[],[]
7,[],[]
9,[],[]
10,[],[]
...,...,...
7935,[],[]
7938,[],[]
7939,"[artificial neural networks, Cypripedioideae, ...","[artificial neural networks, Cypripedioideae, ..."
7942,[],[]


In [59]:
#confirm that the 2 columns are identical, meaning we only need to count keywords in one of these
keywords_all.iloc[:,0].equals(keywords_all.iloc[:,1])

True

In [60]:
#keep one column of keywords
keywords_use = keywords_all.iloc[:,0].to_frame()
keywords_use

Unnamed: 0,metadata
1,[]
3,[]
7,[]
9,[]
10,[]
...,...
7935,[]
7938,[]
7939,"[artificial neural networks, Cypripedioideae, ..."
7942,[]


In [61]:
#count number of keywords for each row
keyword_counts = keywords_use.apply(
    lambda row: sum([len(row[entry]) for entry in keywords_use.columns if row[entry]]),
    axis=1
)
keyword_counts

1       0
3       0
7       0
9       0
10      0
       ..
7935    0
7938    0
7939    7
7942    0
7949    1
Length: 2217, dtype: int64

In [62]:
#confirm accuracy - counting keyword phrases, not all individual words
keywords_use.iloc[0]['metadata']

[]

In [63]:
keywords_use.iloc[1]['metadata']

[]

In [64]:
#get summary statistics
analysis.get_summary_statistics(keyword_counts)

{'mean': 2.8407758231844835, 'median': 2.0, 'max': 92}

## 14. Who are the most common funding agencies for each repo? What are the object counts per agency?
**Property:** Funding Agency

In [65]:
funders = df.ZenodoRecordsCrosswalk.funding_agency
funders

1       None
3       None
7       None
9       None
10      None
        ... 
7935    None
7938    None
7939    None
7942    None
7949    None
Name: metadata, Length: 2217, dtype: object

In [66]:
#how many are not none?
len(pd.Series(filter(None, funders)))

227

In [67]:
#may be more than one funder per object, so expand
funders_long = funders.explode()
funders_long

1       None
3       None
7       None
9       None
10      None
        ... 
7935    None
7938    None
7939    None
7942    None
7949    None
Name: metadata, Length: 2272, dtype: object

In [68]:
funders_counts = funders_long.value_counts().to_frame()
funders_counts['percent'] = funders_counts['metadata']/len(funders) * 100
funders_counts

Unnamed: 0,metadata,percent
European Commission,178,8.028868
Research Councils UK,26,1.172756
Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung,24,1.082544
Academy of Finland,14,0.631484
National Science Foundation,13,0.586378
Austrian Science Fund,7,0.315742
Wellcome Trust,5,0.22553
National Institutes of Health,5,0.22553
Agence Nationale de la Recherche,2,0.090212
Australian Research Council,2,0.090212


## 15. What are the mean, median, and max number of Views per object?
**Property:** Views
**Related function:** `get_summary_statistics`

In [69]:
views = df.ZenodoRecordsCrosswalk.views
views

Unnamed: 0,stats,stats.1
1,12.0,5.0
3,66.0,51.0
7,44.0,43.0
9,19.0,18.0
10,6.0,6.0
...,...,...
7935,55.0,51.0
7938,50.0,50.0
7939,55.0,50.0
7942,244.0,240.0


In [70]:
#refer to crosswalk to identify which is unique views and which is total views
#{'stats': ('views', 'unique_views')}
# first column is 'views', second column is 'unique_views'

In [71]:
#get summary stats for 'views', making sure treated as numeric and no missing values
all_views = views.iloc[:,0]
all_views = pd.to_numeric(all_views)
all_views = all_views.dropna()

In [72]:
analysis.get_summary_statistics(all_views)

{'mean': 250.36671177266575, 'median': 40.0, 'max': 132688.0}

In [73]:
#get summary stats for 'unique views', making sure treated as numeric and no missing values
unique_views = views.iloc[:,1]
unique_views = pd.to_numeric(unique_views)
unique_views = unique_views.dropna()

In [74]:
unique_views

1         5.0
3        51.0
7        43.0
9        18.0
10        6.0
        ...  
7935     51.0
7938     50.0
7939     50.0
7942    240.0
7949    458.0
Name: stats, Length: 2217, dtype: float64

In [75]:
analysis.get_summary_statistics(unique_views)

{'mean': 217.0365358592693, 'median': 35.0, 'max': 114993.0}

## 16. What are the mean, median, and max (total) number of downloads per object?
**Property:** Downloads
**Related function:** `get_summary_statistics`

In [76]:
downloads = df.ZenodoRecordsCrosswalk.downloads
downloads

Unnamed: 0,stats,stats.1
1,1.0,1.0
3,11.0,11.0
7,5.0,5.0
9,0.0,0.0
10,1.0,1.0
...,...,...
7935,3.0,3.0
7938,7.0,7.0
7939,1.0,1.0
7942,8.0,8.0


In [77]:
#refer to crosswalk to identify which is unique downloads and which is total downloads
# {'stats': ('downloads', 'unique_downloads')}
#first column is 'downloads', second column is 'unique downloads'

In [78]:
#get summary stats for 'downloads', making sure treated as numeric and no missing values
all_downloads = downloads.iloc[:,0]
all_downloads = pd.to_numeric(all_downloads)
all_downloads = all_downloads.dropna()

In [79]:
analysis.get_summary_statistics(all_downloads)

{'mean': 731.2557510148849, 'median': 10.0, 'max': 366857.0}

In [80]:
#get summary stats for 'downloads', making sure treated as numeric and no missing values
unique_downloads = downloads.iloc[:,1]
unique_downloads = pd.to_numeric(unique_downloads)
unique_downloads = unique_downloads.dropna()

In [81]:
analysis.get_summary_statistics(unique_downloads)

{'mean': 217.22417681551647, 'median': 7.0, 'max': 186905.0}

## 17. What are the mean, median, and max Citation counts per object?
**Property:** Citation count
**Related function:** `get_summary_statistics`

In [82]:
citation_count = df.ZenodoRecordsCrosswalk.citation_count
citation_count

In [83]:
#confirm missing for this repo
print(df.ZenodoRecordsCrosswalk.citation_count)

None


## 18. How many objects contain each given resource type?
**Property:** Resource type

In [84]:
resource_types = df.ZenodoRecordsCrosswalk.resource_type
resource_types

1        Dataset
3       Software
7       Software
9       Software
10      Software
          ...   
7935    Software
7938    Software
7939     Dataset
7942    Software
7949     Dataset
Name: metadata, Length: 2217, dtype: object

In [85]:
resource_types_counts = resource_types.value_counts().to_frame()
resource_types_counts['percent'] = round(resource_types_counts['metadata']/len(resource_types) * 100)
resource_types_counts

Unnamed: 0,metadata,percent
Dataset,1367,62.0
Software,850,38.0


## 19. How many objects contain each type of file extension given?
**Property:** File Extension
**Related function:** `get_file_extensions`

In [86]:
files = df.ZenodoRecordsCrosswalk.file_extension
files

1        None
3       [zip]
7       [zip]
9       [zip]
10      [zip]
        ...  
7935    [zip]
7938    [zip]
7939    [zip]
7942    [zip]
7949     [gz]
Name: files, Length: 2217, dtype: object

In [87]:
#add ID column to make easier to count by object
files_ids = pd.concat([ids, files], axis = 1)
files_ids.head(10)

Unnamed: 0,id,files
1,5747173,
3,4738770,[zip]
7,4456470,[zip]
9,5654850,[zip]
10,4547779,[zip]
12,4596407,"[txt, txt, txt, txt, txt, txt, txt, txt, txt, ..."
13,5515761,[zip]
14,5008068,"[zip, pdf]"
15,5541446,[zip]
20,4986399,"[7z, 7z, 7z, 7z, 7z, 7z, 7z, 7z, 7z, 7z, 7z, 7..."


In [88]:
#expand so each file within object is own row
files_ids = files_ids.explode('files')
files_ids.head(10)

Unnamed: 0,id,files
1,5747173,
3,4738770,zip
7,4456470,zip
9,5654850,zip
10,4547779,zip
12,4596407,txt
12,4596407,txt
12,4596407,txt
12,4596407,txt
12,4596407,txt


In [89]:
#drop duplicates
files_ids_unique = files_ids.drop_duplicates()
files_ids_unique

Unnamed: 0,id,files
1,5747173,
3,4738770,zip
7,4456470,zip
9,5654850,zip
10,4547779,zip
...,...,...
7935,14183,zip
7938,22204,zip
7939,31904,zip
7942,10075,zip


In [90]:
#confirm
files_ids_unique.loc[files_ids_unique['id'] == 4596407]

Unnamed: 0,id,files
12,4596407,txt


In [91]:
#confirm
files_ids_unique.loc[files_ids_unique['id'] == 5008068]

Unnamed: 0,id,files
14,5008068,zip
14,5008068,pdf


In [92]:
#get ESTIMATE of most common formats - needs some clean up

#group by file type count objects
files_ids_grouped = files_ids_unique.groupby('files').size().to_frame().sort_values(0, ascending = False)
files_ids_grouped['percent'] = files_ids_grouped[0]/len(files) * 100
files_ids_grouped.head(10)

Unnamed: 0_level_0,0,percent
files,Unnamed: 1_level_1,Unnamed: 2_level_1
zip,1205,54.352729
gz,292,13.170952
csv,217,9.788002
txt,212,9.562472
xlsx,113,5.096978
pdf,88,3.969328
tsv,83,3.743798
md,76,3.428056
tif,58,2.616148
png,55,2.48083


In [93]:
#export for further clean up, refining estimates, and plotting

In [94]:
#reset index and update column names
ext_grouped_ready = files_ids_unique.reset_index(level=0)
ext_grouped_ready.columns = ['index', 'id', 'files']

#add column with name of repo
ext_grouped_ready['repo'] = 'zenodo_subset'

#drop extra column to make consistent with other repo exports
ext_grouped_ready = ext_grouped_ready.drop(columns=['id'])

ext_grouped_ready.head(10)

Unnamed: 0,index,files,repo
0,1,,zenodo_subset
1,3,zip,zenodo_subset
2,7,zip,zenodo_subset
3,9,zip,zenodo_subset
4,10,zip,zenodo_subset
5,12,txt,zenodo_subset
6,13,zip,zenodo_subset
7,14,zip,zenodo_subset
8,14,pdf,zenodo_subset
9,15,zip,zenodo_subset


In [95]:
#export to Figures folder
ext_grouped_ready.to_csv('..\\..\\Figures\\Figure2\\file_ext_data\\zenodo_subset_extensions.csv')

## 19.5 How many files of each type of file extension are present?
**Property:** File extension

In [96]:
files = df.ZenodoRecordsCrosswalk.file_extension
files

1        None
3       [zip]
7       [zip]
9       [zip]
10      [zip]
        ...  
7935    [zip]
7938    [zip]
7939    [zip]
7942    [zip]
7949     [gz]
Name: files, Length: 2217, dtype: object

In [97]:
files = files.explode()
files

1       None
3        zip
7        zip
9        zip
10       zip
        ... 
7935     zip
7938     zip
7939     zip
7942     zip
7949      gz
Name: files, Length: 19060, dtype: object

In [98]:
#drop None and count
files_count = files.dropna()
files_count = files_count.value_counts()
files_count.head(10)

xz     3043
gz     2634
zip    2290
jpg    1639
tif    1063
nc      914
csv     848
tsv     648
bz2     581
txt     496
Name: files, dtype: int64

## 20. How many objects contain each type of File format given?
**Property:** File format

In [99]:
file_formats = df.ZenodoRecordsCrosswalk.file_format
file_formats

In [100]:
#confirm missing in this repo
print(df.ZenodoRecordsCrosswalk.file_format)

None


## 21. How many objects contain each type of Media type given?
**Property:** Media type

In [101]:
media_types = df.ZenodoRecordsCrosswalk.media_type
media_types

In [102]:
#confirm missing in this repo
print(df.ZenodoRecordsCrosswalk.media_type)

None


## 22. a) How many objects report one related resource type, and b) how many objects report each of those types? c) How many objects report multiple related resource types (regardless of which types)?
**Property:** Related resource type

In [103]:
related_resource_types = df.ZenodoRecordsCrosswalk.related_resource_type
related_resource_types

1                          [isVersionOf]
3          [isSupplementTo, isVersionOf]
7                          [isVersionOf]
9          [isSupplementTo, isVersionOf]
10         [isSupplementTo, isVersionOf]
                      ...               
7935       [isSupplementTo, isVersionOf]
7938                    [isSupplementTo]
7939                    [isSupplementTo]
7942    [isSupplementTo, isSupplementTo]
7949                                None
Name: metadata, Length: 2217, dtype: object

In [104]:
print(df.ZenodoRecordsCrosswalk.related_resource_type)

1                          [isVersionOf]
3          [isSupplementTo, isVersionOf]
7                          [isVersionOf]
9          [isSupplementTo, isVersionOf]
10         [isSupplementTo, isVersionOf]
                      ...               
7935       [isSupplementTo, isVersionOf]
7938                    [isSupplementTo]
7939                    [isSupplementTo]
7942    [isSupplementTo, isSupplementTo]
7949                                None
Name: metadata, Length: 2217, dtype: object


In [105]:
#replace None values with empty list, 0 related resource types
related_resource_types = related_resource_types.apply(lambda d: d if isinstance(d, list) else [])
related_resource_types

1                          [isVersionOf]
3          [isSupplementTo, isVersionOf]
7                          [isVersionOf]
9          [isSupplementTo, isVersionOf]
10         [isSupplementTo, isVersionOf]
                      ...               
7935       [isSupplementTo, isVersionOf]
7938                    [isSupplementTo]
7939                    [isSupplementTo]
7942    [isSupplementTo, isSupplementTo]
7949                                  []
Name: metadata, Length: 2217, dtype: object

In [106]:
#remove None values from lists
related_resource_types_clean = related_resource_types
related_resource_types_clean = related_resource_types_clean.apply(lambda el: [x for x in el if x is not None])
related_resource_types_clean

1                          [isVersionOf]
3          [isSupplementTo, isVersionOf]
7                          [isVersionOf]
9          [isSupplementTo, isVersionOf]
10         [isSupplementTo, isVersionOf]
                      ...               
7935       [isSupplementTo, isVersionOf]
7938                    [isSupplementTo]
7939                    [isSupplementTo]
7942    [isSupplementTo, isSupplementTo]
7949                                  []
Name: metadata, Length: 2217, dtype: object

In [107]:
related_resource_types_counts = related_resource_types_clean.apply(len)
related_resource_types_counts

1       1
3       2
7       1
9       2
10      2
       ..
7935    2
7938    1
7939    1
7942    2
7949    0
Name: metadata, Length: 2217, dtype: int64

In [108]:
#how many objects (second column) have a specified number of related resource types listed (first column)
related_resource_types_counts.value_counts().to_frame()

Unnamed: 0,metadata
1,1205
2,700
0,140
3,77
4,46
5,27
6,10
8,3
10,2
7,2


#### How many objects report each type of related resource?

In [109]:
#add ID column to make easier to count by object
related_resource_types_ids = pd.concat([ids, related_resource_types_clean], axis = 1)
related_resource_types_ids.head(10)

Unnamed: 0,id,metadata
1,5747173,[isVersionOf]
3,4738770,"[isSupplementTo, isVersionOf]"
7,4456470,[isVersionOf]
9,5654850,"[isSupplementTo, isVersionOf]"
10,4547779,"[isSupplementTo, isVersionOf]"
12,4596407,"[isSupplementTo, isVersionOf]"
13,5515761,[isVersionOf]
14,5008068,[cites]
15,5541446,"[isSupplementTo, isVersionOf]"
20,4986399,"[isSourceOf, isVersionOf]"


In [110]:
#expand so each file within object is own row
related_resource_types_ids = related_resource_types_ids.explode('metadata')
related_resource_types_ids.head(10)

Unnamed: 0,id,metadata
1,5747173,isVersionOf
3,4738770,isSupplementTo
3,4738770,isVersionOf
7,4456470,isVersionOf
9,5654850,isSupplementTo
9,5654850,isVersionOf
10,4547779,isSupplementTo
10,4547779,isVersionOf
12,4596407,isSupplementTo
12,4596407,isVersionOf


In [111]:
#group by file type to sum up related resources of same type within objects
related_resource_types_ids_grouped = related_resource_types_ids.groupby('metadata').size().to_frame().sort_values(0, ascending = False)
related_resource_types_ids_grouped['percent'] = related_resource_types_ids_grouped[0]/len(related_resource_types) * 100
related_resource_types_ids_grouped

Unnamed: 0_level_0,0,percent
metadata,Unnamed: 1_level_1,Unnamed: 2_level_1
isVersionOf,1886,85.069914
isSupplementTo,719,32.431213
cites,329,14.839874
isSupplementedBy,78,3.518268
isDocumentedBy,75,3.38295
isDerivedFrom,55,2.48083
isCitedBy,52,2.345512
isCompiledBy,43,1.939558
references,38,1.714028
isReferencedBy,37,1.668922



## 23-25. If there is an entry for an object in one of the three properties (Original data URL, Primary manuscript PID/URL, and Related resource identifier) count as Related resources = True and then count the number of objects that return True.
**Property:** Related Resource Identifier

In [112]:
related_resources_1 = df.ZenodoRecordsCrosswalk.original_data_url
related_resources_2 = df.ZenodoRecordsCrosswalk.primary_manuscript
related_resources_3 = df.ZenodoRecordsCrosswalk.related_resource_identifier

In [113]:
print(related_resources_1)

None


In [114]:
print(related_resources_2)

1                                [10.5281/zenodo.5747172]
3       [https://github.com/kratzert/multiple_forcing/...
7                                [10.5281/zenodo.4456151]
9       [https://github.com/cmu-delphi/covidcast-pnas/...
10      [https://github.com/medipixel/rl_algorithms/tr...
                              ...                        
7935    [https://github.com/datapoet/hubminer/tree/v1....
7938    [https://github.com/stefanfausser/neural-netwo...
7939    [https://github.com/naturalis/nbclassify-data/...
7942    [http://mlsec.org/malheur, https://github.com/...
7949                                                 None
Name: metadata, Length: 2217, dtype: object


In [115]:
print(related_resources_3)

1                                [10.5281/zenodo.5747172]
3       [https://github.com/kratzert/multiple_forcing/...
7                                [10.5281/zenodo.4456151]
9       [https://github.com/cmu-delphi/covidcast-pnas/...
10      [https://github.com/medipixel/rl_algorithms/tr...
                              ...                        
7935    [https://github.com/datapoet/hubminer/tree/v1....
7938    [https://github.com/stefanfausser/neural-netwo...
7939    [https://github.com/naturalis/nbclassify-data/...
7942    [http://mlsec.org/malheur, https://github.com/...
7949                                                 None
Name: metadata, Length: 2217, dtype: object


In [116]:
#confirm that related_resources_2 and related_resources_3 are identical
related_resources_2.equals(related_resources_3)

True

In [117]:
#how many are not none?
len(pd.Series(filter(None, related_resources_3)))

2077

In [118]:
print(f'There are {len(pd.Series(filter(None, related_resources_3)))} objects with a related resource link')

There are 2077 objects with a related resource link


## 23-25. Also, what is the mean number of related resource links per object (again looking at the three properties (Original data URL, Primary manuscript PID/URL, nd Related resource identifier)?
**Property:** Related Resource Identifier

In this case, all objects have at least one related resource, so no need to subset to only objects with related resources, in order to calculate mean/median number of related resources.

In [119]:
#function to count links
def count_links(entry):
    try:
        return len(entry)
    except TypeError:
        return 0

Function `count_links` is expecting a list (in event of multiple links)

In [120]:
#only not none values
rr_counts = pd.Series(filter(None, related_resources_3))
rr_counts

0                                [10.5281/zenodo.5747172]
1       [https://github.com/kratzert/multiple_forcing/...
2                                [10.5281/zenodo.4456151]
3       [https://github.com/cmu-delphi/covidcast-pnas/...
4       [https://github.com/medipixel/rl_algorithms/tr...
                              ...                        
2072        [https://github.com/wavecake/housed1/tree/v1]
2073    [https://github.com/datapoet/hubminer/tree/v1....
2074    [https://github.com/stefanfausser/neural-netwo...
2075    [https://github.com/naturalis/nbclassify-data/...
2076    [http://mlsec.org/malheur, https://github.com/...
Length: 2077, dtype: object

In [121]:
#get links per object
rr_counts_calc = rr_counts.apply(len)
rr_counts_calc

0       1
1       2
2       1
3       2
4       2
       ..
2072    1
2073    2
2074    1
2075    1
2076    2
Length: 2077, dtype: int64

In [122]:
#get summary statistics
analysis.get_summary_statistics(rr_counts_calc)

{'mean': 1.65912373615792, 'median': 1.0, 'max': 104}

## 26. How many objects report each relation type? How many objects report multiple relation types, regardless of what those types are?
**Property:** Related resource relation type

In [123]:
relation_type = df.ZenodoRecordsCrosswalk.related_resource_relation_type
relation_type

1                          [isVersionOf]
3          [isSupplementTo, isVersionOf]
7                          [isVersionOf]
9          [isSupplementTo, isVersionOf]
10         [isSupplementTo, isVersionOf]
                      ...               
7935       [isSupplementTo, isVersionOf]
7938                    [isSupplementTo]
7939                    [isSupplementTo]
7942    [isSupplementTo, isSupplementTo]
7949                                None
Name: metadata, Length: 2217, dtype: object

#### How many objects report each relation type?

In [124]:
#add ID column to make easier to count by object
relation_type_ids = pd.concat([ids, relation_type], axis = 1)
relation_type_ids.head(10)

Unnamed: 0,id,metadata
1,5747173,[isVersionOf]
3,4738770,"[isSupplementTo, isVersionOf]"
7,4456470,[isVersionOf]
9,5654850,"[isSupplementTo, isVersionOf]"
10,4547779,"[isSupplementTo, isVersionOf]"
12,4596407,"[isSupplementTo, isVersionOf]"
13,5515761,[isVersionOf]
14,5008068,[cites]
15,5541446,"[isSupplementTo, isVersionOf]"
20,4986399,"[isSourceOf, isVersionOf]"


In [125]:
#expand so each type within object is own row
relation_type_ids = relation_type_ids.explode('metadata')
relation_type_ids.head(10)

Unnamed: 0,id,metadata
1,5747173,isVersionOf
3,4738770,isSupplementTo
3,4738770,isVersionOf
7,4456470,isVersionOf
9,5654850,isSupplementTo
9,5654850,isVersionOf
10,4547779,isSupplementTo
10,4547779,isVersionOf
12,4596407,isSupplementTo
12,4596407,isVersionOf


In [126]:
#group by relation type to sum up relations of same type within objects
relation_type_ids_grouped = relation_type_ids.groupby('metadata').value_counts().to_frame()
relation_type_ids_grouped.tail(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
metadata,id,Unnamed: 2_level_1
references,3332712,1
references,808456,1
references,2535759,1
references,2558452,1
references,2634024,1
references,2669180,1
references,2669505,1
references,3601310,1
references,3842143,1
references,3984905,1


In [127]:
#now sum up number of IDs within each relation group
relation_type_ids_grouped = relation_type_ids_grouped.groupby('metadata').value_counts('id').to_frame()
relation_type_ids_grouped['percent'] = relation_type_ids_grouped[0]/len(relation_type) * 100
relation_type_ids_grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,0,percent
metadata,0,Unnamed: 2_level_1,Unnamed: 3_level_1
isVersionOf,1,1886,85.069914
isSupplementTo,1,654,29.499323
cites,1,128,5.773568
isDocumentedBy,1,47,2.119982
isCitedBy,1,36,1.623816
isDerivedFrom,1,33,1.488498
isSupplementedBy,1,24,1.082544
isReferencedBy,1,23,1.037438
isCompiledBy,1,21,0.947226
references,1,15,0.67659


#### How many objects report multiple relation types, regardless of what those types are?

In [128]:
#replace None values with empty list, 0 related resource types
relation_type = relation_type.apply(lambda d: d if isinstance(d, list) else [])
relation_type

1                          [isVersionOf]
3          [isSupplementTo, isVersionOf]
7                          [isVersionOf]
9          [isSupplementTo, isVersionOf]
10         [isSupplementTo, isVersionOf]
                      ...               
7935       [isSupplementTo, isVersionOf]
7938                    [isSupplementTo]
7939                    [isSupplementTo]
7942    [isSupplementTo, isSupplementTo]
7949                                  []
Name: metadata, Length: 2217, dtype: object

In [129]:
relation_type_counts = relation_type.apply(len)
relation_type_counts

1       1
3       2
7       1
9       2
10      2
       ..
7935    2
7938    1
7939    1
7942    2
7949    0
Name: metadata, Length: 2217, dtype: int64

In [130]:
#how many objects (second column) have a specified number of related resource types listed (first column)
relation_type_counts.value_counts().to_frame()

Unnamed: 0,metadata
1,1205
2,700
0,140
3,77
4,46
5,27
6,10
8,3
10,2
7,2


## 27. For repositories that store the full citation in a designated field, how many objects have a populated citation? How many objects have a citation and a URL or other actionable link?
**Property:** Citation

In [131]:
citations = df.ZenodoRecordsCrosswalk.citation
citations

In [132]:
#confirm missing in this repo
print(df.ZenodoRecordsCrosswalk.citation)

None
