# Analysis Template Walkthrough

# Setup

## Select extract
In order for the template cells to query data from the correct repository, enter the repository name as `repository` and repository object type as `object_type`.

In [1]:
repository = 'uci'
object_type = 'datasets'

In [2]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

In [3]:
#see more rows and columns of output
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100) 

## Helper Functions

In [4]:
import os, sys
dir2 = os.path.abspath('../')
dir1 = os.path.dirname(dir2)
if not dir1 in sys.path: sys.path.append(dir1)

from utils import analysis
from utils.crosswalk import RepositoryExtract, property_crosswalk
from utils import accessors

# Summary Statistic Walkthroughs

Read in repository .json file

In [5]:
df = pd.read_json(f'{repository}_{object_type}.json')

In [6]:
df

Unnamed: 0,abstract,additional_info,associated_tasks,citation_requests/acknowledgements,creation_purpose,creators,dataset_characteristics,doi,donation_date,files,funders,instances_represent,keywords,license,link_date,missing_value_placeholder,missing_values,num_attributes,num_citations,num_instances,num_views,preprocessing_done,previous_tasks,recommended_data_split,sensitive_data,subject_area,url,papers
0,"A small classic dataset from Fisher, 1936. One...",This is perhaps the best known database to be ...,Classification,,,[R.A. Fisher],Multivariate,,1988-07-01,"[ Parent Directory, Index, bezdekIris.data, ...",,1. sepal length in cm\n 2. sepal width in...,[ecology],[This allows for the sharing and adaptation of...,,,False,5.0,351,150,120760,,,,,Life Science,https://archive-beta.ics.uci.edu/ml/datasets/iris,[{'title': '$ell_p$-Box ADMM: A Versatile Fram...
1,This diabetes dataset is from AIM '94,Diabetes patient records were obtained from tw...,,,,[Michael Kahn],"Multivariate, Time-Series",,,"[ Parent Directory, Index, README, diabetes...",,Diabetes files consist of four fields per reco...,[N/A],[This allows for the sharing and adaptation of...,,,,,102,0,84510,,,,,Life,https://archive-beta.ics.uci.edu/ml/datasets/d...,[{'title': 'A Cooperative Learning Model for t...
2,Predict whether income exceeds $50K/yr based o...,Extraction was done by Barry Becker from the 1...,Classification,,,[],Multivariate,,1996-05-01,"[ Parent Directory, Index, adult.data, adul...",,"Listing of attributes:\n\n>50K, <=50K.\n\nage:...","[fairness, census]",[This allows for the sharing and adaptation of...,,1,True,15.0,256,48842,79966,,,,,Social,https://archive-beta.ics.uci.edu/ml/datasets/a...,"[{'title': '($k$,$epsilon$)-Anonymity: $k$-Ano..."
3,"4 databases: Cleveland, Hungary, Switzerland, ...","This database contains 76 attributes, but all ...",Classification,,,"[Andras Janosi, William Steinbrunn, Matthias P...",Multivariate,,1988-07-01,"[ Parent Directory, Index, WARNING, ask-det...",,Only 14 attributes used:\n 1. #3 (age) ...,[N/A],[This allows for the sharing and adaptation of...,,,,,63,303,75839,,,,,Life,https://archive-beta.ics.uci.edu/ml/datasets/h...,[{'title': 'A Collective Learning Approach for...
4,Using chemical analysis determine the origin o...,These data are the results of a chemical analy...,Classification,,test,[],Multivariate,,1991-07-01,"[ Parent Directory, Index, wine.data, wine....",,All attributes are continuous\n\t\nNo statisti...,[N/A],[This allows for the sharing and adaptation of...,,0,True,14.0,130,178,62072,,,,,Physical,https://archive-beta.ics.uci.edu/ml/datasets/wine,[{'title': '$k$-POD: A Method for $k$-Means Cl...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586,This dataset was collected by Shan-Hung Wu and...,This dataset was collected by Shan-Hung Wu and...,Clustering,,,[Shan-Hung Wu],"Multivariate, Text",,2016-11-02,"[ Parent Directory, Mturk User-Perceived Clus...",,As the above.,[N/A],[This allows for the sharing and adaptation of...,,,True,500.0,0,180,24,,,,,Computer,https://archive-beta.ics.uci.edu/ml/datasets/m...,[]
587,A pen-based database with more than 11k isolat...,"<font class=""normal"">We have created the UJIpe...",Classification,,,"[F. Prat, M. Castro, D. Llorens, A. Marzal, J....","Multivariate, Sequential",,2009-01-22,"[ Parent Directory, uji2.names, ujipenchars2...",,"<font class=""normal"">The file 'ujipenchars2.tx...",[N/A],[This allows for the sharing and adaptation of...,,,,,0,11640,22,,,,,Computer,https://archive-beta.ics.uci.edu/ml/datasets/u...,[]
588,The handwritten dataset was collected from 170...,The dataset contains handwritten Urdu/Arabic n...,Classification,,,[Ghazanfar Latif],Univariate,,2018-08-05,"[ Parent Directory, PMU-UD.zip]",,The participants were asked to write the numer...,[N/A],[This allows for the sharing and adaptation of...,,,,,0,5180,22,,,,,Computer,https://archive-beta.ics.uci.edu/ml/datasets/p...,[]
589,This dataset contains sentences extracted from...,This dataset contains sentences extracted from...,,,,[],Text,,2010-07-06,"[ Parent Directory, OpinosisDataset1.0.zip, ...",,,[N/A],[This allows for the sharing and adaptation of...,,,,,0,51,21,,,,,Computer,https://archive-beta.ics.uci.edu/ml/datasets/o...,[]


## 1. How many total objects (not just records) are in our main dataset extracts for each repository?
**Property:** unique_identifier

In [7]:
ids = df.UCIDatasetsCrosswalk.unique_identifier
ids

0      https://archive-beta.ics.uci.edu/ml/datasets/iris
1      https://archive-beta.ics.uci.edu/ml/datasets/d...
2      https://archive-beta.ics.uci.edu/ml/datasets/a...
3      https://archive-beta.ics.uci.edu/ml/datasets/h...
4      https://archive-beta.ics.uci.edu/ml/datasets/wine
                             ...                        
586    https://archive-beta.ics.uci.edu/ml/datasets/m...
587    https://archive-beta.ics.uci.edu/ml/datasets/u...
588    https://archive-beta.ics.uci.edu/ml/datasets/p...
589    https://archive-beta.ics.uci.edu/ml/datasets/o...
590    https://archive-beta.ics.uci.edu/ml/datasets/c...
Name: url, Length: 591, dtype: object

In [8]:
ids.nunique()

583

In [9]:
print(f'There are {len(ids)} items in the UCI extract, with {ids.nunique()} unique IDs.')

There are 591 items in the UCI extract, with 583 unique IDs.


In [10]:
#for the most part, each row is a unique object except for some that have 2
ids.value_counts()

https://archive-beta.ics.uci.edu/ml/datasets/divorce+predictors+data+set           2
https://archive-beta.ics.uci.edu/ml/datasets/air+quality                           2
https://archive-beta.ics.uci.edu/ml/datasets/stock+keeping+units                   2
https://archive-beta.ics.uci.edu/ml/datasets/lastfm+asia+social+network            2
https://archive-beta.ics.uci.edu/ml/datasets/wave+energy+converters                2
                                                                                  ..
https://archive-beta.ics.uci.edu/ml/datasets/banknote+authentication               1
https://archive-beta.ics.uci.edu/ml/datasets/drug+consumption+quantified           1
https://archive-beta.ics.uci.edu/ml/datasets/wikipedia+math+essentials-1           1
https://archive-beta.ics.uci.edu/ml/datasets/parkinsons                            1
https://archive-beta.ics.uci.edu/ml/datasets/connectionist+bench+nettalk+corpus    1
Name: url, Length: 583, dtype: int64

In [11]:
#look into these duplicate IDs
uci_test_dupes = df.loc[df['url'] == "https://archive-beta.ics.uci.edu/ml/datasets/divorce+predictors+data+set"]
uci_test_dupes

Unnamed: 0,abstract,additional_info,associated_tasks,citation_requests/acknowledgements,creation_purpose,creators,dataset_characteristics,doi,donation_date,files,funders,instances_represent,keywords,license,link_date,missing_value_placeholder,missing_values,num_attributes,num_citations,num_instances,num_views,preprocessing_done,previous_tasks,recommended_data_split,sensitive_data,subject_area,url,papers
166,Participants completed the Personal Informatio...,Provide all relevant information about your da...,Classification,,,"[Mustafa Yntem, Kemal Adem, Serhat Klarslan]","Multivariate, Univariate",,2019-07-24,"[ Parent Directory, divorce.rar]",,1. If one of us apologizes when our discussion...,[N/A],[This allows for the sharing and adaptation of...,,,True,54.0,1,170,429,,,,,Life,https://archive-beta.ics.uci.edu/ml/datasets/d...,"[{'title': 'Paper', 'authors': [], 'year': '19..."
295,Participants completed the Personal Informatio...,Provide all relevant information about your da...,Classification,,,"[Mustafa Yntem, Kemal Adem, Serhat Klarslan]","Multivariate, Univariate",,2019-07-24,"[ Parent Directory, divorce.rar]",,1. If one of us apologizes when our discussion...,[N/A],[This allows for the sharing and adaptation of...,,,True,54.0,1,170,430,,,,,Life,https://archive-beta.ics.uci.edu/ml/datasets/d...,"[{'title': 'Paper', 'authors': [], 'year': '19..."


In [12]:
#check if they are identical
uci1 = uci_test_dupes.iloc[0]
uci2 = uci_test_dupes.iloc[1]
print(uci1.equals(uci2))

False


In [13]:
#find where instances differ
uci_diffs = uci1 == uci2
uci_diffs

abstract                               True
additional_info                        True
associated_tasks                       True
citation_requests/acknowledgements     True
creation_purpose                       True
creators                               True
dataset_characteristics                True
doi                                   False
donation_date                          True
files                                  True
funders                                True
instances_represent                    True
keywords                               True
license                                True
link_date                             False
missing_value_placeholder              True
missing_values                         True
num_attributes                         True
num_citations                          True
num_instances                          True
num_views                             False
preprocessing_done                     True
previous_tasks                  

In [14]:
index_diffs = uci_diffs[uci_diffs == False]
index_diffs

doi          False
link_date    False
num_views    False
dtype: bool

In [15]:
index_diffs_str = index_diffs.index.tolist()
index_diffs_str

['doi', 'link_date', 'num_views']

In [16]:
uci1[index_diffs_str]

doi          None
link_date    None
num_views     429
Name: 166, dtype: object

In [17]:
uci2[index_diffs_str]

doi          None
link_date    None
num_views     430
Name: 295, dtype: object

In [18]:
#number of views is only difference (may be artifact of web scraping)
#decision: select one with more views, assuming that it is more current
#(not seeing large difference in views, so downstream effects on calculations will be minimal)

In [19]:
#subset to view only the duplicate ids
dupes_all = ids.value_counts().to_frame()
dupes_all = dupes_all[dupes_all['url'] == 2]
dupes_all

Unnamed: 0,url
https://archive-beta.ics.uci.edu/ml/datasets/divorce+predictors+data+set,2
https://archive-beta.ics.uci.edu/ml/datasets/air+quality,2
https://archive-beta.ics.uci.edu/ml/datasets/stock+keeping+units,2
https://archive-beta.ics.uci.edu/ml/datasets/lastfm+asia+social+network,2
https://archive-beta.ics.uci.edu/ml/datasets/wave+energy+converters,2
https://archive-beta.ics.uci.edu/ml/datasets/iranian+churn+dataset,2
https://archive-beta.ics.uci.edu/ml/datasets/unmanned+aerial+vehicle+uav+intrusion+detection,2
https://archive-beta.ics.uci.edu/ml/datasets/bar+crawl+detecting+heavy+drinking,2


In [20]:
dupes_all_ids = dupes_all.index.to_list()
dupes_all_ids

['https://archive-beta.ics.uci.edu/ml/datasets/divorce+predictors+data+set',
 'https://archive-beta.ics.uci.edu/ml/datasets/air+quality',
 'https://archive-beta.ics.uci.edu/ml/datasets/stock+keeping+units',
 'https://archive-beta.ics.uci.edu/ml/datasets/lastfm+asia+social+network',
 'https://archive-beta.ics.uci.edu/ml/datasets/wave+energy+converters',
 'https://archive-beta.ics.uci.edu/ml/datasets/iranian+churn+dataset',
 'https://archive-beta.ics.uci.edu/ml/datasets/unmanned+aerial+vehicle+uav+intrusion+detection',
 'https://archive-beta.ics.uci.edu/ml/datasets/bar+crawl+detecting+heavy+drinking']

In [21]:
#confirm that number of views is off by 1 for these duplicates
dupes_df = df[df.url.isin(dupes_all_ids)].sort_values(['url', 'num_views'])
dupes_df[['url', 'num_views']]

Unnamed: 0,url,num_views
120,https://archive-beta.ics.uci.edu/ml/datasets/a...,1450
277,https://archive-beta.ics.uci.edu/ml/datasets/a...,1451
143,https://archive-beta.ics.uci.edu/ml/datasets/b...,656
386,https://archive-beta.ics.uci.edu/ml/datasets/b...,657
166,https://archive-beta.ics.uci.edu/ml/datasets/d...,429
295,https://archive-beta.ics.uci.edu/ml/datasets/d...,430
160,https://archive-beta.ics.uci.edu/ml/datasets/i...,488
389,https://archive-beta.ics.uci.edu/ml/datasets/i...,489
323,https://archive-beta.ics.uci.edu/ml/datasets/l...,124
526,https://archive-beta.ics.uci.edu/ml/datasets/l...,125


In [22]:
#group by url, sort descending by num_views, and select second within group
df_use = df.sort_values(['num_views'], ascending = False).groupby('url').nth(0).reset_index()
df_use[['url', 'num_views']]

Unnamed: 0,url,num_views
0,https://archive-beta.ics.uci.edu/ml/datasets/2...,683
1,https://archive-beta.ics.uci.edu/ml/datasets/3...,350
2,https://archive-beta.ics.uci.edu/ml/datasets/3...,502
3,https://archive-beta.ics.uci.edu/ml/datasets/a...,335
4,https://archive-beta.ics.uci.edu/ml/datasets/a...,117
...,...,...
578,https://archive-beta.ics.uci.edu/ml/datasets/y...,42
579,https://archive-beta.ics.uci.edu/ml/datasets/y...,148
580,https://archive-beta.ics.uci.edu/ml/datasets/y...,215
581,https://archive-beta.ics.uci.edu/ml/datasets/z...,24


In [23]:
#check output
df_use[df_use.url.isin(dupes_all_ids)].sort_values(['url', 'num_views'])[['url','num_views']]

Unnamed: 0,url,num_views
17,https://archive-beta.ics.uci.edu/ml/datasets/a...,1451
54,https://archive-beta.ics.uci.edu/ml/datasets/b...,657
160,https://archive-beta.ics.uci.edu/ml/datasets/d...,430
280,https://archive-beta.ics.uci.edu/ml/datasets/i...,489
297,https://archive-beta.ics.uci.edu/ml/datasets/l...,125
498,https://archive-beta.ics.uci.edu/ml/datasets/s...,163
540,https://archive-beta.ics.uci.edu/ml/datasets/u...,308
556,https://archive-beta.ics.uci.edu/ml/datasets/w...,100


In [24]:
#for matching other notebooks, rename 'df_use' back to 'df'
df = df_use

In [25]:
len(df)

583

In [26]:
ids = df.UCIDatasetsCrosswalk.unique_identifier
print(f'There are {len(ids)} items in the UCI extract after removing duplicates, with {ids.nunique()} unique IDs.')

There are 583 items in the UCI extract after removing duplicates, with 583 unique IDs.


In [27]:
#For UCI, each row is an object, with file components in a list in a field within row

## 2. See the "Licenses offered" tab in /Working documents/Licenses sheet for list of licenses by repo.

## Given the type(s) of license(s) offered by the repo, how many of each type is assigned?
**Property:** License

In [28]:
licenses = df.UCIDatasetsCrosswalk.license
licenses

0      [This allows for the sharing and adaptation of...
1      [This allows for the sharing and adaptation of...
2      [This allows for the sharing and adaptation of...
3      [This allows for the sharing and adaptation of...
4      [This allows for the sharing and adaptation of...
                             ...                        
578    [This allows for the sharing and adaptation of...
579    [This allows for the sharing and adaptation of...
580    [This allows for the sharing and adaptation of...
581    [This allows for the sharing and adaptation of...
582    [This allows for the sharing and adaptation of...
Name: license, Length: 583, dtype: object

In [29]:
license_counts = licenses.value_counts().to_frame()
license_counts['percent'] = license_counts['license']/len(licenses) * 100
license_counts

Unnamed: 0,license,percent
"[This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.]",576,98.799314
[Go to linked dataset for licensing information],5,0.857633
"[Additional Information, This file is part of APS Failure and Operational Data for Scania Trucks.\n\nCopyright (c) <2016> <Scania CV AB>\n\nThis program (APS Failure and Operational Data for Scania Trucks) is \nfree software: you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or\n(at your option) any later version.\n\nThis program is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\nGNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program. If not, see <http://www.gnu.org/licenses/>.\n\n------------------------------------------------------------------------\n\n1. Title: APS Failure at Scania Trucks\n\n2. Source Information\n -- Creator: Scania CV AB\n VagnmakarvÃ¤gen 1 \n 151 32 SÃ¶dertÃ¤lje \n Stockholm\n Sweden \n -- Donor: Tony Lindgren (tony@dsv.su.se) and Jonas Biteus (jonas.biteus@scania.com)\n -- Date: September, 2016\n \n3. Past Usage:\n Industrial Challenge 2016 at The 15th International Symposium on Intelligent Data Analysis (IDA) \n -- Results: \n The top three contestants | Score | Number of Type 1 faults | Number of Type 2 faults\n ------------------------------------------------------------------------------------------------------------------------------------\n Camila F. Costa and Mario A. Nascimento | 9920 | 542 | 9\n Christopher Gondek, Daniel Hafner and Oliver R. Sampson | 10900 | 490 | 12\n Sumeet Garnaik, Sushovan Das, Rama Syamala Sreepada and Bidyut Kr. Patra | 11480 | 398 | 15\n\n4. Relevant Information:\n -- Introduction\n The dataset consists of data collected from heavy Scania \n trucks in everyday usage. The system in focus is the \n Air Pressure system (APS) which generates pressurised \n air that are utilized in various functions in a truck, \n such as braking and gear changes. The datasets' \n positive class consists of component failures \n for a specific component of the APS system. \n The negative class consists of trucks with failures \n for components not related to the APS. The data consists \n of a subset of all available data, selected by experts. \n\n -- Challenge metric \n\n Cost-metric of miss-classification:\n\n Predicted class | True class |\n | pos | neg |\n -----------------------------------------\n pos | - | Cost_1 |\n -----------------------------------------\n neg | Cost_2 | - |\n -----------------------------------------\n Cost_1 = 10 and cost_2 = 500\n\n The total cost of a prediction model the sum of 'Cost_1' \n multiplied by the number of Instances with type 1 failure \n and 'Cost_2' with the number of instances with type 2 failure, \n resulting in a 'Total_cost'.\n\n In this case Cost_1 refers to the cost that an unnessecary \n check needs to be done by an mechanic at an workshop, while \n Cost_2 refer to the cost of missing a faulty truck, \n which may cause a breakdown.\n\n Total_cost = Cost_1*No_Instances + Cost_2*No_Instances.\n\n5. Number of Instances: \n The training set contains 60000 examples in total in which \n 59000 belong to the negative class and 1000 positive class. \n The test set contains 16000 examples. \n\n6. Number of Attributes: 171 \n\n7. Attribute Information:\n The attribute names of the data have been anonymized for \n proprietary reasons. It consists of both single numerical \n counters and histograms consisting of bins with different \n conditions. Typically the histograms have open-ended \n conditions at each end. For example if we measuring \n the ambient temperature 'T' then the histogram could \n be defined with 4 bins where: \n\n bin 1 collect values for temperature T < -20\n bin 2 collect values for temperature T >= -20 and T < 0 \n bin 3 collect values for temperature T >= 0 and T < 20 \n bin 4 collect values for temperature T > 20 \n\n | b1 | b2 | b3 | b4 | \n ----------------------------- \n -20 0 20\n\n The attributes are as follows: class, then \n anonymized operational data. The operational data have \n an identifier and a bin id, like 'Identifier_Bin'.\n In total there are 171 attributes, of which 7 are \n histogram variabels. Missing values are denoted by 'na'.]",2,0.343053


## 3. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Description
**Related function:** `mean_characters`

In [30]:
descriptions = df.UCIDatasetsCrosswalk.description
descriptions

0      Measurement of the S21,consists of 10 sweeps, ...
1      3D road network with highly accurate elevation...
2      The first realistic and public dataset with ra...
3      Mainly from Project Gutenberg, we combine Upan...
4      This data set compromises the metadata for the...
                             ...                        
578    The datasets are taken from top 2 Indian cooki...
579    This dataset contains about 120k instances, ea...
580    It is a public set of comments collected for s...
581                  It was collected for CAD diagnosis.
582                     Artificial, 7 classes of animals
Name: abstract, Length: 583, dtype: object

In [31]:
print(f'{analysis.mean_characters(descriptions)} mean characters')

139.82547993019196 mean characters


## 4. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Methods
**Related function:** `mean_characters`

In [32]:
methods = df.UCIDatasetsCrosswalk.methods
methods

In [33]:
#confirm missing for this repo
print(df.UCIDatasetsCrosswalk.methods)

None


## 5. What are the min and max publication dates for each repo?

## How many objects were published each year for each repo?
**Property:** Publication date

In [34]:
publication_dates = df.UCIDatasetsCrosswalk.publication_date
publication_dates

0      2018-11-30
1      2013-04-16
2      2019-08-15
3      2019-12-24
4      2014-07-30
          ...    
578    2019-07-03
579    2013-10-16
580    2017-03-26
581    2017-11-17
582    1990-05-15
Name: donation_date, Length: 583, dtype: object

In [35]:
#there are missing values, need to remove those
sum(publication_dates.isna())

34

In [36]:
publication_dates = publication_dates.dropna()

In [37]:
#need to coerce to recognize as date
publication_dates = pd.to_datetime(publication_dates)

In [38]:
#min and max publication year
publication_dates.min(), publication_dates.max()

(Timestamp('1900-05-30 00:00:00'), Timestamp('2021-07-05 00:00:00'))

In [39]:
#objects per year
publication_dates.apply(lambda date: date.year).value_counts().sort_index()

1900     1
1954     1
1987     8
1988    15
1989     7
1990    11
1991     4
1992     9
1993     7
1994    10
1995     7
1996     5
1997     4
1998     9
1999    17
2000     3
2001     4
2002     1
2003     1
2006     2
2007     5
2008    17
2009     8
2010    11
2011    18
2012    20
2013    33
2014    41
2015    27
2016    39
2017    38
2018    54
2019    59
2020    46
2021     7
Name: donation_date, dtype: int64

In [40]:
#export for plotting
pub_dates_export = publication_dates.apply(lambda date: date.year).value_counts().sort_index().to_frame()

In [41]:
#update column names
pub_dates_export_ready = pub_dates_export.reset_index(level=0)
pub_dates_export_ready.columns = ['year', 'count']

In [42]:
#add column with name of repo
pub_dates_export_ready['repo'] = 'uci'
pub_dates_export_ready

Unnamed: 0,year,count,repo
0,1900,1,uci
1,1954,1,uci
2,1987,8,uci
3,1988,15,uci
4,1989,7,uci
5,1990,11,uci
6,1991,4,uci
7,1992,9,uci
8,1993,7,uci
9,1994,10,uci


In [43]:
#export to Figures folder
pub_dates_export_ready.to_csv('..\\..\\Figures\\Figure1\\repository_dates\\uci_pub_years.csv')

## 6. What are the unweighted mean, median, and max file sizes among all ingested files?
**Property:** File size
**Related function:** `get_summary_statistics`

We first get the file size attribute using the crosswalk.

In [44]:
file_sizes = df.UCIDatasetsCrosswalk.file_size

In [45]:
#confirm missing for this repo
print(df.UCIDatasetsCrosswalk.file_size)

None


## 7. What are the mean, median, and max number of files per object?
**Property:** URL
**Related function:** `get_summary_statistics`

`missing` is set to an empty list so that the `None` values for objects without files have "zero files"

In [46]:
files = df.UCIDatasetsCrosswalk.url
files

0      [ Parent Directory,  Measurements_Upload_Small...
1           [ Parent Directory,  3D_spatial_network.txt]
2                       [ Parent Directory,  readme.txt]
3           [ Parent Directory,  AsianReligionsData.zip]
4      [ Parent Directory,  [UCI] AAAI-13 Accepted Pa...
                             ...                        
578               [ Parent Directory,  Cooking Data.zip]
579                   [ Parent Directory,  dir_data.tar]
580    [ Parent Directory,  YouTube-Spam-Collection-v...
581    [ Parent Directory,  Z-Alizadeh sani dataset.x...
582    [ Parent Directory,  Index,  zoo.data,  zoo.na...
Name: files, Length: 583, dtype: object

In [47]:
#replace None with empty list
files.fillna(value='[]', inplace=True)
files

0      [ Parent Directory,  Measurements_Upload_Small...
1           [ Parent Directory,  3D_spatial_network.txt]
2                       [ Parent Directory,  readme.txt]
3           [ Parent Directory,  AsianReligionsData.zip]
4      [ Parent Directory,  [UCI] AAAI-13 Accepted Pa...
                             ...                        
578               [ Parent Directory,  Cooking Data.zip]
579                   [ Parent Directory,  dir_data.tar]
580    [ Parent Directory,  YouTube-Spam-Collection-v...
581    [ Parent Directory,  Z-Alizadeh sani dataset.x...
582    [ Parent Directory,  Index,  zoo.data,  zoo.na...
Name: files, Length: 583, dtype: object

In [48]:
#make sure empty list is recognized as list
files = files.apply(lambda d: d if isinstance(d, list) else [])
files

0      [ Parent Directory,  Measurements_Upload_Small...
1           [ Parent Directory,  3D_spatial_network.txt]
2                       [ Parent Directory,  readme.txt]
3           [ Parent Directory,  AsianReligionsData.zip]
4      [ Parent Directory,  [UCI] AAAI-13 Accepted Pa...
                             ...                        
578               [ Parent Directory,  Cooking Data.zip]
579                   [ Parent Directory,  dir_data.tar]
580    [ Parent Directory,  YouTube-Spam-Collection-v...
581    [ Parent Directory,  Z-Alizadeh sani dataset.x...
582    [ Parent Directory,  Index,  zoo.data,  zoo.na...
Name: files, Length: 583, dtype: object

In [49]:
files_counts = files.apply(len)
files_counts

0      2
1      2
2      2
3      2
4      2
      ..
578    2
579    2
580    2
581    2
582    4
Name: files, Length: 583, dtype: int64

In [50]:
analysis.get_summary_statistics(files_counts)

{'mean': 3.334476843910806, 'median': 2.0, 'max': 35}

## 8. What are the mean, median, and max total dataset size (summed across all files) per object?
**Property:** Dataset size
**Related function:** `get_summary_statistics`

In [51]:
dataset_sizes = df.UCIDatasetsCrosswalk.dataset_size
dataset_sizes

In [52]:
#confirm missing for this repo
print(df.UCIDatasetsCrosswalk.dataset_size)

None


## 9. How many of each scientific domain are assigned?
**Property:** Domain
**Related function:** `domains.value_counts()`

In [53]:
domains = df.UCIDatasetsCrosswalk.domain
domains

0      Computer
1      Computer
2      Computer
3        Social
4      Computer
         ...   
578    Computer
579    Computer
580    Computer
581        Life
582        Life
Name: subject_area, Length: 583, dtype: object

In [54]:
domains_counts = domains.value_counts().to_frame()
domains_counts['percent'] = domains_counts['subject_area']/len(domains) * 100
domains_counts

Unnamed: 0,subject_area,percent
Computer,210,36.020583
Life,130,22.298456
Other,78,13.379074
Physical,57,9.777015
Business,40,6.861063
Social,37,6.346484
Game,11,1.886792
Financial,5,0.857633
,4,0.686106
Computer Science,3,0.51458


## 10. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Technical details
**Related function:** `mean_characters`

In [55]:
# "usage notes" is not in crosswalk

## 11-13. What are the mean and median total number of keyword terms per object, after merging results for Keyword, Geographic keyword, and Scientific keyword?
**Property:** Keyword

In [56]:
print(df.UCIDatasetsCrosswalk.keyword)

0                                                  [N/A]
1                                                  [N/A]
2                                                  [N/A]
3                                                  [N/A]
4      [What do the instances that comprise the datas...
                             ...                        
578                                                [N/A]
579                                                [N/A]
580                                                [N/A]
581                                                [N/A]
582                                                [N/A]
Name: keywords, Length: 583, dtype: object


In [57]:
print(df.UCIDatasetsCrosswalk.geographic_keyword)

None


In [58]:
print(df.UCIDatasetsCrosswalk.scientific_keyword)

None


In [59]:
keywords = df.UCIDatasetsCrosswalk.keyword
keywords

0                                                  [N/A]
1                                                  [N/A]
2                                                  [N/A]
3                                                  [N/A]
4      [What do the instances that comprise the datas...
                             ...                        
578                                                [N/A]
579                                                [N/A]
580                                                [N/A]
581                                                [N/A]
582                                                [N/A]
Name: keywords, Length: 583, dtype: object

In [60]:
#some keywords are potentially very messy
#but for consistency with other repos, using quoted items as single keyword
#example below

In [61]:
keywords[4]

['What do the instances that comprise the dataset represent?',
 'Title: Free text; title of the paper\nKeywords: Free text; author-generated keywords\nTopics: Categorical; author-selected, low-level keywords from conference-provided list\nHigh-level keywords: Categorical; author-selected, high-level keywords from conference-provided list\n']

In [62]:
#remove N/A values from lists
keywords_clean = keywords.apply(lambda el: [x for x in el if x != 'N/A'])
keywords_clean

0                                                     []
1                                                     []
2                                                     []
3                                                     []
4      [What do the instances that comprise the datas...
                             ...                        
578                                                   []
579                                                   []
580                                                   []
581                                                   []
582                                                   []
Name: keywords, Length: 583, dtype: object

In [63]:
keywords_counts = keywords_clean.apply(len)
keywords_counts

0      0
1      0
2      0
3      0
4      2
      ..
578    0
579    0
580    0
581    0
582    0
Name: keywords, Length: 583, dtype: int64

In [64]:
analysis.get_summary_statistics(keywords_counts)

{'mean': 0.0823327615780446, 'median': 0.0, 'max': 4}

## 14. Who are the most common funding agencies for each repo? What are the object counts per agency?
**Property:** Funding Agency

In [65]:
funders = df.UCIDatasetsCrosswalk.funding_agency
funders

In [66]:
#confirm missing for this repo
print(df.UCIDatasetsCrosswalk.funding_agency)

None


## 15. What are the mean, median, and max number of Views per object?
**Property:** Views
**Related function:** `get_summary_statistics`

In [67]:
views = df.UCIDatasetsCrosswalk.views
views

0        683
1        350
2        502
3        335
4        117
       ...  
578       42
579      148
580      215
581       24
582    24200
Name: num_views, Length: 583, dtype: int64

In [68]:
#get summary statistics
analysis.get_summary_statistics(views)

{'mean': 4040.21269296741, 'median': 149.0, 'max': 120760}

## 16. What are the mean, median, and max (total) number of downloads per object?
**Property:** Downloads
**Related function:** `get_summary_statistics`

In [69]:
downloads = df.UCIDatasetsCrosswalk.downloads
downloads

In [70]:
#confirm missing for this repo
print(df.UCIDatasetsCrosswalk.downloads)

None


## 17. What are the mean, median, and max Citation counts per object?
**Property:** Citation count
**Related function:** `get_summary_statistics`

In [71]:
citation_count = df.UCIDatasetsCrosswalk.citation_count
citation_count

0       0
1       0
2       0
3       0
4       0
       ..
578     0
579     0
580     0
581     0
582    29
Name: num_citations, Length: 583, dtype: int64

In [72]:
#get summary statistics
analysis.get_summary_statistics(citation_count)

{'mean': 3.7993138936535162, 'median': 0.0, 'max': 351}

## 18. How many objects contain each given resource type?
**Property:** Resource type

In [73]:
resource_types = df.UCIDatasetsCrosswalk.resource_type
resource_types

In [74]:
#confirm missing for this repo
print(df.UCIDatasetsCrosswalk.resource_type)

None


## 19. How many objects contain each type of file extension given?
**Property:** File Extension
**Related function:** `get_file_extensions`

In [75]:
files = df.UCIDatasetsCrosswalk.file_extension
files

0      [ Parent Directory,  Measurements_Upload_Small...
1           [ Parent Directory,  3D_spatial_network.txt]
2                       [ Parent Directory,  readme.txt]
3           [ Parent Directory,  AsianReligionsData.zip]
4      [ Parent Directory,  [UCI] AAAI-13 Accepted Pa...
                             ...                        
578               [ Parent Directory,  Cooking Data.zip]
579                   [ Parent Directory,  dir_data.tar]
580    [ Parent Directory,  YouTube-Spam-Collection-v...
581    [ Parent Directory,  Z-Alizadeh sani dataset.x...
582    [ Parent Directory,  Index,  zoo.data,  zoo.na...
Name: files, Length: 583, dtype: object

In [76]:
#replace any None values with empty list
files_clean = files.apply(lambda d: d if isinstance(d, list) else [])
files_clean

0      [ Parent Directory,  Measurements_Upload_Small...
1           [ Parent Directory,  3D_spatial_network.txt]
2                       [ Parent Directory,  readme.txt]
3           [ Parent Directory,  AsianReligionsData.zip]
4      [ Parent Directory,  [UCI] AAAI-13 Accepted Pa...
                             ...                        
578               [ Parent Directory,  Cooking Data.zip]
579                   [ Parent Directory,  dir_data.tar]
580    [ Parent Directory,  YouTube-Spam-Collection-v...
581    [ Parent Directory,  Z-Alizadeh sani dataset.x...
582    [ Parent Directory,  Index,  zoo.data,  zoo.na...
Name: files, Length: 583, dtype: object

In [77]:
#since interested in per object, need to group by ID
files_ids = pd.concat([ids, files_clean], axis = 1)
files_ids

Unnamed: 0,url,files
0,https://archive-beta.ics.uci.edu/ml/datasets/2...,"[ Parent Directory, Measurements_Upload_Small..."
1,https://archive-beta.ics.uci.edu/ml/datasets/3...,"[ Parent Directory, 3D_spatial_network.txt]"
2,https://archive-beta.ics.uci.edu/ml/datasets/3...,"[ Parent Directory, readme.txt]"
3,https://archive-beta.ics.uci.edu/ml/datasets/a...,"[ Parent Directory, AsianReligionsData.zip]"
4,https://archive-beta.ics.uci.edu/ml/datasets/a...,"[ Parent Directory, [UCI] AAAI-13 Accepted Pa..."
...,...,...
578,https://archive-beta.ics.uci.edu/ml/datasets/y...,"[ Parent Directory, Cooking Data.zip]"
579,https://archive-beta.ics.uci.edu/ml/datasets/y...,"[ Parent Directory, dir_data.tar]"
580,https://archive-beta.ics.uci.edu/ml/datasets/y...,"[ Parent Directory, YouTube-Spam-Collection-v..."
581,https://archive-beta.ics.uci.edu/ml/datasets/z...,"[ Parent Directory, Z-Alizadeh sani dataset.x..."


In [78]:
#make files lists as series with index as object id
files_use = files_clean.set_axis(files_ids['url'])
files_use

url
https://archive-beta.ics.uci.edu/ml/datasets/2+4+ghz+indoor+channel+measurements                      [ Parent Directory,  Measurements_Upload_Small...
https://archive-beta.ics.uci.edu/ml/datasets/3d+road+network+north+jutland+denmark                         [ Parent Directory,  3D_spatial_network.txt]
https://archive-beta.ics.uci.edu/ml/datasets/3w+dataset                                                                [ Parent Directory,  readme.txt]
https://archive-beta.ics.uci.edu/ml/datasets/a+study+of+asian+religious+and+biblical+texts                 [ Parent Directory,  AsianReligionsData.zip]
https://archive-beta.ics.uci.edu/ml/datasets/aaai+2013+accepted+papers                                [ Parent Directory,  [UCI] AAAI-13 Accepted Pa...
                                                                                                                            ...                        
https://archive-beta.ics.uci.edu/ml/datasets/youtube+cookery+channels+viewers+commen

The following code separates the full file extensions (all dot-separated values after the first dot) for a list of files and creates a set, allowing us to only look at the number of objects that contain a given extension.

In [79]:
files_extension_set = files_use.apply(
    lambda file_list: list({''.join(Path(file).suffixes) for file in file_list})
)
files_extension_set

url
https://archive-beta.ics.uci.edu/ml/datasets/2+4+ghz+indoor+channel+measurements                               [, .zip]
https://archive-beta.ics.uci.edu/ml/datasets/3d+road+network+north+jutland+denmark                             [, .txt]
https://archive-beta.ics.uci.edu/ml/datasets/3w+dataset                                                        [, .txt]
https://archive-beta.ics.uci.edu/ml/datasets/a+study+of+asian+religious+and+biblical+texts                     [, .zip]
https://archive-beta.ics.uci.edu/ml/datasets/aaai+2013+accepted+papers                                         [, .csv]
                                                                                                            ...        
https://archive-beta.ics.uci.edu/ml/datasets/youtube+cookery+channels+viewers+comments+in+hinglish             [, .zip]
https://archive-beta.ics.uci.edu/ml/datasets/youtube+multiview+video+games+dataset                             [, .tar]
https://archive-beta.ics.uci.edu/ml/

In [80]:
#note that this approach ignores items that do not have an extension
#each object for UCI looks to have a "Parent Directory" item (overarching folder  name) that will not be counted here

Confirm accuracy

In [81]:
files_ids[files_ids['url'] == 'https://archive-beta.ics.uci.edu/ml/datasets/2+4+ghz+indoor+channel+measurements']['files'].tolist()

[[' Parent Directory', ' Measurements_Upload_Smaller.zip']]

In [82]:
files_extension_set.loc['https://archive-beta.ics.uci.edu/ml/datasets/2+4+ghz+indoor+channel+measurements']

['', '.zip']

In [83]:
#expand so each file within object is own row
files_ext_ids = files_extension_set.explode().to_frame()
files_ext_ids.head(20)

Unnamed: 0_level_0,files
url,Unnamed: 1_level_1
https://archive-beta.ics.uci.edu/ml/datasets/2+4+ghz+indoor+channel+measurements,
https://archive-beta.ics.uci.edu/ml/datasets/2+4+ghz+indoor+channel+measurements,.zip
https://archive-beta.ics.uci.edu/ml/datasets/3d+road+network+north+jutland+denmark,
https://archive-beta.ics.uci.edu/ml/datasets/3d+road+network+north+jutland+denmark,.txt
https://archive-beta.ics.uci.edu/ml/datasets/3w+dataset,
https://archive-beta.ics.uci.edu/ml/datasets/3w+dataset,.txt
https://archive-beta.ics.uci.edu/ml/datasets/a+study+of+asian+religious+and+biblical+texts,
https://archive-beta.ics.uci.edu/ml/datasets/a+study+of+asian+religious+and+biblical+texts,.zip
https://archive-beta.ics.uci.edu/ml/datasets/aaai+2013+accepted+papers,
https://archive-beta.ics.uci.edu/ml/datasets/aaai+2013+accepted+papers,.csv


In [84]:
#get ESTIMATE of most common - needs some clean up of extensions

#group by extension type to count how many objects have each file type
ext_grouped = files_ext_ids.groupby('files').value_counts().to_frame().sort_values(0, ascending = False)
ext_grouped['percent'] = round(ext_grouped[0]/len(files)*100)
ext_grouped.head(30)

Unnamed: 0_level_0,0,percent
files,Unnamed: 1_level_1,Unnamed: 2_level_1
,574,98.0
.zip,190,33.0
.names,128,22.0
.data,105,18.0
.csv,64,11.0
.txt,43,7.0
.html,31,5.0
.data.html,25,4.0
.rar,23,4.0
.tar.gz,19,3.0


In [85]:
#The 'blank' 574 files are due to 'Parent Directory' and/or 'Index' being included in files list for objects.
#Since these do not have a file extension, they are indicated by a blank file extension in this table.

In [86]:
ext_grouped[ext_grouped['percent'] >= 5]

Unnamed: 0_level_0,0,percent
files,Unnamed: 1_level_1,Unnamed: 2_level_1
,574,98.0
.zip,190,33.0
.names,128,22.0
.data,105,18.0
.csv,64,11.0
.txt,43,7.0
.html,31,5.0


In [87]:
#export for further cleaning, refining estimates, and plotting

In [88]:
#reset index and update column names
ext_grouped_ready = files_ext_ids.reset_index(level=0)
ext_grouped_ready.columns = ['index', 'files']

#add column with name of repo
ext_grouped_ready['repo'] = 'uci'

ext_grouped_ready.head(10)

Unnamed: 0,index,files,repo
0,https://archive-beta.ics.uci.edu/ml/datasets/2...,,uci
1,https://archive-beta.ics.uci.edu/ml/datasets/2...,.zip,uci
2,https://archive-beta.ics.uci.edu/ml/datasets/3...,,uci
3,https://archive-beta.ics.uci.edu/ml/datasets/3...,.txt,uci
4,https://archive-beta.ics.uci.edu/ml/datasets/3...,,uci
5,https://archive-beta.ics.uci.edu/ml/datasets/3...,.txt,uci
6,https://archive-beta.ics.uci.edu/ml/datasets/a...,,uci
7,https://archive-beta.ics.uci.edu/ml/datasets/a...,.zip,uci
8,https://archive-beta.ics.uci.edu/ml/datasets/a...,,uci
9,https://archive-beta.ics.uci.edu/ml/datasets/a...,.csv,uci


In [89]:
#export to Figures folder
ext_grouped_ready.to_csv('..\\..\\Figures\\Figure2\\file_ext_data\\uci_extensions.csv')

## 19.5 How many files of each type of file extension are present?
**Property:** File extension

In [90]:
#pick up from files_clean
files_clean

0      [ Parent Directory,  Measurements_Upload_Small...
1           [ Parent Directory,  3D_spatial_network.txt]
2                       [ Parent Directory,  readme.txt]
3           [ Parent Directory,  AsianReligionsData.zip]
4      [ Parent Directory,  [UCI] AAAI-13 Accepted Pa...
                             ...                        
578               [ Parent Directory,  Cooking Data.zip]
579                   [ Parent Directory,  dir_data.tar]
580    [ Parent Directory,  YouTube-Spam-Collection-v...
581    [ Parent Directory,  Z-Alizadeh sani dataset.x...
582    [ Parent Directory,  Index,  zoo.data,  zoo.na...
Name: files, Length: 583, dtype: object

In [91]:
#expand so each file within object is own row
files_all = files_clean.explode()
files_all

0                      Parent Directory
0       Measurements_Upload_Smaller.zip
1                      Parent Directory
1                3D_spatial_network.txt
2                      Parent Directory
                     ...               
581        Z-Alizadeh sani dataset.xlsx
582                    Parent Directory
582                               Index
582                            zoo.data
582                           zoo.names
Name: files, Length: 1953, dtype: object

In [92]:
files_ext_use = files_all.to_frame()
#for some reason need to coerce to string to then remove 'Index' and 'Parent Directory'
files_ext_use['files'] = files_ext_use['files'].astype(str)
files_ext_use

Unnamed: 0,files
0,Parent Directory
0,Measurements_Upload_Smaller.zip
1,Parent Directory
1,3D_spatial_network.txt
2,Parent Directory
...,...
581,Z-Alizadeh sani dataset.xlsx
582,Parent Directory
582,Index
582,zoo.data


In [93]:
#remove rows with 'Parent Directory' or 'Index' as item
files_ext_clean = files_ext_use
files_ext_clean = files_ext_clean[~files_ext_clean['files'].str.contains("Parent Directory")]
files_ext_clean = files_ext_clean[~files_ext_clean['files'].str.contains("Index")]
files_ext_clean

Unnamed: 0,files
0,Measurements_Upload_Smaller.zip
1,3D_spatial_network.txt
2,readme.txt
3,AsianReligionsData.zip
4,[UCI] AAAI-13 Accepted Papers - Papers.csv
...,...
579,dir_data.tar
580,YouTube-Spam-Collection-v1.zip
581,Z-Alizadeh sani dataset.xlsx
582,zoo.data


In [94]:
#extract as series
files_ext_ready = files_ext_clean['files']
files_ext_ready

0                  Measurements_Upload_Smaller.zip
1                           3D_spatial_network.txt
2                                       readme.txt
3                           AsianReligionsData.zip
4       [UCI] AAAI-13 Accepted Papers - Papers.csv
                          ...                     
579                                   dir_data.tar
580                 YouTube-Spam-Collection-v1.zip
581                   Z-Alizadeh sani dataset.xlsx
582                                       zoo.data
582                                      zoo.names
Name: files, Length: 1279, dtype: object

In [95]:
files_ext_all = files_ext_ready.apply(lambda fn: Path(fn).suffixes)
files_ext_all

0        [.zip]
1        [.txt]
2        [.txt]
3        [.zip]
4        [.csv]
         ...   
579      [.tar]
580      [.zip]
581     [.xlsx]
582     [.data]
582    [.names]
Name: files, Length: 1279, dtype: object

In [96]:
files_ext_all.value_counts().head(10)

[.zip]            202
[.data]           171
[.names]          163
[]                142
[.csv]             69
[.txt]             49
[.html]            37
[.test]            32
[.data, .html]     25
[.rar]             23
Name: files, dtype: int64

## 20. How many objects contain each type of File format given?
**Property:** File format

In [97]:
file_formats = df.UCIDatasetsCrosswalk.file_format
file_formats

In [98]:
#confirm missing for this repo
print(df.UCIDatasetsCrosswalk.file_format)

None


## 21. How many objects contain each type of Media type given?
**Property:** Media type

In [99]:
media_types = df.UCIDatasetsCrosswalk.media_type
media_types

In [100]:
#confirm missing for this repo
print(df.UCIDatasetsCrosswalk.media_type)

None


## 22. a) How many objects report one related resource type, and b) how many objects report each of those types? c) How many objects report multiple related resource types (regardless of which types)?
**Property:** Related resource type

In [101]:
related_resource_types = df.UCIDatasetsCrosswalk.related_resource_type
related_resource_types

In [102]:
#confirm missing for this repo
print(df.UCIDatasetsCrosswalk.related_resource_type)

None



## 23-25. If there is an entry for an object in one of the three properties (Original data URL, Primary manuscript PID/URL, and Related resource identifier) count as Related resources = True and then count the number of objects that return True.
**Property:** Related Resource Identifier

In [103]:
print(df.UCIDatasetsCrosswalk.original_data_url)

None


In [104]:
print(df.UCIDatasetsCrosswalk.primary_manuscript)

0                                                     []
1                                                     []
2                                                     []
3                                                     []
4                                                     []
                             ...                        
578                                                   []
579                                                   []
580                                                   []
581                                                   []
582    [{'title': 'A Hash-based Co-Clustering Algorit...
Name: papers, Length: 583, dtype: object


In [105]:
print(df.UCIDatasetsCrosswalk.related_resource_identifier)

None


In [106]:
#only one of related resources fields we identify has content
related_resources = df.UCIDatasetsCrosswalk.primary_manuscript
related_resources

0                                                     []
1                                                     []
2                                                     []
3                                                     []
4                                                     []
                             ...                        
578                                                   []
579                                                   []
580                                                   []
581                                                   []
582    [{'title': 'A Hash-based Co-Clustering Algorit...
Name: papers, Length: 583, dtype: object

In [107]:
#since interested in number of objects with related resource
#can count up non-empty lists in series
related_resources_use = related_resources[(related_resources.str.len() == 0) == False]
related_resources_use

6      [{'title': 'A Comparison of Model Aggregation ...
14     [{'title': 'Determining the Acute Inflammation...
15     [{'title': '($k$,$epsilon$)-Anonymity: $k$-Ano...
17     [{'title': 'Boosting for Dynamical Systems', '...
19     [{'title': 'Paper', 'authors': [], 'year': '19...
                             ...                        
564    [{'title': 'An Efficient Algorithm for Density...
569    [{'title': '$k$-POD: A Method for $k$-Means Cl...
575    [{'title': 'Active Learning to Rank using Pair...
576    [{'title': 'A Siamese Deep Forest', 'link': 'h...
582    [{'title': 'A Hash-based Co-Clustering Algorit...
Name: papers, Length: 102, dtype: object

In [108]:
print(f'There are {len(related_resources_use)} items with related resources.')

There are 102 items with related resources.


## 23-25. Also, what is the mean number of related resource links per object (again looking at the three properties (Original data URL, Primary manuscript PID/URL, nd Related resource identifier)?
**Property:** Related Resource Identifier

We want to calculate this value to be mean number of links *for objects that have links*

In [109]:
#look closer at dictionary content in related_resources_use
related_resources_use[6]

[{'title': 'A Comparison of Model Aggregation Methods for Regression',
  'link': 'https://api.semanticscholar.org/CorpusID:7173976',
  'authors': ['Zafer Barutçuoglu'],
  'year': '2003',
  'published_in': 'icann',
  'journal': 'Lecture Notes in Computer Science'},
 {'title': 'A highly efficient semismooth Newton augmented Lagrangian method for solving Lasso problems',
  'link': 'https://api.semanticscholar.org/CorpusID:17305034',
  'authors': ['Xudong Li', 'Defeng Sun'],
  'year': '2016',
  'published_in': 'siam journal on optimization',
  'journal': 'SIAM Journal on Optimization'},
 {'title': 'Adaptive Linear and Normalized Combination of Radial Basis Function Networks for Function Approximation and Regression',
  'link': 'https://api.semanticscholar.org/CorpusID:119685847',
  'authors': ['Yunfeng Wu',
   'Xin Luo',
   'Fang Zheng',
   'Shanshan Yang',
   'Suxian Cai',
   'Sin Ng'],
  'year': '2014',
  'published_in': 'mathematical problems in engineering',
  'journal': 'Mathematical 

In [110]:
#turn into dataframe
related_resources_df = pd.DataFrame(related_resources_use.tolist())
related_resources_df

Unnamed: 0,0,1,2,3,4
0,{'title': 'A Comparison of Model Aggregation M...,{'title': 'A highly efficient semismooth Newto...,{'title': 'Adaptive Linear and Normalized Comb...,{'title': 'Aggregation of Classifiers: A Justi...,{'title': 'An Interactive Approach to Outlier ...
1,{'title': 'Determining the Acute Inflammations...,,,,
2,"{'title': '($k$,$epsilon$)-Anonymity: $k$-Anon...","{'title': '(α, k)-anonymous data publishing', ...",{'title': '1 Induction of Association Rules: A...,{'title': '2 Using SDR to Formulate Fairness D...,{'title': 'A Combintorial Tree based Frequent ...
3,"{'title': 'Boosting for Dynamical Systems', 'l...",{'title': 'Combined modeling of sparse and den...,{'title': 'Zoom-SVD: Fast and Memory Efficient...,,
4,"{'title': 'Paper', 'authors': [], 'year': '198...",,,,
...,...,...,...,...,...
97,{'title': 'An Efficient Algorithm for Density ...,,,,
98,{'title': '$k$-POD: A Method for $k$-Means Clu...,{'title': '2 9 A ug 2 01 9 Gradient Methods fo...,{'title': 'A Bayesian Approach for Classificat...,{'title': 'A Deep and Tractable Density Estima...,{'title': 'A Discretization Algorithm for Unce...
99,{'title': 'Active Learning to Rank using Pairw...,{'title': 'Anytime Stochastic Gradient Descent...,{'title': 'Closed Form Variational Objectives ...,{'title': 'Controversy Rules - Discovering Reg...,{'title': 'Linear Convergence of Stochastic Fr...
100,"{'title': 'A Siamese Deep Forest', 'link': 'ht...",{'title': 'A new structure entropy of complex ...,{'title': 'AGORAS: A Fast Algorithm for Estima...,{'title': 'AdaCluster : Adaptive Clustering fo...,{'title': 'BAC: A Bagged Associative Classifie...


In [111]:
#for each row, count number of non 'None' cells
related_resource_counts = related_resources_df.notnull().sum(axis=1)
related_resource_counts

0      5
1      1
2      5
3      3
4      1
      ..
97     1
98     5
99     5
100    5
101    5
Length: 102, dtype: int64

In [112]:
print(f'mean {round(related_resource_counts.mean(), 3)} links per object')

mean 3.588 links per object


In [113]:
print(f'median {round(related_resource_counts.median(), 3)} links per object')

median 5.0 links per object


## 26. How many objects report each relation type? How many objects report multiple relation types, regardless of what those types are?
**Property:** Related resource relation type

In [114]:
relation_type = df.UCIDatasetsCrosswalk.related_resource_relation_type
relation_type

In [115]:
#confirm missing for repo
print(df.UCIDatasetsCrosswalk.related_resource_relation_type)

None


## 27. For repositories that store the full citation in a designated field, how many objects have a populated citation? How many objects have a citation and a URL or other actionable link?
**Property:** Citation

In [116]:
citations = df.UCIDatasetsCrosswalk.citation
citations

0      N/A
1      N/A
2      N/A
3      N/A
4      N/A
      ... 
578    N/A
579    N/A
580    N/A
581    N/A
582    N/A
Name: citation_requests/acknowledgements, Length: 583, dtype: object

In [117]:
citations.value_counts().to_frame()

Unnamed: 0,citation_requests/acknowledgements
,577
"Please cite Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, Technical Report, Computer Science Department, University of Toronto, 2009.",1
"If you use this dataset, please cite the following paper:\n\n@article{mancuso2021,\ntitle = {A machine learning approach for forecasting hierarchical time series},\njournal = {Expert Systems with Applications},\nvolume = {182},\npages = {115102},\nyear = {2021},\nissn = {0957-4174},\ndoi = {https://doi.org/10.1016/j.eswa.2021.115102},\nurl = {https://www.sciencedirect.com/science/article/pii/S0957417421005431},\nauthor = {Paolo Mancuso and Veronica Piccialli and Antonio M. Sudoso},\nkeywords = {Hierarchical time series, Forecast, Machine learning, Deep neural network}\n}",1
"Deng, J. and Dong, W. and Socher, R. and Li, L.-J. and Li, K. and Fei-Fei, L., ImageNet: A Large-Scale Hierarchical Image Database, CVPR, 2009",1
See the dataset’s link for required citation and use policies. Dataset image credit: Artwork by @allison_horst.,1
"M. Irfan, L. Tokarchuk, L. Marcenaro and C. Regazzoni, ""ANOMALY DETECTION IN CROWDS USING MULTI SENSORY INFORMATION,"" 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2018, pp. 1-6, doi: 10.1109/AVSS.2018.8639151.",1
"Liang Zhao, Olga Gkountouna, and Dieter Pfoser. 2019. Spatial Auto-regressive Dependency Interpretable Learning Based on Spatial Topological Constraints. ACM Trans. Spatial Algorithms Syst. 5, 3, Article 19 (August 2019), 28 pages. DOI:https://doi.org/10.1145/3339823",1


Note that beta site was scraped during transition and it looks like not all citations had transferred yet. Checking UCIMLR beta site at a later date shows more objects with citations.