# Analysis Template Walkthrough

# Setup

## Select extract
In order for the template cells to query data from the correct repository, enter the repository name as `repository` and repository object type as `object_type`.

In [1]:
repository = 'kaggle'
object_type = 'datasets'

In [2]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

In [3]:
#see more rows and columns of output
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100) 

## Helper Functions

In [4]:
import os, sys
dir2 = os.path.abspath('../')
dir1 = os.path.dirname(dir2)
if not dir1 in sys.path: sys.path.append(dir1)

from utils import analysis
from utils.crosswalk import RepositoryExtract, property_crosswalk
from utils import accessors

# Summary Statistic Walkthroughs

Read in repository .json file

In [5]:
df = pd.read_json(f'{repository}_{object_type}.json')

In [6]:
df

Unnamed: 0,datasetId_search,id,subtitle_search,creatorName,creatorUrl,totalBytes,url,lastUpdated,downloadCount,isPrivate_search,isReviewed,isFeatured,licenseName,description_search,ownerName,ownerRef,kernelCount,title_search,topicCount,currentVersionNumber,usabilityRating_search,tags,files,versions,page,collaborators,data,datasetId_metadata,datasetSlug,description_metadata,id_no,isPrivate_metadata,keywords,licenses,ownerUser,subtitle_metadata,title_metadata,totalDownloads,totalViews,totalVotes,usabilityRating_metadata
0,70947,kaggle/kaggle-survey-2018,The most comprehensive dataset available on th...,Paul Mooney,paultimothymooney,4.405170e+06,https://www.kaggle.com/kaggle/kaggle-survey-2018,2018-11-03T22:35:07.12Z,16346,False,True,False,CC BY-SA 4.0,,Kaggle,kaggle,483,2018 Kaggle Machine Learning & Data Science Su...,16,5,0.852941,"[{'ref': 'survey analysis', 'name': 'survey an...",[],[],1,,,,,,,,,,,,,,,,
1,2733,kaggle/kaggle-survey-2017,A big picture view of the state of data scienc...,Mark McDonald,markmcdonald,3.692241e+06,https://www.kaggle.com/kaggle/kaggle-survey-2017,2017-10-27T22:03:03.417Z,24031,False,True,False,"Database: Open Database, Contents: © Original ...",,Kaggle,kaggle,435,2017 Kaggle Machine Learning & Data Science Su...,10,4,0.823529,"[{'ref': 'employment', 'name': 'employment', '...",[],[],1,,,,,,,,,,,,,,,,
2,635,alopez247/pokemon,(Almost) all Pokémon stats until generation 6:...,alopez247,alopez247,7.317770e+05,https://www.kaggle.com/alopez247/pokemon,2017-03-05T15:01:26.013Z,11232,False,True,False,CC BY-NC-SA 4.0,,alopez247,alopez247,47,Pokémon for Data Mining and Machine Learning,0,2,0.852941,"[{'ref': 'video games', 'name': 'video games',...",[],[],1,[],[],635.0,pokemon,# Context \n\nWith the rise of the popularity ...,635.0,0.0,"[arts and entertainment, games, card games, vi...",[{'name': 'CC-BY-NC-SA-4.0'}],alopez247,(Almost) all Pokémon stats until generation 6:...,Pokémon for Data Mining and Machine Learning,11232.0,151545.0,248.0,0.852941
3,654897,kaushil268/disease-prediction-using-machine-le...,Use Machine Learning and Deep Learning models ...,KAUSHIL268,kaushil268,3.049000e+04,https://www.kaggle.com/kaushil268/disease-pred...,2020-05-15T03:58:44.15Z,3698,False,False,False,"Database: Open Database, Contents: Database Co...",,KAUSHIL268,kaushil268,17,Disease Prediction Using Machine Learning,2,1,0.823529,"[{'ref': 'neural networks', 'name': 'neural ne...",[],[],1,[],[],654897.0,disease-prediction-using-machine-learning,### Context\n\nDuring the time when Machine Le...,654897.0,0.0,"[diseases, earth and nature, biology, educatio...",[{'name': 'DbCL-1.0'}],kaushil268,Use Machine Learning and Deep Learning models ...,Disease Prediction Using Machine Learning,3698.0,26306.0,56.0,0.823529
4,32132,kashnitsky/mlcourse,Open Machine Learning Course by OpenDataScience,Yury Kashnitsky,kashnitsky,5.359952e+07,https://www.kaggle.com/kashnitsky/mlcourse,2018-12-09T16:45:09.507Z,27972,False,True,False,CC BY-NC-SA 4.0,,Yury Kashnitsky,kashnitsky,472,mlcourse.ai,4,17,0.882353,"[{'ref': 'classification', 'name': 'classifica...",[],[],1,[],[],32132.0,mlcourse,![](https://habrastorage.org/webt/ia/m9/zk/iam...,32132.0,0.0,"[computer science, data visualization, classif...",[{'name': 'CC-BY-NC-SA-4.0'}],kashnitsky,Open Machine Learning Course by OpenDataScience,mlcourse.ai,27973.0,214157.0,1411.0,0.882353
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2011,1755248,manishtripathi86/deloitte-hackathon-predict-th...,,Manish Tripathi,manishtripathi86,9.131913e+06,https://www.kaggle.com/manishtripathi86/deloit...,2021-11-30T18:00:43.22Z,5,False,False,False,Unknown,,Manish Tripathi,manishtripathi86,0,deloitte hackathon predict the loan defaulter,0,2,0.470588,"[{'ref': 'business', 'name': 'business', 'desc...",[],[],101,[],[],1755248.0,deloitte-hackathon-predict-the-loan-defaulter,Overview\nDeloitte refers to one or more of De...,1755248.0,0.0,[business],[{'name': 'unknown'}],manishtripathi86,,deloitte hackathon predict the loan defaulter,5.0,41.0,0.0,0.470588
2012,1418312,jimregan/dalaj-v10,,Jim O'Regan,jimregan,2.664310e+05,https://www.kaggle.com/jimregan/dalaj-v10,2021-06-18T21:07:57.907Z,7,False,False,False,Attribution 4.0 International (CC BY 4.0),,Jim O'Regan,jimregan,0,DaLAJ v.1.0,0,1,0.705882,"[{'ref': 'linguistics', 'name': 'linguistics',...",[],[],101,[],[],1418312.0,dalaj-v10,# I. IDENTIFYING INFORMATION\n\n## Title\n\nDa...,1418312.0,0.0,"[sports, education, linguistics]",[{'name': 'Attribution 4.0 International (CC B...,jimregan,,DaLAJ v.1.0,7.0,75.0,0.0,0.705882
2013,1736649,claytonmiller/ashrae-global-occupant-behavior-...,Data from 32 research studies focused on occup...,Clayton Miller,claytonmiller,4.209520e+08,https://www.kaggle.com/claytonmiller/ashrae-gl...,2021-11-22T06:21:35.157Z,4,False,False,False,ODC Attribution License (ODC-By),,Clayton Miller,claytonmiller,0,ASHRAE Global Occupant Behavior Database,0,1,0.764706,"[{'ref': 'architecture', 'name': 'architecture...",[],[],101,[],[],1736649.0,ashrae-global-occupant-behavior-database,# Dataset and documentation: https://ashraeobd...,1736649.0,0.0,"[architecture, energy, renewable energy, time ...",[{'name': 'ODC Attribution License (ODC-By)'}],claytonmiller,Data from 32 research studies focused on occup...,ASHRAE Global Occupant Behavior Database,4.0,73.0,0.0,0.764706
2014,776337,awsaf49/vip-cup-2020,,Awsaf,awsaf49,5.307890e+09,https://www.kaggle.com/awsaf49/vip-cup-2020,2020-08-02T18:30:26.807Z,21,False,False,False,CC0: Public Domain,,Awsaf,awsaf49,0,VIP CUP 2020,0,8,0.529412,"[{'ref': 'universities and colleges', 'name': ...",[],[],101,[],[],776337.0,vip-cup-2020,## IEEE Video and Image Processing Cup (VIP Cu...,776337.0,0.0,"[universities and colleges, arts and entertain...",[{'name': 'CC0-1.0'}],awsaf49,,VIP CUP 2020,21.0,387.0,0.0,0.529412


## 1. How many total objects (not just records) are in our main dataset extracts for each repository?
**Property:** unique_identifier

In [7]:
ids = df.KaggleDatasetsCrosswalk.unique_identifier
ids

0                               kaggle/kaggle-survey-2018
1                               kaggle/kaggle-survey-2017
2                                       alopez247/pokemon
3       kaushil268/disease-prediction-using-machine-le...
4                                     kashnitsky/mlcourse
                              ...                        
2011    manishtripathi86/deloitte-hackathon-predict-th...
2012                                   jimregan/dalaj-v10
2013    claytonmiller/ashrae-global-occupant-behavior-...
2014                                 awsaf49/vip-cup-2020
2015                           kusuri/melspectrogramsdemo
Name: id, Length: 2016, dtype: object

In [8]:
ids.nunique()

2007

In [9]:
print(f'There are {len(ids)} items in the Kaggle extract, with {ids.nunique()} unique IDs.')

There are 2016 items in the Kaggle extract, with 2007 unique IDs.


In [10]:
#for the most part, each row is a unique object except for a few that have 4
ids.value_counts()

arbazkhan971/github-bugs-prediction-challenge-machine-hack    4
sakhawat18/asteroid-dataset                                   4
ranjan6459/flairs-for-machine-learning-subreddit-data         4
kaggle/kaggle-survey-2018                                     1
csbuja/people-walking-with-no-occlusion                       1
                                                             ..
danielbacioiu/tig-stainless-steel-304                         1
fanbyprinciple/file-pe-headers                                1
blessondensil294/av-janatahack-healthcare-hackathon-ii        1
prathumarikeri/american-sign-language-09az                    1
kusuri/melspectrogramsdemo                                    1
Name: id, Length: 2007, dtype: int64

In [11]:
#look into these duplicate IDs
kaggle_dupes1 = df.loc[df['id'] == "arbazkhan971/github-bugs-prediction-challenge-machine-hack"]
kaggle_dupes1

Unnamed: 0,datasetId_search,id,subtitle_search,creatorName,creatorUrl,totalBytes,url,lastUpdated,downloadCount,isPrivate_search,isReviewed,isFeatured,licenseName,description_search,ownerName,ownerRef,kernelCount,title_search,topicCount,currentVersionNumber,usabilityRating_search,tags,files,versions,page,collaborators,data,datasetId_metadata,datasetSlug,description_metadata,id_no,isPrivate_metadata,keywords,licenses,ownerUser,subtitle_metadata,title_metadata,totalDownloads,totalViews,totalVotes,usabilityRating_metadata
541,911365,arbazkhan971/github-bugs-prediction-challenge-...,GitHub Bugs Prediction Challenge (Machine Hack),ask9,arbazkhan971,103105526.0,https://www.kaggle.com/arbazkhan971/github-bug...,2020-10-08T08:05:42.173Z,123,False,False,False,Other (specified in description),,ask9,arbazkhan971,1,GitHub Bugs Prediction Challenge (Machine Hack),0,1,0.647059,"[{'ref': 'computer science', 'name': 'computer...",[],[],27,[],[],911365.0,github-bugs-prediction-challenge-machine-hack,"Overview\nForeseeing bugs, features, and quest...",911365.0,0.0,"[computer science, programming, nlp, classific...",[{'name': 'other'}],arbazkhan971,GitHub Bugs Prediction Challenge (Machine Hack),GitHub Bugs Prediction Challenge (Machine Hack),123.0,1553.0,25.0,0.647059
542,911365,arbazkhan971/github-bugs-prediction-challenge-...,GitHub Bugs Prediction Challenge (Machine Hack),ask9,arbazkhan971,103105526.0,https://www.kaggle.com/arbazkhan971/github-bug...,2020-10-08T08:05:42.173Z,123,False,False,False,Other (specified in description),,ask9,arbazkhan971,1,GitHub Bugs Prediction Challenge (Machine Hack),0,1,0.647059,"[{'ref': 'computer science', 'name': 'computer...",[],[],27,[],[],911365.0,github-bugs-prediction-challenge-machine-hack,"Overview\nForeseeing bugs, features, and quest...",911365.0,0.0,"[computer science, programming, nlp, classific...",[{'name': 'other'}],arbazkhan971,GitHub Bugs Prediction Challenge (Machine Hack),GitHub Bugs Prediction Challenge (Machine Hack),123.0,1553.0,25.0,0.647059
543,911365,arbazkhan971/github-bugs-prediction-challenge-...,GitHub Bugs Prediction Challenge (Machine Hack),ask9,arbazkhan971,103105526.0,https://www.kaggle.com/arbazkhan971/github-bug...,2020-10-08T08:05:42.173Z,123,False,False,False,Other (specified in description),,ask9,arbazkhan971,1,GitHub Bugs Prediction Challenge (Machine Hack),0,1,0.647059,"[{'ref': 'computer science', 'name': 'computer...",[],[],28,[],[],911365.0,github-bugs-prediction-challenge-machine-hack,"Overview\nForeseeing bugs, features, and quest...",911365.0,0.0,"[computer science, programming, nlp, classific...",[{'name': 'other'}],arbazkhan971,GitHub Bugs Prediction Challenge (Machine Hack),GitHub Bugs Prediction Challenge (Machine Hack),123.0,1553.0,25.0,0.647059
544,911365,arbazkhan971/github-bugs-prediction-challenge-...,GitHub Bugs Prediction Challenge (Machine Hack),ask9,arbazkhan971,103105526.0,https://www.kaggle.com/arbazkhan971/github-bug...,2020-10-08T08:05:42.173Z,123,False,False,False,Other (specified in description),,ask9,arbazkhan971,1,GitHub Bugs Prediction Challenge (Machine Hack),0,1,0.647059,"[{'ref': 'computer science', 'name': 'computer...",[],[],28,[],[],911365.0,github-bugs-prediction-challenge-machine-hack,"Overview\nForeseeing bugs, features, and quest...",911365.0,0.0,"[computer science, programming, nlp, classific...",[{'name': 'other'}],arbazkhan971,GitHub Bugs Prediction Challenge (Machine Hack),GitHub Bugs Prediction Challenge (Machine Hack),123.0,1553.0,25.0,0.647059


In [12]:
#look closer at these
dupes1_1 = kaggle_dupes1.iloc[0]
dupes1_2 = kaggle_dupes1.iloc[1]
dupes1_3 = kaggle_dupes1.iloc[2]
dupes1_4 = kaggle_dupes1.iloc[3]

In [13]:
dupes1_2.equals(dupes1_1)

True

In [14]:
dupes1_2.equals(dupes1_3)

False

In [15]:
dupes1_3.equals(dupes1_4)

True

In [16]:
#first two equal each other, and last two equal each other, but they differ from each other

In [17]:
#find where instances differ
kaggle_diffs = dupes1_1 == dupes1_4
kaggle_diffs

datasetId_search             True
id                           True
subtitle_search              True
creatorName                  True
creatorUrl                   True
totalBytes                   True
url                          True
lastUpdated                  True
downloadCount                True
isPrivate_search             True
isReviewed                   True
isFeatured                   True
licenseName                  True
description_search          False
ownerName                    True
ownerRef                     True
kernelCount                  True
title_search                 True
topicCount                   True
currentVersionNumber         True
usabilityRating_search       True
tags                         True
files                        True
versions                     True
page                        False
collaborators                True
data                         True
datasetId_metadata           True
datasetSlug                  True
description_me

In [18]:
index_diffs = kaggle_diffs[kaggle_diffs == False]
index_diffs

description_search    False
page                  False
dtype: bool

In [19]:
index_diffs_str = index_diffs.index.tolist()
index_diffs_str

['description_search', 'page']

In [20]:
dupes1_1[index_diffs_str]

description_search    NaN
page                   27
Name: 541, dtype: object

In [21]:
dupes1_4[index_diffs_str]

description_search    NaN
page                   28
Name: 544, dtype: object

In [22]:
#page number is only difference (may be artifact of web scraping)
#decision: we don't use this field for any calculations, so just group by ID and select first

In [23]:
#subset to view only the duplicate ids
dupes_all = ids.value_counts().to_frame()
dupes_all = dupes_all[dupes_all['id'] == 4]
dupes_all

Unnamed: 0,id
arbazkhan971/github-bugs-prediction-challenge-machine-hack,4
sakhawat18/asteroid-dataset,4
ranjan6459/flairs-for-machine-learning-subreddit-data,4


In [24]:
dupes_all_ids = dupes_all.index.to_list()
dupes_all_ids

['arbazkhan971/github-bugs-prediction-challenge-machine-hack',
 'sakhawat18/asteroid-dataset',
 'ranjan6459/flairs-for-machine-learning-subreddit-data']

In [25]:
dupes_df = df[df.id.isin(dupes_all_ids)].sort_values(['id', 'page'])
dupes_df[['id', 'page']]

Unnamed: 0,id,page
541,arbazkhan971/github-bugs-prediction-challenge-...,27
542,arbazkhan971/github-bugs-prediction-challenge-...,27
543,arbazkhan971/github-bugs-prediction-challenge-...,28
544,arbazkhan971/github-bugs-prediction-challenge-...,28
1342,ranjan6459/flairs-for-machine-learning-subredd...,67
1343,ranjan6459/flairs-for-machine-learning-subredd...,67
1344,ranjan6459/flairs-for-machine-learning-subredd...,68
1345,ranjan6459/flairs-for-machine-learning-subredd...,68
139,sakhawat18/asteroid-dataset,7
140,sakhawat18/asteroid-dataset,7


In [26]:
#confirmed - value is off by 1 for page

In [27]:
#group by id and select first within group
df_use = df.groupby('id').nth(0).reset_index()
df_use[df_use.id.isin(dupes_all_ids)].sort_values(['id', 'page'])[['id','page']]

Unnamed: 0,id,page
195,arbazkhan971/github-bugs-prediction-challenge-...,27
1413,ranjan6459/flairs-for-machine-learning-subredd...,67
1521,sakhawat18/asteroid-dataset,7


In [28]:
#for matching other notebooks, rename 'df_use' back to 'df'
df = df_use

In [29]:
ids = df.KaggleDatasetsCrosswalk.unique_identifier

In [30]:
ids = df.KaggleDatasetsCrosswalk.unique_identifier
print(f'There are {len(ids)} items in the Kaggle extract after removing duplicates, with {ids.nunique()} unique IDs.')

There are 2007 items in the Kaggle extract after removing duplicates, with 2007 unique IDs.


## 2. See the "Licenses offered" tab in /Working documents/Licenses sheet for list of licenses by repo.

## Given the type(s) of license(s) offered by the repo, how many of each type is assigned?
**Property:** License

In [31]:
licenses = df.KaggleDatasetsCrosswalk.license
licenses

0                      None
1                      None
2                      None
3                      None
4                      None
               ...         
2002    [copyright-authors]
2003              [CC0-1.0]
2004              [CC0-1.0]
2005             [DbCL-1.0]
2006              [CC0-1.0]
Name: licenses, Length: 2007, dtype: object

In [32]:
#replace None values with empty list
licenses = licenses.apply(lambda d: d if isinstance(d, list) else [])
licenses

0                        []
1                        []
2                        []
3                        []
4                        []
               ...         
2002    [copyright-authors]
2003              [CC0-1.0]
2004              [CC0-1.0]
2005             [DbCL-1.0]
2006              [CC0-1.0]
Name: licenses, Length: 2007, dtype: object

In [33]:
#since interested in per object, need to group by ID
licenses_ids = pd.concat([ids, licenses], axis = 1)
licenses_ids

Unnamed: 0,id,licenses
0,BengaliAI/numta,[]
1,Cornell-University/arxiv,[]
2,FootballPrediction/fbpdataset,[]
3,HRAnalyticRepository/employee-attrition-data,[]
4,HRAnalyticRepository/job-classification-dataset,[]
...,...,...
2002,zusmani/pakistans-largest-ecommerce-dataset,[copyright-authors]
2003,zusmani/the-holy-quran,[CC0-1.0]
2004,zusmani/trumps-legacy,[CC0-1.0]
2005,zusmani/us-mass-shootings-last-50-years,[DbCL-1.0]


In [34]:
#expand licenses in list (in case of multiple) and drop duplicates, so one license per unique ID
licenses_ids_unique = licenses_ids.explode(['licenses']).drop_duplicates()
licenses_ids_unique

Unnamed: 0,id,licenses
0,BengaliAI/numta,
1,Cornell-University/arxiv,
2,FootballPrediction/fbpdataset,
3,HRAnalyticRepository/employee-attrition-data,
4,HRAnalyticRepository/job-classification-dataset,
...,...,...
2002,zusmani/pakistans-largest-ecommerce-dataset,copyright-authors
2003,zusmani/the-holy-quran,CC0-1.0
2004,zusmani/trumps-legacy,CC0-1.0
2005,zusmani/us-mass-shootings-last-50-years,DbCL-1.0


In [35]:
license_counts = licenses_ids_unique['licenses'].value_counts().to_frame()
license_counts['percent'] = license_counts['licenses']/len(licenses_ids_unique) * 100
license_counts

Unnamed: 0,licenses,percent
unknown,645,32.137519
CC0-1.0,431,21.474838
other,215,10.712506
copyright-authors,167,8.320877
Attribution 4.0 International (CC BY 4.0),99,4.932735
DbCL-1.0,84,4.185351
CC-BY-NC-SA-4.0,65,3.238665
ODbL-1.0,48,2.391629
CC-BY-SA-4.0,44,2.192327
GPL-2.0,42,2.092676


## 3. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Description
**Related function:** `mean_characters`

In [36]:
descriptions = df.KaggleDatasetsCrosswalk.description
descriptions

0                                                    None
1                                                    None
2                                                    None
3                                                    None
4                                                    None
                              ...                        
2002    ### Context\n\nThis is the largest retail e-co...
2003    ### Context\n\nThe Holy Quran is the central t...
2004    ### Context\n\nUnited States 45th President Do...
2005    ### Context\n\nMass Shootings in the United St...
2006    # NFL Football Stats\nMy family has always bee...
Name: description_metadata, Length: 2007, dtype: object

In [37]:
descriptions = descriptions.drop_duplicates()

In [38]:
#this would need massive text clean up, but a starting point for mean characters for now
print(f'{analysis.mean_characters(descriptions)} mean characters')

2132.1904212348527 mean characters


## 4. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Methods
**Related function:** `mean_characters`

In [39]:
methods = df.KaggleDatasetsCrosswalk.methods
methods

In [40]:
#confirm missing for repo
print(df.KaggleDatasetsCrosswalk.methods)

None


## 5. What are the min and max publication dates for each repo?

## How many objects were published each year for each repo?
**Property:** Publication date

In [41]:
publication_dates = df.KaggleDatasetsCrosswalk.publication_date
publication_dates

0        2018-08-14T03:08:59.81Z
1       2021-11-28T00:51:07.293Z
2       2019-03-23T09:35:39.467Z
3       2017-04-26T18:39:14.473Z
4         2017-01-07T22:15:38.4Z
                  ...           
2002     2021-01-19T11:42:57.93Z
2003    2017-11-20T09:46:11.137Z
2004    2021-01-29T14:24:20.103Z
2005    2021-05-10T05:36:10.597Z
2006    2017-12-08T03:40:48.143Z
Name: lastUpdated, Length: 2007, dtype: object

In [42]:
#since interested in object, group by ID
publication_dates_ids = pd.concat([ids, publication_dates], axis = 1)
publication_dates_ids

Unnamed: 0,id,lastUpdated
0,BengaliAI/numta,2018-08-14T03:08:59.81Z
1,Cornell-University/arxiv,2021-11-28T00:51:07.293Z
2,FootballPrediction/fbpdataset,2019-03-23T09:35:39.467Z
3,HRAnalyticRepository/employee-attrition-data,2017-04-26T18:39:14.473Z
4,HRAnalyticRepository/job-classification-dataset,2017-01-07T22:15:38.4Z
...,...,...
2002,zusmani/pakistans-largest-ecommerce-dataset,2021-01-19T11:42:57.93Z
2003,zusmani/the-holy-quran,2017-11-20T09:46:11.137Z
2004,zusmani/trumps-legacy,2021-01-29T14:24:20.103Z
2005,zusmani/us-mass-shootings-last-50-years,2021-05-10T05:36:10.597Z


In [43]:
publication_dates_unique = publication_dates_ids.drop_duplicates()
publication_dates_unique

Unnamed: 0,id,lastUpdated
0,BengaliAI/numta,2018-08-14T03:08:59.81Z
1,Cornell-University/arxiv,2021-11-28T00:51:07.293Z
2,FootballPrediction/fbpdataset,2019-03-23T09:35:39.467Z
3,HRAnalyticRepository/employee-attrition-data,2017-04-26T18:39:14.473Z
4,HRAnalyticRepository/job-classification-dataset,2017-01-07T22:15:38.4Z
...,...,...
2002,zusmani/pakistans-largest-ecommerce-dataset,2021-01-19T11:42:57.93Z
2003,zusmani/the-holy-quran,2017-11-20T09:46:11.137Z
2004,zusmani/trumps-legacy,2021-01-29T14:24:20.103Z
2005,zusmani/us-mass-shootings-last-50-years,2021-05-10T05:36:10.597Z


In [44]:
#min and max publication year
publication_dates_unique['lastUpdated'].min(), publication_dates_unique['lastUpdated'].max()

('2016-05-20T01:32:31.27Z', '2021-12-03T01:47:37.417Z')

In [45]:
#objects per year
publication_dates_unique['lastUpdated'].astype('datetime64').apply(lambda date: date.year).value_counts().sort_index()

2016     35
2017    144
2018    187
2019    241
2020    663
2021    737
Name: lastUpdated, dtype: int64

In [46]:
#what we get from Kaggle is last updated
#but that's only date available to compare to other repo pub dates

In [47]:
#export for plotting
pub_dates_export = publication_dates_unique['lastUpdated'].astype('datetime64').apply(lambda date: date.year).value_counts().sort_index().to_frame()

In [48]:
#update column names
pub_dates_export_ready = pub_dates_export.reset_index(level=0)
pub_dates_export_ready.columns = ['year', 'count']

In [49]:
#add column with name of repo
pub_dates_export_ready['repo'] = 'kaggle'
pub_dates_export_ready

Unnamed: 0,year,count,repo
0,2016,35,kaggle
1,2017,144,kaggle
2,2018,187,kaggle
3,2019,241,kaggle
4,2020,663,kaggle
5,2021,737,kaggle


In [50]:
#export to Figures folder
pub_dates_export_ready.to_csv('..\\..\\Figures\\Figure1\\repository_dates\\kaggle_pub_years.csv')

## 6. What are the unweighted mean, median, and max file sizes among all ingested files?
**Property:** File size
**Related function:** `get_summary_statistics`

We first get the file size attribute using the crosswalk.

In [51]:
file_sizes = df.KaggleDatasetsCrosswalk.file_size
file_sizes

In [52]:
#confirm missing for repo
print(df.KaggleDatasetsCrosswalk.file_size)

None


## 7. What are the mean, median, and max number of files per object?
**Property:** URL
**Related function:** `get_summary_statistics`

Note that while there is a 'file' column in the original metadata extract, all entries are empty lists.

In [53]:
files_metadata = df['files']
files_metadata

0       []
1       []
2       []
3       []
4       []
        ..
2002    []
2003    []
2004    []
2005    []
2006    []
Name: files, Length: 2007, dtype: object

In [54]:
files_metadata.drop_duplicates()

0    []
Name: files, dtype: object

## 8. What are the mean, median, and max total dataset size (summed across all files) per object?
**Property:** Dataset size
**Related function:** `get_summary_statistics`

In [55]:
dataset_sizes = df.KaggleDatasetsCrosswalk.dataset_size
dataset_sizes

0       2.049704e+09
1       1.064541e+09
2       3.681320e+05
3       5.132020e+05
4       1.428000e+03
            ...     
2002    1.443226e+07
2003    1.004318e+07
2004    3.896206e+06
2005    2.700300e+05
2006    3.439986e+07
Name: totalBytes, Length: 2007, dtype: float64

In [56]:
#since interested in object, group by ID
dataset_sizes_ids = pd.concat([ids, dataset_sizes], axis = 1)
dataset_sizes_ids

Unnamed: 0,id,totalBytes
0,BengaliAI/numta,2.049704e+09
1,Cornell-University/arxiv,1.064541e+09
2,FootballPrediction/fbpdataset,3.681320e+05
3,HRAnalyticRepository/employee-attrition-data,5.132020e+05
4,HRAnalyticRepository/job-classification-dataset,1.428000e+03
...,...,...
2002,zusmani/pakistans-largest-ecommerce-dataset,1.443226e+07
2003,zusmani/the-holy-quran,1.004318e+07
2004,zusmani/trumps-legacy,3.896206e+06
2005,zusmani/us-mass-shootings-last-50-years,2.700300e+05


In [57]:
#remove NA values
dataset_sizes_ids = dataset_sizes_ids.dropna()
dataset_sizes_ids

Unnamed: 0,id,totalBytes
0,BengaliAI/numta,2.049704e+09
1,Cornell-University/arxiv,1.064541e+09
2,FootballPrediction/fbpdataset,3.681320e+05
3,HRAnalyticRepository/employee-attrition-data,5.132020e+05
4,HRAnalyticRepository/job-classification-dataset,1.428000e+03
...,...,...
2002,zusmani/pakistans-largest-ecommerce-dataset,1.443226e+07
2003,zusmani/the-holy-quran,1.004318e+07
2004,zusmani/trumps-legacy,3.896206e+06
2005,zusmani/us-mass-shootings-last-50-years,2.700300e+05


In [58]:
analysis.get_summary_statistics(dataset_sizes_ids['totalBytes'])

{'mean': 638643184.5164671, 'median': 1233604.0, 'max': 73677739407.0}

## 9. How many of each scientific domain are assigned?
**Property:** Domain
**Related function:** `domains.value_counts()`

In [59]:
domains = df.KaggleDatasetsCrosswalk.domain
domains

In [60]:
#confirm missing for repo
print(df.KaggleDatasetsCrosswalk.domain)

None


## 10. What is the mean number of characters (excluding whitespaces, if possible) per object?
**Property:** Technical details
**Related function:** `mean_characters`

In [61]:
# "usage notes" is not in crosswalk

## 11-13. What are the mean and median total number of keyword terms per object, after merging results for Keyword, Geographic keyword, and Scientific keyword?
**Property:** Keyword

In [62]:
print(df.KaggleDatasetsCrosswalk.keyword)

                                               keywords  \
0                                                  None   
1                                                  None   
2                                                  None   
3                                                  None   
4                                                  None   
...                                                 ...   
2002  [business, computer science, internet, retail ...   
2003  [languages, religion and belief systems, inter...   
2004                          [united states, politics]   
2005                             [united states, crime]   
2006                                    [games, sports]   

                                                   tags  
0               [image data, computer vision, business]  
1                         [earth and nature, education]  
2                                               [games]  
3                      [employment, business, internet]  
4

In [63]:
print(df.KaggleDatasetsCrosswalk.geographic_keyword)

None


In [64]:
print(df.KaggleDatasetsCrosswalk.scientific_keyword)

None


In [65]:
#keywords field has both keywords and tags, which we consider all together as keywords
keywords = df.KaggleDatasetsCrosswalk.keyword
keywords

Unnamed: 0,keywords,tags
0,,"[image data, computer vision, business]"
1,,"[earth and nature, education]"
2,,[games]
3,,"[employment, business, internet]"
4,,[earth and nature]
...,...,...
2002,"[business, computer science, internet, retail ...","[e-commerce services, retail and shopping, int..."
2003,"[languages, religion and belief systems, inter...","[languages, internet, religion and belief syst..."
2004,"[united states, politics]","[united states, politics]"
2005,"[united states, crime]","[united states, crime]"


In [66]:
#since interested in object, group by ID
keywords = pd.concat([ids, keywords], axis = 1)
keywords

Unnamed: 0,id,keywords,tags
0,BengaliAI/numta,,"[image data, computer vision, business]"
1,Cornell-University/arxiv,,"[earth and nature, education]"
2,FootballPrediction/fbpdataset,,[games]
3,HRAnalyticRepository/employee-attrition-data,,"[employment, business, internet]"
4,HRAnalyticRepository/job-classification-dataset,,[earth and nature]
...,...,...,...
2002,zusmani/pakistans-largest-ecommerce-dataset,"[business, computer science, internet, retail ...","[e-commerce services, retail and shopping, int..."
2003,zusmani/the-holy-quran,"[languages, religion and belief systems, inter...","[languages, internet, religion and belief syst..."
2004,zusmani/trumps-legacy,"[united states, politics]","[united states, politics]"
2005,zusmani/us-mass-shootings-last-50-years,"[united states, crime]","[united states, crime]"


In [67]:
#replace the None values with empty lists so the count of string values evaluates to 0
keywords_all = keywords.apply(
    lambda row: row.apply(
        lambda cell: cell if cell else []
    ),
    axis=1
)
keywords_all

Unnamed: 0,id,keywords,tags
0,BengaliAI/numta,[],"[image data, computer vision, business]"
1,Cornell-University/arxiv,[],"[earth and nature, education]"
2,FootballPrediction/fbpdataset,[],[games]
3,HRAnalyticRepository/employee-attrition-data,[],"[employment, business, internet]"
4,HRAnalyticRepository/job-classification-dataset,[],[earth and nature]
...,...,...,...
2002,zusmani/pakistans-largest-ecommerce-dataset,"[business, computer science, internet, retail ...","[e-commerce services, retail and shopping, int..."
2003,zusmani/the-holy-quran,"[languages, religion and belief systems, inter...","[languages, internet, religion and belief syst..."
2004,zusmani/trumps-legacy,"[united states, politics]","[united states, politics]"
2005,zusmani/us-mass-shootings-last-50-years,"[united states, crime]","[united states, crime]"


In [68]:
#remove id column for counts
keywords_use = keywords_all.drop('id', axis=1)
keywords_use

Unnamed: 0,keywords,tags
0,[],"[image data, computer vision, business]"
1,[],"[earth and nature, education]"
2,[],[games]
3,[],"[employment, business, internet]"
4,[],[earth and nature]
...,...,...
2002,"[business, computer science, internet, retail ...","[e-commerce services, retail and shopping, int..."
2003,"[languages, religion and belief systems, inter...","[languages, internet, religion and belief syst..."
2004,"[united states, politics]","[united states, politics]"
2005,"[united states, crime]","[united states, crime]"


In [69]:
#since there is some duplicate across keywords and tags, but not alwyays in the same order
#combine lists
keywords_use['keywords_all'] = keywords_use['keywords'] + keywords_use['tags']
keywords_use

Unnamed: 0,keywords,tags,keywords_all
0,[],"[image data, computer vision, business]","[image data, computer vision, business]"
1,[],"[earth and nature, education]","[earth and nature, education]"
2,[],[games],[games]
3,[],"[employment, business, internet]","[employment, business, internet]"
4,[],[earth and nature],[earth and nature]
...,...,...,...
2002,"[business, computer science, internet, retail ...","[e-commerce services, retail and shopping, int...","[business, computer science, internet, retail ..."
2003,"[languages, religion and belief systems, inter...","[languages, internet, religion and belief syst...","[languages, religion and belief systems, inter..."
2004,"[united states, politics]","[united states, politics]","[united states, politics, united states, polit..."
2005,"[united states, crime]","[united states, crime]","[united states, crime, united states, crime]"


In [70]:
#then remove duplicates within list
keywords_use['keywords_all'] = keywords_use['keywords_all'].apply(lambda x: list(set(x)))
keywords_use

Unnamed: 0,keywords,tags,keywords_all
0,[],"[image data, computer vision, business]","[computer vision, image data, business]"
1,[],"[earth and nature, education]","[earth and nature, education]"
2,[],[games],[games]
3,[],"[employment, business, internet]","[employment, internet, business]"
4,[],[earth and nature],[earth and nature]
...,...,...,...
2002,"[business, computer science, internet, retail ...","[e-commerce services, retail and shopping, int...","[retail and shopping, internet, business, e-co..."
2003,"[languages, religion and belief systems, inter...","[languages, internet, religion and belief syst...","[languages, religion and belief systems, inter..."
2004,"[united states, politics]","[united states, politics]","[politics, united states]"
2005,"[united states, crime]","[united states, crime]","[crime, united states]"


In [71]:
keywords_use_clean = keywords_use['keywords_all']
keywords_use_clean

0                 [computer vision, image data, business]
1                           [earth and nature, education]
2                                                 [games]
3                        [employment, internet, business]
4                                      [earth and nature]
                              ...                        
2002    [retail and shopping, internet, business, e-co...
2003    [languages, religion and belief systems, inter...
2004                            [politics, united states]
2005                               [crime, united states]
2006                                      [games, sports]
Name: keywords_all, Length: 2007, dtype: object

In [72]:
keywords_counts = keywords_use_clean.apply(len)
keywords_counts

0       3
1       2
2       1
3       3
4       1
       ..
2002    5
2003    3
2004    2
2005    2
2006    2
Name: keywords_all, Length: 2007, dtype: int64

In [73]:
analysis.get_summary_statistics(keywords_counts)

{'mean': 3.4314897857498754, 'median': 3.0, 'max': 11}

## 14. Who are the most common funding agencies for each repo? What are the object counts per agency?
**Property:** Funding Agency

In [74]:
funders = df.KaggleDatasetsCrosswalk.funding_agency
funders

In [75]:
#confirm missing for this repo
print(df.KaggleDatasetsCrosswalk.funding_agency)

None


## 15. What are the mean, median, and max number of Views per object?
**Property:** Views
**Related function:** get_summary_statistics

In [76]:
views = df.KaggleDatasetsCrosswalk.views
views

0           NaN
1           NaN
2           NaN
3           NaN
4           NaN
         ...   
2002    25255.0
2003    44997.0
2004    10696.0
2005    89557.0
2006    30590.0
Name: totalViews, Length: 2007, dtype: float64

In [77]:
#remove NA values
views_clean = views.dropna()
views_clean

11        376.0
12       7897.0
13       2122.0
14        288.0
15        564.0
         ...   
2002    25255.0
2003    44997.0
2004    10696.0
2005    89557.0
2006    30590.0
Name: totalViews, Length: 1943, dtype: float64

In [78]:
analysis.get_summary_statistics(views_clean)

{'mean': 11826.444673185795, 'median': 1436.0, 'max': 1147607.0}

## 16. What are the mean, median, and max (total) number of downloads per object?
**Property:** Downloads
**Related function:** `get_summary_statistics`

In [79]:
downloads = df.KaggleDatasetsCrosswalk.downloads
downloads

0        2935
1       11323
2          60
3       10548
4        3215
        ...  
2002     2407
2003     3774
2004      132
2005    13583
2006     3528
Name: downloadCount, Length: 2007, dtype: int64

In [80]:
#get summary statistics
analysis.get_summary_statistics(downloads)

{'mean': 2348.1011459890383, 'median': 129.0, 'max': 365376}

## 17. What are the mean, median, and max Citation counts per object?
**Property:** Citation count
**Related function:** `get_summary_statistics`

In [81]:
citation_count = df.KaggleDatasetsCrosswalk.citation_count
citation_count

In [82]:
#confirm missing for this repo
print(df.KaggleDatasetsCrosswalk.citation_count)

None


## 18. How many objects contain each given resource type?
**Property:** Resource type

In [83]:
resource_types = df.KaggleDatasetsCrosswalk.resource_type
resource_types

In [84]:
#confirm missing for this repo
print(df.KaggleDatasetsCrosswalk.resource_type)

None


## 19. How many objects contain each type of file extension given?
**Property:** File Extension

In [85]:
files = df.KaggleDatasetsCrosswalk.file_extension
files

In [86]:
#confirm missing for this repo
print(df.KaggleDatasetsCrosswalk.file_extension)

None


## 19.5 How many files of each type of file extension are present?
**Property:** File extension
**Related function:** `get_file_extensions`

In [87]:
files = df.KaggleDatasetsCrosswalk.file_extension
files

In [88]:
#confirm missing for this repo
print(df.KaggleDatasetsCrosswalk.file_extension)

None


## 20. How many objects contain each type of File format given?
**Property:** File format

In [89]:
file_formats = df.KaggleDatasetsCrosswalk.file_format
file_formats

In [90]:
#confirm missing for repo
print(df.KaggleDatasetsCrosswalk.file_format)

None


## 21. How many objects contain each type of Media type given?
**Property:** Media type

In [91]:
media_types = df.KaggleDatasetsCrosswalk.media_type
media_types

In [92]:
#confirm missing for repo
print(df.KaggleDatasetsCrosswalk.media_type)

None


## 22. a) How many objects report one related resource type, and b) how many objects report each of those types? c) How many objects report multiple related resource types (regardless of which types)?
**Property:** Related resource type

In [93]:
related_resource_types = df.KaggleDatasetsCrosswalk.related_resource_type
related_resource_types

In [94]:
#confirm missing for repo
print(df.KaggleDatasetsCrosswalk.related_resource_type)

None



## 23-25. If there is an entry for an object in one of the three properties (Original data URL, Primary manuscript PID/URL, and Related resource identifier) count as Related resources = True and then count the number of objects that return True.
**Property:** Related Resource Identifier

In [95]:
#confirm missing for this repo
print(df.KaggleDatasetsCrosswalk.original_data_url)

None


In [96]:
#confirm missing for this repo
print(df.KaggleDatasetsCrosswalk.primary_manuscript)

None


In [97]:
#confirm missing for this repo
print(df.KaggleDatasetsCrosswalk.related_resource_identifier)

None


## 23-25. Also, what is the mean number of related resource links per object (again looking at the three properties (Original data URL, Primary manuscript PID/URL, nd Related resource identifier)?
**Property:** Related Resource Identifier

In [98]:
#none of these properties present in this repo (see above)

## 26. How many objects report each relation type? How many objects report multiple relation types, regardless of what those types are?
**Property:** Related resource relation type

In [99]:
relation_type = df.KaggleDatasetsCrosswalk.related_resource_relation_type
relation_type

In [100]:
#confirm missing for repo
print(df.KaggleDatasetsCrosswalk.related_resource_relation_type)

None


## 27. For repositories that store the full citation in a designated field, how many objects have a populated citation? How many objects have a citation and a URL or other actionable link?
**Property:** Citation

In [101]:
citations = df.KaggleDatasetsCrosswalk.citation
citations

In [102]:
#confirm missing for repo
print(df.KaggleDatasetsCrosswalk.citation)

None
