In [1]:
import pandas as pd
import numpy as np
import config as cfg
import json

# Load py2neo
import py2neo
from py2neo import Graph
from py2neo.matching import *

# Interactive Plotting Libraries
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import plotly.graph_objects as go

# Plotting Widgets
import cufflinks as cf

# Throughput Github Analysis

This is a research project led by PhD Simon Goring.

Different research questions are tried to be answer such as: 

- How do individuals and organizations use GitHub (or other public code repositories) to reference, analyze or reuse data from Data Catalogs?

- Are there clear patterns of use across public repositories?

- Do patterns of use differ by data/disciplinary domain, or do properties of the data resource (presence of an API, online documentation, size of user community) affect patterns of use? 

- Does the data reuse observed here expand our understanding of current modes of data reuse, e.g. those outlined in https://datascience.codata.org/articles/10.5334/dsj-2017-008/ ?

- What are the characteristics and shape of the Earth Science research object network?
- What are major nodes of connectivity?
- What poorly connected islands exist? 
- What is the nature of data reuse in this network?
- What downstream/second order grant products can be identified from this network?

## Current Approach

Categorizing a subset of scraped repos, with pre-defined types, which may be updated iteratively as categorization progresses (education, analysis, archiving, informational).


Using ML techniques, we might be able to classify repos according to type automatically; and could consider classifying according to repository quality/completeness. Repository quality or completeness would be defined by:

- presence/absence/length of readme
- number of commits
- number of contributors

By using neo4j, we can construct and analyze the network graph in order to get:
- Centrality and level of connection
- Identification of small networks/islands within the network
- What databases are highly connected and which are not?
- Use database properties (has API, online search portal, has R/Python package, has user forum . . .)

## Objective of the Notebook

This Notebook is going to be used to created an initial Data Exploratory using Neo4j in order to later on, create a Recommendation System using of graph databases. 

In its initial stages, it might look rough, but this will be improved as it is updated and upgraded.

First, let's connect to Neo4j's graph.

There is a `config.py` script, imported as `cfg` that includes personal credentials to log into the database. A `config_sample.py` script has been included. There, change the words `username` and `password` accordingly to match your own credentials.

The port that neo4j automatically usees is 7687 when working in a local database.

In [2]:
# Connect to Graph
graph = Graph("bolt://localhost:7687", auth=(cfg.neo4j['auth']), bolt=True, password=cfg.neo4j['password'])

In [3]:
graph

Graph('bolt://neo4j@localhost:7687', name='neo4j')

In order to select nodes that are a certain kind, we use the command `match`. 

In order to run queries, you can do `graph.run()` and do the Querie inside quotes. Get data using the verb `.data()`

In [4]:
trial = graph.run("MATCH (n:AGENT) RETURN n LIMIT 10").data()
trial[1]

{'n': Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb/keywordMgmt', name='Keyword synonymy')}

As seen above, the nature of nested dictionaries in lists, will definitely represent a challenge when trying to organized data and functions will be needed to make sure each observation's data is appropriately organized in the corresponding features. To convert a list that should be a dictionary use: `json.loads(list_that_should_be_dictionary)`

### Counting observations

In [5]:
graph.run('MATCH (crt:TYPE {type:"schema:CodeRepository"})\
           MATCH (crt)<-[:isType]-(ocr:OBJECT) \
           RETURN COUNT(DISTINCT ocr)').to_data_frame()

Unnamed: 0,COUNT(DISTINCT ocr)
0,73563


# EDA to get the right queries

In order to figure out how to create a ML model, we need to extract the correct data from the Throughput database.

We will analyze and graph the following:
- Distribution of references to DBs

- Note 'Earth Science' databases within graph
    - X = DBs; y = # of referenced repos
    - Linked repos (x) by commits (y)

- Note ES commits 
    - Linked repos (x) by # of contributors (y)
    - Linked repos (x) by # of forks (y)

## Getting DataCatalogs and Counts(CodeRepos)

In [6]:
counts = graph.run('''MATCH (k:KEYWORD {keyword: "earth science"})\
MATCH (k)<-[:hasKeyword]-(:ANNOTATION)-[:Body]->(dc:dataCat)\
MATCH (dc)<-[:Target]-(:ANNOTATION)-[:Target]->(cr:codeRepo)\
RETURN DISTINCT properties(dc), count(DISTINCT cr)''').data()

![](img/01_graph.png)

### Example on extracting data

In [7]:
# Extracting ID of Data Catalog
counts[1]['properties(dc)']['id']

'r3d100010867'

In [8]:
# Extracting number of CodeRepos linked to Data Catalog
counts[1]['count(DISTINCT cr)']

8

In [9]:
# Put DataCatalogs ID's and CodeRepo's counts together

helper_dict={'item': [],
            'counts':[]}

for i in range (0, len(counts)-1):
    helper_dict['item'].append(counts[i]['properties(dc)']['id'])
    helper_dict['counts'].append(counts[i]['count(DISTINCT cr)'])

counts_df = pd.DataFrame(helper_dict)
counts_df = counts_df.rename(columns={'item':'dacat', 'counts':'cr_counts'})

## Getting Other MetaData

In [10]:
data = graph.run('''MATCH (k:KEYWORD {keyword: "earth science"})\
MATCH (k)<-[:hasKeyword]-(a1:ANNOTATION)-[:Body]->(dc:dataCat)\
MATCH (dc)<-[:Target]-(a2:ANNOTATION)-[:Target]->(cr:codeRepo)\
RETURN distinct properties(dc), properties(cr)''').data()

In [12]:
dict1 = data[0]
dict1

{'properties(dc)': {'created': 1586832689251,
  'name': "Unidata's RAMADDA",
  'description': "Our mission is to provide the data services, tools, and cyberinfrastructure leadership that advance earth-system science, enhance educational opportunities, and broaden participation. Unidata's main RAMADDA server (hosted on Unidata's motherlode data server) contains access to a variety of datasets including the full IDD feed, Case Studies and other project data.",
  'id': 'r3d100010356',
  'url': 'http://motherlode.ucar.edu/repository'},
 'properties(cr)': {'created': 1588852640385,
  'meta': '{"id": 37471462, "repo": "ramadda", "owner": "donmurray", "name": "donmurray/ramadda", "url": "https://github.com/donmurray/ramadda", "created": "2015-06-15 (14:49:29.000000)", "description": null, "topics": [], "readme": {"readme": {"readme": true, "badges": 0, "headings": 0, "char": 3369}, "license": "Other"}, "commits": {"totalCommits": 5604, "range": ["2015-06-09 (01:08:44.000000)", "2015-06-13 (12

In [13]:
dict1['properties(dc)']['name']

"Unidata's RAMADDA"

In [None]:
dict1['properties(cr)']['meta']

In [None]:
string = dict1['properties(cr)']['meta'] # this is a string, from here, using find and REGEX, get commits 
string

In [None]:
response = json.loads(string) 
response['forks']

In [None]:
response['id']

## Metadata to DF

In [116]:
helper_dict = None
helper_dict = {'dacat': [],
               'dacat_name': [],
               'meta':[],
               'cr_item': [],
               'cr_name': [],
               'forks':[],
               'commits':[],
               'contributors':[]}

for i in range (0, len(data)-1):
    helper_dict['dacat'].append(data[i]['properties(dc)']['id'])
    helper_dict['dacat_name'].append(data[i]['properties(dc)']['name'])
    try:
        helper_dict['meta'].append(data[i]['properties(cr)']['meta'])
        json_data = json.loads(data[i]['properties(cr)']['meta'])
        helper_data = json_data['id']
        helper_data_name = json_data['name']
        
        
        # Forks
        forks = json_data['forks']
        helper_dict['cr_item'].append(helper_data)
        helper_dict['cr_name'].append(helper_data_name)
        helper_dict['forks'].append(forks)
        
        # Commits
        commits = json_data['commits']['totalCommits']
        helper_dict['commits'].append(commits)
        
        # Contributors 
        contributors = json_data['commits']['authors']
        helper_dict['contributors'].append(len(contributors))
        
    # Take care of empty spaces.    
    except KeyError:
        helper_dict['meta'].append("None2")
        helper_dict['cr_item'].append("Missing")
        helper_dict['cr_name'].append("Missing")
        helper_dict['forks'].append("Missing")
        helper_dict['commits'].append("Missing")
        helper_dict['contributors'].append("Missing")
        

meta_df = pd.DataFrame(helper_dict)
meta_df = meta_df[meta_df['meta'] != "None2"]
meta_df = meta_df[['dacat', 'dacat_name', 'cr_item', 'cr_name', 'forks', 'commits', 'contributors']]
meta_df = meta_df.astype({'cr_item':'str', 'forks': 'int64', 'commits': 'int64', 'contributors': 'int64'})
meta_df['cr_item'] = meta_df['cr_item']+'cr'
meta_df['dacat'] = meta_df['dacat']+'dc'

In [117]:
meta_df.head(2)

Unnamed: 0,dacat,dacat_name,cr_item,cr_name,forks,commits,contributors
0,r3d100010356dc,Unidata's RAMADDA,37471462cr,donmurray/ramadda,0,5604,1
1,r3d100010356dc,Unidata's RAMADDA,44131591cr,CINERGI/TextTeaserOnline,0,6,1


In [118]:
meta_df.describe(include = 'all')

Unnamed: 0,dacat,dacat_name,cr_item,cr_name,forks,commits,contributors
count,328,328,328,328,328.0,328.0,328.0
unique,13,13,319,319,,,
top,r3d100010134dc,PANGAEA,229084981cr,dataone-website-test/hugo-and-forestry,,,
freq,93,93,3,3,,,
mean,,,,,5.987805,1158.920732,5.463415
std,,,,,29.254757,4694.78514,28.739873
min,,,,,0.0,1.0,1.0
25%,,,,,0.0,12.0,1.0
50%,,,,,0.0,54.0,2.0
75%,,,,,2.0,251.5,3.0


### Plotting by Data Catalog or Code Repo

In [119]:
@interact(x=(0,500))
def show_dc_more_than(selection =['dacat','cr'], column=['forks', 'commits', 'contributors'], x = 1):
    meta_df
    if selection =='dacat':
        df = meta_df[['dacat_name', 'cr_item', 'forks', 'commits', 'contributors']]
        df = meta_df.groupby('dacat').agg({'dacat_name': 'max', 'cr_item' : 'count', 'forks' : 'sum', 'commits' : 'sum', 'contributors' : 'sum'}).reset_index()
        
    if selection =='cr':
        df = meta_df.groupby('cr_item').agg({'dacat_name': 'max', 'cr_name': 'max', 'forks' : 'sum', 'commits' : 'sum', 'contributors' : 'sum'}).reset_index()
        
        
    
    return df.loc[df[column] > x]

interactive(children=(Dropdown(description='selection', options=('dacat', 'cr'), value='dacat'), Dropdown(desc…

In [120]:
@interact
def histogram_plot(x = ['dacat_name', 'cr_name', 'cr_item', 'dacat'], 
                   y = list(meta_df.select_dtypes('int64').columns)[0:],
                   filt = widgets.IntSlider(min = 0, max = 100, step = 1, value = 0)):
       
    if x == 'dacat_name':
        grouped_df = meta_df.groupby('dacat_name').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]
        
    if x == 'cr_name':
        grouped_df = meta_df.groupby('cr_name').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]
        
    if x == 'dacat':
        grouped_df = meta_df.groupby('dacat').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]
    
    if x == 'cr_item':
        grouped_df = meta_df.groupby('cr_item').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]       
        
    # trace
    trace = [go.Bar(x=grouped_df[x], y=grouped_df[y])]

    # layout
    layout = go.Layout(
                title = 'Counts plot', # Graph title
                xaxis = dict(title = x.title()), # x-axis label
                yaxis = dict(title = y.title()), # y-axis label
                hovermode ='closest' # handles multiple points landing on the same vertical
    )

    # fig
    fig = go.Figure(trace, layout)
    fig.show()

interactive(children=(Dropdown(description='x', options=('dacat_name', 'cr_name', 'cr_item', 'dacat'), value='…

## Filtering Surprising Data Points for Analysis

## Extraodinary Repo over 400 Forks

In [36]:
meta_df[meta_df['forks']>300]

Unnamed: 0,dacat,dacat_name,cr_item,forks,commits,contributors
302,r3d100011758dc,Nasa's Data Portal,12745174cr,417,55881,504


## Data Catalog it belongs to

In [37]:
meta_df[(meta_df['dacat']=='r3d100011758dc') & (meta_df['forks']>50)]

Unnamed: 0,dacat,dacat_name,cr_item,forks,commits,contributors
302,r3d100011758dc,Nasa's Data Portal,12745174cr,417,55881,504
314,r3d100011758dc,Nasa's Data Portal,33125718cr,142,12237,3
328,r3d100011758dc,Nasa's Data Portal,90807748cr,79,112,2


# Analysis checking for Subject

In [38]:
subject_data = graph.run('''MATCH (s:SUBJECT)\
WHERE s.id IN [313, 314, 315, 317]\
MATCH (s)<-[:hasSubject]-(a:ANNOTATION)-[]->(dc:dataCat)\
MATCH (dc)<-[:Target]-(:ANNOTATION)-[:Target]->(cr:codeRepo)\
RETURN distinct properties(dc), properties(cr), s.id''').data()

In [121]:
def create_df():
    helper_dict = None
    subject_dict = {'313': 'Atmospheric Science, Oceanography and Climate Research',
                '314': 'Geology and Palaeontology',
                '315': 'Geophysics and Geodesy',
                '317': 'Geography'}
    
    helper_dict = {'dacat': [],
                   'dacat_name':[],
                   'meta':[],
                   'cr_item': [],
                   'cr_name': [],
                   'forks':[],
                   'commits':[],
                   'contributors':[],
                   'subject':[]}

    for i in range (0, len(data)-1):
        helper_dict['dacat'].append(subject_data[i]['properties(dc)']['id'])
        helper_dict['dacat_name'].append(data[i]['properties(dc)']['name'])
        helper_dict['subject'].append(subject_data[i]['s.id'])

        try:
            helper_dict['meta'].append(subject_data[i]['properties(cr)']['meta'])
            json_data = json.loads(subject_data[i]['properties(cr)']['meta'])
            helper_data = json_data['id']
            helper_data_name = json_data['name']

            # Forks
            forks = json_data['forks']
            helper_dict['cr_item'].append(helper_data)
            helper_dict['cr_name'].append(helper_data_name)
            helper_dict['forks'].append(forks)

            # Commits
            commits = json_data['commits']['totalCommits']
            helper_dict['commits'].append(commits)

            # Contributors 
            contributors = json_data['commits']['authors']
            helper_dict['contributors'].append(len(contributors))

        # Take care of empty spaces.    
        except KeyError:
            helper_dict['meta'].append("None2")
            helper_dict['cr_item'].append("Missing")
            helper_dict['cr_name'].append("Missing")
            helper_dict['forks'].append("Missing")
            helper_dict['commits'].append("Missing")
            helper_dict['contributors'].append("Missing")


    meta_df = pd.DataFrame(helper_dict)
    meta_df = meta_df[meta_df['meta'] != "None2"]
    meta_df = meta_df[['dacat', 'dacat_name', 'cr_item', 'cr_name', 'forks', 'commits', 'contributors', 'subject']]
    meta_df = meta_df.astype({'cr_item':'str', 'forks': 'int64', 'commits': 'int64', 'contributors': 'int64', 'subject':'str'})
    meta_df['cr_item'] = meta_df['cr_item']+'cr'
    meta_df['dacat'] = meta_df['dacat']+'dc'
    meta_df['subject_str'] = meta_df['subject'].map(subject_dict)
    return meta_df

In [122]:
meta_df_subject = create_df()

In [123]:
meta_df_subject

Unnamed: 0,dacat,dacat_name,cr_item,cr_name,forks,commits,contributors,subject,subject_str
15,r3d100011290dc,Comprehensive Large Array-data Stewardship System,260025976cr,rjmalka/Unit4_Springboard,0,2,1,317,Geography
18,r3d100011290dc,Comprehensive Large Array-data Stewardship System,257601085cr,lokesh-1998/London_HousingPrices,0,2,1,317,Geography
22,r3d100011290dc,Comprehensive Large Array-data Stewardship System,70050572cr,rauldiazpoblete/notes,0,154,2,317,Geography
25,r3d100011290dc,Comprehensive Large Array-data Stewardship System,227422282cr,jasdalucl2018/CASA_Assignment_2019,0,36,1,317,Geography
39,r3d100011290dc,Comprehensive Large Array-data Stewardship System,259462062cr,joew1234/Springboard-London-housing-case-study,0,2,1,317,Geography
...,...,...,...,...,...,...,...,...,...
332,r3d100012894dc,Nasa's Data Portal,193558412cr,earthcubearchitecture-project418/CDFSemanticNe...,0,19,1,315,Geophysics and Geodesy
333,r3d100012894dc,Nasa's Data Portal,193558412cr,earthcubearchitecture-project418/CDFSemanticNe...,0,19,1,314,Geology and Palaeontology
334,r3d100012894dc,Nasa's Data Portal,78788586cr,rdiersing1/FuzzyNameMatching,0,18,1,317,Geography
335,r3d100012894dc,Nasa's Data Portal,78788586cr,rdiersing1/FuzzyNameMatching,0,18,1,315,Geophysics and Geodesy


In [74]:
#meta_df_subject.describe(include='all')

In [126]:
@interact(x=(0,50))
def show_df(value = list(set(meta_df_subject['subject_str'])), 
            selection =['dacat_name', 'dacat','cr', 'cr_name'], 
            column=['forks', 'commits', 'contributors'], 
            x = 1):
   
    df = meta_df_subject[meta_df_subject['subject_str'] == value]
    
    df = pd.DataFrame(df, columns=['dacat', 'dacat_name', 'cr_item', 'cr_name', 'forks', 'commits', 'contributors', 'subject', 'subject_str'])
     
    if selection == 'dacat_name':
        df = df.groupby('dacat_name').agg({'dacat': 'max', 'cr_item' : 'count', 'forks' : 'sum', 'commits' : 'sum', 'contributors' : 'sum'}).reset_index()
        df2 = df[['dacat', 'dacat_name', 'cr_item', 'forks', 'commits', 'contributors']]
        
    if selection =='dacat':
        df = df.groupby('dacat').agg({'dacat_name': 'max', 'cr_item' : 'count', 'forks' : 'sum', 'commits' : 'sum', 'contributors' : 'sum'}).reset_index()
        df2 = df[['dacat', 'dacat_name', 'cr_item', 'forks', 'commits', 'contributors']]
        
    if selection =='cr':
        df = df.groupby('cr_item').agg({'dacat_name': 'max', 'cr_name': 'max', 'forks' : 'sum', 'commits' : 'sum', 'contributors' : 'sum'}).reset_index()
        df2 = df[['cr_item', 'dacat_name', 'forks', 'commits', 'contributors']] 
    
    if selection =='cr_name':
        df = df.groupby('cr_name').agg({'dacat_name': 'max', 'cr_item': 'max', 'forks' : 'sum', 'commits' : 'sum', 'contributors' : 'sum'}).reset_index()
        df2 = df[['cr_name', 'cr_item', 'dacat_name', 'forks', 'commits', 'contributors']] 
           
    return df2.loc[df2[column] >= x]

# Default sort on Number of Commits
# Add Catalog Name

interactive(children=(Dropdown(description='value', options=('Geophysics and Geodesy', 'Geology and Palaeontol…

In [127]:
@interact
def histogram_plot(x = ['dacat_name', 'cr_name'], 
                   y = list(meta_df_subject.select_dtypes('int64').columns)[0:],
                   subject_id = list(set(meta_df_subject['subject_str'])),
                   filt = widgets.IntSlider(min = 0, max = 50, step = 1, value = 0)):
    
    df = meta_df_subject[meta_df_subject['subject_str']==subject_id]
    
    if x == 'dacat_name':
        grouped_df = df.groupby('dacat_name').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]
    
    if x == 'cr_name':
        grouped_df = df.groupby('cr_name').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]
        
    
    # trace
    trace = [go.Bar(x=grouped_df[x], y=grouped_df[y])]

    # layout
    layout = go.Layout(
                title = 'Counts plot', # Graph title
                xaxis = dict(title = x.title()), # x-axis label
                yaxis = dict(title = y.title()), # y-axis label
                hovermode ='closest' # handles multiple points landing on the same vertical
    )

    # fig
    fig = go.Figure(trace, layout)
    fig.show()

interactive(children=(Dropdown(description='x', options=('dacat_name', 'cr_name'), value='dacat_name'), Dropdo…

![img2](img/subject_graph.png)

Green: Data Cat
Navy blue: subject
Pink: Code Repo
Ligh Blue: Annotation

## All Data Without Subjects

In [98]:
all_data = graph.run('''MATCH ()<-[:hasSubject]-(a:ANNOTATION)-[]->(dc:dataCat)\
MATCH (dc)<-[:Target]-(:ANNOTATION)-[:Target]->(cr:codeRepo)\
RETURN distinct properties(dc), properties(cr)''').data()

In [99]:
len(all_data)

57693

In [101]:
all_data[0]

{'properties(dc)': {'created': 1586832657305,
  'contact': 'https://earthquake.usgs.gov/contactus/',
  'name': 'National Earthquake Information Center',
  'description': "The mission of the National Earthquake Information Center (NEIC) is to determine rapidly the location and size of all destructive earthquakes worldwide and to immediately disseminate this information to concerned national and international agencies, scientists, and the general public. The NEIC compiles and maintains an extensive, global seismic database on earthquake parameters and their effects that serves as a solid foundation for basic and applied earth science research.\nThe NEIC maintained until 2012 the former 'World Data Center for Seismology'.",
  'id': 'r3d100010313',
  'url': 'https://earthquake.usgs.gov/contactus/golden/neic.php'},
 'properties(cr)': {'meta': '{"id": 247252173, "repo": "wikipedia.ko", "owner": "chinapedia", "name": "chinapedia/wikipedia.ko", "url": "https://github.com/chinapedia/wikipedia.k

In [128]:
helper_dict = None
helper_dict = {'dacat': [],
               'dacat_name':[],
               'meta':[],
               'cr_item': [],
               'cr_name': [],
               'forks':[],
               'commits':[],
               'contributors':[]}

for i in range (0, len(all_data)-1):
    helper_dict['dacat'].append(all_data[i]['properties(dc)']['id'])
    helper_dict['dacat_name'].append(all_data[i]['properties(dc)']['name'])
    
    try:
        helper_dict['meta'].append(all_data[i]['properties(cr)']['meta'])
        json_data = json.loads(all_data[i]['properties(cr)']['meta'])
        helper_data = json_data['id']
        helper_data_name = json_data['name']
        
        # Forks
        forks = json_data['forks']
        helper_dict['cr_item'].append(helper_data)
        helper_dict['cr_name'].append(helper_data_name)
        helper_dict['forks'].append(forks)
        
        # Commits
        commits = json_data['commits']['totalCommits']
        helper_dict['commits'].append(commits)
        
        # Contributors 
        contributors = json_data['commits']['authors']
        helper_dict['contributors'].append(len(contributors))
        
    # Take care of empty spaces.    
    except KeyError:
        helper_dict['meta'].append("None2")
        helper_dict['cr_item'].append("Missing")
        helper_dict['cr_name'].append("Missing")
        helper_dict['forks'].append("Missing")
        helper_dict['commits'].append("Missing")
        helper_dict['contributors'].append("Missing")
        

meta_df = pd.DataFrame(helper_dict)
meta_df = meta_df[meta_df['meta'] != "None2"]
meta_df = meta_df[['dacat', 'dacat_name', 'cr_item', 'cr_name', 'forks', 'commits', 'contributors']]
meta_df = meta_df.astype({'cr_item':'str', 'forks': 'int64', 'commits': 'int64', 'contributors': 'int64'})
meta_df['cr_item'] = meta_df['cr_item']+'cr'
meta_df['dacat'] = meta_df['dacat']+'dc'

In [129]:
meta_df

Unnamed: 0,dacat,dacat_name,cr_item,cr_name,forks,commits,contributors
0,r3d100010313dc,National Earthquake Information Center,247252173cr,chinapedia/wikipedia.ko,0,326,1
1,r3d100010313dc,National Earthquake Information Center,18522395cr,usgs/earthquake-website,54,3641,29
2,r3d100010313dc,National Earthquake Information Center,94204283cr,ttsteiger/ttsteiger.github.io,0,102,2
3,r3d100010313dc,National Earthquake Information Center,84948908cr,ttsteiger/Udacity_DAND,0,87,2
4,r3d100010313dc,National Earthquake Information Center,69878384cr,usgs/neic-locator,6,273,7
...,...,...,...,...,...,...,...
57675,r3d100010268dc,Incorporated Research Institutions for Seismology,67154115cr,jallen2/Research-Trend,6,42,3
57676,r3d100010268dc,Incorporated Research Institutions for Seismology,52891387cr,iris-edu/stationxml-validator,7,420,7
57677,r3d100010268dc,Incorporated Research Institutions for Seismology,56903665cr,iris-edu-int/ispaq,5,220,3
57683,r3d100010269dc,J. Craig Venter Institute,116506085cr,poldham/blog,3,71,1


In [130]:
@interact
def histogram_plot(x = ['dacat_name', 'cr_name', 'cr_item', 'dacat'], 
                   y = list(meta_df.select_dtypes('int64').columns)[0:], 
                   filt = widgets.IntSlider(min = 0, max = 3000, step = 1, value = 50)):
       
    if x == 'dacat_name':
        grouped_df = meta_df.groupby('dacat_name').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]
    
    if x == 'cr_name':
        grouped_df = meta_df.groupby('cr_name').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]     
        
    if x == 'dacat':
        grouped_df = meta_df.groupby('dacat').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]
    
    if x == 'cr_item':
        grouped_df = meta_df.groupby('cr_item').sum().reset_index()
        grouped_df = grouped_df[grouped_df[y] > filt]     
        
    # trace
    trace = [go.Bar(x=grouped_df[x], y=grouped_df[y])]

    # layout
    layout = go.Layout(
                title = 'Counts plot', # Graph title
                xaxis = dict(title = x.title()), # x-axis label
                yaxis = dict(title = y.title()), # y-axis label
                hovermode ='closest' # handles multiple points landing on the same vertical
    )

    # fig
    fig = go.Figure(trace, layout)
    fig.show()

interactive(children=(Dropdown(description='x', options=('dacat_name', 'cr_name', 'cr_item', 'dacat'), value='…