In [1]:
import pandas as pd
import numpy as np
import config as cfg

# Load py2neo
import py2neo
from py2neo import Graph
from py2neo.matching import *

# Throughput Github Analysis

This is a research project led by PhD Simon Goring.

Different research questions are tried to be answer such as: 

- How do individuals and organizations use GitHub (or other public code repositories) to reference, analyze or reuse data from Data Catalogs?

- Are there clear patterns of use across public repositories?

- Do patterns of use differ by data/disciplinary domain, or do properties of the data resource (presence of an API, online documentation, size of user community) affect patterns of use? 

- Does the data reuse observed here expand our understanding of current modes of data reuse, e.g. those outlined in https://datascience.codata.org/articles/10.5334/dsj-2017-008/ ?

- What are the characteristics and shape of the Earth Science research object network?
- What are major nodes of connectivity?
- What poorly connected islands exist? 
- What is the nature of data reuse in this network?
- What downstream/second order grant products can be identified from this network?

## Current Approach

Categorizing a subset of scraped repos, with pre-defined types, which may be updated iteratively as categorization progresses (education, analysis, archiving, informational).


Using ML techniques, we might be able to classify repos according to type automatically; and could consider classifying according to repository quality/completeness. Repository quality or completeness would be defined by:

- presence/absence/length of readme
- number of commits
- number of contributors

By using neo4j, we can construct and analyze the network graph in order to get:
- Centrality and level of connection
- Identification of small networks/islands within the network
- What databases are highly connected and which are not?
- Use database properties (has API, online search portal, has R/Python package, has user forum . . .)

## Objective of the Notebook

This Notebook is going to be used to created an initial Data Exploratory using Neo4j in order to later on, create a Recommendation System using of graph databases. 

In its initial stages, it might look rough, but this will be improved as it is updated and upgraded.

First, let's connect to Neo4j's graph.

There is a `config.py` script, imported as `cfg` that includes personal credentials to log into the database. A `config_sample.py` script has been included. There, change the words `username` and `password` accordingly to match your own credentials.

The port that neo4j automatically usees is 7687 when working in a local database.

In [2]:
# Connect to Graph
graph = Graph("bolt://localhost:7687", auth=(cfg.neo4j['auth']), bolt=True, password=cfg.neo4j['password'])

In [4]:
graph

Graph('bolt://neo4j@localhost:7687', name='neo4j')

In order to select nodes that are a certain kind, we use the command `match`. 

In [6]:
# Match all nodes
nodes.match("AGENT").all()

[Node('AGENT', homepage='https://github.com/throughput-ec/github_scrapers', name='Code scrapers'),
 Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb/keywordMgmt', name='Keyword synonymy'),
 Node('AGENT', created=1586830369606, id='0000-0002-2700-4605', name='Simon Goring'),
 Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb', name='Database addition'),
 Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb/Re3Databases', name='Keyword addition'),
 Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb/ropensci_libraries', name='Keyword addition'),
 Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb/ropensci_libraries', name='ROpensci Code Addition'),
 Node('AGENT', email='', name='Schlesinger, S.'),
 Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb', name='GitHub code linker'),
 Node('AGENT', email='chemap@emory.edu', name='Padwa, Albert'),
 Node('AGENT', email='boyerjs@misso

In [7]:
# Match the first node.
nodes.match("AGENT").first()

Node('AGENT', homepage='https://github.com/throughput-ec/github_scrapers', name='Code scrapers')

In [3]:
nodes = NodeMatcher(graph)

In order to run queries, you can do `graph.run()` and do the Querie inside quotes. Although you can convert to dataframe using `to_data_frame()`, I would not recommend it as most of the information is nested.

In [9]:
my_list = graph.run("MATCH (n:AGENT) RETURN n LIMIT 25").to_data_frame()
my_list.head(3)

Unnamed: 0,n
0,"{'name': 'Code scrapers', 'homepage': 'https:/..."
1,"{'name': 'Keyword synonymy', 'homepage': 'http..."
2,"{'name': 'Simon Goring', 'id': '0000-0002-2700..."


Instead, do `data()` and work with the list of dictionaries to unnest.

In [12]:
trial = graph.run("MATCH (n:AGENT) RETURN n LIMIT 10").data()
trial

[{'n': Node('AGENT', homepage='https://github.com/throughput-ec/github_scrapers', name='Code scrapers')},
 {'n': Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb/keywordMgmt', name='Keyword synonymy')},
 {'n': Node('AGENT', created=1586830369606, id='0000-0002-2700-4605', name='Simon Goring')},
 {'n': Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb', name='Database addition')},
 {'n': Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb/Re3Databases', name='Keyword addition')},
 {'n': Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb/ropensci_libraries', name='Keyword addition')},
 {'n': Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb/ropensci_libraries', name='ROpensci Code Addition')},
 {'n': Node('AGENT', email='', name='Schlesinger, S.')},
 {'n': Node('AGENT', homepage='https://github.com/throughput-ec/throughputdb', name='GitHub code linker')},
 {'n': Node('AGENT', email='chemap@e

In [14]:
trial[0]

{'n': Node('AGENT', homepage='https://github.com/throughput-ec/github_scrapers', name='Code scrapers')}

In [24]:
trial[9]

{'n': Node('AGENT', email='chemap@emory.edu', name='Padwa, Albert')}

In [17]:
print(type(trial[0]), type(trial))

<class 'dict'> <class 'list'>


In [21]:
pd.DataFrame.transpose(pd.DataFrame.from_dict(trial[0]))

Unnamed: 0,homepage,name
n,https://github.com/throughput-ec/github_scrapers,Code scrapers


In [28]:
df1=pd.DataFrame.transpose(pd.DataFrame.from_dict(trial[0]))
df2=pd.DataFrame.transpose(pd.DataFrame.from_dict(trial[1]))
pd.concat([df1, df2])

Unnamed: 0,homepage,name
n,https://github.com/throughput-ec/github_scrapers,Code scrapers
n,https://github.com/throughput-ec/throughputdb/...,Keyword synonymy


In [91]:
auxiliary_list = list()

for i in range(0, len(trial)):
    df = pd.DataFrame.transpose(pd.DataFrame.from_dict(trial[i]))
    y = df.values.tolist()
    auxiliary_list.append(y)
    
    
df = pd.DataFrame (auxiliary_list,columns=['Column_Name'])
df_expanded = pd.DataFrame(df['Column_Name'].values.tolist(), index=df.index, columns=['url', 'homepage', 'name'])
df_expanded

Unnamed: 0,url,homepage,name
0,https://github.com/throughput-ec/github_scrapers,Code scrapers,
1,https://github.com/throughput-ec/throughputdb/...,Keyword synonymy,
2,1586830369606,0000-0002-2700-4605,Simon Goring
3,https://github.com/throughput-ec/throughputdb,Database addition,
4,https://github.com/throughput-ec/throughputdb/...,Keyword addition,
5,https://github.com/throughput-ec/throughputdb/...,Keyword addition,
6,https://github.com/throughput-ec/throughputdb/...,ROpensci Code Addition,
7,,"Schlesinger, S.",
8,https://github.com/throughput-ec/throughputdb,GitHub code linker,
9,chemap@emory.edu,"Padwa, Albert",


As seen above, the nature of nested dictionaries in lists, will definitely represent a challenge when trying to organized data and functions will be needed to make sure each observation's data is appropriately organized in the corresponding features.

### Useful queries for counting observations

In [92]:
graph.run('MATCH (crt:TYPE {type:"schema:CodeRepository"})\
           MATCH (crt)<-[:isType]-(ocr:OBJECT) \
           RETURN COUNT(DISTINCT ocr)').to_data_frame()

Unnamed: 0,COUNT(DISTINCT ocr)
0,73563


In [None]:
graph.run('MATCH (ocr:codeRepo)\
           MATCH (odb:dataCat)\
           WITH SPLIT(ocr.name, "/")[0] AS owner, COUNT(DISTINCT odb.name) AS n, COLLECT(DISTINCT odb.name) AS resources \
           WHERE n > 3 \
           RETURN owner, n, resources\
           ORDER BY n DESC, resources[0]')

# EDA to get the right queries

In order to figure out how to create a ML model, we need to extract the correct data from the Throughput database.

We will analyze and graph the following:
- Distribution of references to DBs

- Note 'Earth Science' databases within graph
    - X = DBs; y = # of referenced repos
    - Linked repos (x) by commits (y)

- Note ES commits 
    - Linked repos (x) by # of contributors (y)
    - Linked repos (x) by # of forks (y)

In [None]:
graph.run('MATCH p=()-[r:Referenced_by]->() RETURN p')

**Challenges:** Complex codes that showcase relationship hard to bring...

In [None]:
graphs_with_keywords = pd.DataFrame(graph.run('MATCH p=()-[r:hasKeyword]->() RETURN p').data())#.to_data_frame()
#graphs_with_keywords.head(3)

In [None]:
keywords_data = graph.run('MATCH p=()-[r:hasKeyword]->() RETURN p').data()

In [None]:
graphs_with_keywords = graphs_with_keywords.to_data_frame()

In [None]:
keywords_df = pd.DataFrame(graphs_with_keywords['p'].values.tolist(), index=graphs_with_keywords.index)
keywords_df.head(15)

In [None]:
# TODO
# Graphing function that takes data, x and y

Queries January 12

In [None]:
data = graph.run('''MATCH (n:OBJECT)-[:isType]-(:TYPE {type:"schema:DataCatalog"}) 
MATCH (k:KEYWORD {keyword: "earth"}) 
MATCH (o:OBJECT)-[:isType]-(:TYPE {type:"schema:CodeRepository"}) 
WITH n, o 
MATCH p=(n)-[]-(:ANNOTATION)-[:Target]-(o) 
RETURN p 
''')

In [None]:
data.data()

In [22]:
import json
import codecs


file = json.load(codecs.open('records.json', 'r', 'utf-8-sig'))

In [27]:
file

[{'p': {'start': {'identity': 40690,
    'labels': ['OBJECT', 'dataCat'],
    'properties': {'created': 1586832553472,
     'contact': 'https://www.arabidopsis.org/contact/index.jsp',
     'name': 'The Arabidopsis Information Resource',
     'description': 'The Arabidopsis Information Resource (TAIR) maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana . Data available from TAIR includes the complete genome sequence along with gene structure, gene product information, metabolism, gene expression, DNA and seed stocks, genome maps, genetic and physical markers, publications, and information about the Arabidopsis research community. Gene product function data is updated every two weeks from the latest published research literature and community data submissions. Gene structures are updated 1-2 times per year using computational and manual methods as well as community submissions of new and updated genes. TAIR also provides extensive li

In [50]:
for i in range(0,10):
    print(file[i]['p']['start']['identity'])
    print(file[i]['p']['end']['labels'])

40690
['OBJECT', 'codeRepo']
30119
['OBJECT', 'codeRepo']
32999
['OBJECT', 'codeRepo']
10357
['OBJECT', 'codeRepo']
46130
['OBJECT', 'codeRepo']
20714
['OBJECT', 'codeRepo']
28363
['OBJECT', 'codeRepo']
13639
['OBJECT', 'codeRepo']
3260
['OBJECT', 'codeRepo']
30923
['OBJECT', 'codeRepo']
