# Querying the GitHub API for repositories and organizations
By Stuart Geiger and Jamie Whitacre, made at a SciPy 2016 sprint

In [1]:
!pip install pygithub
!pip install geopy
!pip install ipywidgets



In [2]:
from github import Github

In [3]:
#this is my private login credentials, stored in ghlogin.py
import ghlogin


In [4]:
g = Github(login_or_token=ghlogin.gh_user, password=ghlogin.gh_passwd)

With this Github object, you can get all kinds of Github objects, which you can then futher explore.

In [5]:
user = g.get_user("staeiou")


In [6]:
print(user.name)
print(user.created_at)
print(user.location)

Stuart Geiger
2013-06-14 00:25:39
Berkeley, CA


In [7]:
repo = g.get_repo("jupyter/notebook")

In [8]:
print(repo.name)
print(repo.description)
print(repo.organization)
print(repo.organization.name)
print(repo.language)


notebook
Jupyter Interactive Notebook
<github.Organization.Organization object at 0x7f5889b06cf8>
Project Jupyter
JavaScript


In [9]:
print(repo.get_commits())

<github.PaginatedList.PaginatedList object at 0x7f589fe2c630>


In [10]:
commits = repo.get_commits()
commit = commits[0]
print(commit.author.name)
print(commit.commit.message)
print(commit.stats.additions)
print(commit.stats.deletions)

Matthias Bussonnier
Merge pull request #1603 from minrk/rm-docker-readme

remove outdated/false docker info from README
0
28


## Organizations

Organizations are objects too, which have similar properties:

In [11]:
org = g.get_organization("jupyter")

In [12]:
print(org.name)
print(org.created_at)
print(org.html_url)

Project Jupyter
2014-04-23 21:36:43
https://github.com/jupyter


The API has a get_public_members() function, but it just shows those who are on the "people" board on the [organization's page](https://github.com/jupyter). You can also see that if someone doesn't have a field set, it returns None. Some people just have usernames set without full names.

In [13]:
for member in org.get_public_members():
    print(member.name, member.url)

Matthias Bussonnier https://api.github.com/users/Carreau
JamieW https://api.github.com/users/JamiesHQ
Corey Stubbs https://api.github.com/users/Lull3rSkat3r
Sylvain Corlay https://api.github.com/users/SylvainCorlay
Afshin Darian https://api.github.com/users/afshin
Steven Silvester https://api.github.com/users/blink1073
Safia Abdalla https://api.github.com/users/captainsafia
Dave Willmer https://api.github.com/users/dwillmer
Fernando Perez https://api.github.com/users/fperez
Paul Ivanov https://api.github.com/users/ivanov
None https://api.github.com/users/jakirkham
Jason Grout https://api.github.com/users/jasongrout
Jonathan Frederic https://api.github.com/users/jdfreder
Jessica B. Hamrick https://api.github.com/users/jhamrick
Min RK https://api.github.com/users/minrk
Peter Parente https://api.github.com/users/parente
Mike https://api.github.com/users/poplav
Kyle Kelley https://api.github.com/users/rgbkrk
Sumit Sahrawat https://api.github.com/users/sumitsahrawat
Thomas Kluyver https://a

We can go through all the repositories in the organization with the get_repos() function. It returns a list of repository objects, which have their own properties and methods.

In [14]:
for repo in org.get_repos():
    print(repo.name, repo.description)

nbviewer Nbconvert as a webservice (rendering ipynb to static HTML)
nbconvert-examples Examples that illustrate how nbconvert can be used
colaboratory Jupyter CoLaboratory
jupyter.github.io Jupyter Website
design Design related materials for Project Jupyter
nbcache Notebook Caching layer in Docker
nbgrader A system for assigning and grading notebooks
tmpnb Creates temporary Jupyter Notebook servers using Docker containers.
nature-demo Materials for the November 2014 Nature Article
jupyter-drive Google drive for jupyter notebooks
tmpnb-redirector Simple HTTP redirector for tmpnb nodes
tmpnb-deploy Deploying tmpnb nodes
docker-demo-images Demo images for use in try.jupyter.org and tmpnb.org
try.jupyter.org Try Jupyter!
strata-sv-2015-tutorial Strata Silicon Valley 2015 Tutorial
testpath Test utilities for Python code working with files and commands
scipy-2015-advanced-topics Advanced topics in Jupyter tutorial for SciPy 2015.
jupyter_core Core Jupyter functionality
nbformat Reference imp

## Rate limiting

Now that we have made a few requests, we can see what our rate limit is. If you are logged in, you get 5,000 requests per hour. If you are not, you only get 60 per hour. You can use methods in the GitHub object to see your limit, usage, and reset time. We have used less than 50 of our 5,000 requests with these calls.

In [15]:
g.rate_limiting

(4969, 5000)

In [16]:
reset_time = g.rate_limiting_resettime
reset_time

1468970425

This value is in seconds since the UTC epoch (Jan 1st, 1970), so we have to convert it. Here is a quick function that takes a GitHub object, queries the API to find our next reset time, and converts it to minutes.

In [17]:
import datetime
def minutes_to_reset(github):
    github.rate_limiting_resettime
    timedelta_to_reset = datetime.datetime.fromtimestamp(reset_time) - datetime.datetime.now()
    return timedelta_to_reset.seconds / 60
    

In [18]:
minutes_to_reset(g)

59.7

## Getting location data for an organization's contributors
### Mapping and geolocation

Before we get into how to query GitHub, we know we will have to get location coordinates for each contributor, and then plot it on a map. So we are going to do that first.

For geolocation, we are using geopy's geolocator object, which is based on Open Street Map's Nominatim API. Nominatim takes in any arbitrary location data and then returns a location object, which includes the best latitude and longitude coordinates it can find. 

This does mean that we will have more error than if we did this manually, and there might be vastly different levels of accuracy. For example, if someone just has "UK" as their location, it will show up in the geographic center of the UK, which is somewhere on the edge of the Lake District. "USA" resolves to somewhere in Kansas. However, you can get very specific location data if you put in more detail.

In [19]:
from geopy.geocoders import Nominatim

geolocator = Nominatim()
uk_loc = geolocator.geocode("UK")
print(uk_loc.longitude,uk_loc.latitude)

us_loc = geolocator.geocode("USA")
print(us_loc.longitude,us_loc.latitude)

bids_loc = geolocator.geocode("Doe Library, Berkeley CA, 94720 USA")
print(bids_loc.longitude,bids_loc.latitude)

-3.2765752 54.7023545
-100.4458824 39.7837304
-122.259492086406 37.87219435


We can plot points on a map using ipyleaflets and ipywidgets. We first set up a map object, which is created with various parameters. Then we create Marker objects, which are then appended to the map. We then display the map inline in this notebook.

In [20]:
import ipywidgets

from ipyleaflet import (
    Map,
    Marker,
    TileLayer, ImageOverlay,
    Polyline, Polygon, Rectangle, Circle, CircleMarker,
    GeoJSON,
    DrawControl
)

center = [30.0, 5.0]
zoom = 2
m = Map(default_tiles=TileLayer(opacity=1.0), center=center, zoom=zoom, layout=ipywidgets.Layout(height="600px"))

uk_mark = Marker(location=[uk_loc.latitude,uk_loc.longitude])
uk_mark.visible
m += uk_mark

us_mark = Marker(location=[us_loc.latitude,us_loc.longitude])
us_mark.visible
m += us_mark

bids_mark = Marker(location=[bids_loc.latitude,bids_loc.longitude])
bids_mark.visible
m += bids_mark

### Querying GitHub for location data

For our mapping script, we want to get profiles for everyone who has made a commit to any of the repositories in the Jupyter organization, find their location (if any), then add it to a list. The API has a get_contributors function for repo objects, which returns a list of contributors ordered by number of commits, but not one that works across all repos in an org. So we have to iterate through all the repos in the org, and run the get_contributors method for We also want to make sure we don't add any duplicates to our list to over-represent any areas, so we keep track of people in a dictionary.

I've written a few functions to make it easy to retreive and map an organization's contributors.

In [43]:
def get_org_contributor_locations(github, org_name):
    """
    For a GitHub organization, get location for contributors to any repo in the org.
    
    Returns a dictionary of {username URLS : geopy Locations}, then a dictionary of various metadata.
    
    """
    
    # Set up empty dictionaries and metadata variables
    contributor_locs = {}
    locations = []
    none_count = 0
    error_count = 0
    user_loc_count = 0
    duplicate_count = 0
    geolocator = Nominatim()

    
    # For each repo in the organization
    for repo in github.get_organization(org_name).get_repos():
        #print(repo.name)
        
        # For each contributor in the repo        
        for contributor in repo.get_contributors():
            print('.', end="")
            # If the contributor_locs dictionary doesn't have an entry for this user
            if contributor_locs.get(contributor.url) is None:
                
                # Try-Except block to handle API errors
                try:
                    # If the contributor has no location in profile
                    if(contributor.location is None):
                        #print("No Location")
                        none_count += 1
                    else:
                        # Get coordinates for location string from Nominatim API
                        location=geolocator.geocode(contributor.location)

                        #print(contributor.location, " | ", location)
                        
                        # Add a new entry to the dictionary. Value is user's URL, key is geocoded location object
                        contributor_locs[contributor.url] = location
                        user_loc_count += 1
                except Exception:
                    print('!', end="")
                    error_count += 1
            else:
                duplicate_count += 1
                
    return contributor_locs,{'no_loc_count':none_count, 'user_loc_count':user_loc_count, 
                             'duplicate_count':duplicate_count, 'error_count':error_count}


With this, we can easily query an organization. The U.D. Digital Service (org name: usds) is a small org that works well for testing. It takes about a second per contributor to get this data, so we want to test on small orgs.

In [42]:
usds_locs, usds_metadata = get_org_contributor_locations(g,'usds')

...............................

In [44]:
usds_metadata

{'duplicate_count': 1,
 'error_count': 0,
 'no_loc_count': 8,
 'user_loc_count': 22}

We are going to explore this dataset, but not plot names or usernames. I'm a bit hesitant to publish location data with unique identifiers, even if people put that information in their profiles. 

In [45]:
usds_locs_nousernames = []
for contributor, location in usds_locs.items():
    usds_locs_nousernames.append(location)
usds_locs_nousernames

[Location(Portland, Multnomah County, Oregon, United States of America, (45.5202471, -122.6741948, 0.0)),
 Location(東京都, 日本, (34.2255804, 139.294774527387, 0.0)),
 Location(D,C, Buccaneer Ridge Drive, Johnson City, Washington County, Tennessee, 37614, United States of America, (36.29885175, -82.3591932141095, 0.0)),
 Location(Washington, District of Columbia, United States of America, (38.8949549, -77.0366455, 0.0)),
 Location(Oakland, Alameda County, California, United States of America, (37.8044557, -122.2713562, 0.0)),
 Location(Washington, District of Columbia, United States of America, (38.8949549, -77.0366455, 0.0)),
 Location(Dayton, Montgomery County, Ohio, United States of America, (39.7589478, -84.1916068, 0.0)),
 Location(Washington, District of Columbia, United States of America, (38.8949549, -77.0366455, 0.0)),
 Location(SF, California, United States of America, (37.7792808, -122.4192362, 0.0)),
 Location(Milwaukee, Milwaukee County, Wisconsin, United States of America, (4

Now we can map this data using another function I have written.

In [46]:
def map_location_dict(map_obj,org_location_dict):
    """
    Maps the locations in a dictionary of {ids : geoPy Locations}. 
    
    Must be passed a map object, then the dictionary. Returns the map object.
    
    """
    for username, location in org_location_dict.items():
        if(location is not None):
            mark = Marker(location=[location.latitude,location.longitude])
            mark.visible
            map_obj += mark
            

    return map_obj

In [48]:
center = [30.0,5.0]
zoom = 2
usds_map = Map(default_tiles=TileLayer(opacity=1.0), center=center, zoom=zoom, layout=ipywidgets.Layout(height="600px"))

usds_map = map_location_dict(usds_map, usds_locs)

In [49]:
usds_map

## Mapping multiple organizations
Sometimes you have multiple organizations within a group of interest. Because these are functions, they can be combined with some loops.

In [38]:
jupyter_orgs = ['jupyter', 'ipython', 'jupyter-attic','jupyterhub']


In [29]:
orgs_location_dict = {}
orgs_metadata_dict = {}
for org in jupyter_orgs:
    # For a status update, print when we get to a new org in the list
    print(org)
    orgs_location_dict[org], orgs_metadata_dict[org] = get_org_contributor_locations(g,org)

jupyter
ipython
jupyter-attic
jupyterhub


In [30]:
orgs_metadata_dict

{'ipython': {'duplicate_count': 185,
  'error_count': 1,
  'no_loc_count': 307,
  'user_loc_count': 314},
 'jupyter': {'duplicate_count': 322,
  'error_count': 0,
  'no_loc_count': 273,
  'user_loc_count': 322},
 'jupyter-attic': {'duplicate_count': 33,
  'error_count': 0,
  'no_loc_count': 39,
  'user_loc_count': 29},
 'jupyterhub': {'duplicate_count': 35,
  'error_count': 0,
  'no_loc_count': 27,
  'user_loc_count': 46}}

### Plotting the map

In [31]:
center = [30, 5]
zoom = 2
jupyter_orgs_maps = Map(default_tiles=TileLayer(opacity=1.0), center=center, zoom=zoom, 
                        layout=ipywidgets.Layout(height="600px"))

for org_name,org_location_dict in orgs_location_dict.items():
    jupyter_orgs_maps += map_location_dict(jupyter_orgs_maps,org_location_dict)

In [32]:
jupyter_orgs_maps

### Saving to file

In [33]:
def org_dict_to_csv(org_location_dict, filename, hashed_usernames = True):
    """
    Outputs a dict of users : locations to a CSV file. 
    
    Requires org_location_dict and filename, optional hashed_usernames parameter.
    
    Uses hashes of usernames by default for privacy reasons. Think carefully 
    about publishing location data about uniquely identifiable users. Hashing
    allows you to check unique users without revealing personal information.
    """
    try:
        import hashlib
        with open(filename, 'w') as f:
            f.write("user, longitude, latitude\n")
            for user, location in org_location_dict.items():
                if location is not None:
                    if hashed_usernames:
                        user_output = hashlib.sha1(user.encode('utf-8')).hexdigest()
                    else:
                        user_output = user
                    line = user_output + ", " + str(location.longitude) + ", " \
                           + str(location.latitude) + "\n"
                    f.write(line)
        f.close()
    except Exception as e:
        return e

In [34]:
org_dict_to_csv(orgs_location_dict['ipython'], "org_data/ipython.csv")

In [35]:
for org_name, org_location_dict in orgs_location_dict.items():
    org_dict_to_csv(org_location_dict, "org_data/" + org_name + ".csv")

In [36]:
def csv_to_org_dict(filename):
    
    
    """
    TODO: Write function to read an outputted CSV file back to an org_dict.
    Should convert lon/lat pairs to geopy Location objects for full compatibility.
    
    Also, think about a general class object for org_dicts. 
    """
    

            

Note that this will have duplicates across the organizations, as it is just getting the location data from each of the organizations and putting it into a different dictionary.