In [1]:
name = "2020-09-10-github-scrape"
title = "Scraping GitHub after a hackweek"
tags = "requests, github, webscrape, pandas, dataviz, text processing"
author = "Callum Rollo"

In [2]:
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML

html = connect_notebook_to_post(name, title, tags, author)

### A meta-hackweek hack

I put this notebook together after attending the excellent [Oceean Hack Week 2020](https://oceanhackweek.github.io/) (OHW) event. You can read my blog post about it [here](https://callumrollo.github.io/hackweek.html#hackweek). A key part of the event was creating collaborative projects on GitHub. I wanted to see if attending the hackweek changed participants pattern of activity on GitHub. I though the simplest way would be to get info on all the commits by OHW participants to OHW projects.

It goes without saying that **commits =/= work done on a project**. I freely admit that many of my commits are nonsense! This is just a fun side project for me to explore the github API and try some simple analysis methods.

First off, to access the [Github API](https://docs.github.com/) you'll need to edit the credentials file `credentials.json` to supply your username and a [Github access token](https://github.blog/2013-05-16-personal-api-tokens/).

```
{
	"username": "<your-username>",
	"token": "<your-access-token>"
}
```

Once you have supplied these creds, you are still limited to [5000 requests per hour](https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting) so, if you get bounced by the API, leave it some time to cool off. Without credentials the limit is far lower and you will would soon generate an error message like this:

![rate limit reached](../figures/rate-limit-anon.png)

Obviously I have not provided my own credentials! I'm not sure what action github would take if a DDOS on their API originated from my account, but I'm not willing to find out.

### Let's get scraping!

In [3]:
import json
import requests
from collections import Counter
import pandas as pd
import numpy as np

In [6]:
credentials = json.loads(open('credentials-secret.json').read()) #don't forget to add your creds here!

username = credentials['username']
token = credentials['token']

For a start, let's use the API to get some details on my account

In [10]:
user_data = requests.get('https://api.github.com/users/' + credentials['username'],auth = (username,token)).json()
user_data

{'login': 'callumrollo',
 'id': 28703282,
 'node_id': 'MDQ6VXNlcjI4NzAzMjgy',
 'avatar_url': 'https://avatars0.githubusercontent.com/u/28703282?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/callumrollo',
 'html_url': 'https://github.com/callumrollo',
 'followers_url': 'https://api.github.com/users/callumrollo/followers',
 'following_url': 'https://api.github.com/users/callumrollo/following{/other_user}',
 'gists_url': 'https://api.github.com/users/callumrollo/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/callumrollo/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/callumrollo/subscriptions',
 'organizations_url': 'https://api.github.com/users/callumrollo/orgs',
 'repos_url': 'https://api.github.com/users/callumrollo/repos',
 'events_url': 'https://api.github.com/users/callumrollo/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/callumrollo/received_events',
 'type': 'User',
 'site_admin': False,
 'n

This returns a lot of info. Now, I'll try the account of one of my collaborators `ocefpaf`. You can specify any user, though the information returned is less than when you look at your own account.

In [11]:
data = requests.get('https://api.github.com/users/' + 'ocefpaf',auth = (username,token)).json()
data

{'login': 'ocefpaf',
 'id': 950575,
 'node_id': 'MDQ6VXNlcjk1MDU3NQ==',
 'avatar_url': 'https://avatars1.githubusercontent.com/u/950575?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/ocefpaf',
 'html_url': 'https://github.com/ocefpaf',
 'followers_url': 'https://api.github.com/users/ocefpaf/followers',
 'following_url': 'https://api.github.com/users/ocefpaf/following{/other_user}',
 'gists_url': 'https://api.github.com/users/ocefpaf/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/ocefpaf/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/ocefpaf/subscriptions',
 'organizations_url': 'https://api.github.com/users/ocefpaf/orgs',
 'repos_url': 'https://api.github.com/users/ocefpaf/repos',
 'events_url': 'https://api.github.com/users/ocefpaf/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/ocefpaf/received_events',
 'type': 'User',
 'site_admin': False,
 'name': 'Filipe',
 'company': None,
 'blog': 'http://o

We can see a user's core stats. How about their commits and other actions taken? Simply append `/events` to the request query

### Events

In [12]:
data = requests.get('https://api.github.com/users/' + 'callumrollo' +'/events',auth = (username,token)).json()
data[0]

{'id': '13873691257',
 'type': 'PushEvent',
 'actor': {'id': 28703282,
  'login': 'callumrollo',
  'display_login': 'callumrollo',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/callumrollo',
  'avatar_url': 'https://avatars.githubusercontent.com/u/28703282?'},
 'repo': {'id': 48320187,
  'name': 'ueapy/ueapy.github.io',
  'url': 'https://api.github.com/repos/ueapy/ueapy.github.io'},
 'payload': {'push_id': 5868625550,
  'size': 1,
  'distinct_size': 1,
  'ref': 'refs/heads/master',
  'head': '0b70258f3668dbd25fe8e9681b9324af62e3066f',
  'before': 'f9c3596cef28b24cfd4d7c27ba91a1c038c943b6',
  'commits': [{'sha': '0b70258f3668dbd25fe8e9681b9324af62e3066f',
    'author': {'email': 'c.rollo@outlook.com', 'name': 'Callum Rollo'},
    'message': 'Generate Pelican site',
    'distinct': True,
    'url': 'https://api.github.com/repos/ueapy/ueapy.github.io/commits/0b70258f3668dbd25fe8e9681b9324af62e3066f'}]},
 'public': True,
 'created_at': '2020-10-16T17:42:00Z',
 'org': {'id': 1

Note that the Github API lists only the 30 most recent events

In [13]:
len(data)

30

To get at more events, we use a short loop to access subsequent pages of results. I found out the hard way that the API restricts you to 10 pages.

In [16]:
tgt_user = 'callumrollo'
base_url = 'https://api.github.com/users/' + tgt_user +'/events'
url = base_url
url_list = [base_url]
data = []
page_no = 1
repos_data = []
total_fetched = 0
while (True):
    response = requests.get(url,auth = (username,token)).json()
    data = data + response
    events_fetched = len(response)
    total_fetched = events_fetched + total_fetched
    print(f"Page: {page_no} total events fetched: {total_fetched}")
    
    if total_fetched == 300:
        print(f"\nAPI maxed out! https://docs.github.com/v3/#pagination\n\
        returning only most recent 300 events by {tgt_user}")
        print(f"\nevents span the range \n{data[-1]['created_at']}\n{data[0]['created_at']}")
        break
    
    if (events_fetched == 30):
        page_no = page_no + 1
        url = base_url + '?page=' + str(page_no)
        url_list.append(url)
    else:
        print(f"\n{tgt_user}: all your events are belong to us now")
        print(f"\nevents span the range \n{data[-1]['created_at']}\n{data[0]['created_at']}")
        break

Page: 1 total events fetched: 30
Page: 2 total events fetched: 60
Page: 3 total events fetched: 90
Page: 4 total events fetched: 120
Page: 5 total events fetched: 150
Page: 6 total events fetched: 180
Page: 7 total events fetched: 210
Page: 8 total events fetched: 240
Page: 9 total events fetched: 270
Page: 10 total events fetched: 299

callumrollo: all your events are belong to us now

events span the range 
2020-07-22T15:01:25Z
2020-10-16T17:42:00Z


This system logs all events: commits, issues, PRs, forks, stars etc. We are only interested in commits.

These are referred to as `PushEvent` in the json entry `type`

In [17]:
for event in data[-10:]:
    print(event["type"])


CreateEvent
CreateEvent
WatchEvent
WatchEvent
WatchEvent
CreateEvent
PushEvent
PushEvent
PushEvent
PushEvent


In [18]:
commit_events = []
for event in data:
    if event["type"] == "PushEvent":
        commit_events.append(event)
len(commit_events)

186

There are some complications. Not all of these commits in these events are by the Github user we are querying. For instance, some are commits by other users that our target user has merged in.

To work around this, we look through the payoad of each `PushEvent` and retain only the commits associated with the user we are interested in.

**n.b** this approach wil only work if the github username is the exact match for the name the author uses for their git commits. We apply a workaround for when this is not the case later

In [19]:
tgt_username = requests.get('https://api.github.com/users/' + credentials['username'],
                            auth = (username,token)).json()["name"]
tgt_username

'Callum Rollo'

In [20]:
user_commits = []
for event in commit_events:
    commit_list = event["payload"]["commits"]
    commit_list_author = []
    if len(commit_list)>0:
        for com_n in range(len(commit_list)):
            commit_username = commit_list[com_n]["author"]["name"]
            #print(commit_username)
            if commit_username == tgt_username:
                commit_list_author.append(commit_list[com_n])
        if commit_list_author:
            event["payload"]["commits"] = commit_list_author
            user_commits.append(event)

In [21]:
len(user_commits)

182

We can print out the name associated with the commits we have selected to confirm

In [22]:
commit_names = []
for event in user_commits:
    commit_list = event["payload"]["commits"]
    for com_n in range(len(commit_list)):
            commit_username = commit_list[com_n]["author"]["name"]
            commit_names.append(commit_username)
print("git usernames and number of commits:")
Counter(commit_names).most_common()

git usernames and number of commits:


[('Callum Rollo', 232)]

In [23]:
print(f"From {len(data)} events we have extracted {len(commit_names)} commits by {tgt_username}")

From 299 events we have extracted 232 commits by Callum Rollo


The final step of this (almost certainly imperfect) data cleaning is to get info on all the commits by this user. We will pull the author, message, SHA, url, repo and date into a pandas dataframe. 

In [30]:
df = pd.DataFrame()
for event in user_commits:
    for commit in event["payload"]["commits"]:
        commit_subset = {"id": event["id"],
                     "datetime" : event["created_at"],
                     "sha" : commit["sha"],
                     "message" : commit["message"],
                     "author" : commit["author"]["name"],
                     "url": commit["url"],
                     "repo": event["repo"]["name"]}
        df = df.append(commit_subset, ignore_index=True)

We index by datetime and have a look at our dataframe

In [31]:
df.index = pd.DatetimeIndex(df.datetime)
df = df.drop("datetime", axis=1)
df.head()

Unnamed: 0_level_0,author,id,message,repo,sha,url
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-10-16 17:42:00+00:00,Callum Rollo,13873691257,Generate Pelican site,ueapy/ueapy.github.io,0b70258f3668dbd25fe8e9681b9324af62e3066f,https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 17:41:23+00:00,Callum Rollo,13873684842,Added notebook on webscraping,ueapy/ueapy.github.io,3364fd1e786612c16b4e999cbb832e9584818804,https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 16:47:44+00:00,Callum Rollo,13873114207,added beam miss figure and subplot labelling,callumrollo/adcp-glider,ad7aeeac2527b92cb5a98aa8cfc134bf355571e6,https://api.github.com/repos/callumrollo/adcp-...
2020-10-16 16:43:06+00:00,Callum Rollo,13873061051,Generate Pelican site,callumrollo/callumrollo.github.io,f5e76561f2e5886db6a413e079209d039f0bac74,https://api.github.com/repos/callumrollo/callu...
2020-10-16 16:42:30+00:00,Callum Rollo,13873054646,added MATS poster,callumrollo/callumrollo.github.io,30a6802786ba5b479e9cf09594e51e3e5ebcef94,https://api.github.com/repos/callumrollo/callu...


We can see an uncharacteristically helpful set of commit messages, and the meta event of a commit I made to this very notebook

Now we remove any repeated commits that may have snuck in by a deduplicating on the SHA checksum

**side note** the SHA chescksum uniquely identifies each commit. Even if you had commits by the same author to the same repo with the same message ("added stuff" or something similarly helpful) the SHA will differentiate the two. See more [here](https://www.lifewire.com/what-is-sha-1-2626011)

In [32]:
df = df.drop_duplicates(subset=['sha'])
df.head()

Unnamed: 0_level_0,author,id,message,repo,sha,url
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-10-16 17:42:00+00:00,Callum Rollo,13873691257,Generate Pelican site,ueapy/ueapy.github.io,0b70258f3668dbd25fe8e9681b9324af62e3066f,https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 17:41:23+00:00,Callum Rollo,13873684842,Added notebook on webscraping,ueapy/ueapy.github.io,3364fd1e786612c16b4e999cbb832e9584818804,https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 16:47:44+00:00,Callum Rollo,13873114207,added beam miss figure and subplot labelling,callumrollo/adcp-glider,ad7aeeac2527b92cb5a98aa8cfc134bf355571e6,https://api.github.com/repos/callumrollo/adcp-...
2020-10-16 16:43:06+00:00,Callum Rollo,13873061051,Generate Pelican site,callumrollo/callumrollo.github.io,f5e76561f2e5886db6a413e079209d039f0bac74,https://api.github.com/repos/callumrollo/callu...
2020-10-16 16:42:30+00:00,Callum Rollo,13873054646,added MATS poster,callumrollo/callumrollo.github.io,30a6802786ba5b479e9cf09594e51e3e5ebcef94,https://api.github.com/repos/callumrollo/callu...


------------------------
### Scaling it up: work from a list of Github usernames

Now that we have a method for finding commits by a user, the next step is to loop through a list of users. As a test case, I have analysed the commits from [Oceanhackweek2020](https://oceanhackweek.github.io/)

In [33]:
def gh_scrape(tgt_users, cred_file = 'credentials.json', verbose=True):
    """Simple scraping function
    Supply a list of github usersnames ['jane-doe', 'torvalds', 'satoshi_nakamoto']
    Returns a dataframe of unique commits by these users over the last 90 days
    Assumes that the github user is the user with most git commits associated with their github profile
    Rate limited by the Github API to 300 events
    Requires you to supply a Github API token in a credentials.json file
    Verbose switch prints a line for each user with the number of events and commits found
    Returns a pandas dataframe of commit info for all usernames in supplied list
    """
    # Get user supplied credentials for the github API
    credentials = json.loads(open(cred_file).read())
    username = credentials['username']
    token = credentials['token']
    df = pd.DataFrame()
    
    for tgt_user in tgt_users:
        base_url = 'https://api.github.com/users/' + tgt_user +'/events'
        url = base_url
        url_list = [base_url]
        data = []
        page_no = 1
        repos_data = []
        total_fetched = 0
        while (True):
            response = requests.get(url,auth = (username,token)).json()
            if type(response)==dict:
                # Catch when the API returns a dict rather than expected list. Usually a credentials error message
                print(response)
                return
            data = data + response
            events_fetched = len(response)
            total_fetched = events_fetched + total_fetched
            if total_fetched == 300:
                # Requesting more will max out the API
                break
            if (events_fetched == 30):
                # if we fethced 30 events from this page, there will be another one after it
                page_no = page_no + 1
                url = base_url + '?page=' + str(page_no)
                url_list.append(url)
            else:
                # We have collected all events by this user
                break
        commits_events = []
        for event in data:
            # We're only interested in commits, which are classed as "PushEvents"
            if event["type"] == "PushEvent":
                commits_events.append(event)
        if len(commits_events)==0:
            # If the user has no commit events, stop processing
            continue
            
        commit_usernames_list = []
        for event in commits_events:
            # Search though the payload for which git user is associated with each commit
            commit_list = event["payload"]["commits"]
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                        commit_username = commit_list[com_n]["author"]["name"]       
                        commit_usernames_list.append(commit_username)
        # Working on the assumption that the git user with the most commits pushed to Github by this user is the one we want
        c = Counter(commit_usernames_list)
        most_common_username = c.most_common(1)[0][0]
        user_commits = []
        
        # Go back through the commits and pull only the ones by the most common git username
        for event in commits_events:
            commit_list = event["payload"]["commits"]
            commit_list_author = []
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                    commit_username = commit_list[com_n]["author"]["name"]
                    if commit_username == most_common_username:
                        commit_list_author.append(commit_list[com_n])
                if commit_list_author:
                    event["payload"]["commits"] = commit_list_author
                    user_commits.append(event)
        
        # Extract the information we're interested in and put it in a pandas DataFrame
        for commit in user_commits:
            for com_n in range(len(commit["payload"]["commits"])):
                commit_detail = commit["payload"]["commits"][com_n]
                commit_subset = {"id": commit["id"],
                             "datetime" : commit["created_at"],
                             "sha" : commit_detail["sha"],
                             "message" : commit_detail["message"],
                             "author" : commit_detail["author"]["name"],
                             "url": commit_detail["url"],
                             "repo": commit["repo"]["name"]}
                df = df.append(commit_subset, ignore_index=True)
        df.index = pd.DatetimeIndex(df.datetime)
        df = df.drop_duplicates(subset=['sha'])
        if verbose:
            print(f"{tgt_user}: found {len(data)} events containing {len(user_commits)} unique commits by {most_common_username}\n")
    return df


In [34]:
users = ["callumrollo"]
df = gh_scrape(users, cred_file = 'credentials-secret.json')
len(df)

callumrollo: found 299 events containing 182 unique commits by Callum Rollo



210

Quite a few commits. What if we want solely the hackweek ones?

In [35]:
df['ohw20_repo'] = df['repo'].str.contains("ohw20")
sum(df['ohw20_repo'] )

23

### OHW analysis

Using the above function and a list of hackweek participants (not included) I grab github commits from the last 90 days

In [36]:
import csv
from itertools import chain
with open('ohw_participants.csv', newline='') as f:
    nest_list = list(csv.reader(f))
ohw_participants_list = list(chain.from_iterable(nest_list))
df_all = gh_scrape(ohw_participants_list, cred_file='credentials-secret.json', verbose=False)

In [37]:
len(df_all)

1356

As you saw when I grabbed the data just from my username, it contains a lot of identifying information. I have anonymised and saved this data for a later notebook of analysis

-----------------------------

# Going deeper

We can get all the details of a commit by delving deeper into the json structure accessed through the commit url.

**example:**

In [40]:
url = event['payload']['commits'][0]['url']
commit_detail = requests.get(url,auth = (username,token)).json()
commit_detail.keys()

dict_keys(['sha', 'node_id', 'commit', 'url', 'html_url', 'comments_url', 'author', 'committer', 'parents', 'stats', 'files'])

The most interesting section is `files`. This gives a summary of the lines changed on each file altered in this commit

In [52]:
'files' in commit_detail.keys()

True

In [50]:
commit_detail['files']

[{'sha': 'c90c6b45472c3d1a962024805ce9bacac967066c',
  'filename': 'dev-environment.yml',
  'status': 'added',
  'additions': 13,
  'deletions': 0,
  'changes': 13,
  'blob_url': 'https://github.com/callumrollo/geotiff-generator/blob/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/dev-environment.yml',
  'raw_url': 'https://github.com/callumrollo/geotiff-generator/raw/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/dev-environment.yml',
  'contents_url': 'https://api.github.com/repos/callumrollo/geotiff-generator/contents/dev-environment.yml?ref=1b7056d0bd435af6ca6eec9414dd4dc3cd39d587',
  'patch': '@@ -0,0 +1,13 @@\n+name: geotiff-test\n+channels:\n+   - conda-forge\n+dependencies:\n+   - python=3.7\n+   - numpy\n+   - pandas\n+   - xarray\n+   - netcdf4\n+   - gdal\n+   - jupyter\n+   - ipython\n+   - pytest'},
 {'sha': '8ddd4f0dc27b9251adaa755485f754ded4ddc3f3',
  'filename': 'geotiff_gen.py',
  'status': 'modified',
  'additions': 0,
  'deletions': 1,
  'changes': 1,
  'blob_url': 'https://githu

### Let's go through our little database and extract the file extenstions of all the files altered in each commit

This was originally [Filipe](https://github.com/ocefpaf)'s idea

This requires a small extention to our scraping function. Because we want details at a commit level, this will make an additional API call for every single commit, so adds quite a time penalty. A good target for optimistion with async methods perhaps? *This is left as an excercise for the reader*

In [53]:
def gh_scrape(tgt_users, cred_file = 'credentials.json', verbose=True):
    """Simple scraping function
    Supply a list of github usersnames ['jane-doe', 'torvalds', 'satoshi_nakamoto']
    Returns a dataframe of unique commits by these users over the last 90 days
    Assumes that the github user is the user with most git commits associated with their github profile
    Rate limited by the Github API to 300 events
    Requires you to supply a Github API token in a credentials.json file
    Verbose switch prints a line for each user with the number of events and commits found
    Returns a pandas dataframe of commit info for all usernames in supplied list
    """
    # Get user supplied credentials for the github API
    credentials = json.loads(open(cred_file).read())
    username = credentials['username']
    token = credentials['token']
    df = pd.DataFrame()
    
    for tgt_user in tgt_users:
        print(tgt_user)
        base_url = 'https://api.github.com/users/' + tgt_user +'/events'
        url = base_url
        url_list = [base_url]
        data = []
        page_no = 1
        repos_data = []
        total_fetched = 0
        while (True):
            response = requests.get(url,auth = (username,token)).json()
            if type(response)==dict:
                # Catch when the API returns a dict rather than expected list. Usually a credentials error message
                print(response)
                return
            data = data + response
            events_fetched = len(response)
            total_fetched = events_fetched + total_fetched
            if total_fetched == 300:
                # Requesting more will max out the API
                break
            if (events_fetched == 30):
                # if we fethced 30 events from this page, there will be another one after it
                page_no = page_no + 1
                url = base_url + '?page=' + str(page_no)
                url_list.append(url)
            else:
                # We have collected all events by this user
                break
        commits_events = []
        for event in data:
            # We're only interested in commits, which are classed as "PushEvents"
            if event["type"] == "PushEvent":
                commits_events.append(event)
        if len(commits_events)==0:
            # If the user has no commit events, stop processing
            continue
            
        commit_usernames_list = []
        for event in commits_events:
            # Search though the payload for which git user is associated with each commit
            commit_list = event["payload"]["commits"]
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                        commit_username = commit_list[com_n]["author"]["name"]       
                        commit_usernames_list.append(commit_username)
        # Working on the assumption that the git user with the most commits pushed to Github by this user is the one we want
        c = Counter(commit_usernames_list)
        most_common_username = c.most_common(1)[0][0]
        user_commits = []
        
        # Go back through the commits and pull only the ones by the most common git username
        for event in commits_events:
            commit_list = event["payload"]["commits"]
            commit_list_author = []
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                    commit_username = commit_list[com_n]["author"]["name"]
                    if commit_username == most_common_username:
                        commit_list_author.append(commit_list[com_n])
                if commit_list_author:
                    event["payload"]["commits"] = commit_list_author
                    user_commits.append(event)
        
        # Extract the information we're interested in and put it in a pandas DataFrame
        for commit in user_commits:
            for com_n in range(len(commit["payload"]["commits"])):
                commit_detail = commit["payload"]["commits"][com_n]
                commit_all_details = requests.get(commit_detail["url"],auth = (username,token)).json()
                extensions = ""
                if 'files' not in commit_all_details.keys():
                    continue
                for file in commit_all_details['files']:
                    extensions = extensions + file['filename'].split('.')[-1] + ', '
                commit_subset = {"id": commit["id"],
                             "datetime" : commit["created_at"],
                             "sha" : commit_detail["sha"],
                             "message" : commit_detail["message"],
                             "author" : commit_detail["author"]["name"],
                             "url": commit_detail["url"],
                             "repo": commit["repo"]["name"],
                             "extensions": extensions[:-2]}
                df = df.append(commit_subset, ignore_index=True)
        df.index = pd.DatetimeIndex(df.datetime)
        df = df.drop_duplicates(subset=['sha'])
        if verbose:
            print(f"{tgt_user}: found {len(data)} events containing {len(user_commits)} unique commits by {most_common_username}\n")
    return df

In [43]:
users = ["callumrollo"]
df = gh_scrape(users, cred_file = 'credentials-secret.json')
df.head()

callumrollo: found 299 events containing 182 unique commits by Callum Rollo



Unnamed: 0_level_0,author,datetime,extensions,id,message,repo,sha,url
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-10-16 17:42:00+00:00,Callum Rollo,2020-10-16T17:42:00Z,"html, html, html, html, html, html, html, html...",13873691257,Generate Pelican site,ueapy/ueapy.github.io,0b70258f3668dbd25fe8e9681b9324af62e3066f,https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 17:41:23+00:00,Callum Rollo,2020-10-16T17:41:23Z,"md, png, ipynb",13873684842,Added notebook on webscraping,ueapy/ueapy.github.io,3364fd1e786612c16b4e999cbb832e9584818804,https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 16:47:44+00:00,Callum Rollo,2020-10-16T16:47:44Z,"ipynb, png, png",13873114207,added beam miss figure and subplot labelling,callumrollo/adcp-glider,ad7aeeac2527b92cb5a98aa8cfc134bf355571e6,https://api.github.com/repos/callumrollo/adcp-...
2020-10-16 16:43:06+00:00,Callum Rollo,2020-10-16T16:43:06Z,"pdf, html",13873061051,Generate Pelican site,callumrollo/callumrollo.github.io,f5e76561f2e5886db6a413e079209d039f0bac74,https://api.github.com/repos/callumrollo/callu...
2020-10-16 16:42:30+00:00,Callum Rollo,2020-10-16T16:42:30Z,"pdf, md",13873054646,added MATS poster,callumrollo/callumrollo.github.io,30a6802786ba5b479e9cf09594e51e3e5ebcef94,https://api.github.com/repos/callumrollo/callu...


Success! We have the filetypes used in every commit by this author. This should tell us, in broad strokes, what programming language they are working on and potentially many other things. Are they working on .md files for documentation? Uploading lots of .png iamges? Do they prefer `.ipynb` notebooks to pure `.py` files?

### Ideas for further analysis
- Use of different filetypes. Particularly .py vs .ipynb
- word cloud of commit messages
- check out non-commit activity: merge, PR, issue...
- geographical/timezone patterns
- how much "crunch" did we get before presentations on Friday?
- examine links between authors (who merged who? Comments mentioning issues?)

### Github is a rich mine of information
I hope this notebook has given you some ideas of the kinds of information you can grab from GitHub. It may also serve as a good reminder that everything we put on there is public and scrapable, so write helpful commit messages!

In [None]:
HTML(html)