
## Research Project Description

This project will investigate the evolution of neural networks and their performance, through things such as hyperparameters, commits and technology used to train the networks. The primary method as of now is to use Github to scrape releases and commits.

### Imports

In [1]:
import requests as r
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from tools import *

You will need a .env file with your username and token from Github.

In [None]:
from dotenv import load_dotenv
from os import environ
load_dotenv()

## Repository Scraping

We may use the Github API to request information from repositories such as:


- Commit information (includes messages and links to diffs)
- Release information 
- Stars, and other metrics

For now, our repository of interest is Mozilla/DeepSpeech, which has over 104 releases.

In [3]:
g = Github(environ['user'],environ['token'])


Refer to `tools.py` for information on the class for Github scraping. Note that certain parameters, such as page number and the amount per page are passed as requests which is used for pagination.

In [7]:
g.get_releases("mozilla/DeepSpeech", {'page':1,'per_page':100}).json()[-1].keys()

dict_keys(['url', 'assets_url', 'upload_url', 'html_url', 'id', 'author', 'node_id', 'tag_name', 'target_commitish', 'name', 'draft', 'prerelease', 'created_at', 'published_at', 'assets', 'tarball_url', 'zipball_url', 'body'])

As of writing this document, there are 104 releases, so two scrapes sufficess to obtain all the release information. We then extract the columns of interest.

In [10]:
releases = g.get_releases("mozilla/DeepSpeech", {'page':1,'per_page':100}).json()+g.get_releases("mozilla/DeepSpeech", {'page':2,'per_page':100}).json()


# Get and store important information on releases
releases_df = pd.DataFrame(releases).drop([
    'reactions','zipball_url','tarball_url',
    'assets','created_at','draft','assets_url',
    'upload_url','author','node_id',
    'target_commitish','url','id'],axis=1)
releases_df['date'] = pd.to_datetime(releases_df['published_at'])
releases_df.to_pickle('data/deepspeech_releases.pkl')
releases_df

Unnamed: 0,html_url,tag_name,name,prerelease,published_at,body
0,https://github.com/mozilla/DeepSpeech/releases...,v0.10.0-alpha.3,v0.10.0-alpha.3,True,2020-12-19T10:13:23Z,
1,https://github.com/mozilla/DeepSpeech/releases...,v0.9.3,DeepSpeech 0.9.3,False,2020-12-10T15:58:47Z,# General\r\n\r\nThis is the 0.9.3 release of ...
2,https://github.com/mozilla/DeepSpeech/releases...,v0.9.2,DeepSpeech 0.9.2,False,2020-12-03T16:40:18Z,# General\r\n\r\nThis is the 0.9.2 release of ...
3,https://github.com/mozilla/DeepSpeech/releases...,v0.9.1,DeepSpech 0.9.1,False,2020-11-04T16:56:50Z,# General\r\n\r\nThis is the 0.9.1 release of ...
4,https://github.com/mozilla/DeepSpeech/releases...,v0.9.0-alpha.12,v0.9.0-alpha.12,True,2020-10-30T17:17:02Z,
...,...,...,...,...,...,...
100,https://github.com/mozilla/DeepSpeech/releases...,v0.2.1-alpha.0,v0.2.1-alpha.0,True,2018-09-26T11:35:33Z,
101,https://github.com/mozilla/DeepSpeech/releases...,v0.2.0-alpha.10,v0.2.0-alpha.10,True,2018-09-18T15:01:36Z,
102,https://github.com/mozilla/DeepSpeech/releases...,v0.2.0,Deep Speech 0.2.0,False,2018-09-18T22:14:11Z,# General\r\n\r\nThis is the 0.2.0 release of ...
103,https://github.com/mozilla/DeepSpeech/releases...,v0.1.1,Deep Speech 0.1.1,False,2018-09-18T12:15:21Z,# General\r\n\r\nThis is the 0.1.1 release of ...


We repeat the process for the commits. There are around 3500 commits.

In [15]:
commits = [g.get_commits('mozilla/DeepSpeech',{'page':x+1,'per_page':100}).json() for x in range(34)]
len(flatten(commits))

In [19]:
commits_df = pd.DataFrame([{'name':commit['commit']['author']['name'],
'date':commit['commit']['author']['date'], 
'html_url':commit['html_url'], 
'message': commit['commit']['message']} for commit in flatten(commits)])
commits_df['date'] = pd.to_datetime(commits_df['date'])
commits_df = commits_df.sort_values(by="date")
commits_df

Unnamed: 0,name,date,html_url,message
3225,Reuben Morais,2016-10-20 18:27:12+00:00,https://github.com/mozilla/DeepSpeech/commit/c...,Implement Fisher corpus importer
3388,Chris Lord,2016-10-26 15:13:49+00:00,https://github.com/mozilla/DeepSpeech/commit/2...,Use reshape/unpack to remove dependency on n_s...
3224,Reuben Morais,2016-10-27 18:08:14+00:00,https://github.com/mozilla/DeepSpeech/commit/d...,Address review comments and do further filteri...
3189,Andre Natal,2016-10-29 22:21:01+00:00,https://github.com/mozilla/DeepSpeech/commit/c...,Switchboard importer
3190,Andre Natal,2016-10-29 22:21:01+00:00,https://github.com/mozilla/DeepSpeech/commit/5...,Switchboard importer
...,...,...,...,...
4,lissyx,2021-05-13 16:09:11+00:00,https://github.com/mozilla/DeepSpeech/commit/9...,Merge pull request #3647 from mozilla/ftyers-p...
3,lissyx,2021-07-30 18:50:45+00:00,https://github.com/mozilla/DeepSpeech/commit/f...,Update conf.py
2,lissyx,2021-07-30 18:50:58+00:00,https://github.com/mozilla/DeepSpeech/commit/7...,Merge pull request #3674 from mozilla/lissyx-p...
1,Daniel Tinazzi,2021-11-17 13:20:19+00:00,https://github.com/mozilla/DeepSpeech/commit/4...,Fixed M-AILABS broken link\n\nI replaced the b...


### Scrapable Data from Releases

For any proper release, there is a standard pattern in the hyperparameters section that contains useful information. Employ a regex pattern to target all phrases of the form " * `blah` blahblah \r\n" then turn the options into a dictionary.
Important values to tabulate are:

- default_stddev
- dev_batch_size
- dev_files
- epochs
-  train_files
- n_hidden
- lm_alpha
- lm_beta
- train_batch_size
- test_batch_size

'train_batch_size', 'dev_batch_size', 'test_batch_size', 'n_hidden', 'learning_rate', 'dropout_rate', 'epochs', 'lm_alpha', 'lm_beta'

In [21]:
releases_df['hyperparams'] = releases_df[releases_df['body']!='']['body'].str.findall('[*] `(.{2,}?)`(.{2,}?)\r\n').apply(tup_dict, filter=imp_hyperparams) 

In [30]:
releases_df[releases_df['body']!=''][['name','published_at','hyperparams']]

Unnamed: 0,name,published_at,hyperparams
1,DeepSpeech 0.9.3,2020-12-10T15:58:47Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."
2,DeepSpeech 0.9.2,2020-12-03T16:40:18Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."
3,DeepSpech 0.9.1,2020-11-04T16:56:50Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."
6,DeepSpech 0.9.0,2020-11-02T13:07:03Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."
10,DeepSpeech 0.8.2,2020-08-22T14:38:12Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."
13,DeepSpeech 0.8.1,2020-08-11T08:25:48Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."
16,DeepSpeech 0.8.0,2020-07-30T17:16:00Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."
41,DeepSpeech 0.7.4,2020-06-18T14:57:08Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."
43,DeepSpeech 0.7.3,2020-06-04T09:29:32Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."
47,DeepSpeech 0.7.1,2020-05-12T15:31:11Z,"{'train_batch_size': ' 128', 'dev_batch_size':..."


And so, we have the information of 21 releases with their hyperparameters. Note, that there is a lot more information to uncover within release messages.