# GA4GH GitHub Commit Statistics

Expects github-stats-summary.tsv to exist already; see Makefile for generation

This code counts commits in ga4gh-org repos only. Repos outside the org, and no comments and issues are counted anywhere.


## Possible views

### Total Statistics
- num of repos
- num of committers 

### Stats over time
- num of repos
- num committers, per repo
- num commits, per repo

### Repo stats
- first 10
- top 10 num committers
- top 10 2023

### Committers
- contributor longevity
- commits


## Bugs
- [ ] email addresses are not transformed to unique humans.  e.g., reece@harts.net and reecehart@gmail.com are the same people, 102165525+wesleygoar@users.noreply.github.com and wesley.goar@nationwidechildrens.org are the same people, and 49699333+dependabot[bot]@users.noreply.github.com is a bot


In [28]:
import datetime
from datetime import datetime

import pandas as pd
import pytz

stats_fn = "github-stats-summary.tsv"
now = datetime.now(tz=pytz.UTC)

In [29]:
df = pd.read_csv(
    stats_fn,
    delimiter="\t",
    parse_dates=["ts"],
    keep_default_na=False,
    converters={
        "files_changed": lambda x: int(x or 0),
        "insertions": lambda x: int(x or 0),
        "deletions": lambda x: int(x or 0),
    },
)

df["ts_YM"] = df["ts"].apply(lambda ts: ts.strftime("%Y-%m"))
df.insert(2, "ts_YM", df.pop("ts_YM"))   # move ts_YM to right of ts
df["ts_Y"] = df["ts"].apply(lambda ts: ts.strftime("%Y"))
df.insert(2, "ts_Y", df.pop("ts_Y"))   # move ts_Y to right of ts
df.pop("hash")
df.pop("committer_email")

df

Unnamed: 0,repo,ts,ts_Y,ts_YM,author_email,files_changed,insertions,deletions,subject
0,ga4gh/ADA-M,2019-01-21 12:41:00-05:00,2019,2019-01,mirocupak@gmail.com,1,9,30,Remove released build dependencies
1,ga4gh/ADA-M,2018-11-28 17:04:29-05:00,2018,2018-11,fjeanson@yahoo.com,1,1,1,fixed wagger parser version error
2,ga4gh/ADA-M,2018-11-28 16:57:17-05:00,2018,2018-11,fjeanson@yahoo.com,1,1,1,fixed swagger-core clone version
3,ga4gh/ADA-M,2018-10-03 13:07:08-04:00,2018,2018-10,fjeanson@yahoo.com,1,1,1,updated .travis.yml swagger-core to 2.0.2
4,ga4gh/ADA-M,2018-04-19 16:28:57-04:00,2018,2018-04,mirocupak@gmail.com,1,0,2,Remove mention of Protocol Buffers
...,...,...,...,...,...,...,...,...,...
19883,ga4gh/workflow-execution-service-schemas,2016-04-05 16:55:50-04:00,2016,2016-04,briandoconnor@gmail.com,1,1,1,working on first pass at API
19884,ga4gh/workflow-execution-service-schemas,2016-04-05 16:53:01-04:00,2016,2016-04,briandoconnor@gmail.com,1,1,1,working on first pass at API
19885,ga4gh/workflow-execution-service-schemas,2016-04-05 16:50:44-04:00,2016,2016-04,briandoconnor@gmail.com,1,1,1,working on first pass at API
19886,ga4gh/workflow-execution-service-schemas,2016-04-05 16:31:53-04:00,2016,2016-04,briandoconnor@gmail.com,3,592,1,"initial checkin, a work in progress"


In [30]:
stats = {
    "number of repos": df["repo"].nunique(),
    "number of commits": len(df),
    "number of unique authors": df["author_email"].nunique()
}
stats

{'number of repos': 108,
 'number of commits': 19888,
 'number of unique authors': 1345}

In [32]:
df_repo = df.groupby(["repo"], as_index=False).agg(
    min_ts = pd.NamedAgg(column="ts", aggfunc=min),
    num_commits = pd.NamedAgg(column="repo", aggfunc=len),
    num_uniq_authors = pd.NamedAgg(column="authors", aggfunc="nunique")
)
df_repo["age"] = now - df_repo["min_ts"]
df_repo.insert(2, "age", df_repo.pop("age"))
df_repo

Unnamed: 0,repo,min_ts,age,num_commits
0,ga4gh/ADA-M,2017-05-15 14:29:11-07:00,2273 days 22:21:40.966130,21
1,ga4gh/Get-Started-with-GA4GH-APIs,2022-02-09 09:26:47-05:00,543 days 05:24:04.966130,142
2,ga4gh/Strategic-Refresh,2022-11-30 11:17:47-05:00,249 days 03:33:04.966130,6
3,ga4gh/TASC,2020-01-16 23:40:40+02:00,1297 days 22:10:11.966130,35
4,ga4gh/approval-tracker,2018-07-05 13:32:31+01:00,1858 days 07:18:20.966130,15
...,...,...,...,...
103,ga4gh/vrsatile-pydantic,2021-08-25 16:11:03-04:00,710 days 23:39:48.966130,44
104,ga4gh/w3id.org,2013-05-07 17:10:03-04:00,3742 days 22:40:48.966130,4966
105,ga4gh/wiki,2017-08-03 11:48:15-07:00,2194 days 01:02:36.966130,2
106,ga4gh/workflow-execution-server,2016-03-21 16:26:44-07:00,2693 days 20:24:07.966130,1
