![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2Farchitectures%2Ftracking%2Fsetup%2Fgithub&dt=GitHub+Metrics+-+1+-+Initial+Creation.ipynb)

# GitHub Statistics For /statmike/vertex-ai-mlops

Using the [GitHub API](https://docs.github.com/en/rest/metrics/statistics?apiVersion=2022-11-28) to:
- get commit history

**Notes:**

The API offers different ways to retrieve commit information.  These are different summaries of the the same information.
- `/commits` retrieves all commits for a repository, but not files
- `/commits/sha` where `sha` is the sha for a commit will retrieve the files associated with the individual commit
- `/contributors` a weekly summary of commits, additions, deletions for each contributor
- `/stats/code_frequency` is a weekly history of additions and deletions summarized
- `/stats/commit_activity` has the last year of weekly data for commits 


The most granualar levels is using `/commits` and then `/commits/sha` to retrieve timestamped file level information within commits.  This data could then be rolled up to all other levels in reporting.


Approach notes:
- I prefer to not converte datas/times to formats in pandas and instead save this as a step in BigQuery.  Why? Loading a dataframe to BigQuery has a middle layer where the data gets serialized and transferred.  This middle step is another set of format conversions that can impact dates/times.  This can cause errors when later appending to the same BigQuery tables even when the dataframe matches the original identically. A > B > C is not the same as A > B|C > C

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/architectures/tracking/setup/github/GitHub%20Metrics%20-%201%20-%20Initial%20Creation.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [50]:
PROJECT_ID = 'vertex-ai-mlops-369716' # replace with project ID

In [51]:
try:
    import google.colab
    try:
      from google.cloud import secretmanager
    except ImportError:
      !pip install google-cloud-secret-manager -q
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

Updated property [core/project].


---
## Setup

In [52]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'vertex-ai-mlops-369716'

In [53]:
REGION = 'us-central1'

github_user = 'statmike'
github_repo = 'vertex-ai-mlops'

BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'github_metrics'

In [54]:
import requests
import json
import time
from datetime import datetime
import pandas as pd
from io import StringIO
import os, shutil

from google.cloud import bigquery
from google.cloud import secretmanager

In [55]:
bq = bigquery.Client(project = PROJECT_ID)
secret_client = secretmanager.SecretManagerServiceClient()

In [56]:
secret = secret_client.access_secret_version(request = {"name": f'projects/{PROJECT_ID}/secrets/github_api/versions/latest'})
pat = secret.payload.data.decode('utf-8')

---
## GitHub API

Define the API url for the user and repository.  Create a helper function that will make get request from API addresses and if the receive a 202 response (accepted request) then retry until it receives a 200 response (successful response).

In [57]:
github_api_url = f'https://api.github.com/repos/{github_user}/{github_repo}'

In [58]:
def metric_get(metric_type):
  response = requests.get(f'{github_api_url}/{metric_type}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
  while response.status_code == 202:
      time.sleep(10)
      response = requests.get(f'{github_api_url}/{metric_type}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
  return response

---
## Data Exploration

The following subsection retrieve and format data from different parts of the API related to commits.

---
### code_frequency
- stats/code_frequency
- https://docs.github.com/en/rest/metrics/statistics?apiVersion=2022-11-28#get-the-weekly-commit-activity
- Full history of repository
- schema: list of list for each week
  - 3 elements are integers: UNIX Timestamp for 12AM each Sunday, additions, deletions
- LRO may return status_code = 202 while running, do retry until 200


In [59]:
metric_type = 'stats/code_frequency'
response = metric_get(metric_type)
response.status_code

200

In [60]:
code_frequency = pd.DataFrame(json.loads(response.text), columns = ['week', 'additions', 'deletions'])
#code_frequency['week'] = pd.to_datetime(code_frequency['week'], unit = 's')

In [61]:
code_frequency.head()

Unnamed: 0,week,additions,deletions
0,1616889600,2983,-547
1,1617494400,7461,-3499
2,1618099200,12394,-6314
3,1618704000,7904,-6179
4,1619308800,0,0


---
### commit_activity
- stats/commit_activity
- https://docs.github.com/en/rest/metrics/statistics?apiVersion=2022-11-28#get-the-last-year-of-commit-activity
- last year of commit activity
- schema: list of dict for each week
  - week: UNIX Timestamp for 12AM each Sunday
  - total: int of number of commits
  - days: list of ints with commit per day [sunday, ..., saturday]
- LRO may return status_code = 202 while running, do retry until 200


In [62]:
metric_type = 'stats/commit_activity'
response = metric_get(metric_type)
response.status_code

200

In [63]:
commit_activity = pd.DataFrame(json.loads(response.text)).drop(columns = ['days']).rename(columns = {'total' : 'commits'})
#commit_activity['week'] = pd.to_datetime(commit_activity['week'], unit = 's')

In [64]:
commit_activity.head()

Unnamed: 0,commits,week
0,0,1645920000
1,10,1646524800
2,14,1647129600
3,6,1647734400
4,7,1648339200


---
### contributors
- stats/contributors
- https://docs.github.com/en/rest/metrics/statistics?apiVersion=2022-11-28#get-all-contributor-commit-activity
- full history of repository
- schema: list of dict for contributor
  - total: total commits all time
  - weeks: list of dicts for each week:
    - w: UNIX Timestamp for 12AM each Sunday
    - a: additions
    - d: deletions
    - c: commits
  - author: dict
    - login: user name on GitHub
    - ...
- LRO may return status_code = 202 while running, do retry until 200


In [65]:
metric_type = 'stats/contributors'
response = metric_get(metric_type)
response.status_code

200

In [66]:
parsed_response = []
for e in json.loads(response.text):
  for week in e['weeks']:
    parsed_response += [{'author': e['author']['login'], 'week': week['w'], 'additions': week['a'], 'deletions': -1*week['d'], 'commits': week['c']}]

In [67]:
parsed_response[0]

{'author': 'karticn-google',
 'week': 1616889600,
 'additions': 0,
 'deletions': 0,
 'commits': 0}

In [68]:
contributors = pd.DataFrame(parsed_response)
#contributors['week'] = pd.to_datetime(contributors['week'], unit = 's')

In [69]:
contributors.head()

Unnamed: 0,author,week,additions,deletions,commits
0,karticn-google,1616889600,0,0,0
1,karticn-google,1617494400,0,0,0
2,karticn-google,1618099200,0,0,0
3,karticn-google,1618704000,0,0,0
4,karticn-google,1619308800,0,0,0


### commits
- /commits
- Get a list of commits with details
- schema: list of dict for each commit
  - 


In [70]:
page_size = 100
page = 1
raw_commits = []
while page_size == 100:
  response = requests.get(f'{github_api_url}/commits?per_page=100&page={page}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
  new_page = json.loads(response.text)
  if response.status_code == 200:
    raw_commits += new_page
    page_size = len(new_page)
    page +=1
  else: break

In [71]:
len(raw_commits)

708

In [72]:
raw_commits[0]

{'sha': '877ff535c3f01b9383a0cbb4be204705230983cc',
 'node_id': 'C_kwDOFwjp5toAKDg3N2ZmNTM1YzNmMDFiOTM4M2EwY2JiNGJlMjA0NzA1MjMwOTgzY2M',
 'commit': {'author': {'name': 'Mike Henderson',
   'email': 'statmike@gmail.com',
   'date': '2023-02-19T18:19:54Z'},
  'committer': {'name': 'Mike Henderson',
   'email': 'statmike@gmail.com',
   'date': '2023-02-19T18:19:54Z'},
  'message': 'notes',
  'tree': {'sha': '8dad1a7b6c6c707565dd8fc45561d1e73efb0666',
   'url': 'https://api.github.com/repos/statmike/vertex-ai-mlops/git/trees/8dad1a7b6c6c707565dd8fc45561d1e73efb0666'},
  'url': 'https://api.github.com/repos/statmike/vertex-ai-mlops/git/commits/877ff535c3f01b9383a0cbb4be204705230983cc',
  'comment_count': 0,
  'verification': {'verified': False,
   'reason': 'unsigned',
   'signature': None,
   'payload': None}},
 'url': 'https://api.github.com/repos/statmike/vertex-ai-mlops/commits/877ff535c3f01b9383a0cbb4be204705230983cc',
 'html_url': 'https://github.com/statmike/vertex-ai-mlops/commit/87

In [73]:
commits = []
for i, c in enumerate(raw_commits):
  committer = c['commit']['committer']['name']
  author = c['commit']['author']['name']
  committer2 = ''
  if 'committer' in c and c['committer']:
    if 'login' in c['committer']: committer2 = c['committer']['login']
  author2 = ''
  if 'author' in c and c['author']:
    if 'login' in c['author']: author2 = c['author']['login']

  # refined author with logic:
  if author2: refined_author = author2
  else: refined_author = author 

  commits += [{
      'sha': c['sha'],
      'datetime': c['commit']['committer']['date'],
      'url': c['html_url'],
      'message': c['commit']['message'],
      'author': refined_author,
      #'committer': committer,
      #'author': author,
      #'committer2': committer2,
      #'author2': author2
  }]

In [74]:
commits = pd.DataFrame(commits)

In [75]:
commits.head()

Unnamed: 0,sha,datetime,url,message,author
0,877ff535c3f01b9383a0cbb4be204705230983cc,2023-02-19T18:19:54Z,https://github.com/statmike/vertex-ai-mlops/co...,notes,statmike
1,5b85c2efa0a17cf4b0a49d045dc259135a158c59,2023-02-19T01:19:08Z,https://github.com/statmike/vertex-ai-mlops/co...,process update,statmike
2,dbfc6a9755a8eb782856af435a8383c683a598c1,2023-02-19T01:17:43Z,https://github.com/statmike/vertex-ai-mlops/co...,add colab process,statmike
3,151476a18b3a56228e3810bcfd5416e7cfe0de5f,2023-02-19T01:13:21Z,https://github.com/statmike/vertex-ai-mlops/co...,GA4 reporting process update,statmike
4,e8c28c8ed407688435c470b3ce39e3b1d697eef3,2023-02-18T22:59:45Z,https://github.com/statmike/vertex-ai-mlops/co...,tracking update,statmike


In [76]:
#commits['datetime'] = pd.to_datetime(commits['datetime'], infer_datetime_format = True)

In [77]:
commits.head()

Unnamed: 0,sha,datetime,url,message,author
0,877ff535c3f01b9383a0cbb4be204705230983cc,2023-02-19T18:19:54Z,https://github.com/statmike/vertex-ai-mlops/co...,notes,statmike
1,5b85c2efa0a17cf4b0a49d045dc259135a158c59,2023-02-19T01:19:08Z,https://github.com/statmike/vertex-ai-mlops/co...,process update,statmike
2,dbfc6a9755a8eb782856af435a8383c683a598c1,2023-02-19T01:17:43Z,https://github.com/statmike/vertex-ai-mlops/co...,add colab process,statmike
3,151476a18b3a56228e3810bcfd5416e7cfe0de5f,2023-02-19T01:13:21Z,https://github.com/statmike/vertex-ai-mlops/co...,GA4 reporting process update,statmike
4,e8c28c8ed407688435c470b3ce39e3b1d697eef3,2023-02-18T22:59:45Z,https://github.com/statmike/vertex-ai-mlops/co...,tracking update,statmike


In [78]:
commits.dtypes

sha         object
datetime    object
url         object
message     object
author      object
dtype: object

In [79]:
commits.groupby(['author']).count()

Unnamed: 0_level_0,sha,datetime,url,message
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mike Henderson,44,44,44,44
PavelPetukhov,2,2,2,2
goodrules,10,10,10,10
karticn-google,1,1,1,1
mike henderson,1,1,1,1
statmike,650,650,650,650


In [80]:
commits.loc[commits['author'].str.lower() == 'mike henderson', 'author'] = 'statmike'

In [81]:
commits.groupby(['author']).count()

Unnamed: 0_level_0,sha,datetime,url,message
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PavelPetukhov,2,2,2,2
goodrules,10,10,10,10
karticn-google,1,1,1,1
statmike,695,695,695,695


In [82]:
commits.head()

Unnamed: 0,sha,datetime,url,message,author
0,877ff535c3f01b9383a0cbb4be204705230983cc,2023-02-19T18:19:54Z,https://github.com/statmike/vertex-ai-mlops/co...,notes,statmike
1,5b85c2efa0a17cf4b0a49d045dc259135a158c59,2023-02-19T01:19:08Z,https://github.com/statmike/vertex-ai-mlops/co...,process update,statmike
2,dbfc6a9755a8eb782856af435a8383c683a598c1,2023-02-19T01:17:43Z,https://github.com/statmike/vertex-ai-mlops/co...,add colab process,statmike
3,151476a18b3a56228e3810bcfd5416e7cfe0de5f,2023-02-19T01:13:21Z,https://github.com/statmike/vertex-ai-mlops/co...,GA4 reporting process update,statmike
4,e8c28c8ed407688435c470b3ce39e3b1d697eef3,2023-02-18T22:59:45Z,https://github.com/statmike/vertex-ai-mlops/co...,tracking update,statmike


### commits_files
- /commits/REF (use SHA)

In [83]:
sha = list(commits['sha'])

In [84]:
raw_files = []
for s in sha:
  page = 1
  response = requests.get(f'{github_api_url}/commits/{s}?per_page=100&page={page}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
  files = json.loads(response.text)['files']
  if len(files) == 100:
    while len(files) % 100 == 0:
      page += 1
      response = requests.get(f'{github_api_url}/commits/{s}?per_page=100&page={page}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
      files += json.loads(response.text)['files']
  raw_files += [{'sha': s, 'files': files}]

In [85]:
len(raw_files)

708

In [86]:
len(raw_files[2]['files'])

1

In [87]:
raw_files[0]['files']

[{'sha': '2076f1aa518041c2a238e81ffc23e36fba2433d2',
  'filename': 'architectures/tracking/setup/github/readme.md',
  'status': 'modified',
  'additions': 1,
  'deletions': 0,
  'changes': 1,
  'blob_url': 'https://github.com/statmike/vertex-ai-mlops/blob/877ff535c3f01b9383a0cbb4be204705230983cc/architectures%2Ftracking%2Fsetup%2Fgithub%2Freadme.md',
  'raw_url': 'https://github.com/statmike/vertex-ai-mlops/raw/877ff535c3f01b9383a0cbb4be204705230983cc/architectures%2Ftracking%2Fsetup%2Fgithub%2Freadme.md',
  'contents_url': 'https://api.github.com/repos/statmike/vertex-ai-mlops/contents/architectures%2Ftracking%2Fsetup%2Fgithub%2Freadme.md?ref=877ff535c3f01b9383a0cbb4be204705230983cc',
  'patch': '@@ -7,5 +7,6 @@ Setup the data reads from the GitHub API\n \n ## Notebooks For Gathering and Processing Information:\n - [GitHub Metrics - 1 - Initial Creation](./GitHub%20Metrics%20-%201%20-%20Initial%20Creation.ipynb)\n+    - Building the tables `commits` and `commits_files` in the BigQuery

In [88]:
commits_files = []
for c in raw_files:
  for f in c['files']:
    commits_files += [{
        'sha': c['sha'],
        'file_sha': f['sha'],
        'file': f"{github_user}/{github_repo}/{f['filename']}",
        'additions': f['additions'],
        'deletions': f['deletions']
    }]

In [89]:
commits_files = pd.DataFrame(commits_files)

In [90]:
commits_files.head()

Unnamed: 0,sha,file_sha,file,additions,deletions
0,877ff535c3f01b9383a0cbb4be204705230983cc,2076f1aa518041c2a238e81ffc23e36fba2433d2,statmike/vertex-ai-mlops/architectures/trackin...,1,0
1,5b85c2efa0a17cf4b0a49d045dc259135a158c59,d56dfb3c5ff1cb2bfd92ab404cd11a2677875618,statmike/vertex-ai-mlops/architectures/trackin...,1,0
2,dbfc6a9755a8eb782856af435a8383c683a598c1,696a6b8c109004639678b964a782fffea8bec850,statmike/vertex-ai-mlops/architectures/trackin...,11,0
3,151476a18b3a56228e3810bcfd5416e7cfe0de5f,ec4f76a6f89ae3b2e49be2761b7f8d21822ccec2,statmike/vertex-ai-mlops/architectures/trackin...,1,1
4,151476a18b3a56228e3810bcfd5416e7cfe0de5f,b2e135734a66e00479f69e8e4865c2e802e276c4,statmike/vertex-ai-mlops/architectures/trackin...,1,1


---
## Pandas Tables

The following section combines and does final formatting to prepare a dataframe version of data to save.

### weekly_commits (combine code_frequency and commit_activity)

In [91]:
weekly_commits = pd.merge(code_frequency, commit_activity, on = 'week', how = 'outer')
weekly_commits['github_account'] = github_user
weekly_commits['github_repo'] = github_repo
weekly_commits.head()

Unnamed: 0,week,additions,deletions,commits,github_account,github_repo
0,1616889600,2983,-547,,statmike,vertex-ai-mlops
1,1617494400,7461,-3499,,statmike,vertex-ai-mlops
2,1618099200,12394,-6314,,statmike,vertex-ai-mlops
3,1618704000,7904,-6179,,statmike,vertex-ai-mlops
4,1619308800,0,0,,statmike,vertex-ai-mlops


### author_weekly_commits (contributors)

In [92]:
author_weekly_commits = contributors
author_weekly_commits['github_account'] = github_user
author_weekly_commits['github_repo'] = github_repo
author_weekly_commits.head()

Unnamed: 0,author,week,additions,deletions,commits,github_account,github_repo
0,karticn-google,1616889600,0,0,0,statmike,vertex-ai-mlops
1,karticn-google,1617494400,0,0,0,statmike,vertex-ai-mlops
2,karticn-google,1618099200,0,0,0,statmike,vertex-ai-mlops
3,karticn-google,1618704000,0,0,0,statmike,vertex-ai-mlops
4,karticn-google,1619308800,0,0,0,statmike,vertex-ai-mlops


### commits

In [93]:
commits = commits
commits.head()

Unnamed: 0,sha,datetime,url,message,author
0,877ff535c3f01b9383a0cbb4be204705230983cc,2023-02-19T18:19:54Z,https://github.com/statmike/vertex-ai-mlops/co...,notes,statmike
1,5b85c2efa0a17cf4b0a49d045dc259135a158c59,2023-02-19T01:19:08Z,https://github.com/statmike/vertex-ai-mlops/co...,process update,statmike
2,dbfc6a9755a8eb782856af435a8383c683a598c1,2023-02-19T01:17:43Z,https://github.com/statmike/vertex-ai-mlops/co...,add colab process,statmike
3,151476a18b3a56228e3810bcfd5416e7cfe0de5f,2023-02-19T01:13:21Z,https://github.com/statmike/vertex-ai-mlops/co...,GA4 reporting process update,statmike
4,e8c28c8ed407688435c470b3ce39e3b1d697eef3,2023-02-18T22:59:45Z,https://github.com/statmike/vertex-ai-mlops/co...,tracking update,statmike


### commits_files

In [94]:
commits_files = pd.merge(commits, commits_files, on = 'sha', how = 'outer')
commits_files.head()

Unnamed: 0,sha,datetime,url,message,author,file_sha,file,additions,deletions
0,877ff535c3f01b9383a0cbb4be204705230983cc,2023-02-19T18:19:54Z,https://github.com/statmike/vertex-ai-mlops/co...,notes,statmike,2076f1aa518041c2a238e81ffc23e36fba2433d2,statmike/vertex-ai-mlops/architectures/trackin...,1,0
1,5b85c2efa0a17cf4b0a49d045dc259135a158c59,2023-02-19T01:19:08Z,https://github.com/statmike/vertex-ai-mlops/co...,process update,statmike,d56dfb3c5ff1cb2bfd92ab404cd11a2677875618,statmike/vertex-ai-mlops/architectures/trackin...,1,0
2,dbfc6a9755a8eb782856af435a8383c683a598c1,2023-02-19T01:17:43Z,https://github.com/statmike/vertex-ai-mlops/co...,add colab process,statmike,696a6b8c109004639678b964a782fffea8bec850,statmike/vertex-ai-mlops/architectures/trackin...,11,0
3,151476a18b3a56228e3810bcfd5416e7cfe0de5f,2023-02-19T01:13:21Z,https://github.com/statmike/vertex-ai-mlops/co...,GA4 reporting process update,statmike,ec4f76a6f89ae3b2e49be2761b7f8d21822ccec2,statmike/vertex-ai-mlops/architectures/trackin...,1,1
4,151476a18b3a56228e3810bcfd5416e7cfe0de5f,2023-02-19T01:13:21Z,https://github.com/statmike/vertex-ai-mlops/co...,GA4 reporting process update,statmike,b2e135734a66e00479f69e8e4865c2e802e276c4,statmike/vertex-ai-mlops/architectures/trackin...,1,1


---
## BigQuery Tables: Initial Creation


### weekly_commits

In [95]:
weekly_commits_job = bq.load_table_from_dataframe(
    dataframe = weekly_commits,
    destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.weekly_commits"),
    job_config = bigquery.LoadJobConfig(
        write_disposition = 'WRITE_TRUNCATE', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
        autodetect = True, # detect schema
    )
)
weekly_commits_job.result()

LoadJob<project=vertex-ai-mlops-369716, location=US, id=8c6337df-8d27-4676-a05c-987614dddf9a>

### author_weekly_commits

In [96]:
author_weekly_commits_job = bq.load_table_from_dataframe(
    dataframe = author_weekly_commits,
    destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits"),
    job_config = bigquery.LoadJobConfig(
        write_disposition = 'WRITE_TRUNCATE', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
        autodetect = True, # detect schema
    )
)
author_weekly_commits_job.result()

LoadJob<project=vertex-ai-mlops-369716, location=US, id=9bb0abca-04b4-45a9-88f6-051a2123cc21>

### commits

In [97]:
commits_job = bq.load_table_from_dataframe(
    dataframe = commits,
    destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.commits"),
    job_config = bigquery.LoadJobConfig(
        write_disposition = 'WRITE_TRUNCATE', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
        autodetect = True, # detect schema
    )
)
commits_job.result()

LoadJob<project=vertex-ai-mlops-369716, location=US, id=57fa4124-66df-449e-812f-bd0250b0d273>

### commits_files

In [98]:
commits_files_job = bq.load_table_from_dataframe(
    dataframe = commits_files,
    destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.commits_files"),
    job_config = bigquery.LoadJobConfig(
        write_disposition = 'WRITE_TRUNCATE', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
        autodetect = True, # detect schema
    )
)
commits_files_job.result()

LoadJob<project=vertex-ai-mlops-369716, location=US, id=9a2ec441-6d3e-44da-bd48-1eabdf263c5f>

---
## BigQuery Tables: Increment

- get last week of data: prior_weekly_commits
- keep all new records for 'last week' and newer: new_weekly_commits
- conditions:
  - if same number of rows and rows identitial: do nothing
  - elif rows are same for last week: remove it from new, if new records then append them
  - if row for last week differs: drop it from BQ, then append

### weekly_commits

In [147]:
metric_type = 'stats/code_frequency'
response = metric_get(metric_type)
code_frequency = pd.DataFrame(json.loads(response.text), columns = ['week', 'additions', 'deletions'])
#code_frequency['week'] = pd.to_datetime(code_frequency['week'], unit = 's')

In [148]:
metric_type = 'stats/commit_activity'
response = metric_get(metric_type)
commit_activity = pd.DataFrame(json.loads(response.text)).drop(columns = ['days']).rename(columns = {'total' : 'commits'})
#commit_activity['week'] = pd.to_datetime(commit_activity['week'], unit = 's')

In [149]:
weekly_commits = pd.merge(code_frequency, commit_activity, on = 'week', how = 'outer')
weekly_commits['github_account'] = github_user
weekly_commits['github_repo'] = github_repo

In [150]:
prior_weekly_commits = bq.query(query = f"""SELECT t.* FROM `{BQ_PROJECT}.{BQ_DATASET}.weekly_commits` t WHERE 1=1 QUALIFY row_number() over(order by week desc) = 1""").to_dataframe()

In [151]:
prior_weekly_commits

Unnamed: 0,week,additions,deletions,commits,github_account,github_repo
0,1676160000,36560,-13179,21.0,statmike,vertex-ai-mlops


In [152]:
new_weekly_commits = weekly_commits[(weekly_commits['week'] >= prior_weekly_commits['week'].max())]

In [153]:
new_weekly_commits

Unnamed: 0,week,additions,deletions,commits,github_account,github_repo
98,1676160000,36560,-13179,21.0,statmike,vertex-ai-mlops
99,1676764800,21,-3,5.0,statmike,vertex-ai-mlops


In [154]:
# fake a change to force a load
#print(new_weekly_commits['commits'].iloc[0])
#new_weekly_commits['commits'].iloc[0] = 5
#print(new_weekly_commits['commits'].iloc[0])

In [155]:
# check if no new weeks data added (both will have 1 record for last week)
if prior_weekly_commits.shape[0] == new_weekly_commits.shape[0]:
  # check if last weeks data changed
  if prior_weekly_commits.values.tolist() != new_weekly_commits.values.tolist():
    # remove old week from BQ
    job = bq.query(query = f"""DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.weekly_commits` WHERE week = {prior_weekly_commits['week'].max()}""")
    job.result()
    # append updated week
    new_weekly_commits_job = bq.load_table_from_dataframe(
        dataframe = new_weekly_commits,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.weekly_commits"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
            autodetect = True, # detect schema
        )
    )
    new_weekly_commits_job.result()
# if new weeks data has been added (could even be more than one):
else:
  # check if overlapping week changed
  if prior_weekly_commits.values.tolist() != new_weekly_commits[(new_weekly_commits['week'] == prior_weekly_commits['week'].max())].values.tolist():
    # remove old week from BQ
    job = bq.query(query = f"""DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.weekly_commits` WHERE week = {prior_weekly_commits['week'].max()}""")
    job.result()
    # append all: updated and new
    new_weekly_commits_job = bq.load_table_from_dataframe(
        dataframe = new_weekly_commits,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.weekly_commits"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
            autodetect = True, # detect schema
        )
    )
    new_weekly_commits_job.result()
  else:
    # remove old week from new_weekly_commits
    new_weekly_commits = new_weekly_commits[(new_weekly_commits['week'] != prior_weekly_commits['week'].max())]
    # append new week(s)
    new_weekly_commits_job = bq.load_table_from_dataframe(
        dataframe = new_weekly_commits,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.weekly_commits"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
            autodetect = True, # detect schema
        )
    )
    new_weekly_commits_job.result()

### author_weekly_commits

In [156]:
metric_type = 'stats/contributors'
response = metric_get(metric_type)

parsed_response = []
for e in json.loads(response.text):
  for week in e['weeks']:
    parsed_response += [{'author': e['author']['login'], 'week': week['w'], 'additions': week['a'], 'deletions': -1*week['d'], 'commits': week['c']}]

contributors = pd.DataFrame(parsed_response)
#contributors['week'] = pd.to_datetime(contributors['week'], unit = 's')

In [157]:
author_weekly_commits = contributors
author_weekly_commits['github_account'] = github_user
author_weekly_commits['github_repo'] = github_repo

In [158]:
prior_author_weekly_commits = bq.query(query = f"""SELECT t.* FROM `{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits` t WHERE 1=1 QUALIFY row_number() over(partition by author order by week desc) = 1""").to_dataframe().sort_values(by = ['week', 'author'])

In [159]:
prior_author_weekly_commits

Unnamed: 0,author,week,additions,deletions,commits,github_account,github_repo
3,PavelPetukhov,1676764800,0,0,1,statmike,vertex-ai-mlops
2,goodrules,1676764800,0,0,0,statmike,vertex-ai-mlops
0,karticn-google,1676764800,0,0,0,statmike,vertex-ai-mlops
1,statmike,1676764800,20,-3,4,statmike,vertex-ai-mlops


In [160]:
new_author_weekly_commits = author_weekly_commits[(author_weekly_commits['week'] >= prior_author_weekly_commits['week'].max())].sort_values(by = ['week', 'author'])

In [161]:
new_author_weekly_commits

Unnamed: 0,author,week,additions,deletions,commits,github_account,github_repo
199,PavelPetukhov,1676764800,0,0,0,statmike,vertex-ai-mlops
299,goodrules,1676764800,0,0,0,statmike,vertex-ai-mlops
99,karticn-google,1676764800,0,0,0,statmike,vertex-ai-mlops
399,statmike,1676764800,21,-3,5,statmike,vertex-ai-mlops


In [162]:
# fake a change to force a load
#print(new_author_weekly_commits['commits'].iloc[0])
#new_author_weekly_commits['commits'].iloc[0] = 1
#print(new_author_weekly_commits['commits'].iloc[0])

In [163]:
new_author_weekly_commits

Unnamed: 0,author,week,additions,deletions,commits,github_account,github_repo
199,PavelPetukhov,1676764800,0,0,0,statmike,vertex-ai-mlops
299,goodrules,1676764800,0,0,0,statmike,vertex-ai-mlops
99,karticn-google,1676764800,0,0,0,statmike,vertex-ai-mlops
399,statmike,1676764800,21,-3,5,statmike,vertex-ai-mlops


In [164]:
# check if no new weeks data added (both will have 1 record for last week)
if prior_author_weekly_commits.shape[0] == new_author_weekly_commits.shape[0]:
  # check if last weeks data changed
  if prior_author_weekly_commits.values.tolist() != new_author_weekly_commits.values.tolist():
    # remove old week from BQ
    job = bq.query(query = f"""DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits` WHERE week = {prior_author_weekly_commits['week'].max()}""")
    job.result()
    # append updated week
    new_author_weekly_commits_job = bq.load_table_from_dataframe(
        dataframe = new_author_weekly_commits,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
            autodetect = True, # detect schema
        )
    )
    new_author_weekly_commits_job.result()
# if new weeks data has been added (could even be more than one):
else:
  # check if overlapping week changed
  if prior_author_weekly_commits.values.tolist() != new_author_weekly_commits[(new_author_weekly_commits['week'] == prior_author_weekly_commits['week'].max())].values.tolist():
    # remove old week from BQ
    job = bq.query(query = f"""DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits` WHERE week = {prior_author_weekly_commits['week'].max()}""")
    job.result()
    # append all: updated and new
    new_author_weekly_commits_job = bq.load_table_from_dataframe(
        dataframe = new_author_weekly_commits,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
            #autodetect = True, # detect schema
        )
    )
    new_author_weekly_commits_job.result()
  else:
    # remove old week from new_author_weekly_commits
    new_author_weekly_commits = new_author_weekly_commits[(new_author_weekly_commits['week'] != prior_author_weekly_commits['week'].max())]
    # append new week(s)
    new_author_weekly_commits_job = bq.load_table_from_dataframe(
        dataframe = new_author_weekly_commits,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
            autodetect = True, # detect schema
        )
    )
    new_author_weekly_commits_job.result()

### commits

Commits can arrive after they are create because a push or pull may be delayed.  For this reason, load the full commit history and check for new `sha` values.  Only these need to be added. This is always an append operation.

In [165]:
page_size = 100
page = 1
raw_commits = []
while page_size == 100:
  response = requests.get(f'{github_api_url}/commits?per_page=100&page={page}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
  new_page = json.loads(response.text)
  if response.status_code == 200:
    raw_commits += new_page
    page_size = len(new_page)
    page +=1
  else: break

commits = []
for i, c in enumerate(raw_commits):
  committer = c['commit']['committer']['name']
  author = c['commit']['author']['name']
  committer2 = ''
  if 'committer' in c and c['committer']:
    if 'login' in c['committer']: committer2 = c['committer']['login']
  author2 = ''
  if 'author' in c and c['author']:
    if 'login' in c['author']: author2 = c['author']['login']

  # refined author with logic:
  if author2: refined_author = author2
  else: refined_author = author 

  commits += [{
      'sha': c['sha'],
      'datetime': c['commit']['committer']['date'],
      'url': c['html_url'],
      'message': c['commit']['message'],
      'author': refined_author,
      #'committer': committer,
      #'author': author,
      #'committer2': committer2,
      #'author2': author2
  }]

commits = pd.DataFrame(commits)
#commits['datetime'] = pd.to_datetime(commits['datetime'], infer_datetime_format = True)
commits.loc[commits['author'].str.lower() == 'mike henderson', 'author'] = 'statmike'

In [166]:
commits.head()

Unnamed: 0,sha,datetime,url,message,author
0,09404bb01b5c98b55b2387ef5cf48fb73581bc18,2023-02-19T19:14:01Z,https://github.com/statmike/vertex-ai-mlops/co...,notes,statmike
1,877ff535c3f01b9383a0cbb4be204705230983cc,2023-02-19T18:19:54Z,https://github.com/statmike/vertex-ai-mlops/co...,notes,statmike
2,5b85c2efa0a17cf4b0a49d045dc259135a158c59,2023-02-19T01:19:08Z,https://github.com/statmike/vertex-ai-mlops/co...,process update,statmike
3,dbfc6a9755a8eb782856af435a8383c683a598c1,2023-02-19T01:17:43Z,https://github.com/statmike/vertex-ai-mlops/co...,add colab process,statmike
4,151476a18b3a56228e3810bcfd5416e7cfe0de5f,2023-02-19T01:13:21Z,https://github.com/statmike/vertex-ai-mlops/co...,GA4 reporting process update,statmike


In [167]:
prior_commits = bq.query(query = f"""SELECT sha FROM `{BQ_PROJECT}.{BQ_DATASET}.commits`""").to_dataframe()

In [168]:
new_commits = pd.merge(prior_commits, commits, on = 'sha', how = 'outer', indicator = True)
new_commits = new_commits[new_commits['_merge'] == 'right_only'].drop('_merge', axis = 1)

In [169]:
new_commits.shape[0]

0

In [170]:
if new_commits.shape[0] > 0:
  new_commits_job = bq.load_table_from_dataframe(
      dataframe = new_commits,
      destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.commits"),
      job_config = bigquery.LoadJobConfig(
          write_disposition = 'WRITE_APPEND', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
          autodetect = True # detect schema
      )
  )
  new_commits_job.result()

### commits_files

Only need to fetch files for new commits found above.  This is always an append operation.

In [171]:
if new_commits.shape[0] > 0:
  sha = list(new_commits['sha'])
  raw_files = []
  for s in sha:
    page = 1
    response = requests.get(f'{github_api_url}/commits/{s}?per_page=100&page={page}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
    files = json.loads(response.text)['files']
    if len(files) == 100:
      while len(files) % 100 == 0:
        page += 1
        response = requests.get(f'{github_api_url}/commits/{s}?per_page=100&page={page}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
        files += json.loads(response.text)['files']
    raw_files += [{'sha': s, 'files': files}]

  commits_files = []
  for c in raw_files:
    for f in c['files']:
      commits_files += [{
          'sha': c['sha'],
          'file_sha': f['sha'],
          'file': f"{github_user}/{github_repo}/{f['filename']}",
          'additions': f['additions'],
          'deletions': f['deletions']
      }]
  commits_files = pd.DataFrame(commits_files)
  commits_files = pd.merge(new_commits, commits_files, on = 'sha', how = 'inner')

  new_commits_files_job = bq.load_table_from_dataframe(
      dataframe = commits_files,
      destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.commits_files"),
      job_config = bigquery.LoadJobConfig(
          write_disposition = 'WRITE_APPEND', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
          autodetect = True, # detect schema
      )
  )
  new_commits_files_job.result()

---
## Diagnostics

Mainly, can:
- the results in weekly_commits be created from commits
- the results in author_weekly_commits be created from commits
- the results in commits be created from commits_files

In [196]:
# does weekly_commits match the data in commits?
query = f"""
WITH
  weekly_commits AS (
    SELECT
      EXTRACT(DATE FROM TIMESTAMP_SECONDS(week)) AS week,
      commits
    FROM `{BQ_PROJECT}.{BQ_DATASET}.weekly_commits`
    ORDER BY week
  ),
  commits AS (
    SELECT
      DATE_TRUNC(EXTRACT(DATE FROM TIMESTAMP(datetime)), WEEK) AS week,
      #EXTRACT(DATE FROM TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(datetime), WEEK)) AS week,
      count(*) AS commits
    FROM `{BQ_PROJECT}.{BQ_DATASET}.commits`
    GROUP BY week
    ORDER BY week 
  ),
  combo AS (
    SELECT week, weekly_commits.commits as weekly_commits, commits.commits as commits, weekly_commits.commits - commits.commits as diff
    FROM weekly_commits
    JOIN commits
    USING(week)
  ),
  diffs AS (
    SELECT *
    FROM combo
    WHERE diff != 0
  )
SELECT *
FROM diffs
#JOIN combo
#USING(week)
"""
bq.query(query).to_dataframe()

Unnamed: 0,week,weekly_commits,commits,diff
0,2023-01-29,15.0,17,-2.0
1,2023-01-22,9.0,11,-2.0
2,2023-01-08,8.0,9,-1.0
3,2022-10-16,14.0,15,-1.0
4,2022-08-21,17.0,18,-1.0
5,2022-05-22,13.0,14,-1.0
6,2022-03-13,14.0,16,-2.0


In [199]:
# does weekly_commits match the data in commits?
query = f"""
WITH
  author_weekly_commits AS (
    SELECT
      author, 
      EXTRACT(DATE FROM TIMESTAMP_SECONDS(week)) AS week,
      commits
    FROM `{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits`
    ORDER BY author, week
  ),
  commits AS (
    SELECT
      author,
      DATE_TRUNC(EXTRACT(DATE FROM TIMESTAMP(datetime)), WEEK) AS week,
      #EXTRACT(DATE FROM TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(datetime), WEEK)) AS week,
      count(*) AS commits
    FROM `{BQ_PROJECT}.{BQ_DATASET}.commits`
    GROUP BY author, week
    ORDER BY author, week 
  ),
  combo AS (
    SELECT author, week, author_weekly_commits.commits as author_weekly_commits, commits.commits as commits, author_weekly_commits.commits - commits.commits as diff
    FROM author_weekly_commits
    JOIN commits
    USING(author, week)
  ),
  diffs AS (
    SELECT *
    FROM combo
    WHERE diff != 0
  )
SELECT *
FROM diffs
#JOIN combo
#USING(week)
"""
bq.query(query).to_dataframe()

Unnamed: 0,author,week,author_weekly_commits,commits,diff
0,statmike,2021-03-28,0,8,-8
1,statmike,2022-05-15,0,2,-2
2,statmike,2021-08-08,0,4,-4
3,statmike,2021-07-18,0,4,-4
4,statmike,2022-03-27,0,7,-7
5,statmike,2021-07-04,0,9,-9
6,statmike,2022-04-17,0,2,-2
7,statmike,2021-04-04,0,12,-12
8,statmike,2021-06-20,0,1,-1
9,statmike,2022-04-10,0,4,-4


In [201]:
# total commits in weekly_commits
# recall that this file only has history for 52 week = 1 year
query = f"""
SELECT SUM(commits)
FROM `{BQ_PROJECT}.{BQ_DATASET}.weekly_commits`
"""
bq.query(query).to_dataframe()

Unnamed: 0,f0_
0,503.0


In [202]:
# total commits in author_weekly_commits
query = f"""
SELECT SUM(commits)
FROM `{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits`
"""
bq.query(query).to_dataframe()

Unnamed: 0,f0_
0,579


In [203]:
# total commits in commits
query = f"""
SELECT count(*)
FROM `{BQ_PROJECT}.{BQ_DATASET}.commits`
"""
bq.query(query).to_dataframe()

Unnamed: 0,f0_
0,709


In [205]:
# total commits in commits_files
query = f"""
SELECT count(*)
FROM
(SELECT DISTINCT sha FROM `{BQ_PROJECT}.{BQ_DATASET}.commits_files`)
"""
bq.query(query).to_dataframe()

Unnamed: 0,f0_
0,709


## Decision

Only keep `commits` and `commits_files` as they are more detailed and have fully history of the repository as well as including commits that are part of pull request.  The logic for the builds also includes loading comitts that arrive late due to being pulled from branches and other users or local clone days after they are comitted.

In [206]:
bq.delete_table(f"{BQ_PROJECT}.{BQ_DATASET}.weekly_commits", not_found_ok = True)

In [207]:
bq.delete_table(f"{BQ_PROJECT}.{BQ_DATASET}.author_weekly_commits", not_found_ok = True)

In [209]:
for t in list(bq.list_tables(f"{BQ_PROJECT}.{BQ_DATASET}")):
  print(t.full_table_id)

vertex-ai-mlops-369716:github_metrics.commits
vertex-ai-mlops-369716:github_metrics.commits_files
