![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2Farchitectures%2Ftracking%2Fsetup%2Fgithub&dt=GitHub+Metrics+-+1+-+Initial+Creation.ipynb)

# GitHub Traffic For /statmike/vertex-ai-mlops

Using the [GitHub API](https://docs.github.com/en/rest/metrics/statistics?apiVersion=2022-11-28) to:
- get traffic data

**Notes:**

The API offers different ways to retrieve commit information.  These are different summaries of the the same information.
- `/commits` retrieves all commits for a repository, but not files
- `/commits/sha` where `sha` is the sha for a commit will retrieve the files associated with the individual commit
- `/contributors` a weekly summary of commits, additions, deletions for each contributor
- `/stats/code_frequency` is a weekly history of additions and deletions summarized
- `/stats/commit_activity` has the last year of weekly data for commits 


The most granualar levels is using `/commits` and then `/commits/sha` to retrieve timestamped file level information within commits.  This data could then be rolled up to all other levels in reporting.


Approach notes:
- I prefer to not convert date/times to formats in pandas and instead save this as a step in BigQuery.  Why? Loading a dataframe to BigQuery has a middle layer where the data gets serialized and transferred.  This middle step is another set of format conversions that can impact dates/times.  This can cause errors when later appending to the same BigQuery tables even when the dataframe matches the original identically. A -> B -> C is not the same as A -> B|C -> C

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/architectures/tracking/setup/github/GitHub%20Metrics%20-%201%20-%20Traffic%20-%20Initial%20Creation.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [50]:
PROJECT_ID = 'vertex-ai-mlops-369716' # replace with project ID

In [51]:
try:
    import google.colab
    try:
      from google.cloud import secretmanager
    except ImportError:
      !pip install google-cloud-secret-manager -q
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

Updated property [core/project].


---
## Setup

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'

github_user = 'statmike'
github_repo = 'vertex-ai-mlops'

BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'github_metrics'

In [38]:
import requests
import json
import time
from datetime import datetime
import pandas as pd
import numpy as np
from io import StringIO
import os, shutil

from google.cloud import bigquery
from google.cloud import secretmanager

In [4]:
bq = bigquery.Client(project = PROJECT_ID)
secret_client = secretmanager.SecretManagerServiceClient()

In [5]:
secret = secret_client.access_secret_version(request = {"name": f'projects/{PROJECT_ID}/secrets/github_api/versions/latest'})
pat = secret.payload.data.decode('utf-8')

---
## GitHub API

Define the API url for the user and repository.  Create a helper function that will make get request from API addresses and if the receive a 202 response (accepted request) then retry until it receives a 200 response (successful response).

In [6]:
github_api_url = f'https://api.github.com/repos/{github_user}/{github_repo}'

In [85]:
def metric_get(metric_type, query_parameters = ''):
  response = requests.get(f'{github_api_url}/{metric_type}{query_parameters}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
  while response.status_code == 202:
      time.sleep(10)
      response = requests.get(f'{github_api_url}/{metric_type}{query_parameters}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
  return response

---
## Data Exploration

The following subsection retrieve and format data from different parts of the API related to commits.

### /traffic/clones
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones
- 14 day history of clones
- schema:
    - count = total clones for windows
    - uniques = unique cloners across window (not the sum of daily)
    - clones:
        - timestamp = midnight of day (start of day)

In [86]:
metric_type = 'traffic/clones'
response = metric_get(metric_type)
response.status_code

200

In [87]:
#json.loads(response.text)

In [88]:
clones = pd.DataFrame(json.loads(response.text)['clones'])
clones['14day_uniques'] = np.nan
clones['14day_uniques'].iloc[-1] = json.loads(response.text)['uniques']
clones

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,timestamp,count,uniques,14day_uniques
0,2023-02-09T00:00:00Z,20,12,
1,2023-02-10T00:00:00Z,28,17,
2,2023-02-11T00:00:00Z,10,6,
3,2023-02-12T00:00:00Z,9,6,
4,2023-02-13T00:00:00Z,6,6,
5,2023-02-14T00:00:00Z,29,7,
6,2023-02-15T00:00:00Z,13,8,
7,2023-02-16T00:00:00Z,20,19,
8,2023-02-17T00:00:00Z,3,2,
9,2023-02-18T00:00:00Z,14,6,


### /traffic/popular/paths
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-top-referral-paths
- top 10 documents for past 14 days

In [47]:
metric_type = 'traffic/popular/paths'
response = metric_get(metric_type)
response.status_code

200

In [48]:
json.loads(response.text)

[{'path': '/statmike/vertex-ai-mlops',
  'title': 'statmike/vertex-ai-mlops: Google Cloud Platform Vertex AI end-to-end workflow...',
  'count': 588,
  'uniques': 223},
 {'path': '/statmike/vertex-ai-mlops/blob/main/00%20-%20Setup/00%20-%20Environment%20Setup.ipynb',
  'title': 'vertex-ai-mlops/00 - Environment Setup.ipynb at main · statmike/vertex-ai-mlops',
  'count': 73,
  'uniques': 43},
 {'path': '/statmike/vertex-ai-mlops/tree/main/00%20-%20Setup',
  'title': 'vertex-ai-mlops/00 - Setup at main · statmike/vertex-ai-mlops · GitHub',
  'count': 68,
  'uniques': 48},
 {'path': '/statmike/vertex-ai-mlops/tree/main/02%20-%20Vertex%20AI%20AutoML',
  'title': 'vertex-ai-mlops/02 - Vertex AI AutoML at main · statmike/vertex-ai-mlops · Gi...',
  'count': 63,
  'uniques': 42},
 {'path': '/statmike/vertex-ai-mlops/tree/main/04%20-%20scikit-learn',
  'title': 'vertex-ai-mlops/04 - scikit-learn at main · statmike/vertex-ai-mlops · GitHub',
  'count': 60,
  'uniques': 35},
 {'path': '/statmike

In [49]:
paths = pd.DataFrame(json.loads(response.text))
paths

Unnamed: 0,path,title,count,uniques
0,/statmike/vertex-ai-mlops,statmike/vertex-ai-mlops: Google Cloud Platfor...,588,223
1,/statmike/vertex-ai-mlops/blob/main/00%20-%20S...,vertex-ai-mlops/00 - Environment Setup.ipynb a...,73,43
2,/statmike/vertex-ai-mlops/tree/main/00%20-%20S...,vertex-ai-mlops/00 - Setup at main · statmike/...,68,48
3,/statmike/vertex-ai-mlops/tree/main/02%20-%20V...,vertex-ai-mlops/02 - Vertex AI AutoML at main ...,63,42
4,/statmike/vertex-ai-mlops/tree/main/04%20-%20s...,vertex-ai-mlops/04 - scikit-learn at main · st...,60,35
5,/statmike/vertex-ai-mlops/blob/main/01%20-%20D...,vertex-ai-mlops/01 - BigQuery - Table Data Sou...,52,30
6,/statmike/vertex-ai-mlops/tree/main/01%20-%20D...,vertex-ai-mlops/01 - Data Sources at main · st...,48,29
7,/statmike/vertex-ai-mlops/tree/main/05%20-%20T...,vertex-ai-mlops/05 - TensorFlow at main · stat...,47,30
8,/statmike/vertex-ai-mlops/blob/main/architectu...,vertex-ai-mlops/05_overview.png at main · stat...,36,19
9,/statmike/vertex-ai-mlops/tree/main/03%20-%20B...,vertex-ai-mlops/03 - BigQuery ML (BQML) at mai...,35,21


In [50]:
# remove title
# parse path: no / indicates readme.md, otherwise remove /blob/main and url encode
# add todays date (or yesterday?)

### /traffic/popular/referrers
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-top-referral-sources
- top 10 referring sites over past 14 days

In [51]:
metric_type = 'traffic/popular/referrers'
response = metric_get(metric_type)
response.status_code

200

In [52]:
json.loads(response.text)

[{'referrer': 'youtube.com', 'count': 490, 'uniques': 117},
 {'referrer': 'github.com', 'count': 199, 'uniques': 41},
 {'referrer': 'Google', 'count': 187, 'uniques': 60},
 {'referrer': 'statics.teams.cdn.office.net', 'count': 10, 'uniques': 2},
 {'referrer': 'notebooks.githubusercontent.com', 'count': 7, 'uniques': 4},
 {'referrer': 'm.youtube.com', 'count': 6, 'uniques': 1},
 {'referrer': 'mail.google.com', 'count': 1, 'uniques': 1},
 {'referrer': 'colab.research.google.com', 'count': 1, 'uniques': 1}]

In [53]:
referrer = pd.DataFrame(json.loads(response.text))
referrer

Unnamed: 0,referrer,count,uniques
0,youtube.com,490,117
1,github.com,199,41
2,Google,187,60
3,statics.teams.cdn.office.net,10,2
4,notebooks.githubusercontent.com,7,4
5,m.youtube.com,6,1
6,mail.google.com,1,1
7,colab.research.google.com,1,1


In [54]:
# add todays date (or yesterday?)

### /traffic/views
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-page-views
- daily views for last 14 days
- schema:
    - count = total views for last 2 weeks (sum of daily)
    - uniques = total unique over 14 days (not sum of daily)
    - views:
        - timestamp - daily at midnight
        - count = daily count
        - uniques = daily unique count

In [55]:
metric_type = 'traffic/views'
response = metric_get(metric_type)
response.status_code

200

In [56]:
json.loads(response.text)

{'count': 2224,
 'uniques': 324,
 'views': [{'timestamp': '2023-02-09T00:00:00Z', 'count': 155, 'uniques': 43},
  {'timestamp': '2023-02-10T00:00:00Z', 'count': 185, 'uniques': 45},
  {'timestamp': '2023-02-11T00:00:00Z', 'count': 78, 'uniques': 17},
  {'timestamp': '2023-02-12T00:00:00Z', 'count': 90, 'uniques': 20},
  {'timestamp': '2023-02-13T00:00:00Z', 'count': 219, 'uniques': 43},
  {'timestamp': '2023-02-14T00:00:00Z', 'count': 176, 'uniques': 48},
  {'timestamp': '2023-02-15T00:00:00Z', 'count': 118, 'uniques': 37},
  {'timestamp': '2023-02-16T00:00:00Z', 'count': 162, 'uniques': 35},
  {'timestamp': '2023-02-17T00:00:00Z', 'count': 157, 'uniques': 38},
  {'timestamp': '2023-02-18T00:00:00Z', 'count': 87, 'uniques': 18},
  {'timestamp': '2023-02-19T00:00:00Z', 'count': 82, 'uniques': 18},
  {'timestamp': '2023-02-20T00:00:00Z', 'count': 255, 'uniques': 43},
  {'timestamp': '2023-02-21T00:00:00Z', 'count': 260, 'uniques': 51},
  {'timestamp': '2023-02-22T00:00:00Z', 'count': 200

In [57]:
views = pd.DataFrame(json.loads(response.text)['views'])
views['14day_uniques'] = np.nan
views['14day_uniques'].iloc[-1] = json.loads(response.text)['uniques']
views

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,timestamp,count,uniques,14day_uniques
0,2023-02-09T00:00:00Z,155,43,
1,2023-02-10T00:00:00Z,185,45,
2,2023-02-11T00:00:00Z,78,17,
3,2023-02-12T00:00:00Z,90,20,
4,2023-02-13T00:00:00Z,219,43,
5,2023-02-14T00:00:00Z,176,48,
6,2023-02-15T00:00:00Z,118,37,
7,2023-02-16T00:00:00Z,162,35,
8,2023-02-17T00:00:00Z,157,38,
9,2023-02-18T00:00:00Z,87,18,


### /stargazers
- https://docs.github.com/en/rest/activity/starring?apiVersion=2022-11-28#list-stargazers
- list of current users who have starred the repository

In [92]:
metric_type = 'stargazers'

page_size = 100
page = 1
raw = []
while page_size == 100:
    response = metric_get(metric_type, f'?per_page={page_size}&page={page}')
    raw_new = json.loads(response.text)
    raw += raw_new
    page_size = len(raw_new)
    page += 1
len(raw)

148

In [93]:
raw[0]

{'login': 'newcooldiscoveries',
 'id': 68378011,
 'node_id': 'MDQ6VXNlcjY4Mzc4MDEx',
 'avatar_url': 'https://avatars.githubusercontent.com/u/68378011?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/newcooldiscoveries',
 'html_url': 'https://github.com/newcooldiscoveries',
 'followers_url': 'https://api.github.com/users/newcooldiscoveries/followers',
 'following_url': 'https://api.github.com/users/newcooldiscoveries/following{/other_user}',
 'gists_url': 'https://api.github.com/users/newcooldiscoveries/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/newcooldiscoveries/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/newcooldiscoveries/subscriptions',
 'organizations_url': 'https://api.github.com/users/newcooldiscoveries/orgs',
 'repos_url': 'https://api.github.com/users/newcooldiscoveries/repos',
 'events_url': 'https://api.github.com/users/newcooldiscoveries/events{/privacy}',
 'received_events_url': 'https://api.github.com/us

In [94]:
stars = pd.DataFrame(raw)['login']
stars

0      newcooldiscoveries
1                giranntu
2                 sinanek
3             amith-ajith
4                 rsavoie
              ...        
143           JosephDavis
144                dunncw
145        PeterGolovatyi
146        littlefish0331
147                   bx2
Name: login, Length: 148, dtype: object

In [63]:
# add columns for first seen, last seen

### /forks
- https://docs.github.com/en/rest/repos/forks?apiVersion=2022-11-28#list-forks
- list of current forks of main repository

In [95]:
metric_type = 'forks'

page_size = 100
page = 1
raw = []
while page_size == 100:
    response = metric_get(metric_type, f'?per_page={page_size}&page={page}')
    raw_new = json.loads(response.text)
    raw += raw_new
    page_size = len(raw_new)
    page += 1
len(raw)

73

In [96]:
raw[0]

{'id': 605249844,
 'node_id': 'R_kgDOJBNhNA',
 'name': 'vertex-ai-mlops',
 'full_name': 'yfumero/vertex-ai-mlops',
 'private': False,
 'owner': {'login': 'yfumero',
  'id': 36447627,
  'node_id': 'MDQ6VXNlcjM2NDQ3NjI3',
  'avatar_url': 'https://avatars.githubusercontent.com/u/36447627?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/yfumero',
  'html_url': 'https://github.com/yfumero',
  'followers_url': 'https://api.github.com/users/yfumero/followers',
  'following_url': 'https://api.github.com/users/yfumero/following{/other_user}',
  'gists_url': 'https://api.github.com/users/yfumero/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/yfumero/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/yfumero/subscriptions',
  'organizations_url': 'https://api.github.com/users/yfumero/orgs',
  'repos_url': 'https://api.github.com/users/yfumero/repos',
  'events_url': 'https://api.github.com/users/yfumero/events{/privacy}',
  'received_e

In [97]:
forks = []
for f in raw:
    forks += [{
        'name': f['name'],
        'full_name': f['full_name'],
        'owner': f['owner']['login'],
        'stars': f['stargazers_count'],
        'watchers': f['watchers_count'],
        'forks': f['forks_count']
    }]
forks = pd.DataFrame(forks)
forks

Unnamed: 0,name,full_name,owner,stars,watchers,forks
0,vertex-ai-mlops,yfumero/vertex-ai-mlops,yfumero,0,0,0
1,vertex-ai-mlops,ivanmkc/vertex-ai-mlops,ivanmkc,0,0,0
2,vertex-ai-mlops,xjaztek/vertex-ai-mlops,xjaztek,0,0,0
3,vertex-ai-mlops,praneethkumar4/vertex-ai-mlops,praneethkumar4,0,0,0
4,vertex-ai-mlops,psod18/vertex-ai-mlops,psod18,0,0,0
...,...,...,...,...,...,...
68,vertex-ai-mlops,danielnguyen-ds/vertex-ai-mlops,danielnguyen-ds,0,0,0
69,vertex-ai-mlops,justinjm/vertex-ai-mlops,justinjm,0,0,0
70,vertex-ai-mlops,motconmeobuon/vertex-ai-mlops,motconmeobuon,0,0,0
71,vertex-ai-mlops,ANN-KOREA/vertex-ai-mlops,ANN-KOREA,0,0,0


In [72]:
# add columns for first seen, last seen

### /subscribers
- https://docs.github.com/en/rest/activity/watching?apiVersion=2022-11-28#list-watchers
- list of watchers for repository

In [98]:
metric_type = 'subscribers'

page_size = 100
page = 1
raw = []
while page_size == 100:
    response = metric_get(metric_type, f'?per_page={page_size}&page={page}')
    raw_new = json.loads(response.text)
    raw += raw_new
    page_size = len(raw_new)
    page += 1
len(raw)

12

In [99]:
raw[0]

{'login': 'statmike',
 'id': 17235991,
 'node_id': 'MDQ6VXNlcjE3MjM1OTkx',
 'avatar_url': 'https://avatars.githubusercontent.com/u/17235991?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/statmike',
 'html_url': 'https://github.com/statmike',
 'followers_url': 'https://api.github.com/users/statmike/followers',
 'following_url': 'https://api.github.com/users/statmike/following{/other_user}',
 'gists_url': 'https://api.github.com/users/statmike/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/statmike/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/statmike/subscriptions',
 'organizations_url': 'https://api.github.com/users/statmike/orgs',
 'repos_url': 'https://api.github.com/users/statmike/repos',
 'events_url': 'https://api.github.com/users/statmike/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/statmike/received_events',
 'type': 'User',
 'site_admin': False}

In [100]:
watchers = pd.DataFrame(raw)['login']
watchers

0              statmike
1               sinanek
2              inardini
3        rafal-wasowski
4            majacaci00
5             hamehrabi
6     alvaroferrerrizzo
7       rmazara-kinaxis
8           slopez-lmes
9              drkostas
10              rsher60
11       shantanusharma
Name: login, dtype: object

In [101]:
# add columsn for first seen and last seen

---
## Pandas Tables

---
## BigQuery Tables: Initial Creation

---
## BigQuery Tables: Increment

---
## Diagnostics