![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2Farchitectures%2Ftracking%2Fsetup%2Fgithub&dt=GitHub+Metrics+-+1+-+Traffic+-+Initial+Creation.ipynb)

# GitHub Traffic For /statmike/vertex-ai-mlops

Using the [GitHub API](https://docs.github.com/en/rest/metrics/statistics?apiVersion=2022-11-28) to:
- get traffic data and engagement data (stars, forks, watchers)

**Notes:**

The API offer traffic and engagement (stars, forks, watchers) data:
- `/traffic/clones`
- `/traffic/popular/paths`
- `/traffic/popular/referrers`
- `/traffic/views`
- `/stargazers`
- `/forks`
- `/subscribers`


Approach notes:
- I prefer to not convert date/times to formats in pandas and instead save this as a step in BigQuery.  Why? Loading a dataframe to BigQuery has a middle layer where the data gets serialized and transferred.  This middle step is another set of format conversions that can impact dates/times.  This can cause errors when later appending to the same BigQuery tables even when the dataframe matches the original identically. A -> B -> C is not the same as A -> B|C -> C

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/architectures/tracking/setup/github/GitHub%20Metrics%20-%201%20-%20Traffic%20-%20Initial%20Creation.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'vertex-ai-mlops-369716' # replace with project ID

In [2]:
try:
    import google.colab
    try:
      from google.cloud import secretmanager
    except ImportError:
      !pip install google-cloud-secret-manager -q
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/100.4 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.4/100.4 KB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hUpdated property [core/project].


---
## Setup

In [3]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'vertex-ai-mlops-369716'

In [4]:
REGION = 'us-central1'

github_user = 'statmike'
github_repo = 'vertex-ai-mlops'

BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'github_metrics'

In [5]:
import requests
import json
import time
from datetime import datetime
import pandas as pd
import numpy as np
from io import StringIO
import os, shutil
import urllib

from google.cloud import bigquery
from google.cloud import secretmanager

In [6]:
bq = bigquery.Client(project = PROJECT_ID)
secret_client = secretmanager.SecretManagerServiceClient()

In [7]:
secret = secret_client.access_secret_version(request = {"name": f'projects/{PROJECT_ID}/secrets/github_api/versions/latest'})
pat = secret.payload.data.decode('utf-8')

---
## GitHub API

Define the API url for the user and repository.  Create a helper function that will make get request from API addresses and if the receive a 202 response (accepted request) then retry until it receives a 200 response (successful response).

In [8]:
github_api_url = f'https://api.github.com/repos/{github_user}/{github_repo}'

In [9]:
def metric_get(metric_type, query_parameters = ''):
  response = requests.get(f'{github_api_url}/{metric_type}{query_parameters}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
  while response.status_code == 202:
      time.sleep(10)
      response = requests.get(f'{github_api_url}/{metric_type}{query_parameters}', headers = {'Authorization': f'Bearer {pat}', 'Accept': 'application/vnd.github+json'})
  return response

---
## Data Exploration

The following subsection retrieve and format data from different parts of the API related to commits.

### /traffic/clones
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones
- 14 day history of clones
- schema:
    - count = total clones for windows
    - uniques = unique cloners across window (not the sum of daily)
    - clones:
        - timestamp = midnight of day (start of day)

In [10]:
metric_type = 'traffic/clones'
response = metric_get(metric_type)
response.status_code

200

In [11]:
#json.loads(response.text)

In [12]:
traffic_clones = pd.DataFrame(json.loads(response.text)['clones'])
traffic_clones['uniques_last14days'] = np.nan
traffic_clones['uniques_last14days'].iloc[-1] = json.loads(response.text)['uniques']
traffic_clones['repo'] = github_user + '/' + github_repo

traffic_clones

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
0,2023-02-11T00:00:00Z,10,6,,statmike/vertex-ai-mlops
1,2023-02-12T00:00:00Z,9,6,,statmike/vertex-ai-mlops
2,2023-02-13T00:00:00Z,6,6,,statmike/vertex-ai-mlops
3,2023-02-14T00:00:00Z,29,7,,statmike/vertex-ai-mlops
4,2023-02-15T00:00:00Z,13,8,,statmike/vertex-ai-mlops
5,2023-02-16T00:00:00Z,20,19,,statmike/vertex-ai-mlops
6,2023-02-17T00:00:00Z,3,2,,statmike/vertex-ai-mlops
7,2023-02-18T00:00:00Z,14,6,,statmike/vertex-ai-mlops
8,2023-02-19T00:00:00Z,10,6,,statmike/vertex-ai-mlops
9,2023-02-20T00:00:00Z,16,10,,statmike/vertex-ai-mlops


### /traffic/popular/paths
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-top-referral-paths
- top 10 documents for past 14 days

In [13]:
metric_type = 'traffic/popular/paths'
response = metric_get(metric_type)
response.status_code

200

In [14]:
#json.loads(response.text)

In [15]:
traffic_popular_paths = pd.DataFrame(json.loads(response.text))
traffic_popular_paths

Unnamed: 0,path,title,count,uniques
0,/statmike/vertex-ai-mlops,statmike/vertex-ai-mlops: Google Cloud Platfor...,636,237
1,/statmike/vertex-ai-mlops/blob/main/00%20-%20S...,vertex-ai-mlops/00 - Environment Setup.ipynb a...,75,43
2,/statmike/vertex-ai-mlops/tree/main/04%20-%20s...,vertex-ai-mlops/04 - scikit-learn at main · st...,72,41
3,/statmike/vertex-ai-mlops/tree/main/00%20-%20S...,vertex-ai-mlops/00 - Setup at main · statmike/...,70,46
4,/statmike/vertex-ai-mlops/tree/main/02%20-%20V...,vertex-ai-mlops/02 - Vertex AI AutoML at main ...,70,43
5,/statmike/vertex-ai-mlops/blob/main/01%20-%20D...,vertex-ai-mlops/01 - BigQuery - Table Data Sou...,59,31
6,/statmike/vertex-ai-mlops/tree/main/05%20-%20T...,vertex-ai-mlops/05 - TensorFlow at main · stat...,54,34
7,/statmike/vertex-ai-mlops/tree/main/01%20-%20D...,vertex-ai-mlops/01 - Data Sources at main · st...,48,29
8,/statmike/vertex-ai-mlops/tree/main/03%20-%20B...,vertex-ai-mlops/03 - BigQuery ML (BQML) at mai...,40,24
9,/statmike/vertex-ai-mlops/blob/main/architectu...,vertex-ai-mlops/05_overview.png at main · stat...,35,18


In [16]:
# remove title
# parse path: no / indicates readme.md, otherwise remove /blob/main and url encode
# add todays date (or yesterday?)

In [17]:
def parse_path(p):
    p = urllib.parse.unquote(p).replace('blob/main/', '')
    p = urllib.parse.unquote(p).replace('tree/main/', '')
    if p.rfind('.') == -1 or (p.rfind('.') < p.rfind('/')):
        p += '/readme.md'
    return p

In [18]:
traffic_popular_paths['file'] = traffic_popular_paths.apply(lambda x: parse_path(x['path']), axis = 1)
traffic_popular_paths = traffic_popular_paths.drop(['title', 'path'], axis = 1)
traffic_popular_paths['timestamp'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z") #strftime("%Y-%m-%dT%H:%M:%SZ")
traffic_popular_paths['repo'] = github_user + '/' + github_repo

In [19]:
list(traffic_popular_paths['file'])

['/statmike/vertex-ai-mlops/readme.md',
 '/statmike/vertex-ai-mlops/00 - Setup/00 - Environment Setup.ipynb',
 '/statmike/vertex-ai-mlops/04 - scikit-learn/readme.md',
 '/statmike/vertex-ai-mlops/00 - Setup/readme.md',
 '/statmike/vertex-ai-mlops/02 - Vertex AI AutoML/readme.md',
 '/statmike/vertex-ai-mlops/01 - Data Sources/01 - BigQuery - Table Data Source.ipynb',
 '/statmike/vertex-ai-mlops/05 - TensorFlow/readme.md',
 '/statmike/vertex-ai-mlops/01 - Data Sources/readme.md',
 '/statmike/vertex-ai-mlops/03 - BigQuery ML (BQML)/readme.md',
 '/statmike/vertex-ai-mlops/architectures/overview/05_overview.png']

In [20]:
traffic_popular_paths

Unnamed: 0,count,uniques,file,timestamp,repo
0,636,237,/statmike/vertex-ai-mlops/readme.md,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
1,75,43,/statmike/vertex-ai-mlops/00 - Setup/00 - Envi...,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
2,72,41,/statmike/vertex-ai-mlops/04 - scikit-learn/re...,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
3,70,46,/statmike/vertex-ai-mlops/00 - Setup/readme.md,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
4,70,43,/statmike/vertex-ai-mlops/02 - Vertex AI AutoM...,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
5,59,31,/statmike/vertex-ai-mlops/01 - Data Sources/01...,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
6,54,34,/statmike/vertex-ai-mlops/05 - TensorFlow/read...,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
7,48,29,/statmike/vertex-ai-mlops/01 - Data Sources/re...,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
8,40,24,/statmike/vertex-ai-mlops/03 - BigQuery ML (BQ...,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
9,35,18,/statmike/vertex-ai-mlops/architectures/overvi...,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops


### /traffic/popular/referrers
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-top-referral-sources
- top 10 referring sites over past 14 days

In [21]:
metric_type = 'traffic/popular/referrers'
response = metric_get(metric_type)
response.status_code

200

In [22]:
#json.loads(response.text)

In [23]:
traffic_popular_referrers = pd.DataFrame(json.loads(response.text))
traffic_popular_referrers

Unnamed: 0,referrer,count,uniques
0,youtube.com,541,124
1,github.com,234,41
2,Google,215,62
3,notebooks.githubusercontent.com,11,6
4,statics.teams.cdn.office.net,10,2
5,m.youtube.com,5,1
6,mail.google.com,2,2
7,colab.research.google.com,1,1


In [24]:
# add todays date (or yesterday?)

In [25]:
traffic_popular_referrers['timestamp'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z") #strftime("%Y-%m-%dT%H:%M:%SZ")
traffic_popular_referrers['repo'] = github_user + '/' + github_repo

traffic_popular_referrers

Unnamed: 0,referrer,count,uniques,timestamp,repo
0,youtube.com,541,124,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
1,github.com,234,41,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
2,Google,215,62,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
3,notebooks.githubusercontent.com,11,6,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
4,statics.teams.cdn.office.net,10,2,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
5,m.youtube.com,5,1,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
6,mail.google.com,2,2,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops
7,colab.research.google.com,1,1,2023-02-24T00:00:00Z,statmike/vertex-ai-mlops


### /traffic/views
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-page-views
- daily views for last 14 days
- schema:
    - count = total views for last 2 weeks (sum of daily)
    - uniques = total unique over 14 days (not sum of daily)
    - views:
        - timestamp - daily at midnight
        - count = daily count
        - uniques = daily unique count

In [26]:
metric_type = 'traffic/views'
response = metric_get(metric_type)
response.status_code

200

In [27]:
#json.loads(response.text)

In [28]:
traffic_views = pd.DataFrame(json.loads(response.text)['views'])
traffic_views['uniques_last14days'] = np.nan
traffic_views['uniques_last14days'].iloc[-1] = json.loads(response.text)['uniques']
traffic_views

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,timestamp,count,uniques,uniques_last14days
0,2023-02-10T00:00:00Z,1,1,
1,2023-02-11T00:00:00Z,78,17,
2,2023-02-12T00:00:00Z,90,20,
3,2023-02-13T00:00:00Z,219,43,
4,2023-02-14T00:00:00Z,176,48,
5,2023-02-15T00:00:00Z,118,37,
6,2023-02-16T00:00:00Z,162,35,
7,2023-02-17T00:00:00Z,157,38,
8,2023-02-18T00:00:00Z,87,18,
9,2023-02-19T00:00:00Z,82,18,


In [29]:
traffic_views['repo'] = github_user + '/' + github_repo

traffic_views

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
0,2023-02-10T00:00:00Z,1,1,,statmike/vertex-ai-mlops
1,2023-02-11T00:00:00Z,78,17,,statmike/vertex-ai-mlops
2,2023-02-12T00:00:00Z,90,20,,statmike/vertex-ai-mlops
3,2023-02-13T00:00:00Z,219,43,,statmike/vertex-ai-mlops
4,2023-02-14T00:00:00Z,176,48,,statmike/vertex-ai-mlops
5,2023-02-15T00:00:00Z,118,37,,statmike/vertex-ai-mlops
6,2023-02-16T00:00:00Z,162,35,,statmike/vertex-ai-mlops
7,2023-02-17T00:00:00Z,157,38,,statmike/vertex-ai-mlops
8,2023-02-18T00:00:00Z,87,18,,statmike/vertex-ai-mlops
9,2023-02-19T00:00:00Z,82,18,,statmike/vertex-ai-mlops


### /stargazers
- https://docs.github.com/en/rest/activity/starring?apiVersion=2022-11-28#list-stargazers
- list of current users who have starred the repository

In [30]:
metric_type = 'stargazers'

page_size = 100
page = 1
raw = []
while page_size == 100:
    response = metric_get(metric_type, f'?per_page={page_size}&page={page}')
    raw_new = json.loads(response.text)
    raw += raw_new
    page_size = len(raw_new)
    page += 1
len(raw)

148

In [31]:
#raw[0]

In [32]:
stargazers = pd.DataFrame(raw)[['login']]
stargazers

Unnamed: 0,login
0,newcooldiscoveries
1,giranntu
2,sinanek
3,amith-ajith
4,rsavoie
...,...
143,JosephDavis
144,dunncw
145,PeterGolovatyi
146,littlefish0331


In [33]:
# add columns for added, dropped, count

In [34]:
stargazers['added'] = ''
stargazers['dropped'] = ''
stargazers['count'] = 1
stargazers['repo'] = github_user + '/' + github_repo

stargazers

Unnamed: 0,login,added,dropped,count,repo
0,newcooldiscoveries,,,1,statmike/vertex-ai-mlops
1,giranntu,,,1,statmike/vertex-ai-mlops
2,sinanek,,,1,statmike/vertex-ai-mlops
3,amith-ajith,,,1,statmike/vertex-ai-mlops
4,rsavoie,,,1,statmike/vertex-ai-mlops
...,...,...,...,...,...
143,JosephDavis,,,1,statmike/vertex-ai-mlops
144,dunncw,,,1,statmike/vertex-ai-mlops
145,PeterGolovatyi,,,1,statmike/vertex-ai-mlops
146,littlefish0331,,,1,statmike/vertex-ai-mlops


### /forks
- https://docs.github.com/en/rest/repos/forks?apiVersion=2022-11-28#list-forks
- list of current forks of main repository

In [35]:
metric_type = 'forks'

page_size = 100
page = 1
raw = []
while page_size == 100:
    response = metric_get(metric_type, f'?per_page={page_size}&page={page}')
    raw_new = json.loads(response.text)
    raw += raw_new
    page_size = len(raw_new)
    page += 1
len(raw)

73

In [36]:
#raw[0]

In [37]:
forks = []
for f in raw:
    forks += [{
        'name': f['name'],
        'full_name': f['full_name'],
        'owner': f['owner']['login'],
        'stars': f['stargazers_count'],
        'watchers': f['watchers_count'],
        'forks': f['forks_count']
    }]
forks = pd.DataFrame(forks)
forks

Unnamed: 0,name,full_name,owner,stars,watchers,forks
0,vertex-ai-mlops,yfumero/vertex-ai-mlops,yfumero,0,0,0
1,vertex-ai-mlops,ivanmkc/vertex-ai-mlops,ivanmkc,0,0,0
2,vertex-ai-mlops,xjaztek/vertex-ai-mlops,xjaztek,0,0,0
3,vertex-ai-mlops,praneethkumar4/vertex-ai-mlops,praneethkumar4,0,0,0
4,vertex-ai-mlops,psod18/vertex-ai-mlops,psod18,0,0,0
...,...,...,...,...,...,...
68,vertex-ai-mlops,danielnguyen-ds/vertex-ai-mlops,danielnguyen-ds,0,0,0
69,vertex-ai-mlops,justinjm/vertex-ai-mlops,justinjm,0,0,0
70,vertex-ai-mlops,motconmeobuon/vertex-ai-mlops,motconmeobuon,0,0,0
71,vertex-ai-mlops,ANN-KOREA/vertex-ai-mlops,ANN-KOREA,0,0,0


In [38]:
# add columns for added, dropped, count

In [39]:
forks['added'] = ''
forks['dropped'] = ''
forks['count'] = 1
forks['repo'] = github_user + '/' + github_repo

forks

Unnamed: 0,name,full_name,owner,stars,watchers,forks,added,dropped,count,repo
0,vertex-ai-mlops,yfumero/vertex-ai-mlops,yfumero,0,0,0,,,1,statmike/vertex-ai-mlops
1,vertex-ai-mlops,ivanmkc/vertex-ai-mlops,ivanmkc,0,0,0,,,1,statmike/vertex-ai-mlops
2,vertex-ai-mlops,xjaztek/vertex-ai-mlops,xjaztek,0,0,0,,,1,statmike/vertex-ai-mlops
3,vertex-ai-mlops,praneethkumar4/vertex-ai-mlops,praneethkumar4,0,0,0,,,1,statmike/vertex-ai-mlops
4,vertex-ai-mlops,psod18/vertex-ai-mlops,psod18,0,0,0,,,1,statmike/vertex-ai-mlops
...,...,...,...,...,...,...,...,...,...,...
68,vertex-ai-mlops,danielnguyen-ds/vertex-ai-mlops,danielnguyen-ds,0,0,0,,,1,statmike/vertex-ai-mlops
69,vertex-ai-mlops,justinjm/vertex-ai-mlops,justinjm,0,0,0,,,1,statmike/vertex-ai-mlops
70,vertex-ai-mlops,motconmeobuon/vertex-ai-mlops,motconmeobuon,0,0,0,,,1,statmike/vertex-ai-mlops
71,vertex-ai-mlops,ANN-KOREA/vertex-ai-mlops,ANN-KOREA,0,0,0,,,1,statmike/vertex-ai-mlops


### /subscribers
- https://docs.github.com/en/rest/activity/watching?apiVersion=2022-11-28#list-watchers
- list of watchers for repository

In [40]:
metric_type = 'subscribers'

page_size = 100
page = 1
raw = []
while page_size == 100:
    response = metric_get(metric_type, f'?per_page={page_size}&page={page}')
    raw_new = json.loads(response.text)
    raw += raw_new
    page_size = len(raw_new)
    page += 1
len(raw)

12

In [41]:
#raw[0]

In [42]:
subscribers = pd.DataFrame(raw)[['login']]
subscribers

Unnamed: 0,login
0,statmike
1,sinanek
2,inardini
3,rafal-wasowski
4,majacaci00
5,hamehrabi
6,alvaroferrerrizzo
7,rmazara-kinaxis
8,slopez-lmes
9,drkostas


In [43]:
# add columns for added, dropped, count

In [44]:
subscribers['added'] = ''
subscribers['dropped'] = ''
subscribers['count'] = 1
subscribers['repo'] = github_user + '/' + github_repo

subscribers

Unnamed: 0,login,added,dropped,count,repo
0,statmike,,,1,statmike/vertex-ai-mlops
1,sinanek,,,1,statmike/vertex-ai-mlops
2,inardini,,,1,statmike/vertex-ai-mlops
3,rafal-wasowski,,,1,statmike/vertex-ai-mlops
4,majacaci00,,,1,statmike/vertex-ai-mlops
5,hamehrabi,,,1,statmike/vertex-ai-mlops
6,alvaroferrerrizzo,,,1,statmike/vertex-ai-mlops
7,rmazara-kinaxis,,,1,statmike/vertex-ai-mlops
8,slopez-lmes,,,1,statmike/vertex-ai-mlops
9,drkostas,,,1,statmike/vertex-ai-mlops


---
## Pandas Tables

In [45]:
# none to combine from above... yet

---
## BigQuery Tables: Initial Creation

**Running These Will REPLACE the current tables in BigQuery**

Tip: replace `write_disposition =  'WRITE_TRUNCATE'` with `write_disposition =  'WRITE_EMPTY'` to prevent overwriting unless desired.

In [48]:
def bq_loader(df, df_name):
    load_job = bq.load_table_from_dataframe(
        dataframe = df,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.{df_name}"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_EMPTY', # WRITE_TRUNCATE = replace if exists, WRITE_APPEND = append if exists, WRITE_EMPTY = write new but dont overwrite
            autodetect = True, # detect schema
        )
    )
    return load_job.result()

In [49]:
bq_loader(traffic_clones, 'traffic_clones')
bq_loader(traffic_popular_paths, 'traffic_popular_paths')
bq_loader(traffic_popular_referrers, 'traffic_popular_referrers')
bq_loader(traffic_views, 'traffic_views')
bq_loader(stargazers, 'stargazers')
bq_loader(forks, 'forks')
bq_loader(subscribers, 'subscribers')

LoadJob<project=vertex-ai-mlops-369716, location=US, id=79ac1b17-3cd3-4952-879d-7a0113a883e1>

In [48]:
 for table in list(bq.list_tables(
     dataset = bigquery.DatasetReference(
         project = BQ_PROJECT,
         dataset_id = BQ_DATASET
     )
)): print(table.full_table_id)

vertex-ai-mlops-369716:github_metrics.commits
vertex-ai-mlops-369716:github_metrics.commits_files
vertex-ai-mlops-369716:github_metrics.forks
vertex-ai-mlops-369716:github_metrics.stargazers
vertex-ai-mlops-369716:github_metrics.subscribers
vertex-ai-mlops-369716:github_metrics.traffic_clones
vertex-ai-mlops-369716:github_metrics.traffic_popular_paths
vertex-ai-mlops-369716:github_metrics.traffic_popular_referrers
vertex-ai-mlops-369716:github_metrics.traffic_views


---
## BigQuery Tables: Increment

**WAIT OVERNIGHT THEN PROCEED HERE TO TEST INCREMENTING**

Approach:
- Forward incrementing, same time or later
- Efficiency
    - only pull what is needed
    - only replace what is changed or changable
    - only append what is new
    - only update as often as needed


### /traffic/clones
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones
- 14 day history of clones
  - trick: return has truncated values for today and oldest day (14)
- increment:
    - retrieve most recent record based on timestamp
      - this one can change because it might have been truncated based on last run time
    - pull new data
    - if count or uniques is bigger then update prior:
      - why? because GitHub truncates first and last day of returns based on last calculation time.
      - delete record so an append will replace it
    - if new date(s), then keep
    - if changes or new then append

In [10]:
query = f"""
SELECT t.*
FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_clones` t
WHERE 1=1 QUALIFY row_number() OVER(ORDER BY timestamp DESC) = 1
"""
prior_traffic_clones = bq.query(query = query).to_dataframe()
prior_traffic_clones

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
0,2023-02-24T00:00:00Z,2,2,69.0,statmike/vertex-ai-mlops


In [11]:
metric_type = 'traffic/clones'
response = metric_get(metric_type)

new_traffic_clones = pd.DataFrame(json.loads(response.text)['clones'])
new_traffic_clones['uniques_last14days'] = np.nan
new_traffic_clones['uniques_last14days'].iloc[-1] = json.loads(response.text)['uniques']
new_traffic_clones['repo'] = github_user + '/' + github_repo

new_traffic_clones

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
0,2023-02-12T00:00:00Z,3,3,,statmike/vertex-ai-mlops
1,2023-02-13T00:00:00Z,6,6,,statmike/vertex-ai-mlops
2,2023-02-14T00:00:00Z,29,7,,statmike/vertex-ai-mlops
3,2023-02-15T00:00:00Z,13,8,,statmike/vertex-ai-mlops
4,2023-02-16T00:00:00Z,20,19,,statmike/vertex-ai-mlops
5,2023-02-17T00:00:00Z,3,2,,statmike/vertex-ai-mlops
6,2023-02-18T00:00:00Z,14,6,,statmike/vertex-ai-mlops
7,2023-02-19T00:00:00Z,10,6,,statmike/vertex-ai-mlops
8,2023-02-20T00:00:00Z,16,10,,statmike/vertex-ai-mlops
9,2023-02-21T00:00:00Z,1,1,,statmike/vertex-ai-mlops


In [12]:
overlap_record = new_traffic_clones[new_traffic_clones['timestamp'] == prior_traffic_clones['timestamp'].iloc[0]]
overlap_record

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
11,2023-02-24T00:00:00Z,2,2,,statmike/vertex-ai-mlops


In [13]:
new_records = new_traffic_clones[new_traffic_clones['timestamp'] > prior_traffic_clones['timestamp'].iloc[0]]
new_records

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
12,2023-02-25T00:00:00Z,4,3,,statmike/vertex-ai-mlops
13,2023-02-26T00:00:00Z,6,3,69.0,statmike/vertex-ai-mlops


In [14]:
if overlap_record.shape[0] == 1:
  if overlap_record[['timestamp', 'count', 'uniques']].values.tolist() != prior_traffic_clones[['timestamp', 'count', 'uniques']].values.tolist():
    updated_record = overlap_record
    updated_record['uniques_last14days'].iloc[0] = prior_traffic_clones['uniques_last14days'].iloc[0] 
    new_records = pd.concat([updated_record, new_records], ignore_index = True, axis = 0)
    job = bq.query(query = f"DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_clones` WHERE timestamp = '{updated_record['timestamp'].iloc[0]}'")
    job.result()

new_records

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
12,2023-02-25T00:00:00Z,4,3,,statmike/vertex-ai-mlops
13,2023-02-26T00:00:00Z,6,3,69.0,statmike/vertex-ai-mlops


In [15]:
if new_records.shape[0] >=1:
  append_job = bq.load_table_from_dataframe(
        dataframe = new_records,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.traffic_clones"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND',
            autodetect = True, # detect schema
        ) 
  )
  append_job.result()

In [16]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_clones` 
ORDER BY timestamp
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
0,2023-02-11T00:00:00Z,10,6,,statmike/vertex-ai-mlops
1,2023-02-12T00:00:00Z,9,6,,statmike/vertex-ai-mlops
2,2023-02-13T00:00:00Z,6,6,,statmike/vertex-ai-mlops
3,2023-02-14T00:00:00Z,29,7,,statmike/vertex-ai-mlops
4,2023-02-15T00:00:00Z,13,8,,statmike/vertex-ai-mlops
5,2023-02-16T00:00:00Z,20,19,,statmike/vertex-ai-mlops
6,2023-02-17T00:00:00Z,3,2,,statmike/vertex-ai-mlops
7,2023-02-18T00:00:00Z,14,6,,statmike/vertex-ai-mlops
8,2023-02-19T00:00:00Z,10,6,,statmike/vertex-ai-mlops
9,2023-02-20T00:00:00Z,16,10,,statmike/vertex-ai-mlops


### /traffic/popular/paths
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-top-referral-paths
- top 10 documents for past 14 days
- increment:
    - append or drop/append only

In [17]:
metric_type = 'traffic/popular/paths'
response = metric_get(metric_type)

traffic_popular_paths = pd.DataFrame(json.loads(response.text))

def parse_path(p):
    p = urllib.parse.unquote(p).replace('blob/main/', '')
    p = urllib.parse.unquote(p).replace('tree/main/', '')
    if p.rfind('.') == -1 or (p.rfind('.') < p.rfind('/')):
        p += '/readme.md'
    return p

traffic_popular_paths['file'] = traffic_popular_paths.apply(lambda x: parse_path(x['path']), axis = 1)
traffic_popular_paths = traffic_popular_paths.drop(['title', 'path'], axis = 1)
traffic_popular_paths['timestamp'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z") #strftime("%Y-%m-%dT%H:%M:%SZ") 
traffic_popular_paths['repo'] = github_user + '/' + github_repo

traffic_popular_paths

Unnamed: 0,count,uniques,file,timestamp,repo
0,643,232,/statmike/vertex-ai-mlops/readme.md,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
1,73,38,/statmike/vertex-ai-mlops/04 - scikit-learn/re...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
2,71,41,/statmike/vertex-ai-mlops/00 - Setup/00 - Envi...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
3,68,43,/statmike/vertex-ai-mlops/00 - Setup/readme.md,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
4,66,41,/statmike/vertex-ai-mlops/02 - Vertex AI AutoM...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
5,61,29,/statmike/vertex-ai-mlops/01 - Data Sources/01...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
6,54,33,/statmike/vertex-ai-mlops/05 - TensorFlow/read...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
7,47,29,/statmike/vertex-ai-mlops/01 - Data Sources/re...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
8,40,24,/statmike/vertex-ai-mlops/03 - BigQuery ML (BQ...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
9,34,19,/statmike/vertex-ai-mlops/architectures/overvi...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops


In [18]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_paths`
WHERE timestamp = '{traffic_popular_paths['timestamp'].max()}'
ORDER BY count DESC
"""
prior = bq.query(query = query).to_dataframe()
prior

Unnamed: 0,count,uniques,file,timestamp,repo


In [19]:
# if you want to replace previous values on any run then this section clears the old and loads the current
#if prior.shape[0] > 0:
#  job = bq.query(query = f"DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_paths` WHERE timestamp = '{traffic_popular_paths['timestamp'].max()}'")
#  job.result()
#  append_job = bq.load_table_from_dataframe(
#          dataframe = traffic_popular_paths,
#          destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_paths"),
#          job_config = bigquery.LoadJobConfig(
#              write_disposition = 'WRITE_APPEND',
#              autodetect = True, # detect schema
#          ) 
#    )
#  append_job.result()

In [20]:
# append if new
if prior.shape[0] == 0:
  append_job = bq.load_table_from_dataframe(
          dataframe = traffic_popular_paths,
          destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_paths"),
          job_config = bigquery.LoadJobConfig(
              write_disposition = 'WRITE_APPEND',
              autodetect = True, # detect schema
          ) 
    )
  append_job.result()

In [21]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_paths`
WHERE timestamp = '{traffic_popular_paths['timestamp'].max()}'
ORDER BY count DESC
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,count,uniques,file,timestamp,repo
0,643,232,/statmike/vertex-ai-mlops/readme.md,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
1,73,38,/statmike/vertex-ai-mlops/04 - scikit-learn/re...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
2,71,41,/statmike/vertex-ai-mlops/00 - Setup/00 - Envi...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
3,68,43,/statmike/vertex-ai-mlops/00 - Setup/readme.md,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
4,66,41,/statmike/vertex-ai-mlops/02 - Vertex AI AutoM...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
5,61,29,/statmike/vertex-ai-mlops/01 - Data Sources/01...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
6,54,33,/statmike/vertex-ai-mlops/05 - TensorFlow/read...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
7,47,29,/statmike/vertex-ai-mlops/01 - Data Sources/re...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
8,40,24,/statmike/vertex-ai-mlops/03 - BigQuery ML (BQ...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
9,34,19,/statmike/vertex-ai-mlops/architectures/overvi...,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops


### /traffic/popular/referrers
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-top-referral-sources
- top 10 referring sites over past 14 days
- increment:
    - append or drop/append only

In [22]:
metric_type = 'traffic/popular/referrers'
response = metric_get(metric_type)

traffic_popular_referrers = pd.DataFrame(json.loads(response.text))
traffic_popular_referrers['timestamp'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z") #strftime("%Y-%m-%dT%H:%M:%SZ")
traffic_popular_referrers['repo'] = github_user + '/' + github_repo

traffic_popular_referrers

Unnamed: 0,referrer,count,uniques,timestamp,repo
0,youtube.com,498,121,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
1,github.com,252,35,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
2,Google,217,61,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
3,statics.teams.cdn.office.net,12,3,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
4,notebooks.githubusercontent.com,11,6,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
5,m.youtube.com,5,2,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
6,mail.google.com,2,2,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
7,colab.research.google.com,1,1,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops


In [23]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_referrers`
WHERE timestamp = '{traffic_popular_paths['timestamp'].max()}'
ORDER BY count
"""
prior = bq.query(query = query).to_dataframe()
prior

Unnamed: 0,referrer,count,uniques,timestamp,repo


In [24]:
# if you want to replace previous values on any run then this section clears the old and loads the current
#if prior.shape[0] > 0:
#  job = bq.query(query = f"DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_referrers` WHERE timestamp = '{traffic_popular_referrers['timestamp'].max()}'")
#  job.result()
#  append_job = bq.load_table_from_dataframe(
#          dataframe = traffic_popular_referrers,
#          destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_referrers"),
#          job_config = bigquery.LoadJobConfig(
#              write_disposition = 'WRITE_APPEND',
#              autodetect = True, # detect schema
#          ) 
#    )
#  append_job.result()

In [25]:
# append if new
if prior.shape[0] == 0:
  append_job = bq.load_table_from_dataframe(
          dataframe = traffic_popular_referrers,
          destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_referrers"),
          job_config = bigquery.LoadJobConfig(
              write_disposition = 'WRITE_APPEND',
              autodetect = True, # detect schema
          ) 
    )
  append_job.result()

In [26]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_popular_referrers`
WHERE timestamp = '{traffic_popular_paths['timestamp'].max()}'
ORDER BY count DESC
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,referrer,count,uniques,timestamp,repo
0,youtube.com,498,121,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
1,github.com,252,35,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
2,Google,217,61,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
3,statics.teams.cdn.office.net,12,3,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
4,notebooks.githubusercontent.com,11,6,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
5,m.youtube.com,5,2,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
6,mail.google.com,2,2,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops
7,colab.research.google.com,1,1,2023-02-26T00:00:00Z,statmike/vertex-ai-mlops


### /traffic/views
- https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-page-views
- daily views for last 14 days
  - trick: return has truncated values for today and oldest day (14)
- increment:
    - retrieve most recent record based on timestamp
      - this one can change because it might have been truncated based on last run time
    - pull new data
    - if count or uniques is bigger then update prior:
      - why? because GitHub truncates first and last day of returns based on last calculation time.
      - delete record so an append will replace it
    - if new date(s), then keep
    - if changes or new then append

In [27]:
query = f"""
SELECT t.*
FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_views` t
WHERE 1=1 QUALIFY row_number() OVER(ORDER BY timestamp DESC) = 1
"""
prior_traffic_views = bq.query(query = query).to_dataframe()
prior_traffic_views

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
0,2023-02-25T00:00:00Z,5,4,308.0,statmike/vertex-ai-mlops


In [28]:
metric_type = 'traffic/views'
response = metric_get(metric_type)

new_traffic_views = pd.DataFrame(json.loads(response.text)['views'])
new_traffic_views['uniques_last14days'] = np.nan
new_traffic_views['uniques_last14days'].iloc[-1] = json.loads(response.text)['uniques']
new_traffic_views['repo'] = github_user + '/' + github_repo

new_traffic_views

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
0,2023-02-12T00:00:00Z,52,13,,statmike/vertex-ai-mlops
1,2023-02-13T00:00:00Z,219,43,,statmike/vertex-ai-mlops
2,2023-02-14T00:00:00Z,176,48,,statmike/vertex-ai-mlops
3,2023-02-15T00:00:00Z,118,37,,statmike/vertex-ai-mlops
4,2023-02-16T00:00:00Z,162,35,,statmike/vertex-ai-mlops
5,2023-02-17T00:00:00Z,157,38,,statmike/vertex-ai-mlops
6,2023-02-18T00:00:00Z,87,18,,statmike/vertex-ai-mlops
7,2023-02-19T00:00:00Z,82,18,,statmike/vertex-ai-mlops
8,2023-02-20T00:00:00Z,255,43,,statmike/vertex-ai-mlops
9,2023-02-21T00:00:00Z,260,51,,statmike/vertex-ai-mlops


In [29]:
overlap_record = new_traffic_views[new_traffic_views['timestamp'] == prior_traffic_views['timestamp'].iloc[0]]
overlap_record

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
13,2023-02-25T00:00:00Z,56,14,,statmike/vertex-ai-mlops


In [30]:
new_records = new_traffic_views[new_traffic_views['timestamp'] > prior_traffic_views['timestamp'].iloc[0]]
new_records

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
14,2023-02-26T00:00:00Z,38,9,308.0,statmike/vertex-ai-mlops


In [31]:
if overlap_record.shape[0] == 1:
  if overlap_record[['timestamp', 'count', 'uniques']].values.tolist() != prior_traffic_views[['timestamp', 'count', 'uniques']].values.tolist():
    updated_record = overlap_record
    updated_record['uniques_last14days'].iloc[0] = prior_traffic_views['uniques_last14days'].iloc[0] 
    new_records = pd.concat([updated_record, new_records], ignore_index = True, axis = 0)
    job = bq.query(query = f"DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_views` WHERE timestamp = '{updated_record['timestamp'].iloc[0]}'")
    job.result()

new_records

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
0,2023-02-25T00:00:00Z,56,14,308.0,statmike/vertex-ai-mlops
1,2023-02-26T00:00:00Z,38,9,308.0,statmike/vertex-ai-mlops


In [32]:
if new_records.shape[0] >=1:
  append_job = bq.load_table_from_dataframe(
        dataframe = new_records,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.traffic_views"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND',
            autodetect = True, # detect schema
        ) 
  )
  append_job.result()

In [33]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.traffic_views` 
ORDER BY timestamp
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,timestamp,count,uniques,uniques_last14days,repo
0,2023-02-10T00:00:00Z,1,1,,statmike/vertex-ai-mlops
1,2023-02-11T00:00:00Z,78,17,,statmike/vertex-ai-mlops
2,2023-02-12T00:00:00Z,90,20,,statmike/vertex-ai-mlops
3,2023-02-13T00:00:00Z,219,43,,statmike/vertex-ai-mlops
4,2023-02-14T00:00:00Z,176,48,,statmike/vertex-ai-mlops
5,2023-02-15T00:00:00Z,118,37,,statmike/vertex-ai-mlops
6,2023-02-16T00:00:00Z,162,35,,statmike/vertex-ai-mlops
7,2023-02-17T00:00:00Z,157,38,,statmike/vertex-ai-mlops
8,2023-02-18T00:00:00Z,87,18,,statmike/vertex-ai-mlops
9,2023-02-19T00:00:00Z,82,18,,statmike/vertex-ai-mlops


### /stargazers
- https://docs.github.com/en/rest/activity/starring?apiVersion=2022-11-28#list-stargazers
- list of current users who have starred the repository
- increment:
    - conditions
      - added = in `current` but not in `known`: completely new
        - append with added = today
      - dropped = not in `current` but in `known_active`: just dropped
        - update (drop and append): dropped = today
      - re-added = in `current` and `known_inactive`: just re-added
        - update (drop and append): added = today

In [34]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.stargazers`
"""
known = bq.query(query = query).to_dataframe()
known

Unnamed: 0,login,added,dropped,count,repo
0,newcooldiscoveries,,,1,statmike/vertex-ai-mlops
1,giranntu,,,1,statmike/vertex-ai-mlops
2,sinanek,,,1,statmike/vertex-ai-mlops
3,amith-ajith,,,1,statmike/vertex-ai-mlops
4,rsavoie,,,1,statmike/vertex-ai-mlops
...,...,...,...,...,...
143,JosephDavis,,,1,statmike/vertex-ai-mlops
144,dunncw,,,1,statmike/vertex-ai-mlops
145,PeterGolovatyi,,,1,statmike/vertex-ai-mlops
146,littlefish0331,,,1,statmike/vertex-ai-mlops


In [35]:
# list of expected active stargazers (> covers added and re-added, = covers added and never dropped)
known_active = known[known['added'] >= known['dropped']]
# list of known users in current state of dropped
known_inactive = known[known['dropped'] > known['added']]

In [36]:
known_active

Unnamed: 0,login,added,dropped,count,repo
0,newcooldiscoveries,,,1,statmike/vertex-ai-mlops
1,giranntu,,,1,statmike/vertex-ai-mlops
2,sinanek,,,1,statmike/vertex-ai-mlops
3,amith-ajith,,,1,statmike/vertex-ai-mlops
4,rsavoie,,,1,statmike/vertex-ai-mlops
...,...,...,...,...,...
143,JosephDavis,,,1,statmike/vertex-ai-mlops
144,dunncw,,,1,statmike/vertex-ai-mlops
145,PeterGolovatyi,,,1,statmike/vertex-ai-mlops
146,littlefish0331,,,1,statmike/vertex-ai-mlops


In [37]:
known_inactive

Unnamed: 0,login,added,dropped,count,repo


In [38]:
metric_type = 'stargazers'

page_size = 100
page = 1
raw = []
while page_size == 100:
    response = metric_get(metric_type, f'?per_page={page_size}&page={page}')
    raw_new = json.loads(response.text)
    raw += raw_new
    page_size = len(raw_new)
    page += 1

stargazers = pd.DataFrame(raw)[['login']]
#stargazers['added'] = ''
#stargazers['dropped'] = ''
#stargazers['count'] = 1
#stargazers['repo'] = github_user + '/' + github_repo

current = stargazers
current

Unnamed: 0,login
0,newcooldiscoveries
1,giranntu
2,sinanek
3,amith-ajith
4,rsavoie
...,...
145,PeterGolovatyi
146,littlefish0331
147,bx2
148,ajonsson


In [39]:
# newly added: in current but not in known
newly_added = pd.DataFrame([x for x in current['login'].values.tolist() if x not in known['login'].values.tolist()], columns = ['login'])
newly_added['added'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z")
newly_added['dropped'] = ''
newly_added['count'] = 1
newly_added['repo'] = github_user + '/' + github_repo

# newly dropped: in known_active but not in current
newly_dropped = pd.DataFrame([x for x in known_active['login'].values.tolist() if x not in current['login'].values.tolist()], columns = ['login'])
newly_dropped = pd.merge(known_active, newly_dropped, how = 'inner', on = ['login'])
newly_dropped['dropped'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z")

# newly readded: in current and in known_inactive
newly_readded = pd.merge(current['login'], known_inactive['login'], how = 'inner', on = ['login'])
newly_readded = pd.merge(newly_readded, known_inactive, how = 'inner', on = ['login'])
newly_readded['added'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z")
newly_readded['count'] = newly_readded['count'] + 1

# newly combo
new_records = pd.concat([newly_added, newly_dropped, newly_readded], ignore_index = True, axis = 0)

new_records

Unnamed: 0,login,added,dropped,count,repo
0,ajonsson,2023-02-26T00:00:00Z,,1,statmike/vertex-ai-mlops
1,Nitrostrider,2023-02-26T00:00:00Z,,1,statmike/vertex-ai-mlops


In [40]:
if new_records.shape[0] >= 1:
  job = bq.query(query = f"""DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.stargazers` WHERE login in ({', '.join([f"'{x}'" for x in new_records['login'].values.tolist()])})""")
  job.result()
  append_job = bq.load_table_from_dataframe(
        dataframe = new_records,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.stargazers"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND',
            autodetect = True, # detect schema
        ) 
  )
  append_job.result()

In [41]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.stargazers`
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,login,added,dropped,count,repo
0,newcooldiscoveries,,,1,statmike/vertex-ai-mlops
1,giranntu,,,1,statmike/vertex-ai-mlops
2,sinanek,,,1,statmike/vertex-ai-mlops
3,amith-ajith,,,1,statmike/vertex-ai-mlops
4,rsavoie,,,1,statmike/vertex-ai-mlops
...,...,...,...,...,...
145,PeterGolovatyi,,,1,statmike/vertex-ai-mlops
146,littlefish0331,,,1,statmike/vertex-ai-mlops
147,bx2,,,1,statmike/vertex-ai-mlops
148,ajonsson,2023-02-26T00:00:00Z,,1,statmike/vertex-ai-mlops


### /forks
- https://docs.github.com/en/rest/repos/forks?apiVersion=2022-11-28#list-forks
- list of current forks of main repository
- increment:
    - if new, append:
        - added = yesterday's date, dropped = blank, count = 1
    - if reoccur, if dropped is blank: do nothing
    - if reoccur, if dropped < yesterday's date, replace (delete, append):
        - dropped = blank, recent_added = yesterday's date, count += 1

In [10]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.forks`
"""
known = bq.query(query = query).to_dataframe()
known

Unnamed: 0,name,full_name,owner,stars,watchers,forks,added,dropped,count,repo
0,GCP,dxc7jack/GCP,dxc7jack,0,0,0,,,1,statmike/vertex-ai-mlops
1,vertex-ai-mlops,yfumero/vertex-ai-mlops,yfumero,0,0,0,,,1,statmike/vertex-ai-mlops
2,vertex-ai-mlops,ivanmkc/vertex-ai-mlops,ivanmkc,0,0,0,,,1,statmike/vertex-ai-mlops
3,vertex-ai-mlops,xjaztek/vertex-ai-mlops,xjaztek,0,0,0,,,1,statmike/vertex-ai-mlops
4,vertex-ai-mlops,praneethkumar4/vertex-ai-mlops,praneethkumar4,0,0,0,,,1,statmike/vertex-ai-mlops
...,...,...,...,...,...,...,...,...,...,...
68,vertex-ai-mlops,nishitpatel01/vertex-ai-mlops,nishitpatel01,1,1,0,,,1,statmike/vertex-ai-mlops
69,GCP-mlops-vertex-AI,kanishkpatel1995/GCP-mlops-vertex-AI,kanishkpatel1995,0,0,0,,,1,statmike/vertex-ai-mlops
70,vertex-ai-mlops-GCP-,shahaparan/vertex-ai-mlops-GCP-,shahaparan,0,0,0,,,1,statmike/vertex-ai-mlops
71,vertex-ai-mlops-Mike,alfonso-miranda/vertex-ai-mlops-Mike,alfonso-miranda,1,1,0,,,1,statmike/vertex-ai-mlops


In [11]:
# list of expected active stargazers (> covers added and re-added, = covers added and never dropped)
known_active = known[known['added'] >= known['dropped']]
# list of known users in current state of dropped
known_inactive = known[known['dropped'] > known['added']]

In [23]:
known_active

Unnamed: 0,name,full_name,owner,stars,watchers,forks,added,dropped,count,repo
0,GCP,dxc7jack/GCP,dxc7jack,0,0,0,,,1,statmike/vertex-ai-mlops
1,vertex-ai-mlops,yfumero/vertex-ai-mlops,yfumero,0,0,0,,,1,statmike/vertex-ai-mlops
2,vertex-ai-mlops,ivanmkc/vertex-ai-mlops,ivanmkc,0,0,0,,,1,statmike/vertex-ai-mlops
3,vertex-ai-mlops,xjaztek/vertex-ai-mlops,xjaztek,0,0,0,,,1,statmike/vertex-ai-mlops
4,vertex-ai-mlops,praneethkumar4/vertex-ai-mlops,praneethkumar4,0,0,0,,,1,statmike/vertex-ai-mlops
...,...,...,...,...,...,...,...,...,...,...
68,vertex-ai-mlops,nishitpatel01/vertex-ai-mlops,nishitpatel01,1,1,0,,,1,statmike/vertex-ai-mlops
69,GCP-mlops-vertex-AI,kanishkpatel1995/GCP-mlops-vertex-AI,kanishkpatel1995,0,0,0,,,1,statmike/vertex-ai-mlops
70,vertex-ai-mlops-GCP-,shahaparan/vertex-ai-mlops-GCP-,shahaparan,0,0,0,,,1,statmike/vertex-ai-mlops
71,vertex-ai-mlops-Mike,alfonso-miranda/vertex-ai-mlops-Mike,alfonso-miranda,1,1,0,,,1,statmike/vertex-ai-mlops


In [13]:
known_inactive

Unnamed: 0,name,full_name,owner,stars,watchers,forks,added,dropped,count,repo


In [14]:
metric_type = 'forks'

page_size = 100
page = 1
raw = []
while page_size == 100:
    response = metric_get(metric_type, f'?per_page={page_size}&page={page}')
    raw_new = json.loads(response.text)
    raw += raw_new
    page_size = len(raw_new)
    page += 1

forks = []
for f in raw:
    forks += [{
        'name': f['name'],
        'full_name': f['full_name'],
        'owner': f['owner']['login'],
        'stars': f['stargazers_count'],
        'watchers': f['watchers_count'],
        'forks': f['forks_count']
    }]
forks = pd.DataFrame(forks)
#forks['added'] = ''
#forks['dropped'] = ''
#forks['count'] = 1
#forks['repo'] = github_user + '/' + github_repo

current = forks
current

Unnamed: 0,name,full_name,owner,stars,watchers,forks
0,vertex-ai-mlops-youjun-revised,littlefish0331/vertex-ai-mlops-youjun-revised,littlefish0331,0,0,0
1,vertex-ai-mlops,yfumero/vertex-ai-mlops,yfumero,0,0,0
2,vertex-ai-mlops,ivanmkc/vertex-ai-mlops,ivanmkc,0,0,0
3,vertex-ai-mlops,xjaztek/vertex-ai-mlops,xjaztek,0,0,0
4,vertex-ai-mlops,praneethkumar4/vertex-ai-mlops,praneethkumar4,0,0,0
...,...,...,...,...,...,...
69,vertex-ai-mlops,danielnguyen-ds/vertex-ai-mlops,danielnguyen-ds,0,0,0
70,vertex-ai-mlops,justinjm/vertex-ai-mlops,justinjm,0,0,0
71,vertex-ai-mlops,motconmeobuon/vertex-ai-mlops,motconmeobuon,0,0,0
72,vertex-ai-mlops,ANN-KOREA/vertex-ai-mlops,ANN-KOREA,0,0,0


In [19]:
# newly added: in current but not in known
newly_added = pd.DataFrame([x for x in current['full_name'].values.tolist() if x not in known['full_name'].values.tolist()], columns = ['full_name'])
newly_added = pd.merge(newly_added, current, how = 'inner', on = ['full_name'])
newly_added['added'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z")
newly_added['dropped'] = ''
newly_added['count'] = 1
newly_added['repo'] = github_user + '/' + github_repo

# newly dropped: in known_active but not in current
newly_dropped = pd.DataFrame([x for x in known_active['full_name'].values.tolist() if x not in current['full_name'].values.tolist()], columns = ['full_name'])
newly_dropped = pd.merge(known_active, newly_dropped, how = 'inner', on = ['full_name'])
newly_dropped['dropped'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z")

# newly readded: in current and in known_inactive
newly_readded = pd.merge(current['full_name'], known_inactive['full_name'], how = 'inner', on = ['full_name'])
newly_readded = pd.merge(newly_readded, known_inactive, how = 'inner', on = ['full_name'])
newly_readded['added'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z")
newly_readded['count'] = newly_readded['count'] + 1

# newly combo
new_records = pd.concat([newly_added, newly_dropped, newly_readded], ignore_index = True, axis = 0)

new_records

Unnamed: 0,full_name,name,owner,stars,watchers,forks,added,dropped,count,repo
0,littlefish0331/vertex-ai-mlops-youjun-revised,vertex-ai-mlops-youjun-revised,littlefish0331,0,0,0,2023-02-26T00:00:00Z,,1,statmike/vertex-ai-mlops


In [32]:
## check for updated stars/watchers/forks comparing current and known_active, if yes then add to the new_records dataframe so it gets updated

# start with outer merge on all columns in current
non_match = pd.merge(known_active, current, how = 'outer', indicator = True, left_on = ['full_name', 'stars', 'watchers', 'forks'], right_on = ['full_name', 'stars', 'watchers', 'forks'])
# make list of full_name that did not have an exact match in current - these need updating
non_match = non_match[non_match._merge == 'left_only']
non_match = non_match[['full_name']]
# now get current records for the non_match
non_match = pd.merge(non_match, current, how = 'inner', on = ['full_name'])
# now get updated records
updated_records = pd.merge(known_active[['name', 'full_name', 'owner', 'added', 'dropped', 'count', 'repo']], non_match[['full_name', 'stars', 'watchers', 'forks']], how = 'inner', on = 'full_name')

updated_records

Unnamed: 0,name,full_name,owner,added,dropped,count,repo,stars,watchers,forks


In [33]:
# stack updated records with the new_records before updating
new_records = pd.concat([updated_records, new_records], ignore_index = True, axis = 0)

new_records

Unnamed: 0,name,full_name,owner,added,dropped,count,repo,stars,watchers,forks
0,vertex-ai-mlops-youjun-revised,littlefish0331/vertex-ai-mlops-youjun-revised,littlefish0331,2023-02-26T00:00:00Z,,1,statmike/vertex-ai-mlops,0,0,0


In [37]:
if new_records.shape[0] >= 1:
  job = bq.query(query = f"""DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.forks` WHERE full_name in ({', '.join([f"'{x}'" for x in new_records['full_name'].values.tolist()])})""")
  job.result()
  append_job = bq.load_table_from_dataframe(
        dataframe = new_records,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.forks"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND',
            autodetect = True, # detect schema
        ) 
  )
  append_job.result()

In [38]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.forks`
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,name,full_name,owner,stars,watchers,forks,added,dropped,count,repo
0,GCP,dxc7jack/GCP,dxc7jack,0,0,0,,,1,statmike/vertex-ai-mlops
1,vertex-ai-mlops,yfumero/vertex-ai-mlops,yfumero,0,0,0,,,1,statmike/vertex-ai-mlops
2,vertex-ai-mlops,ivanmkc/vertex-ai-mlops,ivanmkc,0,0,0,,,1,statmike/vertex-ai-mlops
3,vertex-ai-mlops,xjaztek/vertex-ai-mlops,xjaztek,0,0,0,,,1,statmike/vertex-ai-mlops
4,vertex-ai-mlops,praneethkumar4/vertex-ai-mlops,praneethkumar4,0,0,0,,,1,statmike/vertex-ai-mlops
...,...,...,...,...,...,...,...,...,...,...
69,GCP-mlops-vertex-AI,kanishkpatel1995/GCP-mlops-vertex-AI,kanishkpatel1995,0,0,0,,,1,statmike/vertex-ai-mlops
70,vertex-ai-mlops-GCP-,shahaparan/vertex-ai-mlops-GCP-,shahaparan,0,0,0,,,1,statmike/vertex-ai-mlops
71,vertex-ai-mlops-Mike,alfonso-miranda/vertex-ai-mlops-Mike,alfonso-miranda,1,1,0,,,1,statmike/vertex-ai-mlops
72,vertex-ai-mlops-statmike,suddhasatwabhaumik/vertex-ai-mlops-statmike,suddhasatwabhaumik,0,0,0,,,1,statmike/vertex-ai-mlops


### /subscribers
- https://docs.github.com/en/rest/activity/watching?apiVersion=2022-11-28#list-watchers
- list of watchers for repository
- increment:
    - conditions
      - added = in `current` but not in `known`: completely new
        - append with added = today
      - dropped = not in `current` but in `known_active`: just dropped
        - update (drop and append): dropped = today
      - re-added = in `current` and `known_inactive`: just re-added
        - update (drop and append): added = today

In [48]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.subscribers`
"""
known = bq.query(query = query).to_dataframe()
known

Unnamed: 0,login,added,dropped,count,repo
0,statmike,,,1,statmike/vertex-ai-mlops
1,sinanek,,,1,statmike/vertex-ai-mlops
2,inardini,,,1,statmike/vertex-ai-mlops
3,rafal-wasowski,,,1,statmike/vertex-ai-mlops
4,majacaci00,,,1,statmike/vertex-ai-mlops
5,hamehrabi,,,1,statmike/vertex-ai-mlops
6,alvaroferrerrizzo,,,1,statmike/vertex-ai-mlops
7,rmazara-kinaxis,,,1,statmike/vertex-ai-mlops
8,slopez-lmes,,,1,statmike/vertex-ai-mlops
9,drkostas,,,1,statmike/vertex-ai-mlops


In [49]:
# list of expected active stargazers (> covers added and re-added, = covers added and never dropped)
known_active = known[known['added'] >= known['dropped']]
# list of known users in current state of dropped
known_inactive = known[known['dropped'] > known['added']]

In [50]:
known_active

Unnamed: 0,login,added,dropped,count,repo
0,statmike,,,1,statmike/vertex-ai-mlops
1,sinanek,,,1,statmike/vertex-ai-mlops
2,inardini,,,1,statmike/vertex-ai-mlops
3,rafal-wasowski,,,1,statmike/vertex-ai-mlops
4,majacaci00,,,1,statmike/vertex-ai-mlops
5,hamehrabi,,,1,statmike/vertex-ai-mlops
6,alvaroferrerrizzo,,,1,statmike/vertex-ai-mlops
7,rmazara-kinaxis,,,1,statmike/vertex-ai-mlops
8,slopez-lmes,,,1,statmike/vertex-ai-mlops
9,drkostas,,,1,statmike/vertex-ai-mlops


In [51]:
known_inactive

Unnamed: 0,login,added,dropped,count,repo


In [52]:
metric_type = 'subscribers'

page_size = 100
page = 1
raw = []
while page_size == 100:
    response = metric_get(metric_type, f'?per_page={page_size}&page={page}')
    raw_new = json.loads(response.text)
    raw += raw_new
    page_size = len(raw_new)
    page += 1

subscribers = pd.DataFrame(raw)[['login']]
#subscribers['added'] = ''
#subscribers['dropped'] = ''
#subscribers['count'] = 1
#subscribers['repo'] = github_user + '/' + github_repo

current = subscribers
current

Unnamed: 0,login
0,statmike
1,sinanek
2,inardini
3,rafal-wasowski
4,majacaci00
5,hamehrabi
6,alvaroferrerrizzo
7,rmazara-kinaxis
8,slopez-lmes
9,drkostas


In [53]:
# newly added: in current but not in known
newly_added = pd.DataFrame([x for x in current['login'].values.tolist() if x not in known['login'].values.tolist()], columns = ['login'])
newly_added['added'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z")
newly_added['dropped'] = ''
newly_added['count'] = 1
newly_added['repo'] = github_user + '/' + github_repo

# newly dropped: in known_active but not in current
newly_dropped = pd.DataFrame([x for x in known_active['login'].values.tolist() if x not in current['login'].values.tolist()], columns = ['login'])
newly_dropped = pd.merge(known_active, newly_dropped, how = 'inner', on = ['login'])
newly_dropped['dropped'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z")

# newly readded: in current and in known_inactive
newly_readded = pd.merge(current['login'], known_inactive['login'], how = 'inner', on = ['login'])
newly_readded = pd.merge(newly_readded, known_inactive, how = 'inner', on = ['login'])
newly_readded['added'] = datetime.now().strftime("%Y-%m-%dT00:00:00Z")
newly_readded['count'] = newly_readded['count'] + 1

# newly combo
new_records = pd.concat([newly_added, newly_dropped, newly_readded], ignore_index = True, axis = 0)

new_records

Unnamed: 0,login,added,dropped,count,repo


In [54]:
if new_records.shape[0] > 1:
  job = bq.query(query = f"""DELETE FROM `{BQ_PROJECT}.{BQ_DATASET}.subscribers` WHERE login in ({', '.join([f"'{x}'" for x in new_records['login'].values.tolist()])})""")
  job.result()
  append_job = bq.load_table_from_dataframe(
        dataframe = new_records,
        destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.subscribers"),
        job_config = bigquery.LoadJobConfig(
            write_disposition = 'WRITE_APPEND',
            autodetect = True, # detect schema
        ) 
  )
  append_job.result()

In [55]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.subscribers`
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,login,added,dropped,count,repo
0,statmike,,,1,statmike/vertex-ai-mlops
1,sinanek,,,1,statmike/vertex-ai-mlops
2,inardini,,,1,statmike/vertex-ai-mlops
3,rafal-wasowski,,,1,statmike/vertex-ai-mlops
4,majacaci00,,,1,statmike/vertex-ai-mlops
5,hamehrabi,,,1,statmike/vertex-ai-mlops
6,alvaroferrerrizzo,,,1,statmike/vertex-ai-mlops
7,rmazara-kinaxis,,,1,statmike/vertex-ai-mlops
8,slopez-lmes,,,1,statmike/vertex-ai-mlops
9,drkostas,,,1,statmike/vertex-ai-mlops


---
## Diagnostics