# ETL Github exercise

Let's workout a simple ETL-like pipeline. The goal here is to exercise some Python coding as well as get some intrinsic decision-making on transforming, naming, storing data.

The exercise:

> To query for the 100 most starred Python-based repositories created this year, and create (related) tables to represented the (JSON) information.
>
> Important steps:
> - save the respective JSON content in individual files in their own directories.
> - Download the corresponding readme files, and create an index (of words) relating the most overall frequent words to the repositories.

Since we are not interesting in Github's API features, here is the URL to query for such information:

```
https://api.github.com/search/repositories?q=created:>2023-01-01+language:python&sort=stars&order=desc&per_page=100
```

> See below for [an example of such query results](#Example-query-result)

The response is composed by some top-level attributes (`total_count`, `incomplete_results`) that are not of our interest here. The contents of `items` is what we want: those are the repositories' metadata we queried for.

* Save each `items` metadata block (aka, _object_ in JS) on a `.json` file in a directory named after owner and/or repository name.

* Organize such information (i.e., inside `items`) into:
  1. a "main" table for the _items_, each record represents a repository;
    - remove all unnecessary URLs: keep only `home_url`.
  2. a "owners" table, with the information inside items' `owner` object;
    - same for `license` and `topics`.

* Download corresponding _readme_ files, save them next to metadata's `.json` file.
  1. Create an index of words for each _readme_ file and compute words frequencies;
  2. Merge them all into one "index" table of the (100) most frequent words;
    - **Remember**: an _index_ (table) related _word/term_ to _source/location_.


### Example query result

Here is an example of a similar query (`per_page=1`):

```json
// 20230228175452
// https://api.github.com/search/repositories?q=language:python&sort=stars&order=desc&per_page=1

{
  "total_count": 8951990,
  "incomplete_results": true,
  "items": [
    {
      "id": 12888993,
      "node_id": "MDEwOlJlcG9zaXRvcnkxMjg4ODk5Mw==",
      "name": "core",
      "full_name": "home-assistant/core",
      "private": false,
      "owner": {
        "login": "home-assistant",
        "id": 13844975,
        "node_id": "MDEyOk9yZ2FuaXphdGlvbjEzODQ0OTc1",
        "avatar_url": "https://avatars.githubusercontent.com/u/13844975?v=4",
        "gravatar_id": "",
        "url": "https://api.github.com/users/home-assistant",
        "html_url": "https://github.com/home-assistant",
        "followers_url": "https://api.github.com/users/home-assistant/followers",
        "following_url": "https://api.github.com/users/home-assistant/following{/other_user}",
        "gists_url": "https://api.github.com/users/home-assistant/gists{/gist_id}",
        "starred_url": "https://api.github.com/users/home-assistant/starred{/owner}{/repo}",
        "subscriptions_url": "https://api.github.com/users/home-assistant/subscriptions",
        "organizations_url": "https://api.github.com/users/home-assistant/orgs",
        "repos_url": "https://api.github.com/users/home-assistant/repos",
        "events_url": "https://api.github.com/users/home-assistant/events{/privacy}",
        "received_events_url": "https://api.github.com/users/home-assistant/received_events",
        "type": "Organization",
        "site_admin": false
      },
      "html_url": "https://github.com/home-assistant/core",
      "description": ":house_with_garden: Open source home automation that puts local control and privacy first.",
      "fork": false,
      "url": "https://api.github.com/repos/home-assistant/core",
      "forks_url": "https://api.github.com/repos/home-assistant/core/forks",
      "keys_url": "https://api.github.com/repos/home-assistant/core/keys{/key_id}",
      "collaborators_url": "https://api.github.com/repos/home-assistant/core/collaborators{/collaborator}",
      "teams_url": "https://api.github.com/repos/home-assistant/core/teams",
      "hooks_url": "https://api.github.com/repos/home-assistant/core/hooks",
      "issue_events_url": "https://api.github.com/repos/home-assistant/core/issues/events{/number}",
      "events_url": "https://api.github.com/repos/home-assistant/core/events",
      "assignees_url": "https://api.github.com/repos/home-assistant/core/assignees{/user}",
      "branches_url": "https://api.github.com/repos/home-assistant/core/branches{/branch}",
      "tags_url": "https://api.github.com/repos/home-assistant/core/tags",
      "blobs_url": "https://api.github.com/repos/home-assistant/core/git/blobs{/sha}",
      "git_tags_url": "https://api.github.com/repos/home-assistant/core/git/tags{/sha}",
      "git_refs_url": "https://api.github.com/repos/home-assistant/core/git/refs{/sha}",
      "trees_url": "https://api.github.com/repos/home-assistant/core/git/trees{/sha}",
      "statuses_url": "https://api.github.com/repos/home-assistant/core/statuses/{sha}",
      "languages_url": "https://api.github.com/repos/home-assistant/core/languages",
      "stargazers_url": "https://api.github.com/repos/home-assistant/core/stargazers",
      "contributors_url": "https://api.github.com/repos/home-assistant/core/contributors",
      "subscribers_url": "https://api.github.com/repos/home-assistant/core/subscribers",
      "subscription_url": "https://api.github.com/repos/home-assistant/core/subscription",
      "commits_url": "https://api.github.com/repos/home-assistant/core/commits{/sha}",
      "git_commits_url": "https://api.github.com/repos/home-assistant/core/git/commits{/sha}",
      "comments_url": "https://api.github.com/repos/home-assistant/core/comments{/number}",
      "issue_comment_url": "https://api.github.com/repos/home-assistant/core/issues/comments{/number}",
      "contents_url": "https://api.github.com/repos/home-assistant/core/contents/{+path}",
      "compare_url": "https://api.github.com/repos/home-assistant/core/compare/{base}...{head}",
      "merges_url": "https://api.github.com/repos/home-assistant/core/merges",
      "archive_url": "https://api.github.com/repos/home-assistant/core/{archive_format}{/ref}",
      "downloads_url": "https://api.github.com/repos/home-assistant/core/downloads",
      "issues_url": "https://api.github.com/repos/home-assistant/core/issues{/number}",
      "pulls_url": "https://api.github.com/repos/home-assistant/core/pulls{/number}",
      "milestones_url": "https://api.github.com/repos/home-assistant/core/milestones{/number}",
      "notifications_url": "https://api.github.com/repos/home-assistant/core/notifications{?since,all,participating}",
      "labels_url": "https://api.github.com/repos/home-assistant/core/labels{/name}",
      "releases_url": "https://api.github.com/repos/home-assistant/core/releases{/id}",
      "deployments_url": "https://api.github.com/repos/home-assistant/core/deployments",
      "created_at": "2013-09-17T07:29:48Z",
      "updated_at": "2023-02-28T15:43:21Z",
      "pushed_at": "2023-02-28T16:53:17Z",
      "git_url": "git://github.com/home-assistant/core.git",
      "ssh_url": "git@github.com:home-assistant/core.git",
      "clone_url": "https://github.com/home-assistant/core.git",
      "svn_url": "https://github.com/home-assistant/core",
      "homepage": "https://www.home-assistant.io",
      "size": 432783,
      "stargazers_count": 58525,
      "watchers_count": 58525,
      "language": "Python",
      "has_issues": true,
      "has_projects": true,
      "has_downloads": true,
      "has_wiki": false,
      "has_pages": false,
      "has_discussions": false,
      "forks_count": 22165,
      "mirror_url": null,
      "archived": false,
      "disabled": false,
      "open_issues_count": 2731,
      "license": {
        "key": "apache-2.0",
        "name": "Apache License 2.0",
        "spdx_id": "Apache-2.0",
        "url": "https://api.github.com/licenses/apache-2.0",
        "node_id": "MDc6TGljZW5zZTI="
      },
      "allow_forking": true,
      "is_template": false,
      "web_commit_signoff_required": false,
      "topics": [
        "asyncio",
        "hacktoberfest",
        "home-automation",
        "internet-of-things",
        "iot",
        "mqtt",
        "python",
        "raspberry-pi"
      ],
      "visibility": "public",
      "forks": 22165,
      "open_issues": 2731,
      "watchers": 58525,
      "default_branch": "dev",
      "score": 1.0
    }
  ]
}
```

# Name:
# Muzhikbayev Timur 

# 06.03.2023
# Email:
## tmuzhikbay@jacobs-university.de



In [None]:
import requests
import pandas as pd
import os
import requests
from collections import Counter

print(f'Requests version: {requests.__version__}')

Requests version: 2.25.1


## Workflow with Requests and built-ins

> The Python Standard Library provides the [`urllib`](https://docs.python.org/3/library/urllib.html) package for web requests; It is, though, a rather low-level interface, and it is broadly recommended to use the (external) [Requests](https://requests.readthedocs.io/) library instead.

Now that we have built the URL to request (in theory) exactly what we want from Github, all we have to do is to effectively get the (10) repositories data and then write it into JSON files as requested.

In [None]:
# Get data

response = requests.get('https://api.github.com/search/repositories?q=created:>2023-01-01+language:python&sort=stars&order=desc&per_page=100')

if response.status_code == 200:
    data_js = response.json()
    print(f"Request successful. Response items size: {len(data_js['items'])}")
else:
    print("Something went wrong with our request:", response.text)

Request successful. Response items size: 100


In [None]:

#Initializing tables
df = pd.DataFrame(data_js['items'])


#Selecting needed items
df = df[['id', 'full_name', 'html_url','description', 'language','stargazers_count', 'forks_count', 'visibility']]
items = data_js['items']

licenses = [] #Data frame of licences
owners = [] #ID of the repository owners
owners_df = [] #Dataframe of owners

for item in items: 
    ow = item['owner']
    owners.append(ow['id'])
    owners_df.append(ow)

    #Loop for checking if items have licences
    if not item['license']:
        license = {}
        license['key'] = None
        license['name'] = None
        license['spdx-id'] = None
        license['url'] = None
        license['node_id'] = None
        licenses.append(license)
    else:
        licenses.append(item['license'])


df['owners_id'] = owners
licenses_df = pd.DataFrame(licenses)
owners_df = pd.DataFrame(owners_df)

#Assigning ID as a forign key in licences
licenses_df['Repository_ID'] = df['id'].tolist()

#Display everything
display(df)
display(owners_df)
display(licenses_df)

#Create a folder, where tables will be saved 
os.makedirs('folder/subfolder', exist_ok=True)
df.to_csv('folder/subfolder/repositories.csv', index = False)
owners_df.to_csv('folder/subfolder/owners.csv', index = False)
licenses_df.to_csv('folder/subfolder/licenses.csv', index = False)


Unnamed: 0,id,full_name,html_url,description,language,stargazers_count,forks_count,visibility,owners_id
0,595893961,lllyasviel/ControlNet,https://github.com/lllyasviel/ControlNet,Let us control diffusion models!,Python,11885,938,public,19834515
1,601538369,facebookresearch/llama,https://github.com/facebookresearch/llama,Inference code for LLaMA models,Python,7791,978,public,16943930
2,602270517,FMInference/FlexGen,https://github.com/FMInference/FlexGen,Throughput-oriented systems for large language...,Python,6294,323,public,125944572
3,600798098,Mikubill/sd-webui-controlnet,https://github.com/Mikubill/sd-webui-controlnet,WebUI extension for ControlNet,Python,4104,401,public,31246794
4,586711486,timothybrooks/instruct-pix2pix,https://github.com/timothybrooks/instruct-pix2pix,,Python,3658,328,public,10535711
...,...,...,...,...,...,...,...,...,...
95,594287160,chavinlo/sda-node,https://github.com/chavinlo/sda-node,Stable Diffusion Accelerated. Node Module.,Python,197,21,public,85657083
96,607276532,facebookresearch/dropout,https://github.com/facebookresearch/dropout,"Code release for ""Dropout Reduces Underfitting""",Python,192,8,public,16943930
97,605682921,DAGWorks-Inc/hamilton,https://github.com/DAGWorks-Inc/hamilton,A scalable general purpose micro-framework for...,Python,191,1,public,116846391
98,585309503,machine1337/gmailc2,https://github.com/machine1337/gmailc2,A Fully Undetectable C2 Server That Communicat...,Python,190,32,public,82051128


Unnamed: 0,login,id,node_id,avatar_url,gravatar_id,url,html_url,followers_url,following_url,gists_url,starred_url,subscriptions_url,organizations_url,repos_url,events_url,received_events_url,type,site_admin
0,lllyasviel,19834515,MDQ6VXNlcjE5ODM0NTE1,https://avatars.githubusercontent.com/u/198345...,,https://api.github.com/users/lllyasviel,https://github.com/lllyasviel,https://api.github.com/users/lllyasviel/followers,https://api.github.com/users/lllyasviel/follow...,https://api.github.com/users/lllyasviel/gists{...,https://api.github.com/users/lllyasviel/starre...,https://api.github.com/users/lllyasviel/subscr...,https://api.github.com/users/lllyasviel/orgs,https://api.github.com/users/lllyasviel/repos,https://api.github.com/users/lllyasviel/events...,https://api.github.com/users/lllyasviel/receiv...,User,False
1,facebookresearch,16943930,MDEyOk9yZ2FuaXphdGlvbjE2OTQzOTMw,https://avatars.githubusercontent.com/u/169439...,,https://api.github.com/users/facebookresearch,https://github.com/facebookresearch,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,Organization,False
2,FMInference,125944572,O_kgDOB4HC_A,https://avatars.githubusercontent.com/u/125944...,,https://api.github.com/users/FMInference,https://github.com/FMInference,https://api.github.com/users/FMInference/follo...,https://api.github.com/users/FMInference/follo...,https://api.github.com/users/FMInference/gists...,https://api.github.com/users/FMInference/starr...,https://api.github.com/users/FMInference/subsc...,https://api.github.com/users/FMInference/orgs,https://api.github.com/users/FMInference/repos,https://api.github.com/users/FMInference/event...,https://api.github.com/users/FMInference/recei...,Organization,False
3,Mikubill,31246794,MDQ6VXNlcjMxMjQ2Nzk0,https://avatars.githubusercontent.com/u/312467...,,https://api.github.com/users/Mikubill,https://github.com/Mikubill,https://api.github.com/users/Mikubill/followers,https://api.github.com/users/Mikubill/followin...,https://api.github.com/users/Mikubill/gists{/g...,https://api.github.com/users/Mikubill/starred{...,https://api.github.com/users/Mikubill/subscrip...,https://api.github.com/users/Mikubill/orgs,https://api.github.com/users/Mikubill/repos,https://api.github.com/users/Mikubill/events{/...,https://api.github.com/users/Mikubill/received...,User,False
4,timothybrooks,10535711,MDQ6VXNlcjEwNTM1NzEx,https://avatars.githubusercontent.com/u/105357...,,https://api.github.com/users/timothybrooks,https://github.com/timothybrooks,https://api.github.com/users/timothybrooks/fol...,https://api.github.com/users/timothybrooks/fol...,https://api.github.com/users/timothybrooks/gis...,https://api.github.com/users/timothybrooks/sta...,https://api.github.com/users/timothybrooks/sub...,https://api.github.com/users/timothybrooks/orgs,https://api.github.com/users/timothybrooks/repos,https://api.github.com/users/timothybrooks/eve...,https://api.github.com/users/timothybrooks/rec...,User,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,chavinlo,85657083,MDQ6VXNlcjg1NjU3MDgz,https://avatars.githubusercontent.com/u/856570...,,https://api.github.com/users/chavinlo,https://github.com/chavinlo,https://api.github.com/users/chavinlo/followers,https://api.github.com/users/chavinlo/followin...,https://api.github.com/users/chavinlo/gists{/g...,https://api.github.com/users/chavinlo/starred{...,https://api.github.com/users/chavinlo/subscrip...,https://api.github.com/users/chavinlo/orgs,https://api.github.com/users/chavinlo/repos,https://api.github.com/users/chavinlo/events{/...,https://api.github.com/users/chavinlo/received...,User,False
96,facebookresearch,16943930,MDEyOk9yZ2FuaXphdGlvbjE2OTQzOTMw,https://avatars.githubusercontent.com/u/169439...,,https://api.github.com/users/facebookresearch,https://github.com/facebookresearch,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,https://api.github.com/users/facebookresearch/...,Organization,False
97,DAGWorks-Inc,116846391,O_kgDOBvbvNw,https://avatars.githubusercontent.com/u/116846...,,https://api.github.com/users/DAGWorks-Inc,https://github.com/DAGWorks-Inc,https://api.github.com/users/DAGWorks-Inc/foll...,https://api.github.com/users/DAGWorks-Inc/foll...,https://api.github.com/users/DAGWorks-Inc/gist...,https://api.github.com/users/DAGWorks-Inc/star...,https://api.github.com/users/DAGWorks-Inc/subs...,https://api.github.com/users/DAGWorks-Inc/orgs,https://api.github.com/users/DAGWorks-Inc/repos,https://api.github.com/users/DAGWorks-Inc/even...,https://api.github.com/users/DAGWorks-Inc/rece...,Organization,False
98,machine1337,82051128,MDQ6VXNlcjgyMDUxMTI4,https://avatars.githubusercontent.com/u/820511...,,https://api.github.com/users/machine1337,https://github.com/machine1337,https://api.github.com/users/machine1337/follo...,https://api.github.com/users/machine1337/follo...,https://api.github.com/users/machine1337/gists...,https://api.github.com/users/machine1337/starr...,https://api.github.com/users/machine1337/subsc...,https://api.github.com/users/machine1337/orgs,https://api.github.com/users/machine1337/repos,https://api.github.com/users/machine1337/event...,https://api.github.com/users/machine1337/recei...,User,False


Unnamed: 0,key,name,spdx_id,url,node_id,spdx-id,Repository_ID
0,apache-2.0,Apache License 2.0,Apache-2.0,https://api.github.com/licenses/apache-2.0,MDc6TGljZW5zZTI=,,595893961
1,gpl-3.0,GNU General Public License v3.0,GPL-3.0,https://api.github.com/licenses/gpl-3.0,MDc6TGljZW5zZTk=,,601538369
2,apache-2.0,Apache License 2.0,Apache-2.0,https://api.github.com/licenses/apache-2.0,MDc6TGljZW5zZTI=,,602270517
3,mit,MIT License,MIT,https://api.github.com/licenses/mit,MDc6TGljZW5zZTEz,,600798098
4,other,Other,NOASSERTION,,MDc6TGljZW5zZTA=,,586711486
...,...,...,...,...,...,...,...
95,,,,,,,594287160
96,other,Other,NOASSERTION,,MDc6TGljZW5zZTA=,,607276532
97,bsd-3-clause-clear,BSD 3-Clause Clear License,BSD-3-Clause-Clear,https://api.github.com/licenses/bsd-3-clause-c...,MDc6TGljZW5zZTIx,,605682921
98,apache-2.0,Apache License 2.0,Apache-2.0,https://api.github.com/licenses/apache-2.0,MDc6TGljZW5zZTI=,,585309503


In [None]:
!pip install nbconvert


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!jupyter nbconvert --to html exercise_github.ipynb


[NbConvertApp] Converting notebook exercise_github.ipynb to html
[NbConvertApp] Writing 649480 bytes to exercise_github.html
