# Download Curated Topics from GitHub

This notebook contains code to download all the user-curated topics on GitHub along with their descriptions, creators, release dates, created dates, updated dates, and whether or not they are a featured topic.

We use the [GitHub Rest API](https://docs.github.com/en/rest) to collect all the topics page-by-page. The API allows for the use of personal access tokens to authenticate requests and take advantage of a greater number of requests per hour among other benefits. See [Getting Started with Authentication](https://docs.github.com/en/rest/guides/getting-started-with-the-rest-api#authentication) for more information.

To use authentication in this program, simply remove the comments on the next two lines of code and insert values for your GitHub username and personal access token. The program will take care of the rest.

In [1]:
# user = ""
# token = ""

In [2]:
from requests import get
import pandas as pd

query = "q=is:curated"
params = "&per_page=100"  # the API limit is 100

if 'user' in locals() and 'token' in locals():
    topics = get(f"https://api.github.com/search/topics?{query}{params}", auth=(user, token))
else:
    topics = get(f"https://api.github.com/search/topics?{query}{params}")

topics = topics.json()

items = topics["items"]

total = topics["total_count"]
per_page = len(topics["items"])

print(f'Total topics: {total}')
print(f'Topics returned: {per_page}')

Total topics: 670
Topics returned: 100


In [3]:
params += "&page={pnum}"
pnum = 1

# Keep sending requests for the next page until we receive all the topics.
while(pnum * per_page < total):
    pnum += 1
    
    if 'user' in locals() and 'token' in locals():
        topics = get(f"https://api.github.com/search/topics?{query}{params.format(pnum=pnum)}", auth=(user, token))
    else:
        topics = get(f"https://api.github.com/search/topics?{query}{params.format(pnum=pnum)}")

    topics = topics.json()
    
    if 'items' not in topics.keys():
        print(topics["message"])
        break
    else:
        items = items + topics["items"]

print(f'Total items retrieved: {len(items)}')

Total items retrieved: 670


In [4]:
# Convert to a DataFrame for easy processing
df_topics = pd.DataFrame(items)

print(df_topics.shape)
df_topics.head()

(670, 11)


Unnamed: 0,name,display_name,short_description,description,created_by,released,created_at,updated_at,featured,curated,score
0,liko-12,LIKO-12,LIKO-12 is an open source fantasy computer mad...,LIKO-12 is an open source fantasy computer com...,Rami Sabbagh,,2017-09-30T07:04:42Z,2021-03-03T11:45:47Z,False,True,1.0
1,basic8,BASIC8,BASIC8 is an integrated Fantasy Computer for g...,BASIC8 is an integrated Fantasy Computer for g...,Wang Renxin,,2018-08-14T19:17:07Z,2021-02-24T18:12:29Z,False,True,1.0
2,pixel-vision-8,Pixel Vision 8,Pixel Vision 8 is an open source fantasy game ...,Pixel Vision 8 is a platform that standardizes...,Jesse Freeman,,2018-08-14T19:17:08Z,2021-06-28T15:22:06Z,False,True,1.0
3,university-of-texas-arlington,The University of Texas at Arlington,The University of Texas at Arlington is a publ...,UT Arlington was founded in 1895 and is the th...,,,2019-08-23T19:58:24Z,2021-06-18T02:56:44Z,False,True,1.0
4,racing-simulator,racing-simulator,A genre of video game.,Computer software that attempts to accurately ...,,,2021-03-15T23:02:09Z,2021-10-11T07:45:06Z,False,True,1.0


In [5]:
df_topics[df_topics.duplicated(keep=False)]

Unnamed: 0,name,display_name,short_description,description,created_by,released,created_at,updated_at,featured,curated,score
86,action-game,action-game,A genre of video game.,A video game genre that emphasizes physical ch...,,,2017-02-02T18:22:49Z,2022-02-20T11:48:54Z,False,True,1.0
87,powertoys,Microsoft PowerToys,Microsoft PowerToys is a set of utilities for ...,Microsoft PowerToys is a set of utilities for ...,Microsoft,,2019-09-22T16:18:54Z,2022-02-28T22:56:06Z,False,True,1.0
88,aurelia,aurelia,A next generation JavaScript client framework ...,Aurelia is a next generation JavaScript client...,Rob Eisenberg,July 2016,2017-01-31T21:28:33Z,2021-08-02T09:22:09Z,False,True,1.0
89,roguelite,roguelite,A genre of video game.,A genre of video games that take certain eleme...,,,2017-02-02T21:11:36Z,2021-10-11T07:45:07Z,False,True,1.0
90,luvit,Luvit,Asynchronous I/O for Lua.,Luvit is an asynchronous I/O Lua runtime envir...,Tim Caswell,2011,2017-02-01T02:43:02Z,2022-02-14T15:44:28Z,False,True,1.0
91,openfin,OpenFin,OpenFin allows web applications to run as firs...,OpenFin is a web application runtime and API t...,Brian Schwinn,December 2012,2017-02-01T14:34:38Z,2021-09-17T16:53:02Z,False,True,1.0
92,christianity,Christianity,Christianity is an Abrahamic monotheistic reli...,Christianity is an Abrahamic monotheistic reli...,,,2017-02-01T03:00:07Z,2022-02-07T09:39:54Z,False,True,1.0
93,raku,Raku,Raku is an expressive and feature-rich program...,Raku is an expressive and feature-rich program...,Larry Wall,"December 25, 2015",2017-05-17T10:33:29Z,2021-12-02T17:25:28Z,False,True,1.0
94,synthetic-biology,Synthetic biology,The creation of new biological systems via the...,Synthetic biology (SynBio) is a multidisciplin...,,,2017-01-31T21:42:02Z,2022-01-17T19:07:13Z,False,True,1.0
95,real-time-strategy,real-time-strategy,A genre of video game.,A sub-genre of strategy video games in which t...,,,2017-02-18T21:57:26Z,2022-03-02T02:08:51Z,False,True,1.0


In [6]:
df_topics = df_topics.drop_duplicates().drop(['curated','score'], axis=1)
df_topics.shape

(655, 9)

In [7]:
import os

os.makedirs('./data_output', exist_ok=True)
df_topics.reset_index(drop=True).to_csv('./data_output/github_topics.csv')