<h1>Requesting data from Github using API</h1>

## 1. Request Json output from GitHub search API:
+ Document can be found at: https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories
+ API documentation: https://docs.github.com/en/rest/reference/search#search-repositories

In [6]:
# Creating a function to search for repositories with a given keyword
# per_page: extract 100 items each page
import requests
import math

def total_pages(term):
    url = "https://api.github.com/search/repositories"
    response = requests.get(url, headers={"Accept": "application/json"},
                       params = {"q": term,
                                "per_page":100})
    git_request = response.json()
    total_pages = math.ceil((git_request['total_count'])/len(git_request['items']))
    return total_pages     

In [7]:
total_pages("python")

22110

## 2. Trying to get data from all pages
From this point towards, we tried to do pagination because GitHub only returns 1000 first results per request.
1. We built a function to calculate the total pages
2. We planned to do a loop to get all data while the page number is no large than <total_page>, however, we have not succeed yet.
3. Next: We will try a  solution suggested on Stackoverflow: splitting up the search into segments by  date of creation: https://stackoverflow.com/questions/37602893/github-search-limit-results

In [97]:
# Function to calculate the total pages

def total_page(term):
    url = "https://api.github.com/search/repositories"
    response = requests.get(url, headers={"Accept": "application/json"},
                            params = {"q": term})
    pages = math.ceil(response.json()['total_count']/(len(response.json()['items'])))
    return pages

term = "open education"
total_page(term)

33

In [36]:
term = "education"
URL = f'https://api.github.com/search/repositories?q={term} created:SINCE..UNTIL&per_page=100'
HEADERS = {'Authorization': 'token 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'}

since = datetime.today() - relativedelta(months=1)  # Start fetching repo created 1 month ago
until = since + timedelta(hours=12) # dividing the total No.of repo into segments of 12 hours each
repo_list = []

while until < datetime.today():
    day_url = URL.replace('SINCE', since.strftime('%Y-%m-%dT%H:%M:%SZ')).replace('UNTIL', until.strftime('%Y-%m-%dT%H:%M:%SZ'))
    repo_request = requests.get(day_url, headers=HEADERS)
    print(f'Repositories created between {since} and {until}: {repo_request.json().get("total_count", 0)}')
    # Update dates for the next search
    since = until #move start-date and end-date up 12hours
    until = since + timedelta(hours=12)

Repositories created between 2021-09-04 14:06:58.138787 and 2021-09-05 02:06:58.138787: 16
Repositories created between 2021-09-05 02:06:58.138787 and 2021-09-05 14:06:58.138787: 28
Repositories created between 2021-09-05 14:06:58.138787 and 2021-09-06 02:06:58.138787: 21
Repositories created between 2021-09-06 02:06:58.138787 and 2021-09-06 14:06:58.138787: 32
Repositories created between 2021-09-06 14:06:58.138787 and 2021-09-07 02:06:58.138787: 17
Repositories created between 2021-09-07 02:06:58.138787 and 2021-09-07 14:06:58.138787: 33
Repositories created between 2021-09-07 14:06:58.138787 and 2021-09-08 02:06:58.138787: 22
Repositories created between 2021-09-08 02:06:58.138787 and 2021-09-08 14:06:58.138787: 22
Repositories created between 2021-09-08 14:06:58.138787 and 2021-09-09 02:06:58.138787: 18
Repositories created between 2021-09-09 02:06:58.138787 and 2021-09-09 14:06:58.138787: 39
Repositories created between 2021-09-09 14:06:58.138787 and 2021-09-10 02:06:58.138787: 27

In [41]:
import requests
import math
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta

term = "python"
URL = f'https://api.github.com/search/repositories?q={term} created:SINCE..UNTIL&per_page=100'
HEADERS = {'Authorization': 'token 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'}

since = datetime.today() - relativedelta(days=1)  # Start fetching repo created 1 month ago
until = since + timedelta(hours=12) # dividing the total No.of repo into segments of 12 hours each
repo_list = []

while until < datetime.today():
    day_url = URL.replace('SINCE', since.strftime('%Y-%m-%dT%H:%M:%SZ')).replace('UNTIL', until.strftime('%Y-%m-%dT%H:%M:%SZ'))
    repo_request = requests.get(day_url, headers=HEADERS)
    print(f'Repositories created between {since} and {until}: {repo_request.json().get("total_count")}')
    no_page = math.ceil(repo_request.json().get("total_count",0)/100) #calculating the total No. of pages
    for i in range(1, no_page + 1): #running a loop to fetch each page
        page_url = f'{day_url}&page={i}'
        page_request = requests.get(page_url, headers=HEADERS)
        #update list of repositories
        repo_list.extend(page_request.json().get("items")) #adding the fetched page to the list
    # Update dates for the next search
    since = until #move start-date and end-date up 12hours
    until = since + timedelta(hours=12)

Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17
Repositories created between 2021-09-04 14:15:09.946350 and 2021-09-05 02:15:09.946350: 17

KeyboardInterrupt: 

In [15]:
len(repo_list)

1141

In [69]:
#function to search for a specific "term" and fletch all repositories created in the last {day} days
def find_repo(term, day):
    import requests
    import math
    from datetime import datetime, timedelta
    from dateutil.relativedelta import relativedelta

    URL = f'https://api.github.com/search/repositories?q={term}+created:SINCE..UNTIL&per_page=100'
    HEADERS = {'Authorization': 'token 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'}

    since = datetime.today() - relativedelta(days= day)  # Start fetching repo created {day} days ago
    until = since + timedelta(hours=12) # dividing the total No.of repo into segments of 12 hours each
    repo_list = []

    while until < datetime.today():
        day_url = URL.replace('SINCE', since.strftime('%Y-%m-%dT%H:%M:%SZ')).replace('UNTIL', until.strftime('%Y-%m-%dT%H:%M:%SZ'))
        repo_request = requests.get(day_url, headers=HEADERS)
        #print(f'Repositories created between {since} and {until}: {repo_request.json().get("total_count")}')
        no_page = math.ceil(repo_request.json().get("total_count")/100) #calculating the total No. of pages
        for i in range(1, no_page + 1): #running a loop to fetch each page
            page_url = f'{day_url}&page={i}'
            page_request = requests.get(page_url, headers=HEADERS)
            #update list of repositories
            repo_list.extend(page_request.json().get("items")) #adding the fetched page to the list
        # Update dates for the next search
        since = until #move start-date and end-date up 12hours
        until = since + timedelta(hours=12)
    return repo_list

In [72]:
dt = find_repo("python",1)

In [86]:
import csv
import pandas as pd   

with open(f"{term}_{datetime.now().date()}.csv", "w") as csv_file:
    writer = csv.writer(csv_file, delimiter = ";")
    writer.writerow(["id", "name", "url", "language", "created", "stars", "watch", "forks"])
    for repo in dt:
        writer.writerow([repo['id'], repo['name'], repo['url'], repo['language'], repo['created'], repo['stars'], repo['watch'], repo['forks']])
rep = pd.read_csv(f"{term}_{datetime.now().date()}.csv", delimiter= ";")

rep

Unnamed: 0,id,name,url,language,created,stars,watch,forks
0,413226906,HacktoberFest2021-python-For-RUSL-Students,https://github.com/Priyasad1997/HacktoberFest2...,Python,2021-10-04T00:09:01Z,2,2,12
1,413127168,Python-Programs,https://github.com/tasha06/Python-Programs,Python,2021-10-03T16:06:35Z,0,0,9
2,413151356,new-python-codes,https://github.com/DheerajMandvi9/new-python-c...,Python,2021-10-03T17:40:24Z,1,1,5
3,413151410,StreamlitModelApp,https://github.com/sonamehdi19/StreamlitModelApp,Python,2021-10-03T17:40:41Z,0,0,4
4,413147970,Hacktoberfest-python-code-bunch,https://github.com/Ankitkundu21/Hacktoberfest-...,Python,2021-10-03T17:26:25Z,0,0,2
...,...,...,...,...,...,...,...,...
1449,413301099,SparkXcloud-Gdrive-MirrorBot,https://github.com/Spark-X-Cloud/SparkXcloud-G...,Python,2021-10-04T06:33:44Z,0,0,0
1450,413374787,EEE3097S-Project-Repository,https://github.com/MikeMillard/EEE3097S-Projec...,,2021-10-04T10:28:37Z,0,0,0
1451,413393269,Prediction-of-Default-Customers-based-on-credi...,https://github.com/maheshk-DS/Prediction-of-De...,Jupyter Notebook,2021-10-04T11:31:14Z,0,0,0
1452,413282742,talo-hacktoberfest2021,https://github.com/abhishek213-alb/talo-hackto...,,2021-10-04T05:14:34Z,0,0,0


In [87]:
#Wraping everything into a function:

def find_repo(term, day):
    import requests
    import math
    from datetime import datetime, timedelta
    from dateutil.relativedelta import relativedelta
    import csv
    import pandas as pd 
    
    URL = f'https://api.github.com/search/repositories?q={term}+created:SINCE..UNTIL&per_page=100'
    HEADERS = {'Authorization': 'token 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'}

    since = datetime.today() - relativedelta(days= day)  # Start fetching repo created {day} days ago
    until = since + timedelta(hours=12) # dividing the total No.of repo into segments of 12 hours each
    repo_list = []
    dt = []
    repo = []

    #Fletching repositories:
    while until < datetime.today():
        day_url = URL.replace('SINCE', since.strftime('%Y-%m-%dT%H:%M:%SZ')).replace('UNTIL', until.strftime('%Y-%m-%dT%H:%M:%SZ'))
        repo_request = requests.get(day_url, headers=HEADERS)
        #print(f'Repositories created between {since} and {until}: {repo_request.json().get("total_count")}')
        no_page = math.ceil(repo_request.json().get("total_count")/100) #calculating the total No. of pages
        for i in range(1, no_page + 1): #running a loop to fetch each page
            page_url = f'{day_url}&page={i}'
            page_request = requests.get(page_url, headers=HEADERS)
            #update list of repositories
            repo_list.extend(page_request.json().get("items")) #adding the fetched page to the list
        # Update dates for the next search
        since = until #move start-date and end-date up 12hours
        until = since + timedelta(hours=12)
    
    #Saving relevant variables into a list:
    for item in repo_list:
        id = item.get("id")
        name = item.get("name")
        url = item.get("html_url")
        created = item.get("created_at")
        stars = item.get("stargazers_count")
        watch = item.get("watchers_count")
        language = item.get("language")
        forks = item.get("forks_count")
        dt.append({"id": id, 
                   "name": name, 
                   "url": url, 
                   "created": created,
                   "stars": stars,
                   "watch": watch,
                   "language": language,
                   "forks": forks})
        
    #Writing data into .csv file and returning the table:
    with open(f"{term}_{datetime.now().date()}.csv", "w") as csv_file:
        writer = csv.writer(csv_file, delimiter = ";")
        writer.writerow(["id", "name", "url", "language", "created", "stars", "watch", "forks"])
        for repo in dt:
            writer.writerow([repo['id'], repo['name'], repo['url'], repo['language'], repo['created'], repo['stars'], repo['watch'], repo['forks']])
    rep = pd.read_csv(f"{term}_{datetime.now().date()}.csv", delimiter= ";")
    
    return rep

In [88]:
find_repo("python",1)

Unnamed: 0,id,name,url,language,created,stars,watch,forks
0,413226906,HacktoberFest2021-python-For-RUSL-Students,https://github.com/Priyasad1997/HacktoberFest2...,Python,2021-10-04T00:09:01Z,2,2,12
1,413151356,new-python-codes,https://github.com/DheerajMandvi9/new-python-c...,Python,2021-10-03T17:40:24Z,1,1,5
2,413151410,StreamlitModelApp,https://github.com/sonamehdi19/StreamlitModelApp,Python,2021-10-03T17:40:41Z,0,0,4
3,413271258,python_wikipedia,https://github.com/harinandanan2112/python_wik...,Python,2021-10-04T04:17:18Z,0,0,3
4,413147970,Hacktoberfest-python-code-bunch,https://github.com/Ankitkundu21/Hacktoberfest-...,Python,2021-10-03T17:26:25Z,0,0,2
...,...,...,...,...,...,...,...,...
1456,413374787,EEE3097S-Project-Repository,https://github.com/MikeMillard/EEE3097S-Projec...,,2021-10-04T10:28:37Z,0,0,0
1457,413393269,Prediction-of-Default-Customers-based-on-credi...,https://github.com/maheshk-DS/Prediction-of-De...,Jupyter Notebook,2021-10-04T11:31:14Z,0,0,0
1458,413455845,FolderOrganizer,https://github.com/arevish/FolderOrganizer,,2021-10-04T14:25:26Z,0,0,0
1459,413282742,talo-hacktoberfest2021,https://github.com/abhishek213-alb/talo-hackto...,,2021-10-04T05:14:34Z,0,0,0
