# GitHub repos list

In this Jupyter Notebook we will go through the BioHackathon repos and retrieve the list of repositories that we will extract metadata for.

## What BioHackathons?

We are using the project repositories to get a tentative list of participating project repos.
* [BioHackathon Europe 2019](https://github.com/elixir-europe/BioHackathon-projects-2019)
* [BioHackathon Europe 2020](https://github.com/elixir-europe/BioHackathon-projects-2020) + topic BioHackEU20

Later we plan to add BioHackatho repos from the DBCLS/NBDC BioHackathons and the Covid-19 Virtual BioHackathon

## What comes next?

* We plan to integrate the repo list retrieval with the process pipeline
* We plan to improve the metadata we are retrieving to get better analyses

## What are we doing here?
* Get projects from the two selected BioHackathons
* For each project we get the readme file and identify any mention of a GitHub URL
* For 2020 we double check the repo has the topic BioHackEU20


In [11]:
import re

#constants
BH_2019 = 'elixir-europe/BioHackathon-projects-2019'
BH_2020 = 'elixir-europe/BioHackathon-projects-2020'
BH_2020_TOPIC = 'biohackeu20'
README = 'README.md'

MID_PATTERN = '\\/[a-zA-z0-9-_.~#]+'
PATTERN = re.compile('(https:\\/\\/github\\.com' + MID_PATTERN + MID_PATTERN + ')')

In [2]:
#let's get the GitHub token so we can use the API
import getpass

try:
    from secret import GITHUB_TOKEN
except ModuleNotFoundError:
    GITHUB_TOKEN = getpass.getpass("Introduce your personal access token to acces the GitHub API: ")

Introduce your personal access token to acces the GitHub API: ········


In [12]:
#Retrieve all GitHub repo like URLs (using pattern matching) from all readme files listed under projects folder
#This functions relies on the project folder pattern defined for the BioHackathon Europe
def retrieve_bheu_repo_set(g, repo_path):
    repo_set = set()
    repo = g.get_repo(repo_path)
    contents = repo.get_contents("projects")
    for project in contents:
        path = project.path + '/' + README
        readme = repo.get_contents(path)
        readme_text = readme.decoded_content.decode()
        #ToDo: Is there a better way to do this?
        items = re.findall(PATTERN, readme_text)        
        repo_set.update(items)
        
    repo_set = {item.replace('https://github.com/', '') for item in repo_set}
    
    return repo_set

In [13]:
#Save to a file all URLs
def save_to_file(file_path, repo_set):
    with open(file_path, 'w') as f:
        for item in repo_set:
            f.write("%s\n" % item)
   

In [3]:
#!pip install PyGithub
from github import Github

g = Github(GITHUB_TOKEN)
print(g)


<github.MainClass.Github object at 0x0000024CBAC86DC8>


In [47]:
#Getting and saving repositories for BioHackathon Europe 2010
bh_2019_repo_set = retrieve_bheu_repo_set(g, BH_2019)
bh_2019_repo_set.add(BH_2019)
save_to_file('./data/bh_2019_repos.txt', bh_2019_repo_set)
        
print('done')

done


In [15]:
bh_2020_repo_set = set()

#Getting repositories for BioHackathon Europe 2020
#bh_2020_repo_set = retrieve_bheu_repo_set(g, BH_2020)
bh_2020_repo_set.add(BH_2020)

#Get also those with the topic
#ToDo: Should it be the intersection?
repositories = g.search_repositories(query='topic:biohackeu20')
biohackeu20_repo_list = [repo.full_name for repo in repositories]
biohackeu20_repo_list
bh_2020_repo_set.update(biohackeu20_repo_list)

#Save it all
save_to_file('./data/bh_2020_repos_biohackeu20_topic.txt', bh_2020_repo_set)
        
print('done')

done


In [53]:
#Get also those with the topic BioHackCovid20, for these we do not have a seed repository (or do we?)
repositories = g.search_repositories(query='topic:biohackcovid20')
bh_covid_2020_repo_set = {repo.full_name for repo in repositories}

#Save it all
save_to_file('./data/bh_covid_2020_repos.txt', bh_covid_2020_repo_set)
        
print('done')

done
