GitHub Events Monitor

Using the GH Archive dataset of events on the public GitHub timeline, the goal of this project is to build real-time ingestion and processing application with Kafka to monitor topics, contributions, repositories, and collaborations of interest.

Ingestion

The GH Archive is available for download via an API, however to mitigate the risk of the streaming directly from the API during my Kafka ingestion, I'm also storing all of this data in an S3 bucket. This process is done with api_to_s3.py which takes dates as command line date arguments. This is so that it can be run across different subsets of dates from the shell on multiple EC2 instances in parallel to speed up ingestion. The only requirements are that the awscli is configured with your credentials on each instance and python3 is installed with the necessary packages (e.g. boto3). Additionally, a file logs/gh-events.log in your directory and an S3 bucket already built and named git-events will allow the script to be run without modifications.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ingestion		ingestion
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Events Monitor

Ingestion

About

Releases

Packages

Languages

tdeshong/github-monitor

Folders and files

Latest commit

History

Repository files navigation

GitHub Events Monitor

Ingestion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages