CrawlerX Distributed Crawling System

Student Info

Name - Sandagomi Pieris
Email - npieris73@gmail.com
GitHub Profile - https://github.com/sandagomipieris

Project Abstract

The CrawlerX is a platform that we can use for crawling web URLs in different kinds of protocols in a distributed way. Web crawling often called web scraping is a method of programmatically going over a collection of web pages and extracting data which is useful for data analysis with web-based data. With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity. CrawlerX was a platform designed to run on a VM-based environment with limited functionality. This project extends that limitation to the containerized environments.

GSoC Project Page

GSoC Project Proposal

GitHub Organization Repo

GitHub Personal Repo

Commits during GSoC 2021

Project Overview

Pod Deployment

Service Deployment

Ingress Deployment

ConfigMap Deployment

Work Summary

This project mainly focuses on deploying the CrawlerX web platform on Kubernetes. As per the details provided by SCoRe organization mentors, CrawlerX needs to be deployed as an on-demand platform. Therefore, as per the investigations Helm is used to implementing the requirement. Helm helps you manage Kubernetes applications as Charts. Charts are easy to create, version, share, and publish also unpublish. Now users can deploy the CrawlerX on the K8s environment with a single command as follows.

helm install <RELEASE_NAME> <HELM_HOME> --namespace <NAMESPACE> --dependency-update --create-namespace

What Covered

Add Helm chart for CrawlerX platform
Add K8s artifacts for VueJS based frontend server deployment
Add K8s artifacts for Django backend server deployment
Add K8s artifacts for Celery beat deployment
Add K8s artifacts for Celery worker deployment
Add K8s artifacts for Scrapy crawler deployment
Add MongoDB, RabbitMQ and Elasticsearch deployments as chart dependencies
Add K8s secret artifacts to pull private images for the pods
Add ConfigMaps for each deployment
Configure values.yaml to customize deployment parameters
Documentation of the K8s deployment
Testing on Minikube local server
Testing on Google Kubernetes Engine

What left

Integrate Grafana dashboard

Reference

Helm
Google Kubernetes Engine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05-Sandagomi-Pieris-CrawlerX-Distributed-Crawling-System.md

05-Sandagomi-Pieris-CrawlerX-Distributed-Crawling-System.md

CrawlerX Distributed Crawling System

Student Info

Project Abstract

GSoC Project Page

GSoC Project Proposal

GitHub Organization Repo

GitHub Personal Repo

Commits during GSoC 2021

Project Overview

Work Summary

What Covered

What left

Reference

Files

05-Sandagomi-Pieris-CrawlerX-Distributed-Crawling-System.md

Latest commit

History

05-Sandagomi-Pieris-CrawlerX-Distributed-Crawling-System.md

File metadata and controls

CrawlerX Distributed Crawling System

Student Info

Project Abstract

GSoC Project Page

GSoC Project Proposal

GitHub Organization Repo

GitHub Personal Repo

Commits during GSoC 2021

Project Overview

Work Summary

What Covered

What left

Reference