Skip to content

Latest commit

 

History

History
73 lines (47 loc) · 3.36 KB

05-Sandagomi-Pieris-CrawlerX-Distributed-Crawling-System.md

File metadata and controls

73 lines (47 loc) · 3.36 KB

CrawlerX Distributed Crawling System

Student Info

Project Abstract

The CrawlerX is a platform that we can use for crawling web URLs in different kinds of protocols in a distributed way. Web crawling often called web scraping is a method of programmatically going over a collection of web pages and extracting data which is useful for data analysis with web-based data. With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity. CrawlerX was a platform designed to run on a VM-based environment with limited functionality. This project extends that limitation to the containerized environments.

K8s with Helm

Project Overview

  1. Pod Deployment

Pods

  1. Service Deployment

Services

  1. Ingress Deployment

Ingress

  1. ConfigMap Deployment

ConfigMaps

Work Summary

This project mainly focuses on deploying the CrawlerX web platform on Kubernetes. As per the details provided by SCoRe organization mentors, CrawlerX needs to be deployed as an on-demand platform. Therefore, as per the investigations Helm is used to implementing the requirement. Helm helps you manage Kubernetes applications as Charts. Charts are easy to create, version, share, and publish also unpublish. Now users can deploy the CrawlerX on the K8s environment with a single command as follows.

helm install <RELEASE_NAME> <HELM_HOME> --namespace <NAMESPACE> --dependency-update --create-namespace

Helm Deployment

What Covered

  • Add Helm chart for CrawlerX platform
  • Add K8s artifacts for VueJS based frontend server deployment
  • Add K8s artifacts for Django backend server deployment
  • Add K8s artifacts for Celery beat deployment
  • Add K8s artifacts for Celery worker deployment
  • Add K8s artifacts for Scrapy crawler deployment
  • Add MongoDB, RabbitMQ and Elasticsearch deployments as chart dependencies
  • Add K8s secret artifacts to pull private images for the pods
  • Add ConfigMaps for each deployment
  • Configure values.yaml to customize deployment parameters
  • Documentation of the K8s deployment
  • Testing on Minikube local server
  • Testing on Google Kubernetes Engine

What left

  • Integrate Grafana dashboard

Reference

  1. Helm
  2. Google Kubernetes Engine