Distributed-crawler-system-based-on-Scrapy-framework-and-Redis-database-cluster

We design a distributed crawler system based on Scrapy framework and Redis database cluster. Scrapy provides efficient data capture and parsing functions, while Redis cluster implements task distribution and deduplication, reducing the space occupancy of string data.

Development Environment

Operating system： Ubuntu 24.04.1

Python version： Python 3.11.4

Redis version：Redis-x64-5.0.14.1

Test Run

Download Package

Download the package using Git command and install environment dependencies.

git clone https://github.com/SYSU-Zhangyp/Weibo-Comment-Manager-Scrapy-Redis.git
cd weibospider
pip install -r requirements.txt

Replace Cookie

Accessing Weibo PC（ https://weibo.com/ ）Log in to your account, open the developer mode of your browser, and refresh again

Replace cookies (./weibo_spider/weibo_spider/settings.py) in Settings.

Add proxy IP

Rewrite the fetch_dexy method (./weibo_spider/weibo_spider/middlewares.py), which needs to return a proxy IP address.

Deploying Redis Cluster

We need to deploy Redis cluster. Please refer to the Redis Cluster Deployment Tutorial for details（ https://blog.csdn.net/Yel_Liang/article/details/132093594 ）Cluster deployment can refer to the Redis cluster folder.

Run crawler

Open Redis cluster and execute commands on the terminal.

cd weibo_spider
scrapy crawl weibo_comment

Waiting for the request queue to be generated, waiting for the terminal to crawl and output the results.

Visualize Redis database

Download and Install Redis Desktop Manager.

Comment Query Platform

Return to the previous directory, run streamlit, and implement user interaction.

cd ..
cd streamlit
streamlit run visualization.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Redis-cluster		Redis-cluster
output		output
streamlit		streamlit
weibo_spider		weibo_spider
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed-crawler-system-based-on-Scrapy-framework-and-Redis-database-cluster

Development Environment

Test Run

Download Package

Replace Cookie

Add proxy IP

Deploying Redis Cluster

Run crawler

Visualize Redis database

Comment Query Platform

About

Uh oh!

Releases

Packages

Languages

Shen-Ruyu/Distributed-crawler-system-based-on-Scrapy-framework-and-Redis-database-cluster

Folders and files

Latest commit

History

Repository files navigation

Distributed-crawler-system-based-on-Scrapy-framework-and-Redis-database-cluster

Development Environment

Test Run

Download Package

Replace Cookie

Add proxy IP

Deploying Redis Cluster

Run crawler

Visualize Redis database

Comment Query Platform

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages