GitHub - wangrui6/Zhihu-KOL

Data Scraping Project for Zhihu Dataset

A simple project to scrape data from Zhihu.

About The Project

This project provides a way to scrape data from Zhi Hu site. We use this project to scrape dataset for Open Assistant LLM project (https://open-assistant.io/).

Use the scrape_process.py to get started.

(back to top)

Built With

The project in is Python. We primarily use

Playwright as headless browser,
Ray for parallel processing
BeautifulSoup for html extraction
DuckDB for data persistance

(back to top)

Getting Started

Installation

Install all dependencies

pip
```
  pip install -r requirements.txt
```
Install playwright
```
playwright install
```
Run the following commands to install necessary library if needed

sudo apt install libatk1.0-0 libatk-bridge2.0-0 libcups2 libatspi2.0-0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libxkbcommon0 libpango-1.0-0 libcairo2 libasound2

Useful concept

To understand how we scrape data from Zhi Hu, it is useful to define the hierarchy of question-answer categorization. Currently we work on 4 level of abstraction categorization. Each level of abstraction contains multiple different instance of abstraction in the next level, i.e. 1 Common Topic might contains hundreds of unique questions and 1 question can contain hundreds of unique answers.An example of such categorization is shown below.

Currently, the scraping process consists of 3 independent scraping process that scrapes 3 different level of categorization. Each scraping process are independent of each other and only requires input from local files to start working. We provide a list of base topic to initialise the scraping process. The first process scrape_common_topic will scrape common topic for all base topic. All common topic will be saved periodically to local file system. The second process is the heaviest process in the system as it utilized headless browser to scrape answer url. This process also utilize ray for parallel processing to speed up the entire chain of processes. The question-answer-url will be saved periodically to local file system. The last process aims to download the exact answer using API request. Shadow socks 5 proxies are used to bypass API rate limit. The extracted answers will be persisted into database.

Roadmap

Decouple all scraping processes
Add Ray For Parallel Processing
Add shadow socks for API rate limit bypass
Add duckdb for persistence
Scrape comment sections
Scale up scraping processes

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
comments		comments
data		data
zhihu_qa/playground		zhihu_qa/playground
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
README_prev.md		README_prev.md
__init__.py		__init__.py
aggregate_data_in_db.py		aggregate_data_in_db.py
auto_restart_download_answer.sh		auto_restart_download_answer.sh
change_vpn_loc.sh		change_vpn_loc.sh
convert_parquet.py		convert_parquet.py
main.py		main.py
requirements.txt		requirements.txt
scrape_by_topic.py		scrape_by_topic.py
scrape_process.png		scrape_process.png
scrape_process.py		scrape_process.py
upload_hf.py		upload_hf.py
whitelist.json		whitelist.json
zhihu_question_hierarchy.png		zhihu_question_hierarchy.png

wangrui6/Zhihu-KOL

Folders and files

Latest commit

History

Repository files navigation

Data Scraping Project for Zhihu Dataset

About The Project

Built With

Getting Started

Installation

Useful concept

Roadmap

License

About

Resources

Stars

Watchers

Forks

Languages