Douban AI Spider - Intelligent Movie Data Crawler

An intelligent web scraping project that combines traditional crawling techniques with Large Language Model (LLM) AI capabilities to extract and analyze movie data from Douban Top 250.

Project Overview

This project demonstrates how to leverage AI to solve the pain points of traditional web scraping, particularly in parsing unstructured text data into structured formats.

Key Features

Automated Crawling: Automatically fetches 250 top-rated movies from Douban
AI-Powered Parsing: Uses LLM (DeepSeek/OpenAI) to intelligently parse unstructured movie information
Data Persistence: Stores clean, structured data in SQLite database
Data Visualization: Generates statistical charts and analysis

Traditional vs AI Approach

Aspect	Traditional (Regex)	AI-Powered
Complexity	High - complex regex patterns	Low - natural language prompts
Maintainability	Fragile - breaks with HTML changes	Robust - adapts to format changes
Accuracy	Moderate	High - understands context

Tech Stack

Language: Python 3.10+
HTTP Client: httpx (async support)
HTML Parser: BeautifulSoup4
AI Engine: DeepSeek API / OpenAI API
Database: SQLite3
Data Analysis: Pandas + Matplotlib/PyEcharts

Project Structure

Douban_AI_Spider/
├── core/
│   ├── __init__.py
│   ├── spider.py          # Web scraping logic
│   ├── ai_engine.py       # AI parsing engine
│   └── database.py        # Database operations
├── data/
│   └── douban.db          # SQLite database file
├── utils/
│   ├── config.py          # Configuration management
│   └── logger.py          # Logging module
├── analysis/
│   └── charts.py          # Data visualization
├── main.py                # Project entry point
├── requirements.txt       # Python dependencies
└── README.md              # This file

Installation

Clone the repository:

git clone <repository-url>
cd python-ai-spider

Create virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configure API keys:

Edit utils/config.py and add your API keys:

DEEPSEEK_API_KEY = "your-deepseek-api-key"  # or OpenAI API key

Get your API key from:

DeepSeek: https://platform.deepseek.com/
OpenAI: https://platform.openai.com/

Usage

Run the main script to start crawling and analysis:

python main.py

The script will:

Fetch movie data from Douban Top 250 (10 pages, 25 movies per page)
Use AI to parse unstructured information into structured data
Store the data in SQLite database
Generate visualization charts

Modules Usage

You can also run individual modules:

# Run spider only
from core.spider import DoubanSpider
spider = DoubanSpider()
movies = spider.fetch_all_pages()

# Run AI parsing only
from core.ai_engine import AIEngine
ai = AIEngine()
parsed_data = ai.parse_movie_info(raw_text)

# Query database
from core.database import Database
db = Database()
movies = db.get_all_movies()

# Generate charts
from analysis.charts import ChartGenerator
charts = ChartGenerator()
charts.generate_all_charts()

Database Schema

Table: `movies`

Column	Type	Description
id	INTEGER	Primary key
rank	INTEGER	Movie ranking (1-250)
title	TEXT	Movie title
director	TEXT	Director name(s)
actors	TEXT	Actors list (JSON format)
year	INTEGER	Release year
country	TEXT	Production country
genres	TEXT	Movie genres (JSON format)
rating	REAL	Douban rating (0-10)
vote_count	INTEGER	Number of ratings
quote	TEXT	Famous quote from the movie
ai_summary	TEXT	AI-generated summary
created_at	TIMESTAMP	Record creation time

Anti-Scraping Measures

To respect Douban's servers and avoid being blocked:

Rate Limiting: Random delay 1-3 seconds between requests
User-Agent Rotation: Random user agents to simulate different browsers
Concurrency: Single-threaded requests (no parallel scraping)
Exception Handling: Comprehensive error handling with logging

Legal & Ethical Notes

Educational Purpose Only: This project is for learning and research purposes only
Respect robots.txt: Follows website scraping guidelines
Rate Limiting: Implements delays to avoid server overload
Data Usage: Collected data should not be used for commercial purposes

Configuration

Edit utils/config.py to customize:

# API Configuration
DEEPSEEK_API_KEY = "your-api-key"
DEEPSEEK_BASE_URL = "https://api.deepseek.com"

# Scraping Configuration
BASE_URL = "https://movie.douban.com/top250"
TOTAL_PAGES = 10
MOVIES_PER_PAGE = 25
DELAY_MIN = 1  # Minimum delay in seconds
DELAY_MAX = 3  # Maximum delay in seconds

# Database Configuration
DB_PATH = "data/douban.db"

# Logging Configuration
LOG_LEVEL = "INFO"
LOG_FILE = "logs/spider.log"

Troubleshooting

Common Issues

403 Forbidden Error:
- Increase delay time in config
- Check if your IP is blocked
- Try using different User-Agents
API Rate Limit:
- Implement caching to avoid duplicate API calls
- Check your API quota
- Reduce batch size
Database Locked:
- Close other database connections
- Check file permissions

Roadmap

Add proxy support
Implement incremental updates (only fetch new data)
Add more visualization options
Support for other movie databases (IMDb, Rotten Tomatoes)
Export data to CSV/Excel
Web dashboard for viewing results

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is for educational purposes only. Please respect the terms of service of the websites being scraped.

Blog Series

This project accompanies a blog series covering:

Environment Setup: Setting up development environment and analyzing HTML structure
AI Integration: Using LLM APIs for intelligent data extraction
Database Design: SQLite schema design and persistence logic
Data Analysis: Statistical analysis and visualization techniques

Acknowledgments

Douban for providing the movie data
DeepSeek for AI API services
The open-source community for the amazing tools and libraries

Note: Always comply with website terms of service and robots.txt when scraping data. This project is intended for educational purposes to demonstrate AI-powered web scraping techniques.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Douban AI Spider - Intelligent Movie Data Crawler

Project Overview

Key Features

Traditional vs AI Approach

Tech Stack

Project Structure

Installation

Usage

Modules Usage

Database Schema

Table: `movies`

Anti-Scraping Measures

Legal & Ethical Notes

Configuration

Troubleshooting

Common Issues

Roadmap

Contributing

License

Blog Series

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis		analysis
core		core
utils		utils
.env.example		.env.example
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

stars1324/python-ai-spider

Folders and files

Latest commit

History

Repository files navigation

Douban AI Spider - Intelligent Movie Data Crawler

Project Overview

Key Features

Traditional vs AI Approach

Tech Stack

Project Structure

Installation

Usage

Modules Usage

Database Schema

Table: movies

Anti-Scraping Measures

Legal & Ethical Notes

Configuration

Troubleshooting

Common Issues

Roadmap

Contributing

License

Blog Series

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Table: `movies`

Packages