🏡 591 Rental Listings Scraper with OCR Analysis

This project automates scraping rental listings from 591租屋網 using Playwright, takes a screenshot, and extracts listing text using Tesseract OCR.

⚠️ Note: This site uses obfuscated or dynamically generated HTML content, making it difficult to scrape using static requests. I'm using Playwright to load the page in a headless browser and capture the content. However, developer tools cannot be used directly — opening them triggers client-side scripts that redirect the page back to the previous view. As a result, automated tools like Playwright are necessary to bypass these restrictions and extract the data.

⚠️ Warning: This project is a work in progress and the code is not yet optimized.

🤝 Contributing & Feature Requests

Contributions are welcome! Whether it's fixing a bug, improving documentation, or adding new functionality — feel free to open a pull request.

Got ideas for new features or improvements? Please open an issue and share your thoughts! Even simple suggestions or feedback are greatly appreciated.

Let's build something useful together. 💡

Prerequisites for Contributing

We use conventional commits for the changelog.

Please use commitizen to commit your changes.

cz commit

🔧 Tools Used

Playwright – for automating browser and taking a screenshot.
Tesseract OCR – to extract text from screenshots.
pytesseract – Python wrapper for Tesseract.
Pillow – for image handling in Python.

📸 Workflow

Navigate to Listing Page using Playwright.
Take Full Page Screenshot of filtered results.
Run OCR on the screenshot to extract listing text.

Why not use Selenium/BeautifulSoup/...?

I've tried... The site blocks it.

Why not just use the data from `page.content()`?

You can try printing it. All the useful data seems to be encoded in the HTML content and cannot be directly used.

💻 Prerequisites

Install uv

We use uv to install the dependencies.

brew install uv

Install Playwright (runtime only)

playwright install

Install Tesseract OCR

macOS (via Homebrew)

brew install tesseract
brew install tesseract-lang  # To install Chinese languages

Windows

Download from: https://github.com/tesseract-ocr/tesseract
Add Tesseract-OCR path to environment variables

Run

uv run main.py

🛠 Troubleshooting

TesseractNotFoundError: Ensure Tesseract is installed and tesseract_cmd is set if needed.
Language file error: Ensure chi_tra.traineddata exists in the correct tessdata directory.
- macOS default: /opt/homebrew/share/tessdata/

📄 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏡 591 Rental Listings Scraper with OCR Analysis

🤝 Contributing & Feature Requests

Prerequisites for Contributing

🔧 Tools Used

📸 Workflow

Why not use Selenium/BeautifulSoup/...?

Why not just use the data from `page.content()`?

💻 Prerequisites

Install uv

Install Playwright (runtime only)

Install Tesseract OCR

macOS (via Homebrew)

Windows

Run

🛠 Troubleshooting

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

bearomorphism/scraper-591

Folders and files

Latest commit

History

Repository files navigation

🏡 591 Rental Listings Scraper with OCR Analysis

🤝 Contributing & Feature Requests

Prerequisites for Contributing

🔧 Tools Used

📸 Workflow

Why not use Selenium/BeautifulSoup/...?

Why not just use the data from page.content()?

💻 Prerequisites

Install uv

Install Playwright (runtime only)

Install Tesseract OCR

macOS (via Homebrew)

Windows

Run

🛠 Troubleshooting

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Why not just use the data from `page.content()`?

Packages