Skip to content

bearomorphism/scraper-591

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏡 591 Rental Listings Scraper with OCR Analysis

ChatGPT discussion thread

This project automates scraping rental listings from 591租屋網 using Playwright, takes a screenshot, and extracts listing text using Tesseract OCR.

⚠️ Note: This site uses obfuscated or dynamically generated HTML content, making it difficult to scrape using static requests. I'm using Playwright to load the page in a headless browser and capture the content. However, developer tools cannot be used directly — opening them triggers client-side scripts that redirect the page back to the previous view. As a result, automated tools like Playwright are necessary to bypass these restrictions and extract the data.

⚠️ Warning: This project is a work in progress and the code is not yet optimized.

🤝 Contributing & Feature Requests

Contributions are welcome! Whether it's fixing a bug, improving documentation, or adding new functionality — feel free to open a pull request.

Got ideas for new features or improvements? Please open an issue and share your thoughts! Even simple suggestions or feedback are greatly appreciated.

Let's build something useful together. 💡

Prerequisites for Contributing

We use conventional commits for the changelog.

Please use commitizen to commit your changes.

cz commit

🔧 Tools Used

  • Playwright – for automating browser and taking a screenshot.
  • Tesseract OCR – to extract text from screenshots.
  • pytesseract – Python wrapper for Tesseract.
  • Pillow – for image handling in Python.

📸 Workflow

  1. Navigate to Listing Page using Playwright.
  2. Take Full Page Screenshot of filtered results.
  3. Run OCR on the screenshot to extract listing text.

Why not use Selenium/BeautifulSoup/...?

I've tried... The site blocks it.

Why not just use the data from page.content()?

You can try printing it. All the useful data seems to be encoded in the HTML content and cannot be directly used.

💻 Prerequisites

Install uv

We use uv to install the dependencies.

brew install uv

Install Playwright (runtime only)

playwright install

Install Tesseract OCR

macOS (via Homebrew)

brew install tesseract
brew install tesseract-lang  # To install Chinese languages

Windows

Run

uv run main.py

🛠 Troubleshooting

  • TesseractNotFoundError: Ensure Tesseract is installed and tesseract_cmd is set if needed.
  • Language file error: Ensure chi_tra.traineddata exists in the correct tessdata directory.
    • macOS default: /opt/homebrew/share/tessdata/

📄 License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages