This project automates scraping rental listings from 591租屋網 using Playwright, takes a screenshot, and extracts listing text using Tesseract OCR.
⚠️ Note: This site uses obfuscated or dynamically generated HTML content, making it difficult to scrape using static requests. I'm using Playwright to load the page in a headless browser and capture the content. However, developer tools cannot be used directly — opening them triggers client-side scripts that redirect the page back to the previous view. As a result, automated tools like Playwright are necessary to bypass these restrictions and extract the data.
⚠️ Warning: This project is a work in progress and the code is not yet optimized.
Contributions are welcome! Whether it's fixing a bug, improving documentation, or adding new functionality — feel free to open a pull request.
Got ideas for new features or improvements? Please open an issue and share your thoughts! Even simple suggestions or feedback are greatly appreciated.
Let's build something useful together. 💡
We use conventional commits for the changelog.
Please use commitizen to commit your changes.
cz commit
- Playwright – for automating browser and taking a screenshot.
- Tesseract OCR – to extract text from screenshots.
pytesseract
– Python wrapper for Tesseract.Pillow
– for image handling in Python.
- Navigate to Listing Page using Playwright.
- Take Full Page Screenshot of filtered results.
- Run OCR on the screenshot to extract listing text.
I've tried... The site blocks it.
You can try printing it. All the useful data seems to be encoded in the HTML content and cannot be directly used.
We use uv to install the dependencies.
brew install uv
playwright install
brew install tesseract
brew install tesseract-lang # To install Chinese languages
- Download from: https://github.com/tesseract-ocr/tesseract
- Add
Tesseract-OCR
path to environment variables
uv run main.py
- TesseractNotFoundError: Ensure Tesseract is installed and
tesseract_cmd
is set if needed. - Language file error: Ensure
chi_tra.traineddata
exists in the correcttessdata
directory.- macOS default:
/opt/homebrew/share/tessdata/
- macOS default:
MIT License