Crawreilly is a Python-based web scraper built using Scrapy that allows you to download all books and other resources from O'Reilly, pre-process the scraped HTML, CSS, and image resources, and save them locally as PDF files. To use this tool, you will need a valid subscription to O'Reilly.
To run this scraper, you'll need to have the following:
- Python >=3.9
- Scrapy
- WeasyPrint
- MongoDB and a MongoDB cluster along with its URI for the app to communicate with the right database.
-
Clone the repository:
git clone https://github.com/rifatrakib/crawreilly.git
-
Open a terminal window and navigate to the repository directory:
cd crawreilly
-
Create a virtual environment:
virtualenv venv
-
Run
pip install poetry
to install poetry -
Install dependencies:
poetry install
First, you need to authenticate yourself with your O'Reilly credentials to access the resources. The auth
spider does this automatically. Before running the spider, you need to create a file called auth.sh
in the keys/raw
directory of the repository and save the cURL
command for logging in:
-
In a web browser, log in to O'Reilly using your credentials.
-
Use a web developer tool such as the network tab to capture the
cURL
command for the login request. -
Save the cURL command as a
auth.sh
file in thekeys/raw
directory. For example, you could create a file calledauth.sh
with the following contents. -
Run the scraper using the command
scrapy crawl auth
. This will log you in to O'Reilly and save the necessary session cookies for future requests.
The catalogue
spider collects information about all books and other resources from O'Reilly catalogue and stores them as a CSV, JSON, and JSONLines files locally in data/csv
, data/json
, and data/jsonline
directories respectively, and store the JSON formatted records in MongoDB collections called catalogue
after the spider name:
-
In a web browser, log in to O'Reilly using your credentials.
-
Use a web developer tool such as the network tab to capture the
cURL
command for the request that fetches paginated resource information and save it incatalogue.sh
under thekeys/raw
directory. -
In the terminal, navigate to the project directory.
-
Run the scraper using the command scrapy crawl catalogue -o catalogue.json. This will collect information about all books and other resources from O'Reilly catalogue and save it as a CSV, JSON, and JSONLines files in
data/csv
,data/json
, anddata/jsonline
directories respectively, and store the JSON formatted records in MongoDB collections calledcatalogue
after the spider name.
The book
spider downloads, pre-processes the scraped HTML, CSS, and image resources, and saves them locally in directories based on their category and book title:
-
In a web browser, log in to O'Reilly using your credentials.
-
Use a web developer tool such as the network tab to capture the
cURL
command for the request that fetches book information and downloads an image and save them inbook.sh
andimage.sh
respectively under thekeys/raw
directory. -
In the terminal, navigate to the project directory.
-
Run the scraper using the command scrapy crawl book. This will download and pre-process the HTML, CSS, and image resources for each book in your O'Reilly subscription and save them locally based on their category and book title. The pre-processing includes, but not limited to, fixing links so that the final PDF is a more readable and complete representation of the book.
-
Information about each individual book in JSON format, which is also the source of the URLs for the corresponding HTMLs, CSSs, and images, will be stored in a MongoDB collection called
book
along with some metadata.
After running the book
spider, you can combine all corresponding resources (HTMLs, CSSs, and images) for each individual book and create one PDF per book by running the following command:
-
In the terminal, navigate to the project directory.
-
Run the command
python services/pdfmaker.py
. This will combine all the corresponding resources (HTMLs, CSSs, and images) for each individual book and create one PDF per book. The PDFs will be saved in thedata/books
directory.
Contributions are always welcome! Please follow these steps to contribute:
-
Fork the repository.
-
Create a new branch for your feature or bug fix.
-
Make your changes and test thoroughly.
-
Submit a pull request with a clear description of your changes.
Thank you for contributing to Crawreilly!
This project is licensed under the Apache License Version 2.0 - see the LICENSE file for details.