Crawreilly

Crawreilly is a Python-based web scraper built using Scrapy that allows you to download all books and other resources from O'Reilly, pre-process the scraped HTML, CSS, and image resources, and save them locally as PDF files. To use this tool, you will need a valid subscription to O'Reilly.

Getting Started

Prerequisites

To run this scraper, you'll need to have the following:

Python >=3.9
Scrapy
WeasyPrint
MongoDB and a MongoDB cluster along with its URI for the app to communicate with the right database.

Installation

Clone the repository: git clone https://github.com/rifatrakib/crawreilly.git
Open a terminal window and navigate to the repository directory: cd crawreilly
Create a virtual environment: virtualenv venv
Run pip install poetry to install poetry
Install dependencies: poetry install

Usage

Authentication

First, you need to authenticate yourself with your O'Reilly credentials to access the resources. The auth spider does this automatically. Before running the spider, you need to create a file called auth.sh in the keys/raw directory of the repository and save the cURL command for logging in:

In a web browser, log in to O'Reilly using your credentials.
Use a web developer tool such as the network tab to capture the cURL command for the login request.
Save the cURL command as a auth.sh file in the keys/raw directory. For example, you could create a file called auth.sh with the following contents.
Run the scraper using the command scrapy crawl auth. This will log you in to O'Reilly and save the necessary session cookies for future requests.

Collecting Catalogue Information

The catalogue spider collects information about all books and other resources from O'Reilly catalogue and stores them as a CSV, JSON, and JSONLines files locally in data/csv, data/json, and data/jsonline directories respectively, and store the JSON formatted records in MongoDB collections called catalogue after the spider name:

In a web browser, log in to O'Reilly using your credentials.
Use a web developer tool such as the network tab to capture the cURL command for the request that fetches paginated resource information and save it in catalogue.sh under the keys/raw directory.
In the terminal, navigate to the project directory.
Run the scraper using the command scrapy crawl catalogue -o catalogue.json. This will collect information about all books and other resources from O'Reilly catalogue and save it as a CSV, JSON, and JSONLines files in data/csv, data/json, and data/jsonline directories respectively, and store the JSON formatted records in MongoDB collections called catalogue after the spider name.

Downloading and Pre-Processing Resources

The book spider downloads, pre-processes the scraped HTML, CSS, and image resources, and saves them locally in directories based on their category and book title:

In a web browser, log in to O'Reilly using your credentials.
Use a web developer tool such as the network tab to capture the cURL command for the request that fetches book information and downloads an image and save them in book.sh and image.sh respectively under the keys/raw directory.
In the terminal, navigate to the project directory.
Run the scraper using the command scrapy crawl book. This will download and pre-process the HTML, CSS, and image resources for each book in your O'Reilly subscription and save them locally based on their category and book title. The pre-processing includes, but not limited to, fixing links so that the final PDF is a more readable and complete representation of the book.
Information about each individual book in JSON format, which is also the source of the URLs for the corresponding HTMLs, CSSs, and images, will be stored in a MongoDB collection called book along with some metadata.

Combining Resources into PDFs

After running the book spider, you can combine all corresponding resources (HTMLs, CSSs, and images) for each individual book and create one PDF per book by running the following command:

In the terminal, navigate to the project directory.
Run the command python services/pdfmaker.py. This will combine all the corresponding resources (HTMLs, CSSs, and images) for each individual book and create one PDF per book. The PDFs will be saved in the data/books directory.

Contributing

Contributions are always welcome! Please follow these steps to contribute:

Fork the repository.
Create a new branch for your feature or bug fix.
Make your changes and test thoroughly.
Submit a pull request with a clear description of your changes.

Thank you for contributing to Crawreilly!

License

This project is licensed under the Apache License Version 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
oreilly		oreilly
services		services
.env.sample		.env.sample
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oreilly

oreilly

services

services

.env.sample

.env.sample

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

LICENSE

LICENSE

README.md

README.md

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

scrapy.cfg

scrapy.cfg

Repository files navigation

Crawreilly

Getting Started

Prerequisites

Installation

Usage

Authentication

Collecting Catalogue Information

Downloading and Pre-Processing Resources

Combining Resources into PDFs

Contributing

License

About

Releases

Packages

Languages

License

rifatrakib/crawreilly

Folders and files

Latest commit

History

Repository files navigation

Crawreilly

Getting Started

Prerequisites

Installation

Usage

Authentication

Collecting Catalogue Information

Downloading and Pre-Processing Resources

Combining Resources into PDFs

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Languages