Skip to content

yukienomiya/link-scraper

Repository files navigation

link-scraper

Lint status Test Ubuntu status
Code style Linter Types checker Test runner Task runner Build tool
Project license

The most basic link scraper in the whole world.

Synopsis

This is a simple link scraper that returns all the anchor links in a webpage.

A simple CLI is also available for quick prototyping.
You can run it locally or on directly on Colab using this notebook.

Install

pip install git+https://github.com/yukienomiya/link-scraper.git

or

poetry add git+https://github.com/yukienomiya/link-scraper.git

Usage

from link_scraper.utils import scrape_links

scrape_links('https://google.it')
# => [
#     ('Immagini', 'https://www.google.it/imghp?hl=it&tab=wi'),
#     ('Maps', 'https://maps.google.it/maps?hl=it&tab=wl'),
#     ...
#    ]

CLI

The pip package includes a CLI that you can use to extract links.

usage: link-scraper [-h] [--debug] urls [urls ...]

Extract links from a list of URLs.

positional arguments:
  urls        A list of strings containing links to score, one per line. If - is given as filename it reads from stdin
              instead.

optional arguments:
  -h, --help  show this help message and exit
  --debug     If provided it provides additional logging in case of errors.

Development

You can install this library locally for development using the commands below. If you don't have it already, you need to install poetry first.

# Clone the repo
git clone https://github.com/yukienomiya/link-scraper
# CD into the created folder
cd link-scraper
# Create a virtualenv and install the required dependencies using poetry
poetry install

You can then run commands inside the virtualenv by using poetry run COMMAND.
Alternatively, you can open a shell inside the virtualenv using poetry shell.

If you wish to contribute to this project, run the following commands locally before opening a PR and check that no error is reported (warnings are fine).

# Run the code formatter
poetry run task format
# Run the linter
poetry run task lint
# Run the static type checker
poetry run task types
# Run the tests
poetry run task test

License

This project is licensed under the MIT License - see the license file for details.

About

The most basic link scraper in the whole world.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published