Skip to content
Find which links on a web page are pagination links
Branch: master
Clone or download
Pull request Compare This branch is 43 commits ahead of plafl:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
page_finder
tests
.gitignore
.travis.yml
CHANGES.md
LICENSE
Makefile.buildbot
README.md
requirements.txt
setup.py
tox.ini

README.md

Page Finder Build Status

This module detects which links inside a page are pagination links. It works by manually marking inside a web page at least one link as a pagination link. The algorithm then uses label propagation and a gaussian kernel with Levenshtein edit distance as a measure of similarity to determine which other links are pagination links. There is a small demo included to show you how to use and test it.

Install

python setup.py develop

Dependencies: numpy and scrapely

pip install -r requirements.txt

Demo

cd tests
python demo.py https://news.ycombinator.com

Enter link to follow (tab autocompletes): news?<TAB>
Enter link to follow (tab autocompletes): https://news.ycombinator.com/news?p=2 <RET>

0) Quit
1) Enter link directly
2) https://news.ycombinator.com/news?p=3
3) https://news.ycombinator.com/news
4) https://news.ycombinator.com/newest
5) https://news.ycombinator.com/jobs
6) https://news.ycombinator.com/ask
Select link to follow:
2 <RET>
You can’t perform that action at this time.