Skip to content

RTDMTD algorithm for the extraction of html templates

Notifications You must be signed in to change notification settings

python2and3developer/RTDMTD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

RTDMTD algorithm

I implemented the algorithm in this paper using Beautifulsoup:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.629&rep=rep1&type=pdf

These are the steps in the algorithm:

  • Guiven 2 pages A and B, use the DOM of the pages to represent them as trees.
  • Find the edition between the 2 pages with minimal cost. The possible tree editions are: insertion, deletion or replace
  • The nodes that are keep intact in the edition with minimal cost are considered template nodes. Create the minimal subtree containing that nodes. This subtree is the template.

About

RTDMTD algorithm for the extraction of html templates

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages