[discussion] compared to other tools (links) #66

dgtlmoon · 2022-08-30T13:17:32Z

I wanted to add this as a discussion, but cant see/find the tab, thought the comparison was interesting, although links is a different tool (command line only)

Using links, it takes 0.13s , vs nearly 3s for pips inscript.py (23 times faster/slower)

links uses 16Mb vs inscript using 178~Mb

$ /usr/bin/time -v links -dump ./leaky.html > links.txt
        Command being timed: "links -dump ./leaky.html"
        User time (seconds): 0.08
        System time (seconds): 0.01
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.10
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 15964
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 4227
        Voluntary context switches: 1
        Involuntary context switches: 2
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

$ cat leaky.html |time -v inscript.py  > inscript.txt
        Command being timed: "inscript.py"
        User time (seconds): 2.99
        System time (seconds): 0.08
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.08
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 178232
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 54636
        Voluntary context switches: 55
        Involuntary context switches: 8
        Swaps: 0
        File system inputs: 0
        File system outputs: 1136
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

inscriptis adds a lot of white space at the start before the content, but links seems to nail it in one.

if i trim the whitespace, links shows the 'visible' representation, whilst inscript.py will attempt to show the field values as one line

The text was updated successfully, but these errors were encountered:

AlbertWeichselbraun · 2022-08-31T06:17:04Z

that's interesting.

there is a benchmarking script in the repository (./inscriptis/benchmarking/run_benchmarking.py) which tests the performance of lynx, beautifulsoup, html2text, justext and inscriptis against use cases that have been relevant to our work.

in these tests, inscriptis was most of the time either a top or second-to-the-top performer.

maybe running links against these benchmarks could also provide some interesting insights.

AlbertWeichselbraun · 2022-09-03T20:01:23Z

ad speed: did you also check the output?

i converted leaky.html with links, lynx and inscriptis.

links yielded a 7.7 kb text file
both inscriptis and lynx, in contrast, produced approx. 500 kb of text.
it looks like links only provides a fraction of the total output.

ad formatting:

inscriptis has been optimized for natural text processing - that's why it is run per default with the --indentation relaxed flag (which yields more spaces, but also reduces the likelihood of words being glued together).
the --indentation strict mode does not add these additional spaces and, therefore, produces a more compact text representation.

AlbertWeichselbraun · 2022-09-20T13:10:03Z

I close this for now, since I did not receive any feedback on the different output size (links provides only 1.5% of the total output).

Please feel free to reopen, if this becomes relevant again.

AlbertWeichselbraun closed this as completed Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] compared to other tools (links) #66

[discussion] compared to other tools (links) #66

dgtlmoon commented Aug 30, 2022

AlbertWeichselbraun commented Aug 31, 2022

AlbertWeichselbraun commented Sep 3, 2022 •

edited

AlbertWeichselbraun commented Sep 20, 2022

[discussion] compared to other tools (links) #66

[discussion] compared to other tools (links) #66

Comments

dgtlmoon commented Aug 30, 2022

AlbertWeichselbraun commented Aug 31, 2022

AlbertWeichselbraun commented Sep 3, 2022 • edited

AlbertWeichselbraun commented Sep 20, 2022

AlbertWeichselbraun commented Sep 3, 2022 •

edited