Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] compared to other tools (links) #66

Closed
dgtlmoon opened this issue Aug 30, 2022 · 3 comments
Closed

[discussion] compared to other tools (links) #66

dgtlmoon opened this issue Aug 30, 2022 · 3 comments

Comments

@dgtlmoon
Copy link

I wanted to add this as a discussion, but cant see/find the tab, thought the comparison was interesting, although links is a different tool (command line only)

Using links, it takes 0.13s , vs nearly 3s for pips inscript.py (23 times faster/slower)

links uses 16Mb vs inscript using 178~Mb

$ /usr/bin/time -v links -dump ./leaky.html > links.txt
        Command being timed: "links -dump ./leaky.html"
        User time (seconds): 0.08
        System time (seconds): 0.01
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.10
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 15964
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 4227
        Voluntary context switches: 1
        Involuntary context switches: 2
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

$ cat leaky.html |time -v inscript.py  > inscript.txt
        Command being timed: "inscript.py"
        User time (seconds): 2.99
        System time (seconds): 0.08
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.08
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 178232
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 54636
        Voluntary context switches: 55
        Involuntary context switches: 8
        Swaps: 0
        File system inputs: 0
        File system outputs: 1136
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

inscriptis adds a lot of white space at the start before the content, but links seems to nail it in one.

image

if i trim the whitespace, links shows the 'visible' representation, whilst inscript.py will attempt to show the field values as one line

image

@AlbertWeichselbraun
Copy link
Contributor

that's interesting.

there is a benchmarking script in the repository (./inscriptis/benchmarking/run_benchmarking.py) which tests the performance of lynx, beautifulsoup, html2text, justext and inscriptis against use cases that have been relevant to our work.

in these tests, inscriptis was most of the time either a top or second-to-the-top performer.

maybe running links against these benchmarks could also provide some interesting insights.

@AlbertWeichselbraun
Copy link
Contributor

AlbertWeichselbraun commented Sep 3, 2022

ad speed: did you also check the output?

i converted leaky.html with links, lynx and inscriptis.

  • links yielded a 7.7 kb text file
  • both inscriptis and lynx, in contrast, produced approx. 500 kb of text.
  • it looks like links only provides a fraction of the total output.

ad formatting:

  • inscriptis has been optimized for natural text processing - that's why it is run per default with the --indentation relaxed flag (which yields more spaces, but also reduces the likelihood of words being glued together).
  • the --indentation strict mode does not add these additional spaces and, therefore, produces a more compact text representation.

@AlbertWeichselbraun
Copy link
Contributor

I close this for now, since I did not receive any feedback on the different output size (links provides only 1.5% of the total output).

Please feel free to reopen, if this becomes relevant again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants