Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create lists of commands to test coverage parity against #1070

Open
waldyrious opened this issue Sep 17, 2016 · 41 comments
Open

Create lists of commands to test coverage parity against #1070

waldyrious opened this issue Sep 17, 2016 · 41 comments
Assignees
Labels
documentation Issues/PRs modifying the documentation. tooling Helper tools, scripts and automated processes.

Comments

@waldyrious
Copy link
Member

No description provided.

@waldyrious waldyrious added the new command Issues requesting creation of a new page. label Sep 17, 2016
@waldyrious waldyrious self-assigned this Sep 17, 2016
@leostera
Copy link
Contributor

leostera commented Sep 17, 2016

We should absolutely leverage the online linux man pages to periodically fetch a big, big list of commands.

Sample: http://linux.die.net/man/1/ has almost 10,000 commands.

@leostera leostera added the tooling Helper tools, scripts and automated processes. label Sep 17, 2016
@waldyrious
Copy link
Member Author

waldyrious commented Sep 17, 2016

We could make separate projects to track commands based on platform (since they overlap, we can't use milestones, which is a pity since it would give us a nice progress bar)

Linux:

Windows:

OS X:

@be5invis
Copy link

be5invis commented Jan 16, 2017

@waldyrious For Windows, commands in CMD and PowerShell are DIFFERENT. For example, dir is a CMD built-in, also an alias of Get-ChildItem in PowerShell.
(Even ls in PowerShell is an alias, though they are going to remove it.)

@waldyrious
Copy link
Member Author

@be5invis thanks for bringing that up. It is certainly something we need to consider (e.g. we currently treat all linuxes the same, even though some of the commands are shell-specific). See #190 and #816 for previous discussion.

That said, that problem does not affect this issue: the former deals with how we organize the command pages we do have, while this issue is about identifying which commands we don't yet have, but should.

@be5invis
Copy link

be5invis commented Jan 16, 2017

@waldyrious The full PowerShell commands on my PC:
https://gist.github.com/be5invis/57d906e6f6935f7a1f19279878c2c214

@sbrl
Copy link
Member

sbrl commented May 4, 2017

If the tldr client emits a different exit status depending on whether the page exists or not (like tldr-bash-client does, then we could have an semi-automatic bash script that runs through a list of commands and emits a list of commands that don't exist yet. I could even write something like that & create a gist quite easily.

It would certainly help people who want to contribute find a page that needs doing.

@agnivade
Copy link
Member

agnivade commented May 4, 2017

You can always check the files present in the repo itself for parity no ?

@sbrl
Copy link
Member

sbrl commented May 4, 2017

@agnivade Yeah, we could do that too! Do a git clone and then a find tldr -iname "command.md"` or something

@sbrl
Copy link
Member

sbrl commented Jun 29, 2017

Executing

var els = document.querySelectorAll("dt a[href]");
var cmds = [];
for(let el of els) cmds.push(el.innerText);
cmds.join("\n");

on http://linux.die.net/man/1/, gives this file: linux-commands.txt

This is obviously pending sorting, which I'll do soon.

@sbrl
Copy link
Member

sbrl commented Jul 1, 2017

Sorting complete! Here's what I came up with:

cat linux-commands.txt | xargs -P4 -I {} bash -c 'if [[ "$(find tldr/pages/ -name {}.md | wc -l)" -ne 0 ]]; then echo yep>>yeses.txt; else echo nope>>nos.txt; fi'
echo We have $(cat yeses.txt | wc -l) out of $(cat linux-commands.txt | wc -l) commands in tldr-pages -  $(cat nos.txt | wc -l) commands are missing.

Running the above reveals that:

  • We've got 328 commands documented
  • We've got 9497 to go
  • We've done ~3.34% so far

@waldyrious
Copy link
Member Author

I wonder if, after we've compiled one or more lists of commands to add, we could somehow calculate the completeness percentages automatically and display them in the README with a badge.

If we do compile multiple lists, we could even organize the completion badges in a table to provide a dashboard similar to the progress table of Wikipedia's WikiProject Missing encyclopedic articles.

Does anyone have an idea whether something like that is doable and/or hints about how to go about implementing it?

@agnivade
Copy link
Member

agnivade commented Sep 4, 2017

I would like to take a stab at this. I am thinking of just taking the GNU coreutils list and test parity against it. The linux.die.net page contains a lot commands which have to be installed separately.

The badge thing can be easily done with a custom svg element.

@waldyrious
Copy link
Member Author

@agnivade I can't wait to see what you come up with! I'm more than willing to provide the actual content of the lists if that takes some work off your plate (I have a bunch of notes and links in a google doc, besides the resources I listed above).

@agnivade
Copy link
Member

agnivade commented Sep 4, 2017

Sure, that would be great.

@sbrl
Copy link
Member

sbrl commented Sep 4, 2017

Oooh, awesome :D

@agnivade
Copy link
Member

@waldyrious - I might take a stab at it this weekend. Can you share the links/notes that you have ?

@waldyrious
Copy link
Member Author

Sure. I'll block off one hour to work on this today, and will post the resulting data.

@waldyrious
Copy link
Member Author

waldyrious commented Sep 15, 2017

Heads-up: the wiki page "Pages plan" has been deleted to centralize tracking of missing pages in this thread. I've moved all the information that was present there to this spreadsheet, which is publicly viewable and anyone can add comments. It's a work in progress (I just started it). I'll give write access to the current maintainers.

@sbrl
Copy link
Member

sbrl commented Sep 15, 2017

@waldyrious Wow, that's an impressive spreadsheet! Is there a filter for just the ones that haven't been done yet? How are bulk lists of commands added to the list?

@waldyrious
Copy link
Member Author

There will be a filter, yeah -- that's one of the reasons I've decided to build it in a spreadsheet. The lists of commands will be added manually (using various helper tools, of course), since the various sources don't use a common format. Let me know on Gitter if you'd like to work on this so we can coordinate.

@agnivade
Copy link
Member

I am concerned about how do I get the total list of commands programmatically. Since I would like to run the list against every commit merged with master.

@waldyrious
Copy link
Member Author

waldyrious commented Sep 15, 2017

That document is by no means meant to be the final location of the list. It's just the way I figured would be easiest to get it started and quickly filling it. I don't know yet what setup would be the best balance of (1) community maintenance of the data, (2) machine consumption of the contents, (3) automatic synchronization (as much as possible) as new pages are added. Ideas are welcome.

Also, the choice of how to set this up would depend on how often we would want to update the list. I think we can start with something reasonably static, to make things easier, especially since we have a lot of work to catch up to established commands before it would make sense to start chasing more dynamic lists (say, top node.js-based CLI tools or something like that)

@agnivade
Copy link
Member

(3) automatic synchronization (as much as possible) as new pages are added.

Umm no .. I think you got the wrong idea. 😝 We don't need to synchronize when new pages are added. That would be crazy. It seems like you put a lot of effort into this. Frankly, I didn't need so much details.

Here's what my plan is -

  • On every commit to master branch, run a script which will get all the commands in the repo, get a list of target commands we want to match our list with, calculate percentage and update the svg badge which shows the percent completion.

That's it. No need to update any list when new pages are added.

@waldyrious waldyrious added documentation Issues/PRs modifying the documentation. and removed new command Issues requesting creation of a new page. labels Sep 15, 2017
@waldyrious
Copy link
Member Author

Hahah, yeah, I got a little carried away there. Although I might have given off the wrong impression.

The way I was planning to have this "automatic sync" feature was to simply open one issue per command to add, and assign them to milestones according to the lists they appear in. That way we'd get a nice live overview page with progress bars for each of the lists we'd want to reach parity with. For reference, my inspiration came from the overview table of Wikipedia's WikiProject Missing encyclopedic articles.

In addition one milestone per (major) source, we might also want platform-specific lists (windows commands, bsd, etc.), and maybe topic-specific lists (email clients, text editors, compilers, etc.).

Of course, this doesn't prevent us from having a "master completeness list" and use that to compute a single "overall completeness" metric. We do need to decide what goes into that list, though. The obvious choice is a metric of the most popular pages (e.g. the top 1000 entries sorted by how many of those lists they appear on), but let me know if you think something else would make more sense.

@agnivade
Copy link
Member

Your idea seems like a lot of manual work, something which I personally would want to avoid. I was planning for that "completeness" metric and be done with it. If we indeed decide that it's just gonna be 1000 commands, then we might as well compute the list and check it in in the repo, so that my code can easily compare with it.

@waldyrious
Copy link
Member Author

Sure, as I said, the list will probably not change much after we compile it. My idea is just a nice-to-have I might do on my own later on (unless you guys object).

For the master list, we just need to decide what is the criteria we'll use to define its contents -- from there it's just a matter of collecting the rest of the data and applying the filters.

So in that regard, what are your thoughts regarding which criteria to use: which lists to compare against, how many commands to include, etc?

@waldyrious
Copy link
Member Author

waldyrious commented Sep 16, 2017

Update: the table is pretty much ready now. Some areas that still need some help:

  • cells painted yellow indicate sources that haven't been integrated yet. There are four of them. Any help in that regard would be appreciated (just parsing those sources into a plaintext list of commands would suffice)
  • cells painted orange indicate expected counts (number of "x" marks in that column) that don't match the automated count. I'm not sure what's going on there, so it would be nice if someone with fresh eyes could double-check those columns.

Apart from that, we can start deciding how to use that data to compile our master list :)

Note: I didn't include the linux.die.net manpages, since even just the first section contains about 10,000 commands, which makes the table unwieldy and kinda overwhelming, to be honest.

@waldyrious
Copy link
Member Author

By the way, the plan to use milestones won't be possible after all. I had already reached this conclusion before, but forgot it in the meantime: it turns out GitHub only allows a single milestone per issue, so there would be no way to simultaneously track progress towards multiple coverage parity goals :(

That said, we could still have a milestone for the master parity list, which IMO would be a good thing as it would make those missing commands more visible as issues that newcomers could tackle. (It could also be the target URL for the badge.)

@agnivade
Copy link
Member

cells painted orange indicate expected counts (number of "x" marks in that column) that don't match the automated count.

So are you saying you have manually counted each 'x' just to verify the automated count ? That's some dedication ! Why bother with the manual count at all if there is already automation for it ? Unless you suspect that =COUNTIF(L5:L,"x") is wrong ?

@waldyrious
Copy link
Member Author

waldyrious commented Sep 16, 2017

So are you saying you have manually counted each 'x' just to verify the automated count ? That's some dedication !

Oh god no, haha :P I'm not that crazy ;)
I had the correct count from the actual lists that can be seen in the "lists" sheet (number of lines, basically, which any decent text editor will provide), which gives me more confidence in the result than using the formulas. Besides, some of the automated counts indeed are correct. I found some duplicated entries before, due to imperfect filtering, and that fixed some of the mismatched counts -- but I can't figure out what's causing the remaining mismatches...

@agnivade
Copy link
Member

Ah I see :) Didn't notice that there was another sheet.

@sbrl
Copy link
Member

sbrl commented Sep 16, 2017

Awesome work! Yeah, perhaps we could have a 'current goal' to document all the commands in a given list, and keep moving to new lists as we complete old ones. Having a list of commands auto-generated that have yet to be documented for the 'current goal' parity list would be helpful for newcomers, yeah.

The sheet is rather unwieldy though on my screen, since the frozen panes take up about 60% of my available screen real-estate on my laptop 😕

@agnivade
Copy link
Member

I think we should move the orange and yellow cells to a new row below. Because its in the same row as coverage. And it just signifies the expected count, not coverage.

And lastly, our current coverage % is 52 right ?

@waldyrious
Copy link
Member Author

The sheet is rather unwieldy though on my screen, since the frozen panes take up about 60% of my available screen real-estate on my laptop 😕

I made the heading more compact. Is that workable now?

I think we should move the orange and yellow cells to a new row below. Because its in the same row as coverage. And it just signifies the expected count, not coverage.

Agreed, I just did that. Ideally we won't even have to include the expected count on the table, but until we figure out what's going on with the mismatched values, we'll need those cells.

@waldyrious
Copy link
Member Author

waldyrious commented Sep 17, 2017

Ok, I've filled the table some more.
Also, apparently I can't reproduce the count mismatch anymore, so ¯\_(ツ)_/¯

The two sources that still need parsing into a plain list of command names are Inconsolation and ArchWiki's List of applications. Any help appreciated!

And lastly, our current coverage % is 52 right ?

Yes, but that's a plain fraction that doesn't consider the relative importance of the missing commands. I'd rather have a weighted coverage percentage, where each entry is weighted by the number of occurrences in these other lists. I'll have that working in a bit. Edit: done (see top right corner of the table). Looks like at this point it isn't much different from the plain percentage, though 😝

@sbrl
Copy link
Member

sbrl commented Sep 17, 2017

Through weird es6 magic, I bring you a list of commands for the Inconsolation lists! Here's the code I used in the firefox console for reference:

(function() {
let result = [];
document.querySelectorAll(".entry-content > p:nth-child(4) a[href]").forEach((el) => {
    if(el.innerText.search(":") === -1 || el.innerText.trim()[0] !== el.innerText.trim()[0].toLowerCase() || el.innerText.search(/\./) !== -1) return;
    result.push(...el.innerText.split(":")[0].split(/\s*(and|,)\s*/gi));
});
result = result.filter((cmd) => cmd.search(/[,\*\(\) \{\}]|and/) == -1 || cmd.length == 0);
console.log(result.filter((el, i, arr) => arr.indexOf(el) === i).join("\n"));
})();

...I've pasted them into the spreadsheet. They might need a little bit of tidy-up work though, since the input was messy.

That archwiki one though looks tough, since they don't detail the name of all commands in the list.

@waldyrious
Copy link
Member Author

Can you explain the code? I'm afraid just parsing the link titles will produce a list with way too many missing entries, because many of the titles don't contain command names directly. On the other hand, I'm not sure I can think of anything that would work better without involving manual processing of each page linked from the entries... 😕

As for the ArchWiki page, I guess it would suffice to extract only the contents of the sections titled "Console". That will definitely leave some gaps in the output, but the page isn't meant to be a structured list anyway, nor it focuses specifically on command line programs, so I guess it's reasonable to parse it more loosely.

@sbrl
Copy link
Member

sbrl commented Sep 17, 2017

It is a bit messy, isn't it! 😛 What it does is extract the names of the commands listed on the page, since I assumed that it was an index of all the commands the author had talked about. It discards the following:

  • Items in the list with a capital letter at the beginning
  • Items without a colon

Once done, it extracts the bit before the colon and does the following:

  • Splits it on , and and
  • Discards any parts containing and, ,, (, ), {, },
  • Discards any parts that have a length of zero.

@gingerbeardman
Copy link
Contributor

Some related discussion here: #1953

@agnivade
Copy link
Member

agnivade commented Feb 1, 2018

This has been pending too long ! I will go on vacation soon, I promise to work on this during that time !

@gingerbeardman
Copy link
Contributor

Enjoy your vacation! If the two coincide, then so be it 👍

@agnivade agnivade removed their assignment Oct 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues/PRs modifying the documentation. tooling Helper tools, scripts and automated processes.
Projects
Development

Successfully merging a pull request may close this issue.

6 participants