New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ep4 `sort -n` has different behaviour on Mac and git bash vs linux #810

Open
gcapes opened this Issue Jul 3, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@gcapes
Contributor

gcapes commented Jul 3, 2018

The -n flag only works as expected on Linux. The other OSes give the same output as sort

We noticed this when @mawds was teaching using git bash.

@gcapes gcapes changed the title from Ep4 sort -n exercise doesn't work on Mac or git bash to Ep4 `sort -n` has different behaviour on Mac and git bash Vs linux Jul 3, 2018

@gcapes gcapes changed the title from Ep4 `sort -n` has different behaviour on Mac and git bash Vs linux to Ep4 `sort -n` has different behaviour on Mac and git bash vs linux Jul 3, 2018

@gcapes

This comment has been minimized.

Contributor

gcapes commented Jul 5, 2018

The example in question uses the molecules directory.

wc -l *.pdb > lengths.txt

On Linux:

$ sort lengths.txt 
 107 total
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
   9 methane.pdb
$ sort -n lengths.txt 
   9 methane.pdb
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
 107 total
$ sort -b lengths.txt 
 107 total
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
   9 methane.pdb

On Git Bash:

$ sort lengths.txt
   9 methane.pdb
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
 107 total
$ sort -n lengths.txt
   9 methane.pdb
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
 107 total
$ sort -b lengths.txt
 107 total
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
   9 methane.pdb

Can anyone help me understand what is going on here?

@rgaiacs

This comment has been minimized.

Collaborator

rgaiacs commented Jul 5, 2018

Can anyone help me understand what is going on here?

Git Bash has a different default arguments for sort, like macOS, than GNU/Linux.

@gcapes

This comment has been minimized.

Contributor

gcapes commented Jul 5, 2018

Hi Raniere,

I'm not sure I understand - are you suggesting that sort may be aliased to use certain flags in Git Bash so that when I run sort it might run e.g. sort -n by default?
There are no aliases involved here (I ran alias sort, and \sort to investigate), but I can see the behaviour is different depending on the OS.

@mawds

This comment has been minimized.

Contributor

mawds commented Jul 5, 2018

I wondered whether this was something to do with locale.
(I set the locale to en_US as this apeared to be the default on git bash. My default locale on Linux is en_GB)

On Linux:

export LANG=en_US.UTF8
sort ~/VirtualBoxShared/lengths.txt --debug
export LANG=C
sort ~/VirtualBoxShared/lengths.txt --debug

gives:

sort: using ‘en_US.UTF8’ sorting rules
 107 total
__________
  12 ethane.pdb
_______________
  15 propane.pdb
________________
  20 cubane.pdb
_______________
  21 pentane.pdb
________________
  30 octane.pdb
_______________
   9 methane.pdb
________________

sort: using simple byte comparison
   9 methane.pdb
________________
  12 ethane.pdb
_______________
  15 propane.pdb
________________
  20 cubane.pdb
_______________
  21 pentane.pdb
________________
  30 octane.pdb
_______________
 107 total
__________

The same in git bash gives:

sort: using ‘en_US.UTF8’ sorting rules
   9 methane.pdb
________________
  12 ethane.pdb
_______________
  15 propane.pdb
________________
  20 cubane.pdb
_______________
  21 pentane.pdb
________________
  30 octane.pdb
_______________
 107 total
__________

sort: using simple byte comparison
   9 methane.pdb
________________
  12 ethane.pdb
_______________
  15 propane.pdb
________________
  20 cubane.pdb
_______________
  21 pentane.pdb
________________
  30 octane.pdb
_______________
 107 total
__________

(again, no aliases set on sort).

Not sure if that helps, or just adds to the confusion. I'm not sure what else to try, in order to try and work out what's going on.

@rgaiacs

This comment has been minimized.

Collaborator

rgaiacs commented Jul 5, 2018

As mentioned in https://stackoverflow.com/a/28903/1802726,

The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

@mawds

This comment has been minimized.

Contributor

mawds commented Jul 5, 2018

I get the same output on git bash whether I used export LANG=C or export LC_ALL=C (i.e. a numeric sort regardless of -n being set or not)

Doing some further digging, the callout example in the lesson:

If we run sort on a file containing the following lines:

10
2
19
22
6

Does work as expected on git Bash (regardless of LC_ALL/LANG being set to C or en_US.UTF8), i.e. I get character or numeric sorting depending on -n, so it's something to do with leading blanks.

What I think is happening is that with LC_ALL=C on either platform, or LC_ALL=en_US.UTF-8 on git Bash, space sorts before numeric digits. On Linux, with LC_ALL=en_US.UTF-8 space sorts after numeric digits.

Where space sorts low you'll get an apparently numeric sort, even if the sort command is sorting by character.

In terms of episode 4, this all means that the call-out behaves as expected, but asking learners to compare the output of:

 wc -l *.pdb |sort

and

 wc -l *.pdb |sort -n 

(as I did in the carpentry course on Monday) is system dependent because of the leading spaces in the output of wc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment