Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard to look for Halloween #1493

Open
jidanni opened this issue Nov 5, 2021 · 17 comments
Open

Hard to look for Halloween #1493

jidanni opened this issue Nov 5, 2021 · 17 comments

Comments

@jidanni
Copy link

jidanni commented Nov 5, 2021

Hard to look for Halloween. Look is scared to look for it.
Man page says

       -f, --ignore-case
           Ignore the case of alphabetic characters. This is on by default if
           no file is specified.

Well all I know is

$ look Halloween
$ look -f Halloween
$ look Halloween /usr/share/dict/words
Halloween
Halloween's
$ look -f Halloween /usr/share/dict/words
$ look --version
look from util-linux 2.37.2
@jidanni
Copy link
Author

jidanni commented Nov 5, 2021

So I have proven, -f or not, some other factor not mentioned on the man page is at work behind the scenes/screens.

@karelzak
Copy link
Collaborator

$ look Halloween
Hallowe'en
Halloween
hallow-e'en
halloween
halloweens

$ look -f Halloween
Hallowe'en
Halloween
hallow-e'en
halloween
halloweens

$ look Halloween /usr/share/dict/words
$  look -f Halloween /usr/share/dict/words
halloween
halloweens

$  look --version
look from util-Linux 2.37.2

What distro? And how the words file has been composed?

For example on Fedora we use data from Moby Project (now available on archive.org only) and it's sorted by

cd mwords
dos2unix -o *; chmod a+r *
cat [1-9]*.??? | egrep --invert-match "'s$" | egrep  "^[[:alnum:]'&!,./-]+$" | sort --ignore-case --dictionary-order | uniq > moby

This is copy & paste from Fedora words spec file, important is "--ignore-case --dictionary-order".

@karelzak
Copy link
Collaborator

It seems that "look Halloween /usr/share/dict/words" which returns nothing in my example is a bug.

@jidanni
Copy link
Author

jidanni commented Nov 16, 2021

What distro?

Debian.

And how the words file has been composed?

You will need to post a script I can run that will print that information.

@uablrek
Copy link

uablrek commented May 3, 2022

Ref. https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1971425

It works with an older version of "look" with the same database. Indicates that the "words" file is OK.

@Inkbottle007
Copy link

look ancillary returns nothing.
/usr/share/dict/words is https://packages.debian.org/sid/wamerican.
There's a debian open bug https://bugs.debian.org/973471.
Next oneliner fails a lot:

look $(shuf /usr/share/dict/words|head -n1)

You can try this one:

for A in a b c d e f g h i j k l m n o p q r s t u v w x y z
    do (for i in $(seq 1 100); do
        look $(grep "^$A" /usr/share/dict/words|shuf|head -n1)|head -n1
    done)|wc -l
done

@karelzak
Copy link
Collaborator

OK, I have tried "american-english" dictionaly from the Debian package. You're right that look ancillary returns nothing.

The way how to fix it is to sort the dictionary in a proper way, try sort --dictionary-order for the dictionary file, then it works as expected.

@karelzak
Copy link
Collaborator

# cd /usr/share/dict
# ls -l
total 10652
-rw-r--r-- 1 kzak kzak  985084 Jan 20 06:16 american-english
-rw-r--r-- 1 root root 4953598 Jan 22 05:48 linux.words
lrwxrwxrwx 1 root root      16 May 27 09:29 words -> american-english

# look ancillary

# sort --dictionary-order < american-english > american-english.fixed
# ln -fs american-english.fixed words

# look ancillary
ancillary
ancillary's

@karelzak
Copy link
Collaborator

Note, see also #284

@Inkbottle007
Copy link

It seems my dictionary is already "dictionary-order" sorted:
sort -dc /usr/share/dict/american-english shows no error.
I did what you said but look ancillary is still refusing to work.

# aptitude show wamerican bsdextrautils coreutils | grep '^Package\|^Version'
Package: wamerican
Version: 2020.12.07-2
Package: bsdextrautils
Version: 2.38-4
Package: coreutils
Version: 8.32-4.1
# cd /usr/share/dict
# sort --dictionary-order < american-english > american-english.fixed
# ln -fs american-english.fixed words
# chmod u+r american-english.fixed
# diff american-english*
# sort -dc american-english
# md5sum american-english*
16de2454dee65e9ceed77f9c1cd8a15e  american-english
16de2454dee65e9ceed77f9c1cd8a15e  american-english.fixed

@Inkbottle007
Copy link

I'm not sure symbolic links play well with relative path on my debian, anyway I've put things back the way they were with:

# ln -fs /usr/share/dict/american-english /etc/dictionaries-common/words
# ln -fs /etc/dictionaries-common/words /usr/share/dict/words
# aptitude reinstall wamerican
$ md5sum /usr/share/dict/american-english 
16de2454dee65e9ceed77f9c1cd8a15e  /usr/share/dict/american-english
$ md5sum <(sort --dictionary-order /usr/share/dict/american-english)
16de2454dee65e9ceed77f9c1cd8a15e  /dev/fd/63
$ look anc
Anchorage
Anchorage's

Only those two suggestions with look anc.

@uablrek
Copy link

uablrek commented May 28, 2022

It works with an older version of "look" with the same database. Indicates that the "words" file is OK.

At least on Ubuntu 22.04 it works when I take the "look" program from Ubuntu 20.04. So it is not only the dictionary. It's maybe the combination.

@Inkbottle007
Copy link

I haven't found a more minimal example.

$ grep -i "[^a-z']" /usr/share/dict/american-english > dict_02

$ look -df co dict_02
Concepción
Concepción's

$ grep -i "^co" dict_02
Concepción
Concepción's
confrère
confrère's
confrères
consommé
consommé's
cortège
cortège's
cortèges

$ md5sum /usr/share/dict/american-english 
16de2454dee65e9ceed77f9c1cd8a15e  /usr/share/dict/american-english

$ md5sum dict_02
dfc0d6e69caf71225c1b1e3622deb904  dict_02

$ cat dict_02 | tr "\n" " " | fmt
Asunción Asunción's Atatürk Atatürk's Bartók Bartók's Bogotá
Bogotá's Boötes Boötes's Buñuel Buñuel's Concepción Concepción's
Dürer Dürer's Düsseldorf Düsseldorf's Dvorák Dvorák's Elysée
Elysée's Esterházy Esterházy's Fabergé Fabergé's Furtwängler
Furtwängler's Gödel Gödel's Gewürztraminer Gewürztraminer's
Grünewald Grünewald's Gruyère Gruyère's Göteborg Göteborg's Héloise
Héloise's Köln Köln's Lumière Lumière's Mallarmé Mallarmé's
Münchhausen Münchhausen's Paraná Paraná's Poincaré Poincaré's
Pokémon Pokémon's Provençal Provençal's Pôrto Pôrto's Pétain
Pétain's Québecois Québecois's Ragnarök Ragnarök's Schrödinger
Schrödinger's Sèvres Sèvres's Tannhäuser Tannhäuser's Thessaloníki
Thessaloníki's Valéry Valéry's Velásquez Velásquez's Velázquez
Velázquez's Zürich Zürich's abbé abbé's abbés adiós appliqué
appliquéd appliquéing appliqué's appliqués attaché attaché's
attachés blasé boutonnière boutonnière's boutonnières café
café's cafés canapé canapé's canapés château château's châteaux
châtelaine châtelaine's châtelaines éclair éclair's éclairs éclat
éclat's cliché clichéd cliché's clichés clientèle clientèle's
clientèles confrère confrère's confrères consommé consommé's
cortège cortège's cortèges crèche crèche's crèches croûton
croûton's croûtons crudités crudités's débutante débutante's
débutantes décolleté derrière derrière's derrières divorcée
divorcée's divorcées dérailleur dérailleur's dérailleurs détente
détente's entrée entrée's entrées fiancé fiancée fiancée's
fiancées fiancé's fiancés flambé flambéed flambé's frappé frappé's
fête fête's fêtes habitué habitué's habitués ingénue ingénue's
ingénues jalapeño jalapeño's jalapeños jardinière jardinière's
jardinières kindergärtner kindergärtner's kindergärtners króna
króna's krónur élan élan's macramé macramé's manège manège's
manqué matinée matinée's matinées matériel matériel's émigré
émigré's émigrés mêlée mêlée's mêlées métier métier's
métiers naiveté naiveté's née Ångström Ångström's outré
passé épée épée's épées précis précised précising précis's
protégé protégé's protégés recherché risqué roué roué's roués
séance séance's séances sauté sautéed sautéing sauté's sautés
smörgåsbord smörgåsbord's smörgåsbords soirée soirée's soirées
soufflé soufflé's soufflés soupçon soupçon's soupçons touché
étude étude's études vicuña vicuña's vicuñas

Now you can even do:

$ echo Halloween >> dict_02
$ sort -d dict_02 > dict_03
$ look -df Halloween dict_03
$ grep Halloween dict_03
Halloween
$ look -d Halloween dict_03
Halloween

@Inkbottle007
Copy link

Here a very small minimal example. It has been generated with a line such as:

(grep -i "[^a-z']" /usr/share/dict/american-english | shuf | head -n20; echo Halloween) | sort -d > dict_02

Many such examples display the "unpredictible behavior" thing.

cat dict_03

Output

Concepción
Halloween
Héloise's
Provençal
Sèvres
appliqué
appliquéing
consommé
cortège's
sort -dc dict_03

no output

look -d Halloween dict_03

output

Halloween
look -df Halloween dict_03

no output

head -n5 dict_03 > dict_04
look -df Halloween dict_04

output

Halloween
cat dict_03 | python3 -c "import sys,urllib.parse;[sys.stdout.write(urllib.parse.quote_plus(line.rstrip('\n'))+'\n') for line in sys.stdin]"

output

Concepci%C3%B3n
Halloween
H%C3%A9loise%27s
Proven%C3%A7al
S%C3%A8vres
appliqu%C3%A9
appliqu%C3%A9ing
consomm%C3%A9
cort%C3%A8ge%27s
md5sum dict_03 <(cat dict_03 | python3 -c "import sys,urllib.parse;[sys.stdout.write(urllib.parse.quote_plus(line.rstrip('\n'))+'\n') for line in sys.stdin]" | python3 -c "import sys,urllib.parse;[sys.stdout.write(urllib.parse.unquote(line.rstrip('\n'))+'\n') for line in sys.stdin]")

output

09afe24f8cb91d0c3adebaa06275cc80  dict_03
09afe24f8cb91d0c3adebaa06275cc80  /dev/fd/63

@Inkbottle007
Copy link

Minimal examples are getting really small:

(echo Benchley; echo benched) > dict_20
sort -dc dict_20
look -df benched dict_20

@Inkbottle007
Copy link

# cd /usr/share/dict
# ls -l
total 10652
-rw-r--r-- 1 kzak kzak  985084 Jan 20 06:16 american-english
-rw-r--r-- 1 root root 4953598 Jan 22 05:48 linux.words
lrwxrwxrwx 1 root root      16 May 27 09:29 words -> american-english

# look ancillary

# sort --dictionary-order < american-english > american-english.fixed
# ln -fs american-english.fixed words

# look ancillary
ancillary
ancillary's

Hi, you must have been using a different version of look, or your american-english dictionary was initially sorted differently from mine. In any case I couldn't reproduced your example.

For the thing to work it must include additional constraint that I've earlier overlooked in debian bug #973471, specifically man look "the lines in file must be sorted (where sort(1) was given the same options -d and/or -f that look is invoked with)."

So the -f aka --ignore-case option must be added to the sort constraints.

So with the minimal example:

Not working:

(echo Benchley; echo benched) > dict_20
sort -dc dict_20
look -df benched dict_20

Working:

(echo Benchley; echo benched) > dict_20
sort -df dict_20 > dict_20_2
look -df benched dict_20_2

So, to fix my debian I did: (as root)

cd /usr/share/dict
sort -df american-english > american-english.fixed
ln -sf /usr/share/dict/american-english.fixed /etc/dictionaries-common/words
chmod o+r american-english.fixed

Now:

$ look ancillary
ancillary
ancillary's

$ look halloween
Hallowe'en
Halloween
Halloween's
Halloweens

$ look Halloween
Hallowe'en
Halloween
Halloween's
Halloweens

$ look accident
accident
accidental
accidentally
accidental's
accidentals
accident's
accidents

So, as far as I'm concerned it is fixed: there were no bug on util-linux side initially

@uablrek
Copy link

uablrek commented Mar 18, 2023

The above fixes the problem, which still exist on latest Ubuntu 22.04.2 LTS (which is kind of lame)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants