Broken link on :h thesaurus #629

DionisiusMayr · 2016-02-09T03:46:17Z

It seems that there is a broken link on the thesaurus documentation:

"To obtain a file to be used here, check out this ftp site: ftp://ftp.ox.ac.uk/pub/wordlists/ ..."

dcslagel · 2018-11-08T21:07:19Z

After manually following the link suggested in #3583, it doesn't look like a comma separated thesaurus file is there.

The following link has the Moby Thesaurus (public-domain) and might be more reliable going forward.
http://www.gutenberg.org/files/3202/files/mthesaur.txt
Format Note: the file has each set of similar words comma separated on a line. Not space separated as preferred in the current documentation.

Additional info:
https://en.wikipedia.org/wiki/Moby_Project#Thesaurus
http://www.gutenberg.org/catalog/world/results?title=moby+list

brammool · 2018-11-08T22:07:59Z

After manually following the link suggested in #3583, it doesn't look like a comma separated thesaurus file is there.

Looks like it's one word per line, thus that won't work as a thesaurus.

The following link has the Moby Thesaurus (public-domain) and might be more reliable going forward. http://www.gutenberg.org/files/3202/files/mthesaur.txt

Hmm, does that actually work? I found this random entry: table,Domesday Book,account,account book,address book,adjourn, A table is a "Domesday Book"? Also, it uses comma separated words, and includes spaces. Vim doesn't appear to handle that.

…

Additional info: https://en.wikipedia.org/wiki/Moby_Project#Thesaurus http://www.gutenberg.org/catalog/world/results?title=moby+list

-- A computer programmer is a device for turning requirements into undocumented features. It runs on cola, pizza and Dilbert cartoons. Bram Moolenaar /// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\\ /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\ an exciting new programming language -- http://www.Zimbu.org /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

dcslagel · 2018-11-08T22:48:33Z

Yeah, 'table,Domesday Book,account,account book,address book,adjourn,' is a very broad association. It looks like Domesday Book is an accounting survey of land and a table can be a table of accounts.

That does seem like too far a stretch in meaning to be useful and this isn't a good thesaurus source.

dcslagel · 2018-11-09T19:41:46Z

After searching around for quite a bit more, there doesn't seem to be a current open licensed thesaurus in space separated 'key-word alt1-word alt2-word...' form. (At least in english). So it probably makes sense to just remove the text suggesting a thesaurus file to download..

dcslagel · 2018-11-09T20:13:35Z

Just some notes that might be of interest to anyone looking at this ticket but not directly related to the solution..

Wordnet seems to the main source for an english thesaurus:
https://wordnet.princeton.edu

Openoffice maintains a structured text version of the wordnet data the date here:
https://www.openoffice.org/lingucomponent/thesaurus.html
The main download file is here:
https://www.openoffice.org/lingucomponent/MyThes-1.zip

brammool · 2018-11-09T21:28:55Z

Just some notes that might be of interest to anyone looking at this ticket but not directly related to the solution.. Wordnet seems to the main source for an english thesaurus: https://wordnet.princeton.edu Openoffice maintains a structured text version of the wordnet data the date here: https://www.openoffice.org/lingucomponent/thesaurus.html The main download file is here: https://www.openoffice.org/lingucomponent/MyThes-1.zip

If the data exists but is in the wrong format, perhaps someone can write a script to turn it into the right format. We could then include the script with Vim and/or make the output available on the ftp site.

…

-- `When any government, or any church for that matter, undertakes to say to its subjects, "This you may not read, this you must not see, this you are forbidden to know," the end result is tyranny and oppression no matter how holy the motives' -- Robert A Heinlein, "If this goes on --" /// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\\ /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\ an exciting new programming language -- http://www.Zimbu.org /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

dcslagel · 2018-11-30T18:21:24Z

Thesaurus Zip Attached:
thesaurus_pkg.zip

The attached thesaurus_pkg.zip contains a thesaurus.txt in the vim space-separated
format derived from the Wordnet/MyThes-1 sources.

Can the thesaurus.txt be made available on the the ftp site?

The thesaurus.pkg.zip contains these files:

A patch file to modify MyThes-1.0 and generate the thesarus.txt file
thesaurus_pkg/thesaurus-for-vim.patch
A thesaurus file in the vim format
thesaurus_pkg/thesaurus.txt
The WordNet and MyThes-1 licences
thesaurus_pkg/WordNet_license.txt
thesaurus_pkg/license.readme

Notes:

All distinct terms made of multiple words were removed because the space
between the words conflicts with the default term delimiter which is a space.
All terms that didn't have any synonyms were removed.

Patch File:
Here is the patch file thesaurus_pkg/thesaurus-for-vim.patch for MyThes-1:

diff -ruN MyThes-1.0/Makefile MyThes-1.0-vim/Makefile
--- MyThes-1.0/Makefile	2003-12-08 14:42:33.000000000 -0700
+++ MyThes-1.0-vim/Makefile	2018-11-30 07:48:21.000000000 -0700
@@ -21,7 +21,7 @@
 	-@ ($(RANLIB) $@ || true) >/dev/null 2>&1
 
 example: example.o $(LIBS)
-	$(CXX) $(CXXFLAGS) -o $@ example.o $(LDFLAGS)
+	$(CXX) -o $@ example.o $(LDFLAGS)
 
 %.o: %.cxx 
 	$(CXX) $(CXXFLAGS) -c $<
diff -ruN MyThes-1.0/README-VIM-THESAURUS MyThes-1.0-vim/README-VIM-THESAURUS
--- MyThes-1.0/README-VIM-THESAURUS	1969-12-31 17:00:00.000000000 -0700
+++ MyThes-1.0-vim/README-VIM-THESAURUS	2018-11-30 10:04:10.000000000 -0700
@@ -0,0 +1,21 @@
+To create a thesaurus file formatted for vim's thesaurus run:
+bash ./mk-vim-thesaurus.sh
+
+The file 'thesaurus.txt' will be created.
+
+Here are the steps that mk-vim-thesaurus.sh takes:
+
+1. Extract term list from MyThes-1.0 th_en_US_new.dat file:
+# Note: This will remove complex words with spaces in them because space is the
+# default delimiter for vim's thesaurus format.
+grep -v "^(" th_en_US_new.dat | awk -F"|" '{print $1}' | grep -v ' ' | grep -v 'ISO8859-1' > words-without-spaces.lst
+
+2. make example
+make
+
+3. Run mk_vim_thesaurus_format:
+# Note: While extracting synonyms, multiple word synonyms with spaces are excluded.
+./example th_en_US_new.idx th_en_US_new.dat words-without-spaces.lst  > raw-list
+
+4. Remove entries that don't have synonyms:
+grep -v "^\w\+$" raw-list > thesaurus.txt 
diff -ruN MyThes-1.0/example.cxx MyThes-1.0-vim/example.cxx
--- MyThes-1.0/example.cxx	2003-12-08 14:37:13.000000000 -0700
+++ MyThes-1.0-vim/example.cxx	2018-11-30 09:00:44.000000000 -0700
@@ -70,16 +70,20 @@
       // or count since needed for CleanUpAfterLookup routine
       mentry* pm = pmean;
       if (count) {
-        fprintf(stdout,"%s has %d meanings\n",buf,count);
-	for (int  i=0; i < count; i++) {
-          fprintf(stdout,"   meaning %d: %s\n",i,pm->defn);
+        // initial word
+        fprintf(stdout,"%s",buf);
+        for (int  i=0; i < count; i++) {
           for (int j=0; j < pm->count; j++) {
-	    fprintf(stdout,"       %s\n",pm->psyns[j]);
+            // only output the word if it doesn't have spaces
+            // because space is the standard delimiter in the 
+            // vim thesaurus file format.
+            if (strchr(pm->psyns[j], ' ') == NULL)  {
+              fprintf(stdout," %s",pm->psyns[j]);
+            }
           }
-          fprintf(stdout,"\n");
           pm++;
-	}
-        fprintf(stdout,"\n\n");
+        }
+        fprintf(stdout,"\n");
         // now clean up all allocated memory 
         pMT->CleanUpAfterLookup(&pmean,count);
       } else {
diff -ruN MyThes-1.0/mk-vim-thesaurus.sh MyThes-1.0-vim/mk-vim-thesaurus.sh
--- MyThes-1.0/mk-vim-thesaurus.sh	1969-12-31 17:00:00.000000000 -0700
+++ MyThes-1.0-vim/mk-vim-thesaurus.sh	2018-11-30 08:51:15.000000000 -0700
@@ -0,0 +1,13 @@
+
+# Extract term list from MyThes-1.0 th_en_US_new.dat file:
+grep -v "^(" th_en_US_new.dat | awk -F"|" '{print $1}' | grep -v ' ' | grep -v 'ISO8859-1' > words-without-spaces.lst
+
+# make example
+make
+
+#  Run mk_vim_thesaurus_format:
+./example th_en_US_new.idx th_en_US_new.dat words-without-spaces.lst  > raw-list
+
+#  Remove entries that don't have synonyms:
+grep -v '^\w\+$' raw-list > thesaurus.txt 
+
diff -ruN MyThes-1.0/mythes.cxx MyThes-1.0-vim/mythes.cxx
--- MyThes-1.0/mythes.cxx	2003-12-08 14:40:27.000000000 -0700
+++ MyThes-1.0-vim/mythes.cxx	2018-11-28 13:12:29.000000000 -0700
@@ -25,7 +25,7 @@
 // return index of char in string
 int mystr_indexOfChar(const char * d, int c)
 {
-  char * p = strchr(d,c);
+  const char * p = strchr(d,c);
   if (p) return (int)(p-d);
   return -1;
 }

brammool · 2018-11-30T20:13:17Z

I can at least mention this comment in the help, unpacking the .zip file isn't too difficult.

dcslagel · 2018-12-01T13:41:00Z

Thanks, Is there anything else that needs to be done to complete this issue?

brammool · 2018-12-01T20:05:40Z

Thanks, Is there anything else that needs to be done to complete this issue?

Well, this only provides one English thesaurus. I don't know if this is even a good one. And there are many other languages...

…

-- Every time I lose weight, it finds me again! /// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\\ /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\ an exciting new programming language -- http://www.Zimbu.org /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Since german and english are mostly distinct and have only few mutual words, we're taking the easy approach and add both thesauri files at the same time. Vim will query both and display the ciumulated matches. Taken from: * vim/vim#629 (comment) * https://github.com/Yamagi/vim-german-thesaurus

k-takata added the documentation label Oct 12, 2017

localstatic mentioned this issue Nov 1, 2018

Update thesaurus wordlist URL #3583

Closed

ghost mentioned this issue Jul 13, 2019

Thesaurus completion matches more than one line in thesaurus file #4667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken link on :h thesaurus #629

Broken link on :h thesaurus #629

DionisiusMayr commented Feb 9, 2016

dcslagel commented Nov 8, 2018 •

edited

Loading

brammool commented Nov 8, 2018 via email

dcslagel commented Nov 8, 2018 •

edited

Loading

dcslagel commented Nov 9, 2018

dcslagel commented Nov 9, 2018

brammool commented Nov 9, 2018 via email

dcslagel commented Nov 30, 2018

brammool commented Nov 30, 2018

dcslagel commented Dec 1, 2018

brammool commented Dec 1, 2018 via email

Broken link on :h thesaurus #629

Broken link on :h thesaurus #629

Comments

DionisiusMayr commented Feb 9, 2016

dcslagel commented Nov 8, 2018 • edited Loading

brammool commented Nov 8, 2018 via email

dcslagel commented Nov 8, 2018 • edited Loading

dcslagel commented Nov 9, 2018

dcslagel commented Nov 9, 2018

brammool commented Nov 9, 2018 via email

dcslagel commented Nov 30, 2018

brammool commented Nov 30, 2018

dcslagel commented Dec 1, 2018

brammool commented Dec 1, 2018 via email

dcslagel commented Nov 8, 2018 •

edited

Loading

dcslagel commented Nov 8, 2018 •

edited

Loading