Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken link on :h thesaurus #629

Open
DionisiusMayr opened this issue Feb 9, 2016 · 10 comments
Open

Broken link on :h thesaurus #629

DionisiusMayr opened this issue Feb 9, 2016 · 10 comments

Comments

@DionisiusMayr
Copy link

It seems that there is a broken link on the thesaurus documentation:

"To obtain a file to be used here, check out this ftp site: ftp://ftp.ox.ac.uk/pub/wordlists/ ..."

@dcslagel
Copy link

dcslagel commented Nov 8, 2018

After manually following the link suggested in #3583, it doesn't look like a comma separated thesaurus file is there.

The following link has the Moby Thesaurus (public-domain) and might be more reliable going forward.
http://www.gutenberg.org/files/3202/files/mthesaur.txt
Format Note: the file has each set of similar words comma separated on a line. Not space separated as preferred in the current documentation.

Additional info:
https://en.wikipedia.org/wiki/Moby_Project#Thesaurus
http://www.gutenberg.org/catalog/world/results?title=moby+list

@brammool
Copy link
Contributor

brammool commented Nov 8, 2018 via email

@dcslagel
Copy link

dcslagel commented Nov 8, 2018

Yeah, 'table,Domesday Book,account,account book,address book,adjourn,' is a very broad association. It looks like Domesday Book is an accounting survey of land and a table can be a table of accounts.

That does seem like too far a stretch in meaning to be useful and this isn't a good thesaurus source.

@dcslagel
Copy link

dcslagel commented Nov 9, 2018

After searching around for quite a bit more, there doesn't seem to be a current open licensed thesaurus in space separated 'key-word alt1-word alt2-word...' form. (At least in english). So it probably makes sense to just remove the text suggesting a thesaurus file to download..

@dcslagel
Copy link

dcslagel commented Nov 9, 2018

Just some notes that might be of interest to anyone looking at this ticket but not directly related to the solution..

Wordnet seems to the main source for an english thesaurus:
https://wordnet.princeton.edu

Openoffice maintains a structured text version of the wordnet data the date here:
https://www.openoffice.org/lingucomponent/thesaurus.html
The main download file is here:
https://www.openoffice.org/lingucomponent/MyThes-1.zip

@brammool
Copy link
Contributor

brammool commented Nov 9, 2018 via email

@dcslagel
Copy link

Thesaurus Zip Attached:
thesaurus_pkg.zip

The attached thesaurus_pkg.zip contains a thesaurus.txt in the vim space-separated
format derived from the Wordnet/MyThes-1 sources.

Can the thesaurus.txt be made available on the the ftp site?

The thesaurus.pkg.zip contains these files:

  1. A patch file to modify MyThes-1.0 and generate the thesarus.txt file
    thesaurus_pkg/thesaurus-for-vim.patch

  2. A thesaurus file in the vim format
    thesaurus_pkg/thesaurus.txt

  3. The WordNet and MyThes-1 licences
    thesaurus_pkg/WordNet_license.txt
    thesaurus_pkg/license.readme

Notes:

  1. All distinct terms made of multiple words were removed because the space
    between the words conflicts with the default term delimiter which is a space.

  2. All terms that didn't have any synonyms were removed.

Patch File:
Here is the patch file thesaurus_pkg/thesaurus-for-vim.patch for MyThes-1:

diff -ruN MyThes-1.0/Makefile MyThes-1.0-vim/Makefile
--- MyThes-1.0/Makefile	2003-12-08 14:42:33.000000000 -0700
+++ MyThes-1.0-vim/Makefile	2018-11-30 07:48:21.000000000 -0700
@@ -21,7 +21,7 @@
 	-@ ($(RANLIB) $@ || true) >/dev/null 2>&1
 
 example: example.o $(LIBS)
-	$(CXX) $(CXXFLAGS) -o $@ example.o $(LDFLAGS)
+	$(CXX) -o $@ example.o $(LDFLAGS)
 
 %.o: %.cxx 
 	$(CXX) $(CXXFLAGS) -c $<
diff -ruN MyThes-1.0/README-VIM-THESAURUS MyThes-1.0-vim/README-VIM-THESAURUS
--- MyThes-1.0/README-VIM-THESAURUS	1969-12-31 17:00:00.000000000 -0700
+++ MyThes-1.0-vim/README-VIM-THESAURUS	2018-11-30 10:04:10.000000000 -0700
@@ -0,0 +1,21 @@
+To create a thesaurus file formatted for vim's thesaurus run:
+bash ./mk-vim-thesaurus.sh
+
+The file 'thesaurus.txt' will be created.
+
+Here are the steps that mk-vim-thesaurus.sh takes:
+
+1. Extract term list from MyThes-1.0 th_en_US_new.dat file:
+# Note: This will remove complex words with spaces in them because space is the
+# default delimiter for vim's thesaurus format.
+grep -v "^(" th_en_US_new.dat | awk -F"|" '{print $1}' | grep -v ' ' | grep -v 'ISO8859-1' > words-without-spaces.lst
+
+2. make example
+make
+
+3. Run mk_vim_thesaurus_format:
+# Note: While extracting synonyms, multiple word synonyms with spaces are excluded.
+./example th_en_US_new.idx th_en_US_new.dat words-without-spaces.lst  > raw-list
+
+4. Remove entries that don't have synonyms:
+grep -v "^\w\+$" raw-list > thesaurus.txt 
diff -ruN MyThes-1.0/example.cxx MyThes-1.0-vim/example.cxx
--- MyThes-1.0/example.cxx	2003-12-08 14:37:13.000000000 -0700
+++ MyThes-1.0-vim/example.cxx	2018-11-30 09:00:44.000000000 -0700
@@ -70,16 +70,20 @@
       // or count since needed for CleanUpAfterLookup routine
       mentry* pm = pmean;
       if (count) {
-        fprintf(stdout,"%s has %d meanings\n",buf,count);
-	for (int  i=0; i < count; i++) {
-          fprintf(stdout,"   meaning %d: %s\n",i,pm->defn);
+        // initial word
+        fprintf(stdout,"%s",buf);
+        for (int  i=0; i < count; i++) {
           for (int j=0; j < pm->count; j++) {
-	    fprintf(stdout,"       %s\n",pm->psyns[j]);
+            // only output the word if it doesn't have spaces
+            // because space is the standard delimiter in the 
+            // vim thesaurus file format.
+            if (strchr(pm->psyns[j], ' ') == NULL)  {
+              fprintf(stdout," %s",pm->psyns[j]);
+            }
           }
-          fprintf(stdout,"\n");
           pm++;
-	}
-        fprintf(stdout,"\n\n");
+        }
+        fprintf(stdout,"\n");
         // now clean up all allocated memory 
         pMT->CleanUpAfterLookup(&pmean,count);
       } else {
diff -ruN MyThes-1.0/mk-vim-thesaurus.sh MyThes-1.0-vim/mk-vim-thesaurus.sh
--- MyThes-1.0/mk-vim-thesaurus.sh	1969-12-31 17:00:00.000000000 -0700
+++ MyThes-1.0-vim/mk-vim-thesaurus.sh	2018-11-30 08:51:15.000000000 -0700
@@ -0,0 +1,13 @@
+
+# Extract term list from MyThes-1.0 th_en_US_new.dat file:
+grep -v "^(" th_en_US_new.dat | awk -F"|" '{print $1}' | grep -v ' ' | grep -v 'ISO8859-1' > words-without-spaces.lst
+
+# make example
+make
+
+#  Run mk_vim_thesaurus_format:
+./example th_en_US_new.idx th_en_US_new.dat words-without-spaces.lst  > raw-list
+
+#  Remove entries that don't have synonyms:
+grep -v '^\w\+$' raw-list > thesaurus.txt 
+
diff -ruN MyThes-1.0/mythes.cxx MyThes-1.0-vim/mythes.cxx
--- MyThes-1.0/mythes.cxx	2003-12-08 14:40:27.000000000 -0700
+++ MyThes-1.0-vim/mythes.cxx	2018-11-28 13:12:29.000000000 -0700
@@ -25,7 +25,7 @@
 // return index of char in string
 int mystr_indexOfChar(const char * d, int c)
 {
-  char * p = strchr(d,c);
+  const char * p = strchr(d,c);
   if (p) return (int)(p-d);
   return -1;
 }

@brammool
Copy link
Contributor

I can at least mention this comment in the help, unpacking the .zip file isn't too difficult.

@dcslagel
Copy link

dcslagel commented Dec 1, 2018

Thanks, Is there anything else that needs to be done to complete this issue?

@brammool
Copy link
Contributor

brammool commented Dec 1, 2018 via email

Yamagi added a commit to Yamagi/vimrc that referenced this issue Nov 3, 2019
Since german and english are mostly distinct and have only few mutual
words, we're taking the easy approach and add both thesauri files at
the same time. Vim will query both and display the ciumulated matches.

Taken from:
 * vim/vim#629 (comment)
 * https://github.com/Yamagi/vim-german-thesaurus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants