Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopwords #35

Closed
wants to merge 13 commits into from
Closed

Stopwords #35

wants to merge 13 commits into from

Conversation

assem-ch
Copy link
Contributor

http://trac.xapian.org/ticket/269#no1

  • Gathering the list of Arabic Stop-words
  • Include the stop-word lists from Snowball Project

- contains about 10k words (counting all forms)
…ges.

+ convert the words to arabic letters, instead of unicode
- eliminate lot of words that may appear not as a stop word
- eliminate different forms, not all forms has high frequency
(xapian-data are only for test files)
about the  function to load stopwords form a file
Except Russian, with conversion, it seems the letters are totally messed up
@@ -35,6 +35,7 @@

#include <set>
#include <string>
#include <fstream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to actually include <fstream> if you only reference things from it in a documentation comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it

(Intended to use it in a stopper loader from a file)
$(snowball_stopwords:.txt=.list)

.txt.list:
sed 's/[$$'\t' ]*|.*//;/^[$$'\t' ]*$$/d' < $< > $@
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you're trying to use $'\t' here? That's a bash-ism, and will only work as you expect if SHELL gets set to bash, which certainly won't happen on systems without bash installed. I think you've not managed to get it quite right for bash either - if I un-double the $ and echo the result to see how it expands in bash, I get:

$ echo sed 's/[$'\t' ]*|.*//;/^[$'\t' ]*$/d'
sed s/[$t ]*|.*//;/^[$t ]*$/d

But what you want is simply a literal tab character in this Makefile.mk. How to insert one depends on what editor you're using. Maybe just hitting the tab key will do it, but often that's mapped to some sort of smart indent function. For example, in vim it's Ctrl+V tab to insert a literal tab.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it

languages/stopwords/spanish/stop.txt\
languages/stopwords/swedish/stop.txt

snowball_stopwords_preprocessed =\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a space after the backslash on this line, which confuses the preautoreconf script which collates a list of the sources to feed to doxygen, and results in make choking on docs/Makefile:

make[2]: Entering directory '/home/olly/git/xapian-svn/xapian-core/docs'
Makefile:390: *** recipe commences before first target.  Stop.

This may be something GNU make 4.0 is fussier about than 3.x, but generally spaces or tabs after a \ intended to continue the line are asking for trouble, as they may be interpreted as escaping the whitespace instead.

@ojwb
Copy link
Contributor

ojwb commented Jun 17, 2014

OK, merged almost all of this via SVN, and committed the whitespace fixup to languages/Makefile.mk as well.

I haven't merged the bindings changes yet, but it's past time I went to bed.

@ojwb
Copy link
Contributor

ojwb commented Jun 28, 2014

I just noticed there are a number of duplicates in your arabic stopword list (number is how many times it occurs):

      2 إذ
      2 إذا
      2 أما
      2 أنى
      3 أي
      2 إياك
      2 أيان
      3 أين
      2 بكم
      2 ذا
      2 سوى
      2 كل
      2 كلا
      2 كم
      3 لست
      3 لكن
      2 لما
      3 متى
      2 هيا
      2 وا

List was generated by:

sort arabic/stop.list|uniq -c|grep -v '^ *1\>'

@ojwb
Copy link
Contributor

ojwb commented Jun 28, 2014

OK, I've tweaked the stop word prep makefile rule to sort and uniq the words, so any duplicates in the sources will get eliminated, though unless there's a good reason to have them in the sources it's probably better to remove them from there too.

A few of the snowball lists also had duplicates which I've removed and I will open a PR with snowball.

The Russian list seemed to be KOI8-R, so I've converted that to UTF-8.

And I've merged the stemmer change, reworked slightly to wrap it as a constructor so it's more similar to the C++ API. Also added a quick testcase.

So everything's now merged - sorry it's taken a while.

@ojwb ojwb closed this Jun 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants