Support full text search for all languages #700

Closed
wants to merge 20 commits into
from

Projects

None yet

5 participants

@kou
Contributor
kou commented Apr 24, 2016

GitHub: #615

It uses PGroonga instead of built-in PostgreSQL full text search
feature. Because built-in PostgreSQL full text search doesn't support
languages that don't put space between terms such as Japanese, Chinese
and so on. PGroonga supports all languages including Japanese and
Chinese.

This change doesn't enable PGroonga by default. It means that this
change doesn't change the current behavior.

You need to set True to USING_PGROONGA in zrpoject/settings.py to enable
PGroonga.

@smarx
smarx commented Apr 24, 2016

Automated message from Dropbox CLA bot

@kou, thanks for the pull request! It looks like you haven't yet signed the Dropbox CLA. Please sign it here.

@smarx
smarx commented Apr 24, 2016

Automated message from Dropbox CLA bot

@kou, thanks for signing the CLA!

@timabbott timabbott and 1 other commented on an outdated diff Apr 25, 2016
puppet/zulip/files/postgresql/process_fts_updates
@@ -43,6 +43,8 @@ sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../.
try:
os.environ['DJANGO_SETTINGS_MODULE'] = 'zproject.settings'
from django.conf import settings
+ if settings.USING_PGROONGA:
+ exit()
@timabbott
timabbott Apr 25, 2016 edited Contributor

@kou can you explain a bit more what's going on here? Is the story that with PGROONGA we're not logging and deferring updates to the full-text search index, and so we don't need this script?

@kou
kou Apr 26, 2016 Contributor

Sure.
PGroonga supports read lock free updating. It means that search performance isn't decreased while we add new messages. So we don't need to defer full-text search index update.

(I don't know why Zulip defers full-text search index update. So the answer may not describe what you want to know.)

@timabbott
timabbott Apr 26, 2016 Contributor

Right, so the story here is that Zulip does this deferral as a performance optimization. Updating the full-text search index is a substantial portion of the total work involved in do_send_message, so rather than blocking on doing that before finishing the message sending operation, we do it in the background after the message is sent. So this issue isn't search performance, it's message sending performance.

I think it would be good to use the same mechanism with PGroonga if there isn't a technical constraint that makes this difficult, since otherwise sites using this will experience substantially slower message sending performance.

@kou
kou Apr 27, 2016 Contributor

Thanks for the information.

PGroonga has "newer data are more important" policy. So PGroonga's index update is faster than other full-text search implementation such as GIN. But it's slower than "do nothing".

I hope that sent messages are searchable as soon as possible. (It also respects PGroonga policy.)

But this is Zulip project. Zulip project prefers deferring index update, we should do. It should be done in this pull request? Can we separate it as another issue? Because the current implementation works.

@timabbott
timabbott Apr 27, 2016 Contributor

The deferred index update usually only adds <10ms of delay to the data being available in the index -- it just avoids blocking the critical send_message code path on updating the index.

I definitely understand the desire to get something finished and merged and then move onto building improvements and optimizations. I'm OK with fixing this being a follow-up issue, but I consider it a blocker for PGroonga being available in Zulip in production, because I don't want to support two models here (and I think it'll actually be very little work to make the PGroonga integration use the existing infrastructure for doing deferred updates we have for postgres full text search).

@kou
kou Apr 29, 2016 Contributor

OK. I've implemented and pushed it.

In the implementation, the default full-text search index is also updated. Should we disable index update for the default full-text search?

@timabbott timabbott and 1 other commented on an outdated diff Apr 25, 2016
zproject/settings.py
@@ -329,6 +329,22 @@ def get_secret(key):
else:
DATABASES['default']['OPTIONS']['sslmode'] = 'verify-full'
+if 'postgres' in DATABASES['default']['ENGINE']:
+ # Set True if you want to support full text search against all languages
+ # USING_PGROONGA = True
+ USING_PGROONGA = False
+else:
+ USING_PGROONGA = False
@timabbott
timabbott Apr 25, 2016 Contributor

This user-configurable setting for USING_PGROONGA should be added to zproject/local_settings_template.py (installed as /etc/zulip/settings.py) so users don't need to patch Zulip to set this setting. You should add its default value of False to DEFAULT_SETTINGS in this file.

@timabbott
timabbott Apr 25, 2016 Contributor

You should add some docs in a new file docs/full-text-search.md explaining how to setup pgroonga, and linking to the upstream broader docs on pgroonga. (We should probably also have some text in there explaining how the default Zulip full-text search works, but I'm happy to write that).

Also, the section for this that you add in zproject/local_settings_template.py should have comments linking to that documentation.

@kou
kou Apr 26, 2016 Contributor

This user-configurable setting for USING_PGROONGA should be added to zproject/local_settings_template.py (installed as /etc/zulip/settings.py) so users don't need to patch Zulip to set this setting. You should add its default value of False to DEFAULT_SETTINGS in this file.

Done.

You should add some docs in a new file docs/full-text-search.md explaining how to setup pgroonga, and linking to the upstream broader docs on pgroonga. (We should probably also have some text in there explaining how the default Zulip full-text search works, but I'm happy to write that).

I've added it with some TODO marks. Could you resolve TODOs for full-text search feature summary and the default full-text search feature description after this pull request is merged?

Also, the section for this that you add in zproject/local_settings_template.py should have comments linking to that documentation.

Done.

@timabbott timabbott and 1 other commented on an outdated diff Apr 25, 2016
@@ -105,6 +107,11 @@
def main():
run(["sudo", "apt-get", "update"])
+ run(["sudo", "apt-get", "-y", "install", "software-properties-common"])
+ run(["sudo", "add-apt-repository", "-y", "universe"])
+ run(["sudo", "add-apt-repository", "-y", "ppa:groonga/ppa"])
+
+ run(["sudo", "apt-get", "update"])
@timabbott
timabbott Apr 25, 2016 Contributor

This only suffices to install pgroonga in a development environment; we'll also need some code for installing it in a production environment. I think the right way to do this would be to add a new (optional) puppet module under puppet/zulip/manifests/ intended to be used alongside postgres_appdb.pp that adds this PPA and depends on the package. Let me know if you need help understanding how to do this properly.

@kou
kou Apr 26, 2016 Contributor

How about separating the task a separated issue? I want to work on minimum task in one pull request.

@timabbott
timabbott Apr 27, 2016 Contributor

I think that addressing this is required to make pgroonga a supported feature for Zulip, but I'm OK with making this a follow-up issue if you like. But we should clearly mark this feature as experimental in the docs/ page and be clear it only works in the Zulip development environment for now. Does that seem reasonable? I don't want to have people think they can use this in production when that's not done yet.

@kou
kou Apr 29, 2016 Contributor

OK. I've pushed puppet support.

@timabbott timabbott and 1 other commented on an outdated diff Apr 25, 2016
zerver/views/messages.py
+ if settings.USING_PGROONGA:
+ return self._by_search_pgroonga(query, operand, maybe_negate)
+ else:
+ return self._by_search_tsearch(query, operand, maybe_negate)
+
+ def _by_search_pgroonga(self, query, operand, maybe_negate):
+ match_positions_byte = func.pgroonga.match_positions_byte
+ query_extract_keywords = func.pgroonga.query_extract_keywords
+ keywords = query_extract_keywords(operand)
+ query = query.column(match_positions_byte(column("rendered_content"),
+ keywords).label("content_matches"))
+ query = query.column(match_positions_byte(column("subject"),
+ keywords).label("subject_matches"))
+ condition = or_(column("subject").op("@@")(operand),
+ column("rendered_content").op("@@")(operand))
+ return query.where(maybe_negate(condition))
@timabbott
timabbott Apr 25, 2016 Contributor

Does highlighting of the matched words work with the pgroonga version here?

@kou
kou Apr 26, 2016 Contributor

Sure. I've added a new function that compatible with ts_match_locs_array in tsearch_extras to PGroonga.

We can use the existing code for highlighting.

@timabbott
timabbott Apr 26, 2016 Contributor

Cool, that's great! Do you need to add any code to make that work?

@kou
kou Apr 27, 2016 Contributor

Do you need to add any code to make that work?

No. The current code is enough.

@timabbott
timabbott Apr 27, 2016 Contributor

great!

@timabbott timabbott and 1 other commented on an outdated diff Apr 25, 2016
zerver/migrations/0001_initial.py
@@ -47,8 +53,8 @@ class Migration(migrations.Migration):
CREATE TRIGGER zerver_message_update_search_tsvector_async
BEFORE INSERT OR UPDATE OF subject, rendered_content ON zerver_message
FOR EACH ROW EXECUTE PROCEDURE append_to_fts_update_log();
-"""),
- ]
+""")
+ zulip_postgres_migrations = [full_text_search_migration]
@timabbott
timabbott Apr 25, 2016 Contributor

This approach of modifying the initial migration to use PGROONGA isn't ideal, since it makes it fairly unpleasant for an existing Zulip installation to switch to using PGROONGA. I'm not sure what the right answer is (maybe we just provide a separate script the user can use when they toggle PGROONGA it to create the migration), but wanted to flag this as a potential issue.

@kou
kou Apr 26, 2016 Contributor

OK. Should we create a new issue about it? It may be a large task.

@timabbott
timabbott Apr 26, 2016 Contributor

I'm not sure it needs to be a large task. Assuming we're OK with the old full-text search still having its implementation present, it seems to me like we need the script to just do a few things:

  • Do an ALTER ROLE to update the default search path to the appropriate value
  • Run the CREATE INDEX command above

Or am I missing something?

@kou
kou Apr 27, 2016 Contributor

Sorry. My description didn't have enough information.

What the script needs to do is what you list. It's not a large task as you said.

I think that we need to decide the followings things. It may be a large task because we may need many things or discussions.

  • Where should we put the script?
  • How do users use the script?
@timabbott
timabbott Apr 27, 2016 Contributor

I'd put the script with a path like scripts/setup/enable-pgroonga, and the documentation for the USING_PGROONGA option would suggest that users need to run that script in order to use it.

Does that seem reasonable?

@kou
kou Apr 29, 2016 Contributor

Thanks for deciding them. It's reasonable.

I've put scripts/setup/enable-pgroonga and scripts/setup/disable-pgroonga and removed related codes from tools/postgres-init-deb-db and zerver/migrations/0001_initial.py. docs/full-text-search.md is updated.

@timabbott timabbott and 1 other commented on an outdated diff Apr 25, 2016
tools/postgres-init-dev-db
@@ -24,7 +24,7 @@ if [[ $# == 0 ]]; then
USERNAME=zulip
PASSWORD=$("$(dirname "$0")/../bin/get-django-setting" LOCAL_DATABASE_PASSWORD)
DBNAME=zulip
- SEARCH_PATH="$USERNAME",public
+ SEARCH_PATH="$USERNAME",public,pgroonga,pg_catalog
@timabbott
timabbott Apr 25, 2016 Contributor

why do we need to add pg_catalog here as well?

@kou
kou Apr 26, 2016 Contributor

PostgreSQL searches pg_catalog first if it's not specified in search_path.

http://www.postgresql.org/docs/9.5/static/runtime-config-client.html :

The system catalog schema, pg_catalog, is always searched, whether it is mentioned in the path or not. If it is mentioned in the path then it will be searched in the specified order. If pg_catalog is not in the path then it will be searched before searching any of the path items.

pgroonga schema provides @@ operator. It's also defined in pg_catalog. If we just specify only pgroonga, @@ operator is never used. Because pg_catalog is searched first.

It causes a problem when sequential scan is used by PostgreSQL. Both of @@ operator in pg_catalog and @@ operator in pgroonga provide full-text search feature. But @@ operator in pg_catalog doesn't support languages that doesn't put space between words such as Japanese and Chinese. In sequential scan, we need to use @@ operator in pgroonga to support all languages.

@timabbott
timabbott Apr 26, 2016 edited Contributor

OK, makes sense. Can you add a comment explaining this in the relevant section of zproject/settings.py?

I'm thinking something like:

We need to have pgroonga before pg_catalog in the postgres search path, because pgroonga overrides the @@ operator from pg_catalog, and pg_catalog is searched first if not specified in the search path.

Also, you'll need to put this in scripts/setup/postgres-init-db as well (the version that runs for setting up production instances).

@kou
kou Apr 27, 2016 Contributor

OK, makes sense. Can you add a comment explaining this in the relevant section of zproject/settings.py?

Done.

Also, you'll need to put this in scripts/setup/postgres-init-db as well (the version that runs for setting up production instances).

We should not add pgroonga to search when we don't use PGroonga. If PGroonga is installed but Zulip doesn't use PGroonga, @@ operator in pgroonga schema is used. It's a problem.

We need to know whether does the user want to use PGroonga or not in scripts/setup/postgres-init-db. If the user wants to use PGroonga, we should setup search path. How do we know about it?

@kou
kou Apr 27, 2016 Contributor

Ah, should we use ADDITIONAL_PACKAGES environment variable introduced at 68c6d51 ?

@timabbott
timabbott Apr 27, 2016 Contributor

Hmm, this is tricky -- postgres-init-db is run before we give the user the opportunity to put things into settings.py. Let me think about this a bit more and get back to you. It might work to move postgres-init-db to being run a bit later in the installation process.

@timabbott
timabbott Apr 27, 2016 Contributor

From our discussion on provision.py, it sounds like you're planning to defer enabling this feature in production for a follow-up PR; since this is related to the same goal, it's fine for this to be bundled with that follow-up issue.

@timabbott
Contributor

Cool, thanks for working on this @kou! I posted a bunch of comments and questions; let me know when you have a new version ready!

@kou
Contributor
kou commented Apr 26, 2016

I've pushed some commits to apply your comments.

@timabbott timabbott and 1 other commented on an outdated diff Apr 26, 2016
docs/full-text-search.md
+See also:
+[GRANT USAGE ON SCHEMA pgroonga](http://pgroonga.github.io/reference/grant-usage-on-schema-pgroonga.html)
+
+Then, you set `True` to `USING_PGROONGA` in `local_settings.py`.
+
+Before:
+
+ # USING_PGROONGA = True
+
+After:
+
+ USING_PGROONGA = True
+
+TODO: Describe how to enable PGroonga on installed Zulip.
+
+Now, you can use full-text search against all languages.
@timabbott
timabbott Apr 26, 2016 Contributor

Does the user need to add any dictionary files or do other setup to make PGROONGA work?

@kou
kou Apr 27, 2016 Contributor

There are no more setup.
PGroonga can use dictionary based full-text search and not dictionary based full-text search. In this pull request, we use not dictionary based full-text search. Because we need dictionaries for all languages to support all languages with dictionary based full-text search.

@timabbott
timabbott Apr 27, 2016 Contributor

I see -- maybe we should add documentation on how to add a dictionary for a particular language to PGroonga?

@kou
kou Apr 27, 2016 Contributor

I think that we don't need it. Because we use dictionary based full-text search, we can't support all languages. If the user who doesn't need to support all languages, the user will use the current (default) full-text search feature.

@timabbott timabbott and 1 other commented on an outdated diff Apr 26, 2016
docs/full-text-search.md
+## How to enable full-text search against all languages
+
+Zulip also supports [PGroonga](http://pgroonga.github.io/). PGroonga
+is a PostgreSQL extension that provides full-text search
+feature. PGroonga supports all languages including Japanese, Chinese
+and so on.
+
+This section describes how to enable full-text search feature based on
+PGroonga.
+
+First, you [install PGroonga](http://pgroonga.github.io/install/).
+
+Then, you grant `USAGE` privilege on `pgroonga` schema to `zulip`
+user:
+
+ GRANT USAGE ON SCHEMA pgroonga TO zulip;
@timabbott
timabbott Apr 26, 2016 Contributor

Can we move the GRANT USAGE part into the pgroonga setup script?

@kou
kou Apr 27, 2016 Contributor

Yes when we decide how to provide the PGroonga setup script. It will be discussed at https://github.com/zulip/zulip/pull/700/files/3532f455a1bff2fb8aa413cceaa6b1aba9fdb8de#r60951735 or new issue.

@timabbott
timabbott Apr 27, 2016 Contributor

OK.

@timabbott
Contributor

@kou thanks for addressing those comments! I posted a few more issues.

The other thing that I think would be really good to have here is some sort of automated testing to confirm the pgroonga implementation doesn't get broken the next time we do refactoring in the search code paths. I'm not totally sure what the best way to do this is. https://docs.djangoproject.com/en/1.8/topics/testing/tools/#overriding-settings would help with letting us set different settings for a test case, but since this involves a modified database schema, we may need to do something more complex to do the testing.

@kou
Contributor
kou commented Apr 27, 2016

The other thing that I think would be really good to have here is some sort of automated testing to confirm the pgroonga implementation doesn't get broken the next time we do refactoring in the search code paths.

I've added it: 062afa6

@timabbott
Contributor

Posted a few more comments. You have a lint failure that travis CI has flagged. The remaining big issue from my perspective is patching the initial migration as the model for supporting USING_PGROONGA. I think it will make it quite difficult for anyone to switch between using pgroonga and the default postgres full text search.

The simplest fix would be to add a scripts/setup/enable-pgroonga script (and maybe a corresponding disable script) to setup pgroonga and add the PGroonga index, which is outside the zerver migration model. Perhaps the nicest approach might be to make a separate Django app for PGroonga with its own migration sequence; the user would add the new app to INSTALLED_APPS and run manage.py migrate to enable PGroonga; this would leave us the flexibility to update the PGroonga database schema in a natural way by adding new migrations.

Maybe the simplest path to getting this merged would be to move the PGroonga setup code out of the first migration and instead just put it into the docs/ file along with the GRANT statement, and then we can put it into a script as we get this productionizable?

What do you think @kou?

@kou
Contributor
kou commented Apr 29, 2016

Posted a few more comments.

I think that I've applied all your comments.

You have a lint failure that travis CI has flagged.

Sorry. There was a code for debug. I've fixed it.

@timabbott
Contributor

OK cool; I'm working on getting a Zulip 1.3.11 release out the door and then will re-review and hopefully merge this just after (since I think it's worth having a bit of time on master befoer this goes into a release)

@kou
Contributor
kou commented May 3, 2016

OK. I wait for your review.

@timabbott
Contributor

@kou just wanted to update you that I still really want to review this, but I simply haven't had time to do it and I'm about to go on a 2-week vacation, so it may be a little while. I'm really sorry about the delay here!

@kou
Contributor
kou commented May 19, 2016

Thanks for notifying your status. Have a nice vacation!

@Jianchun1
Contributor

hi all, I have enable the Chinese full-text search in my Zulip production environment. It works well for some days. I hope it would be a feature of the next release. Thanks for all your works.

@kou
Contributor
kou commented Jun 7, 2016

@timabbott Could you restart this pull request?

@timabbott
Contributor

@kou yep I've recently returned from my trip, and am trying to catch up on a large backlog of open pull requests. I'm hopefully going to be able to review this again properly soon. Sorry again for the delays @kou!

And thanks @Jianchun1 for the report that this is working for you in production! It's very helpful data.

@timabbott
Contributor

@kou I found time to take another detailed look at this, and tested it out; things went smoothly except for a couple small issues:

  • It seems the enable/disable scripts need to run as root, but they aren't run under sudo in the docs; can you update the docs?
  • Do you need to add the universe repository in the provision.py file?

I have a few questions about how pgroonga's release and packaging process works:

  • Is there a plan for PGroonga being included in Debian upstream (and thus eventually Ubuntu)?
  • To what extent is PGroonga in a stable state of development? E.g. can we assume that it won't have major backwards-incompatible upgrades in the future? If so, we should at least document this status.
  • If we're going to be using the PPA in all Zulip production environments, we're putting all Zulip users at (hopefully small) risk of apt-get upgrade breaking their system, so I'd like to understand the tradeoffs there. An alternative, more conservative approach could be to only add the Pgroonga PPA and package in production if the feature is enabled, so problems with PGroonga can only affect those sites that are using it. We can do easily that by making a separate puppet manifest for it. (I think it's fine to have it installed in development for everyone).
  • To what extent can we expect the PGroonga PPA to have packages for all major versions of Ubuntu (+ supported Postgres versions), and to be kept up to date with any security updates?
  • Is there benchmark data on the performance of PGroonga compares with Postgres' built-in full-text search? I think I'd want to have that before switching our full-text search to use it for everything (as opposed to just an option), which seems like the long-term path (since full-text search working in all languages is clearly better than just working in some).

In the meantime I can work on writing the rest of the docs on our full-text search system, so that we can remove the TODOs from this pull request.

@timabbott
Contributor

BTW, thanks for your work on this and the upstream PGroonga tool -- I'm definitely excited about being able to offer full-text search not just in English!

@kou
Contributor
kou commented Jun 18, 2016 edited
  • It seems the enable/disable scripts need to run as root, but they aren't run under sudo in the docs; can you update the docs?

Sure. I've added the following description:

The following processes should be executed as the root user. Run:

sudo -i

The description is based on https://zulip.org/server.html :

The installation process should be executed as the root user. Run:

sudo -i


  • Do you need to add the universe repository in the provision.py file?

Yes. postgresql-9.4-pgroonga package depends on liblz4-1 package. It's in the universe repository in Ubuntu 14.04: http://packages.ubuntu.com/trusty/liblz4-1

It's in the main repository since Ubuntu 16.04: http://packages.ubuntu.com/xenial/liblz4-1

@kou
Contributor
kou commented Jun 18, 2016
  • Is there a plan for PGroonga being included in Debian upstream (and thus eventually Ubuntu)?

Unfortunately, no.

  • To what extent is PGroonga in a stable state of development? E.g. can we assume that it won't have major backwards-incompatible upgrades in the future? If so, we should at least document this status.

PGroonga 1.x keeps backward compatibility. (The current version is 1.1.0.)

But PGroonga 2.0.0 introduces backward incompatible changes. PGroonga users need to recreate PGroonga indexes and change SQL (custom operators in WHERE). PGroonga 2.0.0 will be released within a year.

I'll send a pull request when PGroonga 2.0.0 is released.

  • If we're going to be using the PPA in all Zulip production environments, we're putting all Zulip users at (hopefully small) risk of apt-get upgrade breaking their system, so I'd like to understand the tradeoffs there. An alternative, more conservative approach could be to only add the Pgroonga PPA and package in production if the feature is enabled, so problems with PGroonga can only affect those sites that are using it. We can do easily that by making a separate puppet manifest for it. (I think it's fine to have it installed in development for everyone).

Here are risks:

  • If the PPA site is down, apt-get update will be failed.
  • If the PPA introduces wrong dependency, apt-get upgrade will be failed. (It'll be fixed soon when it's occurred. Because Groonga (full text search engine used in PGroonga) and Mroonga (storage engine based on Groonga for MySQL) users are also using the PPA. They will find the problem before Zulip users find the problem. If the problem is found, we'll fix the problem soon.)
  • When PGroonga 2.0.0 is released but Zulip doesn't support PGroonga 2.0.0 yet, apt-get upgrade will break Zulip. (It may be better that we use Pin: version 1.* in /etc/apt/preferences.)

There is no Zulip breakage risk by apt-get upgrade for Zulip users who install PGroonga but doesn't use PGroonga. PGroonga is never used until Zulip users run CREATE EXTENSION pgroonga.

  • To what extent can we expect the PGroonga PPA to have packages for all major versions of Ubuntu (+ supported Postgres versions), and to be kept up to date with any security updates?

The PGroonga PPA will support Ubuntu 14.04 or later (= PostgreSQL 9.3 or later) at least until the next few years. We may drop Ubuntu 14.04 before Ubuntu 14.04 EOL when we need to require any features that aren't in Ubuntu 14.04. We'll keep supporting Ubuntu 14.04 (= PostgreSQL 9.3) as much as possible.

The PGroonga PPA is updated new PGroonga version is released. If any security issues are found, new PGroonga version is released. The new package is also released.

  • Is there benchmark data on the performance of PGroonga compares with Postgres' built-in full-text search? I think I'd want to have that before switching our full-text search to use it for everything (as opposed to just an option), which seems like the long-term path (since full-text search working in all languages is clearly better than just working in some).

No.

There is benchmark data for PGroonga and pg_bigm. pg_bigm is a full text search extension based on GIN. The benchmark uses Japanese Wikipedia data: http://groonga.org/en/blog/2015/10/29/pgroonga-1.0.0.html

@timabbott
Contributor

OK, cool. It'd be great to have a benchmark against the default postgres full-text search, but that's certianly plenty of evidence that this is efficient :). My main concern is the PGroona 2.0 compatibility issue:

When PGroonga 2.0.0 is released but Zulip doesn't support PGroonga 2.0.0 yet, apt-get upgrade will break Zulip. (It may be better that we use Pin: version 1.* in /etc/apt/preferences.)

I think probably we can reasonably include with the Zulip PGroonga configuration an /etc/apt/preferences.d file to ping PGroonga to 1.0.x. When PGroonga 2.0.0 comes out, will PGroonga 1.0.x still be available in the PGroonga PPA repository? If not, we may need to build PGroonga packages for our own PPA, so we can make sure old versions of Zulip remain installable.

@kou
Contributor
kou commented Jun 23, 2016

When PGroonga 2.0.0 comes out, will PGroonga 1.0.x still be available in the PGroonga PPA repository?

Ah, no. PPA provides only the latest versions.

OK. I change my plan. PGroonga 2.x keeps 1.x API and make 1.x API deprecated.
I'll send a pull request to use PGroonga 2.x API after I release PGroonga 2.0.0.
Zulip can use PGroonga 1.x API while PGroonga 2.x. I drop PGroonga 1.x API when I release PGroonga 3.0.0. All Zulip users who use PGroonga should upgrade to PGroonga 2.0.0 or later until PGroonga 3.0.0 is released.

What do you think about this plan?

@timabbott
Contributor

That plan sounds great! I imagine it'd help a lot of other projects that might use PGroonga as well.

@kou
Contributor
kou commented Jun 24, 2016

Thanks for confirming the new plan.
Is there any concern to merge this pull request?

@timabbott timabbott and 1 other commented on an outdated diff Jun 26, 2016
zproject/local_settings_template.py
@@ -336,3 +336,12 @@
# Controls whether Zulip will rate-limit user requests.
# RATE_LIMITING = True
+
+# Full-text search configuration
+#
+# By default, Zulip supports full-text search only for English. If
+# you want to use other languages, use PGroonga based full-text search
+# feature. See
+# https://zulip.readthedocs.org/en/latest/full-text-search.html how to
+# enable PGroonga based full-text search feature.
+# USING_PGROONGA = True
@timabbott
timabbott Jun 26, 2016 Contributor

maybe change this to just read USING_PGROONGA = False and then the user can change False -> True to enable it? The instructions would be simpler this way.

@kou
kou Jun 27, 2016 Contributor

Done.

@timabbott timabbott and 1 other commented on an outdated diff Jun 26, 2016
docs/full-text-search.md
+### How to disable full-text search against all languages
+
+This section describes how to disable full-text search feature based
+on PGroonga.
+
+You set `False` to `USING_PGROONGA` in
+`local_settings.py`. `USING_PGROONGA` is `False` by default. So you
+just comment it out:
+
+Before:
+
+ USING_PGROONGA = True
+
+After:
+
+ # USING_PGROONGA = True
@timabbott
timabbott Jun 26, 2016 Contributor

See my comment below about changing how this setting appears in local_settings_template.py, but I think we can simplify these instructions substantially once it's just "change False to True" or "True to False".

@kou
kou Jun 27, 2016 Contributor

Done.

@timabbott timabbott and 1 other commented on an outdated diff Jun 26, 2016
scripts/setup/enable-pgroonga
+# Make sure the current working directory is readable by postgres
+cd /
+
+su "$POSTGRES_USER" -c 'psql zulip' <<EOF
+CREATE EXTENSION pgroonga;
+
+GRANT USAGE ON SCHEMA pgroonga TO zulip;
+ALTER ROLE zulip SET search_path TO zulip,public,pgroonga,pg_catalog;
+
+SET search_path = zulip,public,pgroonga,pg_catalog;
+
+ALTER TABLE zerver_message ADD COLUMN search_pgroonga text;
+
+UPDATE zerver_message SET search_pgroonga = subject || ' ' || rendered_content;
+
+CREATE INDEX zerver_message_search_pgroonga ON zerver_message
@timabbott
timabbott Jun 26, 2016 Contributor

We probably want this to use CREATE INDEX CONCURRENTLY; otherwise, nobody will be able to send messages while this is running.

@kou
kou Jun 27, 2016 Contributor

Done.

@timabbott timabbott and 1 other commented on an outdated diff Jun 26, 2016
scripts/setup/enable-pgroonga
+GRANT USAGE ON SCHEMA pgroonga TO zulip;
+ALTER ROLE zulip SET search_path TO zulip,public,pgroonga,pg_catalog;
+
+SET search_path = zulip,public,pgroonga,pg_catalog;
+
+ALTER TABLE zerver_message ADD COLUMN search_pgroonga text;
+
+UPDATE zerver_message SET search_pgroonga = subject || ' ' || rendered_content;
+
+CREATE INDEX zerver_message_search_pgroonga ON zerver_message
+ USING pgroonga(search_pgroonga pgroonga.text_full_text_search_ops);
+EOF
+)
+
+# Clear memcached to avoid contamination from previous database state
+sh "$(dirname "$0")/flush-memcached"
@timabbott
timabbott Jun 26, 2016 Contributor

We should probably run this just after the ALTER TABLE line, before creating the new index, since the contamination issue is primarily around the new column on zerver_message.

@kou
kou Jun 27, 2016 Contributor

I think that we don't need to run the line hurry.

Because the new column on zerver_message isn't used until we set True to USING_PGROONGA and run restart-server. We'll do them after enable-pgroonga is finished.

@timabbott
Contributor

I posted a few smaller comments; the one larger issue is that I'd prefer to be using the Django migrations system for managing Zulip's database migrations, rather than some scripts off to the side. Here's what I think we could do that would work nicely:

  • We create a new (very small) Django app in the Zulip tree called "pgroonga", which just contains an single migration containing the code from enable-pgroonga (with a "reverse" migration with the code from disable-pgroonga).
  • We modify INSTALLED_APPS in zproject/settings.py to have pgroonga added if and only if USING_PGROONGA=True.
  • The instructions to install it become "set USING_PGROONGA and run ./manage.py migrate"
  • The instructions to remove it become "run ./manage.py migrate pgroonga zero and unset USING_PGROONGA"

Since that's basically just rearranging the code you've already written, I think this shouldn't be a lot of work.

@timabbott
Contributor

Also I pushed ea611c0; so you can rebase and merge your new docs with that, to eliminate the TODOs.

@kou
Contributor
kou commented Jul 2, 2016 edited

I created "pgroonga" Django application and moved PGroonga related SQL to the application.

Here are limitations by this change:

  • Users need to add SUPERUSER permission to zulip role in PostgreSQL temporary while enabling/disabling PGroonga. Because CREATE EXTENSION requires SUPERUSER permission.
  • We can't use CREATE INDEX CONCURRENTLY. Because we can't use CREATE INDEX CONCURRENTLY in transaction and Django migration uses transaction.
@timabbott
Contributor

Urgh, those are annoying limitations. The CREATE INDEX CONCURRENTLY issue is apparently fixed in Django 1.10, but that doesn't help us quite yet. The other issue we may need to resolve another way (don't want to change the permissions model). Sigh. I guess we could have a script run as root do the CREATE EXTENSION and the rest in the migration? Or perhaps better, we can do the CREATE EXTENSION as part of the installation of the extension...

I'll need to find more time to think about this later...

@kou
Contributor
kou commented Jul 9, 2016

I'll need to find more time to think about this later...

Does it mean that this pull request isn't merged soon...?

@timabbott
Contributor

No, I think this is important and I really want to get this merged soon, I'm just extremely busy right now and need to find the time. Sorry again about the delays!

@brainwane
Contributor
brainwane commented Aug 17, 2016 edited

@timabbott Good news: @kou can probably come to the office hour on August 25/26. So if you have some time to do some more thinking before then, you & kou can talk about it live next week. :)

(I've just reread the comments on this pull request. I think the main thing that needs to happen is that Tim needs to revisit his open questions from July 6th.)

kou added some commits Apr 24, 2016
@kou kou Support full text search for all languages.
GitHub: #615

It uses PGroonga instead of built-in PostgreSQL full text search
feature. Because built-in PostgreSQL full text search doesn't support
languages that don't put space between terms such as Japanese, Chinese
and so on. PGroonga supports all languages including Japanese and
Chinese.

This change doesn't enable PGroonga by default. It means that this
change doesn't change the current behavior.

You need to set True to USING_PGROONGA in zrpoject/settings.py to enable
PGroonga.
df3c4d8
@kou kou Add document about full-text search. 4dd7f71
@kou kou Move USING_PGROONGA configuration to local_settings_template.py.
Because it's an user configurable item.
197f867
@kou kou Add description why we need to custom search_path. d9ce99d
@kou kou Add a test for PGroonga based full-text search. 94c05b0
@kou kou Extract PGroonga enable/disable code as separated scripts.
It's for enabling PGroonga based full-text search feature on already
installed Zulip instance.
2466ffa
@kou kou Support deferred full-text search index update for PGroonga.
It reduces response time on sending message. It's good for users.
e980b7d
@kou kou Remove needless "su".
"su" is used in these scripts. We need to root privilege to use "su".
a9f2a7e
@kou kou Fix a typo e7efa63
@kou kou Add missing description about execution user 4181af0
@kou kou Specify operator class for PGroonga v1 API explicitly
It will be useful to work with PGroonga 2.x without any changes.
67c6ad5
@kou kou Set False to USING_PGROONGA by default in configuration template
It's for simplifying enable/disable instructions.
3cf4b01
@kou kou Use CREATE INDEX CONCURRENTLY
It's for supporting sending messages while creating index.
77aeaaf
@kou kou Use pgroonga project to enable/disable PGroonga
You can enable PGroonga by the following command:

    ./manage.py migrate pgroonga

You can disable PGroonga by the following command:

    ./manage.py migrate pgroonga zero

There are some limitations:

  * You need to add SUPERUSER permission to "zulip" role temporary while
    you enable/disable PGroonga. Because CREATE/DROP EXTENSION require
    SUPERUSER permission.

  * You can't send new messages until the PGroonga enable migration is
    finished. Because CREATE INDEX CONCURRENTLY can't be used in
    transaction.
3edd588
@kou kou doc: Remove needless superuser related descriptions 38d8975
@kou kou Drop PGroonga setup from migration d2bdc88
@kou kou doc: Describe about installing PGroonga by Puppet 9e8166c
@kou kou Add comment about CREATE INDEX CONCURRENTLY problem
2b1da51
@kou kou puppet: check "pgroonga" value
d0959d7
@kou kou test: Add end-to-end test for PGroonga search
It fails for now because Zulip requires match locations in characters
instead of bytes in a90b470.

PGroonga provides only in bytes version. I need to implement in
characters version.
edf85b0
@timabbott
Contributor

Cool, the test looks good; I think we should just make a second copy of highlight_string that uses bytes for this code path (it's probably a bug that the existing full-text search system, tsearch_extras doesn't measure in bytes).

@timabbott
Contributor

I did a bit of cleanups to the code, tweaked the tests a bit and squashed the commits, creating this: https://github.com/timabbott/zulip/tree/pgroonga-test, which I think is ready to merge (pending Travis CI passing).

@kou thanks so much for all your work on this PR (and also on building PGroonga itself)! I'm really excited about this feature. I'm going on vacation tomorrow but when I return, I plan to enable PGroonga on zulip.tabbott.net; with luck we can test this extensively enough that we can just remove tsearch-extras and use PGroonga as the only supported full-text search in Zulip (since given both options, I imagine everyone would want the every-language version).

Sorry again for all the delay in figuring out how to integrate this.

@timabbott
Contributor

Merged, thanks @kou!

@timabbott timabbott closed this Aug 27, 2016
@kou
Contributor
kou commented Aug 27, 2016

Thanks!
I'll start using Zulip with PGroonga. Some problems will be happen but we'll be able to fix them. Please mention me when full-text search related issues are reported.

@timabbott
Contributor

Will do!

@kou kou deleted the kou:full-text-search-support-all-languages branch Dec 1, 2016
@kou
Contributor
kou commented Dec 1, 2016

FYI: I have published benchmark result with English Wikipedia data against PGroonga, textsearch and pg_trgm: http://pgroonga.github.io/reference/pgroonga-versus-textsearch-and-pg-trgm.html

Search performance: PGroonga and textsearch are similar. pg_trgm is slower.

Index creation performance: PGroonga is the fastest module. PGroong is about 2x faster than textsearch.

@timabbott
Contributor

Awesome, thanks for following up on that @kou!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment