New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments on "Implementing faceted search with Django and PostgreSQL" #1

Open
simonw opened this Issue Oct 5, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@simonw
Owner

simonw commented Oct 5, 2017

@pauloxnet

This comment has been minimized.

Show comment
Hide comment
@pauloxnet

pauloxnet Oct 6, 2017

Hi Simon, and thank you for sharing your article.

We used the postgresql's full-text search module in django for some projects at work and I think that striping html tags from your Entry's body is not necessary.

I tried to apply to_tsvector on a snippet of html code from your article, before and after strip_tags, and the resulting vectors are the same:

SQL Fiddle

PostgreSQL 9.6 Schema Setup:

Query 1:

SELECT to_tsvector('<p>I''ve added <a href="https://simonwillison.net/search/">a faceted search engine</a> to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:</p><ul><li><a href="https://simonwillison.net/search/?q=postgresql">All content matching "postgresql"</a></li><li><a href="https://simonwillison.net/search/?q=django&amp;type=quotation">Just quotations matching "django"</a></li><li><a href="https://simonwillison.net/search/?q=python+javascript&amp;tag=mozilla&amp;year=2007">All content matching "python" and "javascript" with the tag "mozilla" posted in 2007</a></li></ul><p>It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.</p>')

Results:

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        to_tsvector |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007':70 'ad':3 'also':72 'appli':89 'back':86 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':104 'count':75 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,74 'feel':98 'filter':26,29,32,93 'general':97 'get':85,95 'javascript':63 'like':24 'mani':81 'match':52,56,60 'mozilla':67 'navig':107 'one':90 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':73 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':82 'search':6,18,20,49 'shape':101 'sql':23 'support':15 'tag':28,66 'tell':79 'text':17 'type':35 've':2 'vs':37,39 |

Query 2:

SELECT to_tsvector('I''ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:All content matching "postgresql"Just quotations matching "django"All content matching "python" and "javascript" with the tag "mozilla" posted in 2007It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.')

Results:

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          to_tsvector |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007it':70 'ad':3 'also':71 'appli':88 'back':85 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':103 'count':74 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,73 'feel':97 'filter':26,29,32,92 'general':96 'get':84,94 'javascript':63 'like':24 'mani':80 'match':52,56,60 'mozilla':67 'navig':106 'one':89 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':72 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':81 'search':6,18,20,49 'shape':100 'sql':23 'support':15 'tag':28,66 'tell':78 'text':17 'type':35 've':2 'vs':37,39 |

pauloxnet commented Oct 6, 2017

Hi Simon, and thank you for sharing your article.

We used the postgresql's full-text search module in django for some projects at work and I think that striping html tags from your Entry's body is not necessary.

I tried to apply to_tsvector on a snippet of html code from your article, before and after strip_tags, and the resulting vectors are the same:

SQL Fiddle

PostgreSQL 9.6 Schema Setup:

Query 1:

SELECT to_tsvector('<p>I''ve added <a href="https://simonwillison.net/search/">a faceted search engine</a> to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:</p><ul><li><a href="https://simonwillison.net/search/?q=postgresql">All content matching "postgresql"</a></li><li><a href="https://simonwillison.net/search/?q=django&amp;type=quotation">Just quotations matching "django"</a></li><li><a href="https://simonwillison.net/search/?q=python+javascript&amp;tag=mozilla&amp;year=2007">All content matching "python" and "javascript" with the tag "mozilla" posted in 2007</a></li></ul><p>It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.</p>')

Results:

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        to_tsvector |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007':70 'ad':3 'also':72 'appli':89 'back':86 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':104 'count':75 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,74 'feel':98 'filter':26,29,32,93 'general':97 'get':85,95 'javascript':63 'like':24 'mani':81 'match':52,56,60 'mozilla':67 'navig':107 'one':90 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':73 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':82 'search':6,18,20,49 'shape':101 'sql':23 'support':15 'tag':28,66 'tell':79 'text':17 'type':35 've':2 'vs':37,39 |

Query 2:

SELECT to_tsvector('I''ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:All content matching "postgresql"Just quotations matching "django"All content matching "python" and "javascript" with the tag "mozilla" posted in 2007It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.')

Results:

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          to_tsvector |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007it':70 'ad':3 'also':71 'appli':88 'back':85 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':103 'count':74 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,73 'feel':97 'filter':26,29,32,92 'general':96 'get':84,94 'javascript':63 'like':24 'mani':80 'match':52,56,60 'mozilla':67 'navig':106 'one':89 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':72 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':81 'search':6,18,20,49 'shape':100 'sql':23 'support':15 'tag':28,66 'tell':78 'text':17 'type':35 've':2 'vs':37,39 |
@simonw

This comment has been minimized.

Show comment
Hide comment
@simonw

simonw Oct 6, 2017

Owner

You're absolutely right! Thanks, that's really interesting. I'll update the entry to reflect this.

Owner

simonw commented Oct 6, 2017

You're absolutely right! Thanks, that's really interesting. I'll update the entry to reflect this.

@simonw

This comment has been minimized.

Show comment
Hide comment
@simonw

simonw Oct 6, 2017

Owner

Added this to the post:

Update 6th October 8:23pm UTC - it turns out this step is entirely unnecessary. Paolo Melchiorre points out that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing SELECT to_tsvector('<div>Hey look what happens to <blockquote>this tag</blockquote></div>') using SQL Fiddle returns 'happen':4 'hey':1 'look':2 'tag':7, with the tags already stripped.

I'm leaving the regexp_replace stuff in there though as it's a nice demonstration of how to run regular expressions inside of PostgreSQL, even though they aren't actually needed here.

Owner

simonw commented Oct 6, 2017

Added this to the post:

Update 6th October 8:23pm UTC - it turns out this step is entirely unnecessary. Paolo Melchiorre points out that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing SELECT to_tsvector('<div>Hey look what happens to <blockquote>this tag</blockquote></div>') using SQL Fiddle returns 'happen':4 'hey':1 'look':2 'tag':7, with the tags already stripped.

I'm leaving the regexp_replace stuff in there though as it's a nice demonstration of how to run regular expressions inside of PostgreSQL, even though they aren't actually needed here.

@pauloxnet

This comment has been minimized.

Show comment
Hide comment
@pauloxnet

pauloxnet Oct 7, 2017

Thank you for mentioning me.

I think you have done well to leave the regexp_replace stuff, because it is very interesting.

I think I will use some of the ideas contained in your article in my upcoming talk about full-text search in django with postgresql, obviously quoting you.

pauloxnet commented Oct 7, 2017

Thank you for mentioning me.

I think you have done well to leave the regexp_replace stuff, because it is very interesting.

I think I will use some of the ideas contained in your article in my upcoming talk about full-text search in django with postgresql, obviously quoting you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment