Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to distinct on a multi-table search #137

Closed
tomcardoso opened this issue Jun 19, 2020 · 20 comments
Closed

Unable to distinct on a multi-table search #137

tomcardoso opened this issue Jun 19, 2020 · 20 comments

Comments

@tomcardoso
Copy link

tomcardoso commented Jun 19, 2020

Hi there, I'm running into an issue where I'm unable to dedupe my results after using web_search on a model and its relations. Here's what I'm doing:

Case
  .where(nil)
  .joins(:case_events)
  .web_search({
    case_number: 'cobalt',
    short_title: 'cobalt',
    case_events: {
      party_name: 'cobalt',
      short_title: 'cobalt'
    }
  }, false)

So far so good. Except, this returns four results (since I have matches in both the parent Case model and its relation CaseEvent), which becomes two results after the distinct:

irb(main):041:0> Case.joins(:case_events).web_search({ case_number: 'cobalt', short_title: 'cobalt', case_events: { party_name: 'cobalt', short_title: 'cobalt' }}, false).distinct
=> #<ActiveRecord::Relation [#<Case id: 592, case_number: "CV-13-00495078-0000", short_title: "IGERNAN LIMITED v COBALT CONSTRUCTION AND CONSULTI...", case_type_id: 1, court_id: 1, created_at: "2020-06-19 20:54:23", updated_at: "2020-06-19 20:54:24">, #<Case id: 592, case_number: "CV-13-00495078-0000", short_title: "IGERNAN LIMITED v COBALT CONSTRUCTION AND CONSULTI...", case_type_id: 1, court_id: 1, created_at: "2020-06-19 20:54:23", updated_at: "2020-06-19 20:54:24">]>

As you can see, they both have the same id, 592, so I'd like to dedupe further. I try doing a .select("DISTINCT ON (id) *") on that first command up there, but then I get this:

irb(main):042:0> Case.joins(:case_events).web_search({ case_number: 'cobalt', short_title: 'cobalt', case_events: { party_name: 'cobalt', short_title: 'cobalt' }}, false).distinct.select("DISTINCT ON (id) *")
Traceback (most recent call last):
ActiveRecord::StatementInvalid (PG::SyntaxError: ERROR:  syntax error at or near "DISTINCT")
LINE 1: ... 'cobalt'::text)), 0) AS "rank86178496123729701", DISTINCT O...
                                                             ^

I realize I could just do a uniq, but I need to keep this in ActiveRecord for several reasons. I'm a bit out of my depth here… Any suggestions?

@tomcardoso
Copy link
Author

Hi, just wanted to follow up… @gregmolnar @simi any thoughts on this? I tried a bunch of approaches but I don't have much experience with ActiveRecord (or Postgres, for that matter), so I've struck out so far.

@simi
Copy link
Contributor

simi commented Jul 2, 2020

Is this problem present for other searches as well, right? It is not related to web search only.

All you wrote makes sense actual, you need to de-duplicate results. Would grouping by Case id help here?

@tomcardoso
Copy link
Author

I'm not sure, I haven't tried other searches… but I keep running into that syntax error whenever I try to run any kind of select or group command. Could you send me an example of what you'd do? I'll test it out and report back here.

@gregmolnar
Copy link
Member

@tomcardoso you can't use .select("DISTINCT ON (id) *") with textacular, because that messes up the generating SQL query. Unless your dataset will be huge, using uniq would be the simplest resolution. When I get some free time I will look into why distinct doesn't work as expected.

@tomcardoso
Copy link
Author

@gregmolnar Thanks. My results could be in the tens of thousands easily, if not more (my database grows by between 2,000-10,000 entries a day). I'm also using web_search as part of a chain of filters and whatnot which gets passed to gems like will-paginate, so I want to make sure to keep it in ActiveRecord if at all possible.

@gregmolnar
Copy link
Member

gregmolnar commented Jul 3, 2020 via email

@gregmolnar
Copy link
Member

@tomcardoso I tried to reproduce this and wrote a spec here: e7f7da5
But it passes, so I probably miss something. Can you help me to cover the case you are having?

@tomcardoso
Copy link
Author

@gregmolnar Thanks, took me a minute to figure out what was going on… try doing this instead:

expect(
  WebComicWithSearchableName
    .joins(:characters)
    .web_search({name: "Batman", characters: { name: "Batman" }}, false)
    .distinct
    .size
  ).to eq(1)

I'm using the false there to allow me to search for hits on either result (I want to be able to search both the WebComic and Character and return results for either). If you do that and add a distinct, it will return 2 instead of 1.

@gregmolnar
Copy link
Member

Ah, I missed that part. If you do that, an OR condition is used and since we return the rank as well, which will have a unique alias for both, distinct can't return a unique row, since the alias is different.
To get around this you can do something like this:

Case
  .joins(:case_events)
  .select("DISTINCT ON (cases.id) cases.*")
  .order("cases.id")
  .web_search({
    case_number: 'cobalt',
    short_title: 'cobalt',
    case_events: {
      party_name: 'cobalt',
      short_title: 'cobalt'
    }
  }, false)

Let me know if that still doesn't work.

@tomcardoso
Copy link
Author

tomcardoso commented Jul 8, 2020

Thanks @gregmolnar. But with this, I'd lose the rank-based sorting, right? Is there any way to retain that? I realize some entries would have two rankings – I think it'd make sense to pick the entry with the higher rank value and drop the other.

@tomcardoso
Copy link
Author

(I did test your code out, and it works! But I assume that I'd lose the rank-based sorting.)

@gregmolnar
Copy link
Member

gregmolnar commented Jul 8, 2020

@tomcardoso indeed, you would lose the ranking based sorting.

Is there any way to retain that?

I don't know any solution from the top of my head, but I will think about it.

@tomcardoso
Copy link
Author

Thanks. I tried something dumb like:

Case
  .joins(:case_events)
  .select("DISTINCT ON (cases.id) cases.*")
  .order("cases.rank_web")
  .web_search({ case_number: 'cobalt', short_title: 'cobalt', case_events: { party_name: 'cobalt', short_title: 'cobalt' }}, false, rank_alias = 'rank_web')

But that didn't work:

Traceback (most recent call last):
ActiveRecord::StatementInvalid (PG::UndefinedColumn: ERROR:  column cases.rank_web does not exist)
LINE 1: ...h_to_tsquery('english', 'cobalt'::text)) ORDER BY cases.rank...
                                                             ^

@gregmolnar
Copy link
Member

There is a unique alias generated for the ranking, so you can't do it like that. And the column you use with distinct on has to be the first column you order on, so even if you would use the correct alias, it would fail with another error.

@tomcardoso
Copy link
Author

Perhaps for cases like this, we almost need a special Textacular version of distinct or something similar…

@simi
Copy link
Contributor

simi commented Jul 10, 2020

I think the query you're looking for is this:

SELECT web_comics.*
FROM "web_comics"
INNER JOIN "characters" ON "characters"."web_comic_id" = "web_comics"."id"
WHERE (to_tsvector('english', "web_comics"."name"::text) @@ websearch_to_tsquery('english', 'Batman'::text) OR to_tsvector('english', "characters"."name"::text) @@ websearch_to_tsquery('english', 'Batman'::text))
GROUP BY "web_comics"."id"
ORDER BY max(COALESCE(ts_rank(to_tsvector('english', "web_comics"."name"::text), websearch_to_tsquery('english', 'Batman'::text)), 0) + COALESCE(ts_rank(to_tsvector('english', "characters"."name"::text), websearch_to_tsquery('english', 'Batman'::text)), 0))

^ group by ID and order by max is the key to get deduplicated result set.

I'm not sure how to get this query out of Textacular directly. This hacky version was enough for me, but I think this deservers much better API if we would like to make it into master. This is just POC:

# monkey-patch
module Textacular
  # hacked version of web_search to not assemble final query
  def web_search_similarities_and_conditions(query = '', exclusive = true, rank_alias = nil)
    exclusive, query = munge_exclusive_and_query(exclusive, query)
    parsed_query_hash = parse_query_hash(query)
    similarities, conditions = web_similarities_and_conditions(parsed_query_hash)
  end
end

# passing spec
it "abc" do
  WebComicWithSearchableName.create(
    name: 'Batman',
    author: 'Bill Finger',
    characters: Character.create([
      { name: "Batman" },
      { name: "Robin" },
    ])
  )

  order, conditions = WebComicWithSearchableName.web_search_similarities_and_conditions({name: "Batman", characters: { name: "Batman" }}, false)

  r = WebComicWithSearchableName
    .joins(:characters)
    .where(conditions.join(' OR ')) # depends on exclusive, could be AND as well
    .order(Arel.sql("max(#{order.join(' + ')})")) # use aggregation function to pick highest rank for dupilcates
    .group(:id) # group by ID to deduplicate

  expect(r.length).to eq(1)
end

@simi
Copy link
Contributor

simi commented Jul 10, 2020

Looking again at this, for those advanced cases Textacular can just provide related SQL fragments and let user to use it as needed. Any ideas for API around this?

@tomcardoso
Copy link
Author

tomcardoso commented Jul 13, 2020

Huh! This is all well beyond my comprehension, but if it works, great. I'm happy to test this on our data when it's incorporated into the Textacular API generally (assuming there's an interest in doing so). Could it be as simple as adding an extra param telling Textacular whether to assemble the final query?

@tomcardoso
Copy link
Author

@simi and @gregmolnar, one more question if you don't mind. Your code has worked great, but it's unfortunately broken the .size function for me since it's now returning the size of each case's case_events. For instance, using the order and conditions code and then doing this:

cases
  .joins(:case_events)
  .where(conditions.join(' OR '))
  .group(:id)
  .order(Arel.sql("max(#{order.join(' + ')})"))
  .size

Returns a hash like so:

irb(main):042:0> cases.joins(:case_events).where(conditions.join(' OR ')).group(:id).order(Arel.sql("max(#{order.join(' + ')})")).size
=> {6550=>1, 6958=>10}

Where each key is the case ID and each value is the number of case events – but I just want .size to be able to return 2. Is there anything I can do to change the grouping back to counting the overall number of items in the relation, instead of just the case events in each case?

@tomcardoso
Copy link
Author

Oh – is the answer to just use the results of that query to build a new, final one, like so?

first_query = cases
  .joins(:case_events)
  .where(conditions.join(' OR '))
  .group(:id)
  .order(Arel.sql("max(#{order.join(' + ')})"))

final_query = cases.where(:id => first_query)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants