Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connective boosts support #959

Merged
merged 4 commits into from
Feb 13, 2020
Merged

Connective boosts support #959

merged 4 commits into from
Feb 13, 2020

Conversation

heaven
Copy link
Contributor

@heaven heaven commented Sep 30, 2019

Currently, only full-text searches support boosts while these can be used with almost any searches that do not include full-text queries.

This patch adds support for boosts and injects an empty Dismax search to the scope with a no-op *:* query which adds bf, bq, and boost query parameters.

An example:

Post.search do
  boost(0.5) do
    with(:is_promoted, true)
  end

  # adds a boost function (bf parameter)
  boost(function { sqrt(:promotion) })

  # adds a multiplicative boost function (boost parameter)
  boost_multiplicative(function { sqrt(:promotion) })
end

@heaven
Copy link
Contributor Author

heaven commented Sep 30, 2019

It seems the problem comes from JSON parser when a few fulltext queries specified in a single search request.

This example fails:

search = Sunspot.search(Post) do
  fulltext('Post Ipsum') do
    boost_fields :body => 0.2
    minimum_match 1
  end

  boost(0.9) do
    with(:blog_id, 1)
  end

  boost(function() { div(field(:average_rating), 100) })

  fulltext('Post') do
    minimum_match 1
  end
end

This one passes:

search = Sunspot.search(Post) do
  fulltext('Post Ipsum') do
    boost_fields :body => 0.2
    minimum_match 1
  end

  boost(0.9) do
    with(:blog_id, 1)
  end

  boost(function() { div(field(:average_rating), 100) })
end

Happens only with JSON format.

@serggl
Copy link
Collaborator

serggl commented Oct 1, 2019

this is 👍 hope you'll sort out JSON issue

@heaven
Copy link
Contributor Author

heaven commented Oct 1, 2019

Here are the raw docs that being sent to Solr using XML and JSON formats:

{
  :data => "<?xml version=\"1.0\" encoding=\"UTF-8\"?><add><doc boost=\"7.75\"><field name=\"id\">Post 1</field><field name=\"type\">Post</field><field name=\"type\">SuperClass</field><field name=\"type\">MockRecord</field><field name=\"class_name\">Post</field><field name=\"title_ss\">Post</field><field name=\"blog_id_i\">1</field><field name=\"category_ids_im\">3</field><field name=\"average_rating_ft\">30.0</field><field name=\"sort_title_s\">post</field><field name=\"primary_category_id_i\">3</field><field name=\"last_indexed_at_ds\">2019-10-01T13:07:36Z</field><field name=\"legacy_field_s\">legacy Post</field><field name=\"legacy_array_field_sm\">first string</field><field name=\"legacy_array_field_sm\">second string</field><field boost=\"2\" name=\"title_text\">Post</field><field boost=\"3\" name=\"text_array_text\">Post</field><field boost=\"3\" name=\"text_array_text\">Post</field><field name=\"body_textsv\">Lorem</field><field name=\"backwards_title_text\">tsoP</field><field name=\"custom_integer:3_i\">1</field></doc><doc boost=\"15.25\"><field name=\"id\">Post 2</field><field name=\"type\">Post</field><field name=\"type\">SuperClass</field><field name=\"type\">MockRecord</field><field name=\"class_name\">Post</field><field name=\"title_ss\">Post</field><field name=\"blog_id_i\">2</field><field name=\"category_ids_im\">2</field><field name=\"average_rating_ft\">60.0</field><field name=\"sort_title_s\">post</field><field name=\"primary_category_id_i\">2</field><field name=\"last_indexed_at_ds\">2019-10-01T13:07:36Z</field><field name=\"legacy_field_s\">legacy Post</field><field name=\"legacy_array_field_sm\">first string</field><field name=\"legacy_array_field_sm\">second string</field><field boost=\"2\" name=\"title_text\">Post</field><field boost=\"3\" name=\"text_array_text\">Post</field><field boost=\"3\" name=\"text_array_text\">Post</field><field name=\"body_textsv\">Ipsum</field><field name=\"backwards_title_text\">tsoP</field><field name=\"custom_integer:2_i\">1</field></doc><doc boost=\"22.75\"><field name=\"id\">Post 3</field><field name=\"type\">Post</field><field name=\"type\">SuperClass</field><field name=\"type\">MockRecord</field><field name=\"class_name\">Post</field><field name=\"title_ss\">Post</field><field name=\"blog_id_i\">3</field><field name=\"category_ids_im\">1</field><field name=\"average_rating_ft\">90.0</field><field name=\"sort_title_s\">post</field><field name=\"primary_category_id_i\">1</field><field name=\"last_indexed_at_ds\">2019-10-01T13:07:36Z</field><field name=\"legacy_field_s\">legacy Post</field><field name=\"legacy_array_field_sm\">first string</field><field name=\"legacy_array_field_sm\">second string</field><field boost=\"2\" name=\"title_text\">Post</field><field boost=\"3\" name=\"text_array_text\">Post</field><field boost=\"3\" name=\"text_array_text\">Post</field><field name=\"body_textsv\">Dolor</field><field name=\"backwards_title_text\">tsoP</field><field name=\"custom_integer:1_i\">1</field></doc></add>",
  :headers => { "Content-Type" => "text/xml" }
}

{
  :data    => "{\"add\":{\"boost\":7.75,\"doc\":{\"id\":\"Post 1\",\"type\":[\"Post\",\"SuperClass\",\"MockRecord\"],\"class_name\":\"Post\",\"title_ss\":\"Post\",\"blog_id_i\":\"1\",\"category_ids_im\":\"3\",\"average_rating_ft\":\"30.0\",\"sort_title_s\":\"post\",\"primary_category_id_i\":\"3\",\"last_indexed_at_ds\":\"2019-10-01T13:24:02Z\",\"legacy_field_s\":\"legacy Post\",\"legacy_array_field_sm\":[\"first string\",\"second string\"],\"title_text\":{\"boost\":2,\"value\":\"Post\"},\"text_array_text\":{\"boost\":3,\"value\":[\"Post\",\"Post\"]},\"body_textsv\":\"Lorem\",\"backwards_title_text\":\"tsoP\",\"custom_integer:3_i\":\"1\"}},\"add\":{\"boost\":15.25,\"doc\":{\"id\":\"Post 2\",\"type\":[\"Post\",\"SuperClass\",\"MockRecord\"],\"class_name\":\"Post\",\"title_ss\":\"Post\",\"blog_id_i\":\"2\",\"category_ids_im\":\"2\",\"average_rating_ft\":\"60.0\",\"sort_title_s\":\"post\",\"primary_category_id_i\":\"2\",\"last_indexed_at_ds\":\"2019-10-01T13:24:02Z\",\"legacy_field_s\":\"legacy Post\",\"legacy_array_field_sm\":[\"first string\",\"second string\"],\"title_text\":{\"boost\":2,\"value\":\"Post\"},\"text_array_text\":{\"boost\":3,\"value\":[\"Post\",\"Post\"]},\"body_textsv\":\"Ipsum\",\"backwards_title_text\":\"tsoP\",\"custom_integer:2_i\":\"1\"}},\"add\":{\"boost\":22.75,\"doc\":{\"id\":\"Post 3\",\"type\":[\"Post\",\"SuperClass\",\"MockRecord\"],\"class_name\":\"Post\",\"title_ss\":\"Post\",\"blog_id_i\":\"3\",\"category_ids_im\":\"1\",\"average_rating_ft\":\"90.0\",\"sort_title_s\":\"post\",\"primary_category_id_i\":\"1\",\"last_indexed_at_ds\":\"2019-10-01T13:24:02Z\",\"legacy_field_s\":\"legacy Post\",\"legacy_array_field_sm\":[\"first string\",\"second string\"],\"title_text\":{\"boost\":2,\"value\":\"Post\"},\"text_array_text\":{\"boost\":3,\"value\":[\"Post\",\"Post\"]},\"body_textsv\":\"Dolor\",\"backwards_title_text\":\"tsoP\",\"custom_integer:1_i\":\"1\"}}}",
  :headers => { "Content-Type" => "application/json" }
}

Here's the formatted XML:

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<add>
    <doc boost=\"7.75\">
        <field name=\"id\">Post 1</field>
        <field name=\"type\">Post</field>
        <field name=\"type\">SuperClass</field>
        <field name=\"type\">MockRecord</field>
        <field name=\"class_name\">Post</field>
        <field name=\"title_ss\">Post</field>
        <field name=\"blog_id_i\">1</field>
        <field name=\"category_ids_im\">3</field>
        <field name=\"average_rating_ft\">30.0</field>
        <field name=\"sort_title_s\">post</field>
        <field name=\"primary_category_id_i\">3</field>
        <field name=\"last_indexed_at_ds\">2019-10-01T13:07:36Z</field>
        <field name=\"legacy_field_s\">legacy Post</field>
        <field name=\"legacy_array_field_sm\">first string</field>
        <field name=\"legacy_array_field_sm\">second string</field>
        <field boost=\"2\" name=\"title_text\">Post</field>
        <field boost=\"3\" name=\"text_array_text\">Post</field>
        <field boost=\"3\" name=\"text_array_text\">Post</field>
        <field name=\"body_textsv\">Lorem</field>
        <field name=\"backwards_title_text\">tsoP</field>
        <field name=\"custom_integer:3_i\">1</field>
    </doc>
    <doc boost=\"15.25\">
        <field name=\"id\">Post 2</field>
        <field name=\"type\">Post</field>
        <field name=\"type\">SuperClass</field>
        <field name=\"type\">MockRecord</field>
        <field name=\"class_name\">Post</field>
        <field name=\"title_ss\">Post</field>
        <field name=\"blog_id_i\">2</field>
        <field name=\"category_ids_im\">2</field>
        <field name=\"average_rating_ft\">60.0</field>
        <field name=\"sort_title_s\">post</field>
        <field name=\"primary_category_id_i\">2</field>
        <field name=\"last_indexed_at_ds\">2019-10-01T13:07:36Z</field>
        <field name=\"legacy_field_s\">legacy Post</field>
        <field name=\"legacy_array_field_sm\">first string</field>
        <field name=\"legacy_array_field_sm\">second string</field>
        <field boost=\"2\" name=\"title_text\">Post</field>
        <field boost=\"3\" name=\"text_array_text\">Post</field>
        <field boost=\"3\" name=\"text_array_text\">Post</field>
        <field name=\"body_textsv\">Ipsum</field>
        <field name=\"backwards_title_text\">tsoP</field>
        <field name=\"custom_integer:2_i\">1</field>
    </doc>
    <doc boost=\"22.75\">
        <field name=\"id\">Post 3</field>
        <field name=\"type\">Post</field>
        <field name=\"type\">SuperClass</field>
        <field name=\"type\">MockRecord</field>
        <field name=\"class_name\">Post</field>
        <field name=\"title_ss\">Post</field>
        <field name=\"blog_id_i\">3</field>
        <field name=\"category_ids_im\">1</field>
        <field name=\"average_rating_ft\">90.0</field>
        <field name=\"sort_title_s\">post</field>
        <field name=\"primary_category_id_i\">1</field>
        <field name=\"last_indexed_at_ds\">2019-10-01T13:07:36Z</field>
        <field name=\"legacy_field_s\">legacy Post</field>
        <field name=\"legacy_array_field_sm\">first string</field>
        <field name=\"legacy_array_field_sm\">second string</field>
        <field boost=\"2\" name=\"title_text\">Post</field>
        <field boost=\"3\" name=\"text_array_text\">Post</field>
        <field boost=\"3\" name=\"text_array_text\">Post</field>
        <field name=\"body_textsv\">Dolor</field>
        <field name=\"backwards_title_text\">tsoP</field>
        <field name=\"custom_integer:1_i\">1</field>
    </doc>
</add>

And formatted JSON (I had to manually convert it to an array with separated add hashes):

[{"add"=>
   {"boost"=>7.75,
    "doc"=>
     {"id"=>"Post 1",
      "type"=>["Post", "SuperClass", "MockRecord"],
      "class_name"=>"Post",
      "title_ss"=>"Post",
      "blog_id_i"=>"1",
      "category_ids_im"=>"3",
      "average_rating_ft"=>"30.0",
      "sort_title_s"=>"post",
      "primary_category_id_i"=>"3",
      "last_indexed_at_ds"=>"2019-10-01T13:24:02Z",
      "legacy_field_s"=>"legacy Post",
      "legacy_array_field_sm"=>["first string", "second string"],
      "title_text"=>{"boost"=>2, "value"=>"Post"},
      "text_array_text"=>{"boost"=>3, "value"=>["Post", "Post"]},
      "body_textsv"=>"Lorem",
      "backwards_title_text"=>"tsoP",
      "custom_integer:3_i"=>"1"}}},
 {"add"=>
   {"boost"=>15.25,
    "doc"=>
     {"id"=>"Post 2",
      "type"=>["Post", "SuperClass", "MockRecord"],
      "class_name"=>"Post",
      "title_ss"=>"Post",
      "blog_id_i"=>"2",
      "category_ids_im"=>"2",
      "average_rating_ft"=>"60.0",
      "sort_title_s"=>"post",
      "primary_category_id_i"=>"2",
      "last_indexed_at_ds"=>"2019-10-01T13:24:02Z",
      "legacy_field_s"=>"legacy Post",
      "legacy_array_field_sm"=>["first string", "second string"],
      "title_text"=>{"boost"=>2, "value"=>"Post"},
      "text_array_text"=>{"boost"=>3, "value"=>["Post", "Post"]},
      "body_textsv"=>"Ipsum",
      "backwards_title_text"=>"tsoP",
      "custom_integer:2_i"=>"1"}}},
 {"add"=>
   {"boost"=>22.75,
    "doc"=>
     {"id"=>"Post 3",
      "type"=>["Post", "SuperClass", "MockRecord"],
      "class_name"=>"Post",
      "title_ss"=>"Post",
      "blog_id_i"=>"3",
      "category_ids_im"=>"1",
      "average_rating_ft"=>"90.0",
      "sort_title_s"=>"post",
      "primary_category_id_i"=>"1",
      "last_indexed_at_ds"=>"2019-10-01T13:24:02Z",
      "legacy_field_s"=>"legacy Post",
      "legacy_array_field_sm"=>["first string", "second string"],
      "title_text"=>{"boost"=>2, "value"=>"Post"},
      "text_array_text"=>{"boost"=>3, "value"=>["Post", "Post"]},
      "body_textsv"=>"Dolor",
      "backwards_title_text"=>"tsoP",
      "custom_integer:1_i"=>"1"}}}]

The select query looks identical for both XML and JSON formats:

{
  :data => {
    :fq    => ["type:Post"],
    :q     => "(_query_:\"{!edismax qf='body_textsv^0.2 title_text text_array_text backwards_title_text tags_textv' mm='1' bq='blog_id_i:1^0.9' bf='div(field(average_rating_ft),100)'}Post Ipsum\" AND _query_:\"{!edismax qf='title_text text_array_text body_textsv backwards_title_text tags_textv' mm='2'}Post\")",
    :fl    => "* score",
    :start => 0,
    :rows  => 30 }
}

And responses for XML and JSON request respectively (the global maxScore and score in each doc differ for XML and JSON protocols):

{ "responseHeader" => { "status" => 0, "QTime" => 5 },
  "response"       =>
    { "numFound" => 3,
      "start"    => 0,
      "maxScore" => 36.18859,
      "docs"     =>
        [{ "id"                => "Post 2",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T13:48:42Z",
          "body_textsv"        => ["Ipsum"],
          "_version_"          => 1646199017083764736,
          "score"              => 36.18859 },
        { "id"                 => "Post 3",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T13:48:42Z",
          "body_textsv"        => ["Dolor"],
          "_version_"          => 1646199017085861888,
          "score"              => 34.87313 },
        { "id"                 => "Post 1",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T13:48:42Z",
          "body_textsv"        => ["Lorem"],
          "_version_"          => 1646199017071181824,
          "score"              => 13.51507 }] } }

{ "responseHeader" => { "status" => 0, "QTime" => 6 },
  "response"       =>
    { "numFound" => 3,
      "start"    => 0,
      "maxScore" => 13.21888,
      "docs"     =>
        [{ "id"                => "Post 3",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T13:50:37Z",
          "body_textsv"        => ["Dolor"],
          "_version_"          => 1646199138248818688,
          "score"              => 13.21888 },
        { "id"                 => "Post 2",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T13:50:37Z",
          "body_textsv"        => ["Ipsum"],
          "_version_"          => 1646199138247770112,
          "score"              => 13.090725 },
        { "id"                 => "Post 1",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T13:50:37Z",
          "body_textsv"        => ["Lorem"],
          "_version_"          => 1646199138243575808,
          "score"              => 4.8533697 }] } }

@heaven
Copy link
Contributor Author

heaven commented Oct 1, 2019

And even for the simplest search XML and JSON responses look differently.

For this example:

search = Sunspot.search(Post) do
  fulltext('Post Ipsum') do
    minimum_match 1
  end
end

When using XML format (the maxScore and scores in the results are higher):

{ "responseHeader" => { "status" => 0, "QTime" => 3 },
  "response"       =>
    { "numFound" => 3,
      "start"    => 0,
      "maxScore" => 32.528587,
      "docs"     =>
        [{ "id"                => "Post 2",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T14:37:09Z",
          "body_textsv"        => ["Ipsum"],
          "_version_"          => 1646202065285808128,
          "score"              => 32.528587 },
        { "id"                 => "Post 3",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T14:37:09Z",
          "body_textsv"        => ["Dolor"],
          "_version_"          => 1646202065288953856,
          "score"              => 15.473748 },
        { "id"                 => "Post 1",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T14:37:09Z",
          "body_textsv"        => ["Lorem"],
          "_version_"          => 1646202065279516672,
          "score"              => 5.8026557 }] } }

When using JSON format:

{ "responseHeader" => { "status" => 0, "QTime" => 4 },
  "response"       =>
    { "numFound" => 3,
      "start"    => 0,
      "maxScore" => 17.054836,
      "docs"     =>
        [{ "id"                => "Post 2",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T14:37:43Z",
          "body_textsv"        => ["Ipsum"],
          "_version_"          => 1646202100724531200,
          "score"              => 17.054836 },
        { "id"                 => "Post 3",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T14:37:43Z",
          "body_textsv"        => ["Dolor"],
          "_version_"          => 1646202100726628352,
          "score"              => 5.8026557 },
        { "id"                 => "Post 1",
          "title_ss"           => "Post",
          "last_indexed_at_ds" => "2019-10-01T14:37:43Z",
          "body_textsv"        => ["Lorem"],
          "_version_"          => 1646202100722434048,
          "score"              => 1.9342185 }] } }

Thus I'm assuming there's another discrepancy between these two protocols, not related to this feature. The protocol affects how documents were indexed and index time boosts are calculated differently.

BTW, we're on Solr 8 now and index-time boosts were removed from Solr at all.

@serggl
Copy link
Collaborator

serggl commented Oct 2, 2019

@heaven Im not sure I follow now. Are you saying that Solr returns now different results now depending on what protocol was used for the query?

@heaven
Copy link
Contributor Author

heaven commented Oct 2, 2019

@serggl the opposite – Solr returns different results depending on what protocol was used while indexing documents. The same docs indexed with XML protocol get higher scores than the same documents indexed with JSON protocol.

Even for a simple search that I described here #959 (comment) the same docs indexed using different transports get different scores when searching.

I am wondering if Solr treats these differently somehow multiplying the resulting boost:
XML:

<field boost=\"3\" name=\"text_array_text\">Post</field>
<field boost=\"3\" name=\"text_array_text\">Post</field>

JSON:

"text_array_text"=>{"boost"=>3, "value"=>["Post", "Post"]},

I will try removing all index-time boosts from the Post model and see how that changes the situation. For now, it feels pretty much like boosts described in XML applied differently than those from JSON docs.

@heaven
Copy link
Contributor Author

heaven commented Oct 2, 2019

So my assumption above was correct and this is what causes the problem

text :text_array, :boost => 3 do

With the XML transport, this boost is applied as many times as long the array is. With JSON just once. Only I removed this boost and both XML and JSON started returning equal scores in results..

…ncy in boost calculation between XML and JSON protocols

* Restore test case for boosted queries combined with full-texts
@serggl
Copy link
Collaborator

serggl commented Oct 3, 2019

Not sure how should be deal with this index time boost issue you found...
But if you say that current Solr version have them deprecated, that makes it not too significant.
So this looks good to merge for me

@heaven
Copy link
Contributor Author

heaven commented Oct 3, 2019 via email

@mlh758
Copy link
Collaborator

mlh758 commented Oct 3, 2019 via email

@heaven
Copy link
Contributor Author

heaven commented Feb 12, 2020

Hi, how about merging this one?

Copy link
Collaborator

@mlh758 mlh758 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.

@serggl serggl merged commit 4b8d5fa into sunspot:master Feb 13, 2020
@heaven heaven deleted the connective-boosts branch February 14, 2020 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants