Nested Documents #89

Crossener · 2014-02-12T09:09:55Z

How to use mongo connector to indexing fields of embedded documents for solr? For example the field StreetName of:

{
  "_id" : ObjectId("52fa3674395d4602b8be4a1b"),
  "AddressDirectory" : {
    "Owner" : "Mayank",
    "Age" : "24",
    "Company" : "BIPL",
    "Address" : {
      "HouseNo" : "4",
      "StreetName" : "Rohini",
      "City" : "Delhi"
    }
  }
}

The manual says the connector flattens the nested documents but in solr I cant see the flatted field StreetName.

The text was updated successfully, but these errors were encountered:

llvtt · 2014-02-12T22:14:09Z

That manual is not for this project ("Mongo Connector"), it is for a different project called "solr mongo connector". The manual for this project is located here.

Mongo Connector doesn't do any kind of restructuring of MongoDB documents outside of excluding fields not listed in schema.xml. Nested documents are turned into strings, so your the document you provided would look something like this in Solr:

{
    "_id": "52fa3674395d4602b8be4a1b",
    "ns": "dbname.collectionname",
    "_ts": 12345,
    "_version_": 67890,

    "AddressDirectory": '{
      u"Owner" : u"Mayank"
      u"Age" : u"24",
      u"Company" : u"BIPL",
      u"Address" : {
        u"HouseNo" : u"4",
        u"StreetName" : u"Rohini",
        u"City" : u"Delhi"
      }
    }'
}

The u prefix on the strings indicate unicode Python string literals. They are preserved when passed onto Solr due to the way the underlying library pysolr encodes Python dictionaries.

One obvious disadvantage to this approach is the fact that stringifying sub-documents makes the inner fields unindexable in Solr. Mongo Connector should flatten the documents to provide a way to index those fields. I'm marking this as a bug and will use this issue to track progress.

Thanks for pointing this out, @Crossener!

Crossener · 2014-02-13T08:42:48Z

My query output in solr is:

{
  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "indent": "true",
      "q": "*:*",
      "_": "1392280406387",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "_id": "52fa3674395d4602b8be4a1b",
        "_ts": 5979151327131861000,
        "ns": "test.test",
        "_version_": 1459753744036003800
      },
      {
        "_id": "52fa3e57395d4616743f2cf2",
        "_ts": 5979159998670832000,
        "ns": "test.test",
        "_version_": 1459755859305300000
      }
    ]
  }
}

What I should do if I want to index fields from subdocuments? Do have to do flattens the structure of embedded documents?

llvtt · 2014-02-14T19:30:49Z

Hi @Crossener,

Flattening the document would allow the inner fields to be indexed. There isn't a way to index the inner fields when they're in the string-ified Python dictionary form. I have a patch in code review right now to address this issue, and I should have it pushed in the next few days. If you need a more immediate fix, try adding the following at the very beginning of the clean_doc method in solr_doc_manager.py (before and at the same indentation level as if not self.field_list):

# flatten document
def flattened(doc):
    def flattened_kernel(doc, path):
        for k, v in doc.items():
            path.append(k)
            if isinstance(v, dict):
                for inner_k, inner_v in flattened_kernel(v, path):
                    yield inner_k, inner_v
            else:
                yield ".".join(path), v
            path.pop()
    return dict(flattened_kernel(doc, []))
doc = flattened(doc)

This change will flatten your nested documents so that a document that looks like this in MongoDB:

{
    "_id": 0,
    "billing": {
        "address": {
            "street": "123 Ft. Knox Street"
        },
        "method": "gold bricks"
    }
}

will now look like this in Solr:

{
    "_id": 0,
    "_version_": 123129379183,
    "ns": "billing.receipts",
    "_ts": 239847892374,
    "billing.address.street": "123 Ft. Knox Street",
    "billing.method": "gold bricks"
}

Note that this requires the definition of two new fields in schema.xml: billing.address.street and billing.method. Otherwise, these fields will be removed by the doc manager, since they are not in Solr's schema and would cause an exception to be thrown instead of the documents being inserted.

Crossener · 2014-02-17T13:36:54Z

Many thx, what happened if there are arrays, e.g.

 "AddressDirectory" : {
    "@typ" : "abc",
    "Owner" : "Mayank",
    "Age" : "24",
    "Company" : "BIPL",
    "Address" : [{
       "HouseNo" : "1",
       "StreetName" : "Pitampura",
       "City" : "Delhi"
     }, {
       "HouseNo" : "4",
       "StreetName" : "Rohini",
       "City" : "Delhi"
     }]
   }

llvtt · 2014-02-26T01:19:21Z

resolved by commit 2ecf3b6. This includes unwinding arrays so that sub-documents can be flattened. See this comment for _clean_doc in solr_doc_manager.py for an example of how this works.

acmeinternetsolutions · 2014-02-26T23:02:12Z

Awesome!

Crossener · 2014-02-27T08:28:55Z

Thx a lot.

sirmyron · 2014-02-27T17:56:07Z

Thanks for the solution guys... my only problem is the "index" included in the field names for the arrays which restricts how I configure my fields to be indexed in Solr.

Here's an example of what I have:

{
    "a": 0,
    "b": [
        {
            "c": 6,
            "d": "six"
        },
        {
            "c": 7,
            "d": "seven"
        },
        {
            "c": 8,
            "d": "eight"
        }
    ]
}

I'm forced to configure dynamic fields that look like:

<dynamicField name="b.*" type="string" indexed="true" stored="true"/>

I want to be able to do something like this:

<dynamicField name="b.c.*" type="int" indexed="true" stored="true"/>
<dynamicField name="b.d.*" type="string" indexed="true" stored="true"/>
<!-- OR -->
<field name="b.c" type="int" indexed="true" stored="true" multiValued="true" />
<field name="b.d" type="string" indexed="true" stored="true" multiValued="true" />

Let me know if there is something I'm doing wrong or missing.

Thanks.

llvtt · 2014-02-27T19:20:44Z

@sirmyron,

There isn't a very easy way to deal with that schema. Your example document will look like the following after going through Mongo Connector:

{
    "a": 0,
    "b.0.c": 6,
    "b.0.d": "six",
    "b.1.c": 7,
    "b.1.d": "seven",
    "b.2.c": 8,
    "b.2.d": "eight"
}

You do have a few options:

Create fields b.<n>.c, b.<n>.c for each index n in the array. This would be a good choice if you know there are a limited number of entries in b.
Create dynamicFields *.c and *.d, if c and d are field names not used elsewhere.
In addition to (2), you could also pair each of these dynamicFields with a copyField from *.c to an inner_c field that you can index, so you can search all b.<n>.c, for example.

You could change your schema in MongoDB to be more Solr-friendly. For example:

{
    "a": 0,
    "b.c": [6, 7, 8],
    "b.d": ["six", "seven", "eight"]
}

which will turn into:

{
    "a":
    "b.c.0": 6,
    "b.c.1": 7,
    "b.c.2": 8,
    "b.d.0": "six",
    "b.d.1": "seven",
    "b.d.2": "eight"
}

I'm sorry that the transformation isn't super-helpful to you. It's difficult to find a good solution for flattening a MongoDB document, given that arrays may contain any mixture of data types.

sirmyron · 2014-02-28T17:10:09Z

Thanks @lovett89, I totally understand what you're saying. I don't know how many items will be in the array and it will be a lot of effort to change the mongo structure at this point.

I may try to tinker with the code a bit to see if I can get the list index at end. I'm considering adding an additional param to "flattened_kernel" method for the index, where I only add it and pop it back if there's a value.

It may work for my case, but I don't know if it would be best as a general solution. Will let you know how it works out.

rohsan · 2014-03-24T05:49:50Z

Any fix yet for this issue .I am facing the same problem

llvtt · 2014-03-24T15:38:22Z

@suja-arun,

Please see #99.

rohsan · 2014-03-24T17:29:48Z

Hi,I saw that ,but I have array of objects as given below .

This is my Schema

<dynamicField name="Contributor.*" type="string" indexed="true"  
  stored="true"    />

And the data I am trying to Insert is
db.tests.insert({
Title:{
TitleText:"Title"
},
_id:"1",
Contributor:[
{
Name:"John"
},
{
Name:"David"
}
]
})

But this is not inserting anything to solr for contributor .

rohsan · 2014-03-31T18:11:51Z

Hi ,

I need some help on the following.I have data as follows

    "Publisher" : [
        {
            "PublishingRole" : "01",
            "NameCodeType" : "01",
            "NameCodeValue" : "SPVB",
            "PublisherName" : "Springer Berlin Heidelberg"
        },
        {
            "PublishingRole" : "01",
            "NameCodeType" : "05",
            "NameCodeValue" : "5108985",
            "PublisherName" : "Springer Berlin Heidelberg"
        }
    ]

I need only the publisher name from above
On giving the schema as follows,

<dynamicField name="Publisher*" type="string" indexed="true"  
  stored="true"   multiValued="true" />
<field name="pname" type="string_lowercase" indexed="true" 
    stored="true" multiValued="true"/>
<copyField source="Publisher*" dest="pname" />

I get the following Document

{
        "pname": [
          "SPVB",
          "01",
          "01",
          "Springer Berlin Heidelberg",
          "01",
          "5108985",
          "Springer Berlin Heidelberg",
          "05"
        ],
        "BookId": 21
      }

How ever i need only the Publisher.PublisherName and I need to facet on the same. How can I accomplish this .On defining the dynamic field as
*PublisherName instead of Publisher* did not add any documents for the field.

Much Appreciate help on this topic

acmeinternetsolutions · 2014-03-31T18:29:37Z

Hey, hope this help. try this.

<dynamicField name="Publisher.PublisherName*" type="string" indexed="true"  
  stored="true"   multiValued="false" />
<field name="pname" type="string_lowercase" indexed="true" 
    stored="true" multiValued="true"/>
<copyField source="Publisher.PublisherName*" dest="pname" />

rohsan · 2014-03-31T18:47:45Z

Hi ,

Thanks for your response, but this does not seem to work
Changde the dynamic field name to Publisher.PublisherName* and the Multivalued param to false and also changed the copyfield source as suggested above ,but pname gets no data in that case !!

What are other options?

acmeinternetsolutions · 2014-03-31T19:07:51Z

I assumed that would create Publisher.PublisherName.0, Publisher.PublisherName.1, etc and the copyField would catch it. Is it instead creating Publisher.0.PublisherName, Publisher.1.PublisherName? If so, maybe change the source copyField to be *.PublisherName. I believe the wild card and go on the beginning or end of the dynamic field name in solr.

Tim

rohsan · 2014-04-01T05:00:21Z

Hi,
Publisher.PublisherName* - There is no data at all in dynamic field or pname field
*.PublisherName - Same no data in either dynamic filed or pname field , Please see my schema for both cases and let me know If I am missing anything .Thanks

<dynamicField name="Publisher.PublisherName*" type="string" indexed="true"  
  stored="true"   multiValued="false" />
<field name="pname" type="string_lowercase" indexed="true" 
    stored="true" multiValued="true"/>
<copyField source="Publisher.PublisherName*" dest="pname" />

<dynamicField name="*.PublisherName" type="string" indexed="true"  
  stored="true"   multiValued="false" />
<field name="pname" type="string_lowercase" indexed="true" 
    stored="true" multiValued="true"/>
<copyField source="*.PublisherName" dest="pname" />

No Data in both cases in pname or dynamic field.

llvtt · 2014-04-02T01:01:38Z

@suja-arun @acmeinternetsolutions
The regular expressions built from dynamicFields were incorrect in Solr DocManager. The issue of not being able to match nested field names with dynamicFields should now be fixed by commit 4ef0bcc. Please let me know if there are more issues.
Thank you for pointing this out, and thank you for your patience!

rohsan · 2014-04-03T06:00:51Z

Thank you.This is working as expected. Perfect !!

annvakulchyk · 2015-05-29T21:28:48Z

Hi, is it possible to keep existing structure in MongoDB for Solr? Solr supports multivalues so I have an array in Mongo and expect Solr to insert array as well, copyField solution is not an option for me because I have a lot of dynamic fields and I can't define all fields in my schema.xml. So it's really critical to have ability to insert arrays like:
"category_ids" : [ NumberLong(2), NumberLong(13), NumberLong(14), NumberLong(15), NumberLong(16), NumberLong(37) ]

llvtt · 2015-06-01T16:23:22Z

Hi Anna,

Mongo Connector automatically unwinds arrays. For the rationale behind
this, see
#148 (comment).
This is not configurable right now.
On 5/29/15 14:28, Anna Vakulchyk wrote:

Hi, is it possible to keep existing structure in MongoDB for Solr?
Solr supports multivalues so I have an array in Mongo and expect Solr
to insert array as well, copyField solution is not an option for me
because I have a lot of dynamic fields and I can't define all fields
in my schema.xml. So it's really critical to have ability to insert
arrays like:
|"category_ids" : [
NumberLong(2),
NumberLong(13),
NumberLong(14),
NumberLong(15),
NumberLong(16),
NumberLong(37)
]|

—
Reply to this email directly or view it on GitHub
#89 (comment).

llvtt added the bug label Feb 12, 2014

llvtt added waiting for input and removed waiting for input labels Feb 14, 2014

llvtt closed this as completed Feb 26, 2014

llvtt removed the in progress label Feb 27, 2014

llvtt mentioned this issue Mar 12, 2014

Solr arrays not getting inserted in to solr #99

Closed

flavouski mentioned this issue Feb 28, 2015

SOLR Insert/Update Failing, still in Python dictionary form? #216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested Documents #89

Nested Documents #89

Crossener commented Feb 12, 2014

llvtt commented Feb 12, 2014

Crossener commented Feb 13, 2014

llvtt commented Feb 14, 2014

Crossener commented Feb 17, 2014

llvtt commented Feb 26, 2014

acmeinternetsolutions commented Feb 26, 2014

Crossener commented Feb 27, 2014

sirmyron commented Feb 27, 2014

llvtt commented Feb 27, 2014

sirmyron commented Feb 28, 2014

rohsan commented Mar 24, 2014

llvtt commented Mar 24, 2014

rohsan commented Mar 24, 2014

rohsan commented Mar 31, 2014

acmeinternetsolutions commented Mar 31, 2014

rohsan commented Mar 31, 2014

acmeinternetsolutions commented Mar 31, 2014

rohsan commented Apr 1, 2014

llvtt commented Apr 2, 2014

rohsan commented Apr 3, 2014

annvakulchyk commented May 29, 2015

llvtt commented Jun 1, 2015

Nested Documents #89

Nested Documents #89

Comments

Crossener commented Feb 12, 2014

llvtt commented Feb 12, 2014

Crossener commented Feb 13, 2014

llvtt commented Feb 14, 2014

Crossener commented Feb 17, 2014

llvtt commented Feb 26, 2014

acmeinternetsolutions commented Feb 26, 2014

Crossener commented Feb 27, 2014

sirmyron commented Feb 27, 2014

llvtt commented Feb 27, 2014

sirmyron commented Feb 28, 2014

rohsan commented Mar 24, 2014

llvtt commented Mar 24, 2014

rohsan commented Mar 24, 2014

rohsan commented Mar 31, 2014

acmeinternetsolutions commented Mar 31, 2014

rohsan commented Mar 31, 2014

acmeinternetsolutions commented Mar 31, 2014

rohsan commented Apr 1, 2014

llvtt commented Apr 2, 2014

rohsan commented Apr 3, 2014

annvakulchyk commented May 29, 2015

llvtt commented Jun 1, 2015