Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested Documents #89

Closed
Crossener opened this issue Feb 12, 2014 · 22 comments
Closed

Nested Documents #89

Crossener opened this issue Feb 12, 2014 · 22 comments
Labels

Comments

@Crossener
Copy link

How to use mongo connector to indexing fields of embedded documents for solr? For example the field StreetName of:

{
  "_id" : ObjectId("52fa3674395d4602b8be4a1b"),
  "AddressDirectory" : {
    "Owner" : "Mayank",
    "Age" : "24",
    "Company" : "BIPL",
    "Address" : {
      "HouseNo" : "4",
      "StreetName" : "Rohini",
      "City" : "Delhi"
    }
  }
} 

The manual says the connector flattens the nested documents but in solr I cant see the flatted field StreetName.

@llvtt
Copy link

llvtt commented Feb 12, 2014

  1. That manual is not for this project ("Mongo Connector"), it is for a different project called "solr mongo connector". The manual for this project is located here.

  2. Mongo Connector doesn't do any kind of restructuring of MongoDB documents outside of excluding fields not listed in schema.xml. Nested documents are turned into strings, so your the document you provided would look something like this in Solr:

    {
        "_id": "52fa3674395d4602b8be4a1b",
        "ns": "dbname.collectionname",
        "_ts": 12345,
        "_version_": 67890,
    
        "AddressDirectory": '{
          u"Owner" : u"Mayank"
          u"Age" : u"24",
          u"Company" : u"BIPL",
          u"Address" : {
            u"HouseNo" : u"4",
            u"StreetName" : u"Rohini",
            u"City" : u"Delhi"
          }
        }'
    } 
    

The u prefix on the strings indicate unicode Python string literals. They are preserved when passed onto Solr due to the way the underlying library pysolr encodes Python dictionaries.

One obvious disadvantage to this approach is the fact that stringifying sub-documents makes the inner fields unindexable in Solr. Mongo Connector should flatten the documents to provide a way to index those fields. I'm marking this as a bug and will use this issue to track progress.

Thanks for pointing this out, @Crossener!

@llvtt llvtt added the bug label Feb 12, 2014
@Crossener
Copy link
Author

My query output in solr is:

{
  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "indent": "true",
      "q": "*:*",
      "_": "1392280406387",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "_id": "52fa3674395d4602b8be4a1b",
        "_ts": 5979151327131861000,
        "ns": "test.test",
        "_version_": 1459753744036003800
      },
      {
        "_id": "52fa3e57395d4616743f2cf2",
        "_ts": 5979159998670832000,
        "ns": "test.test",
        "_version_": 1459755859305300000
      }
    ]
  }
}

What I should do if I want to index fields from subdocuments? Do have to do flattens the structure of embedded documents?

@llvtt
Copy link

llvtt commented Feb 14, 2014

Hi @Crossener,

Flattening the document would allow the inner fields to be indexed. There isn't a way to index the inner fields when they're in the string-ified Python dictionary form. I have a patch in code review right now to address this issue, and I should have it pushed in the next few days. If you need a more immediate fix, try adding the following at the very beginning of the clean_doc method in solr_doc_manager.py (before and at the same indentation level as if not self.field_list):

# flatten document
def flattened(doc):
    def flattened_kernel(doc, path):
        for k, v in doc.items():
            path.append(k)
            if isinstance(v, dict):
                for inner_k, inner_v in flattened_kernel(v, path):
                    yield inner_k, inner_v
            else:
                yield ".".join(path), v
            path.pop()
    return dict(flattened_kernel(doc, []))
doc = flattened(doc)

This change will flatten your nested documents so that a document that looks like this in MongoDB:

{
    "_id": 0,
    "billing": {
        "address": {
            "street": "123 Ft. Knox Street"
        },
        "method": "gold bricks"
    }
}

will now look like this in Solr:

{
    "_id": 0,
    "_version_": 123129379183,
    "ns": "billing.receipts",
    "_ts": 239847892374,
    "billing.address.street": "123 Ft. Knox Street",
    "billing.method": "gold bricks"
}

Note that this requires the definition of two new fields in schema.xml: billing.address.street and billing.method. Otherwise, these fields will be removed by the doc manager, since they are not in Solr's schema and would cause an exception to be thrown instead of the documents being inserted.

@Crossener
Copy link
Author

Many thx, what happened if there are arrays, e.g.

 "AddressDirectory" : {
    "@typ" : "abc",
    "Owner" : "Mayank",
    "Age" : "24",
    "Company" : "BIPL",
    "Address" : [{
       "HouseNo" : "1",
       "StreetName" : "Pitampura",
       "City" : "Delhi"
     }, {
       "HouseNo" : "4",
       "StreetName" : "Rohini",
       "City" : "Delhi"
     }]
   }

@llvtt
Copy link

llvtt commented Feb 26, 2014

resolved by commit 2ecf3b6. This includes unwinding arrays so that sub-documents can be flattened. See this comment for _clean_doc in solr_doc_manager.py for an example of how this works.

@llvtt llvtt closed this as completed Feb 26, 2014
@acmeinternetsolutions
Copy link

Awesome!

@Crossener
Copy link
Author

Thx a lot.

@sirmyron
Copy link

Thanks for the solution guys... my only problem is the "index" included in the field names for the arrays which restricts how I configure my fields to be indexed in Solr.

Here's an example of what I have:

{
    "a": 0,
    "b": [
        {
            "c": 6,
            "d": "six"
        },
        {
            "c": 7,
            "d": "seven"
        },
        {
            "c": 8,
            "d": "eight"
        }
    ]
}

I'm forced to configure dynamic fields that look like:

<dynamicField name="b.*" type="string" indexed="true" stored="true"/>

I want to be able to do something like this:

<dynamicField name="b.c.*" type="int" indexed="true" stored="true"/>
<dynamicField name="b.d.*" type="string" indexed="true" stored="true"/>
<!-- OR -->
<field name="b.c" type="int" indexed="true" stored="true" multiValued="true" />
<field name="b.d" type="string" indexed="true" stored="true" multiValued="true" />

Let me know if there is something I'm doing wrong or missing.

Thanks.

@llvtt
Copy link

llvtt commented Feb 27, 2014

@sirmyron,

There isn't a very easy way to deal with that schema. Your example document will look like the following after going through Mongo Connector:

{
    "a": 0,
    "b.0.c": 6,
    "b.0.d": "six",
    "b.1.c": 7,
    "b.1.d": "seven",
    "b.2.c": 8,
    "b.2.d": "eight"
}

You do have a few options:

  1. Create fields b.<n>.c, b.<n>.c for each index n in the array. This would be a good choice if you know there are a limited number of entries in b.

  2. Create dynamicFields *.c and *.d, if c and d are field names not used elsewhere.

  3. In addition to (2), you could also pair each of these dynamicFields with a copyField from *.c to an inner_c field that you can index, so you can search all b.<n>.c, for example.

  4. You could change your schema in MongoDB to be more Solr-friendly. For example:

    {
        "a": 0,
        "b.c": [6, 7, 8],
        "b.d": ["six", "seven", "eight"]
    }
    

    which will turn into:

    {
        "a":
        "b.c.0": 6,
        "b.c.1": 7,
        "b.c.2": 8,
        "b.d.0": "six",
        "b.d.1": "seven",
        "b.d.2": "eight"
    }
    

I'm sorry that the transformation isn't super-helpful to you. It's difficult to find a good solution for flattening a MongoDB document, given that arrays may contain any mixture of data types.

@llvtt llvtt removed the in progress label Feb 27, 2014
@sirmyron
Copy link

Thanks @lovett89, I totally understand what you're saying. I don't know how many items will be in the array and it will be a lot of effort to change the mongo structure at this point.

I may try to tinker with the code a bit to see if I can get the list index at end. I'm considering adding an additional param to "flattened_kernel" method for the index, where I only add it and pop it back if there's a value.

It may work for my case, but I don't know if it would be best as a general solution. Will let you know how it works out.

@rohsan
Copy link

rohsan commented Mar 24, 2014

Any fix yet for this issue .I am facing the same problem

@llvtt
Copy link

llvtt commented Mar 24, 2014

@suja-arun,

Please see #99.

@rohsan
Copy link

rohsan commented Mar 24, 2014

Hi,I saw that ,but I have array of objects as given below .

This is my Schema

<dynamicField name="Contributor.*" type="string" indexed="true"  
  stored="true"    />

And the data I am trying to Insert is
db.tests.insert({
Title:{
TitleText:"Title"
},
_id:"1",
Contributor:[
{
Name:"John"
},
{
Name:"David"
}
]
})

But this is not inserting anything to solr for contributor .

@rohsan
Copy link

rohsan commented Mar 31, 2014

Hi ,

I need some help on the following.I have data as follows

    "Publisher" : [
        {
            "PublishingRole" : "01",
            "NameCodeType" : "01",
            "NameCodeValue" : "SPVB",
            "PublisherName" : "Springer Berlin Heidelberg"
        },
        {
            "PublishingRole" : "01",
            "NameCodeType" : "05",
            "NameCodeValue" : "5108985",
            "PublisherName" : "Springer Berlin Heidelberg"
        }
    ]

I need only the publisher name from above
On giving the schema as follows,

<dynamicField name="Publisher*" type="string" indexed="true"  
  stored="true"   multiValued="true" />
<field name="pname" type="string_lowercase" indexed="true" 
    stored="true" multiValued="true"/>
<copyField source="Publisher*" dest="pname" />

I get the following Document

{
        "pname": [
          "SPVB",
          "01",
          "01",
          "Springer Berlin Heidelberg",
          "01",
          "5108985",
          "Springer Berlin Heidelberg",
          "05"
        ],
        "BookId": 21
      }

How ever i need only the Publisher.PublisherName and I need to facet on the same. How can I accomplish this .On defining the dynamic field as
*PublisherName instead of Publisher* did not add any documents for the field.

Much Appreciate help on this topic

@acmeinternetsolutions
Copy link

Hey, hope this help. try this.

<dynamicField name="Publisher.PublisherName*" type="string" indexed="true"  
  stored="true"   multiValued="false" />
<field name="pname" type="string_lowercase" indexed="true" 
    stored="true" multiValued="true"/>
<copyField source="Publisher.PublisherName*" dest="pname" /> 

@rohsan
Copy link

rohsan commented Mar 31, 2014

Hi ,

Thanks for your response, but this does not seem to work
Changde the dynamic field name to Publisher.PublisherName* and the Multivalued param to false and also changed the copyfield source as suggested above ,but pname gets no data in that case !!

What are other options?

@acmeinternetsolutions
Copy link

I assumed that would create Publisher.PublisherName.0, Publisher.PublisherName.1, etc and the copyField would catch it. Is it instead creating Publisher.0.PublisherName, Publisher.1.PublisherName? If so, maybe change the source copyField to be *.PublisherName. I believe the wild card and go on the beginning or end of the dynamic field name in solr.

Tim

@rohsan
Copy link

rohsan commented Apr 1, 2014

Hi,
Publisher.PublisherName* - There is no data at all in dynamic field or pname field
*.PublisherName - Same no data in either dynamic filed or pname field , Please see my schema for both cases and let me know If I am missing anything .Thanks

<dynamicField name="Publisher.PublisherName*" type="string" indexed="true"  
  stored="true"   multiValued="false" />
<field name="pname" type="string_lowercase" indexed="true" 
    stored="true" multiValued="true"/>
<copyField source="Publisher.PublisherName*" dest="pname" />
<dynamicField name="*.PublisherName" type="string" indexed="true"  
  stored="true"   multiValued="false" />
<field name="pname" type="string_lowercase" indexed="true" 
    stored="true" multiValued="true"/>
<copyField source="*.PublisherName" dest="pname" />

No Data in both cases in pname or dynamic field.

@llvtt
Copy link

llvtt commented Apr 2, 2014

@suja-arun @acmeinternetsolutions
The regular expressions built from dynamicFields were incorrect in Solr DocManager. The issue of not being able to match nested field names with dynamicFields should now be fixed by commit 4ef0bcc. Please let me know if there are more issues.
Thank you for pointing this out, and thank you for your patience!

@rohsan
Copy link

rohsan commented Apr 3, 2014

Thank you.This is working as expected. Perfect !!

@annvakulchyk
Copy link

Hi, is it possible to keep existing structure in MongoDB for Solr? Solr supports multivalues so I have an array in Mongo and expect Solr to insert array as well, copyField solution is not an option for me because I have a lot of dynamic fields and I can't define all fields in my schema.xml. So it's really critical to have ability to insert arrays like:
"category_ids" : [
NumberLong(2),
NumberLong(13),
NumberLong(14),
NumberLong(15),
NumberLong(16),
NumberLong(37)
]

@llvtt
Copy link

llvtt commented Jun 1, 2015

Hi Anna,

Mongo Connector automatically unwinds arrays. For the rationale behind
this, see
#148 (comment).
This is not configurable right now.
On 5/29/15 14:28, Anna Vakulchyk wrote:

Hi, is it possible to keep existing structure in MongoDB for Solr?
Solr supports multivalues so I have an array in Mongo and expect Solr
to insert array as well, copyField solution is not an option for me
because I have a lot of dynamic fields and I can't define all fields
in my schema.xml. So it's really critical to have ability to insert
arrays like:
|"category_ids" : [
NumberLong(2),
NumberLong(13),
NumberLong(14),
NumberLong(15),
NumberLong(16),
NumberLong(37)
]|


Reply to this email directly or view it on GitHub
#89 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants