Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solr loads only one nested attribute #870

Closed
hackartisan opened this issue Aug 3, 2015 · 16 comments
Closed

solr loads only one nested attribute #870

hackartisan opened this issue Aug 3, 2015 · 16 comments

Comments

@hackartisan
Copy link
Contributor

Here is a test showing the problem:
https://github.com/chemheritage/active_fedora/tree/solr-load-bug

@hackartisan
Copy link
Contributor Author

@Cam156 @awead @hectorcorrea - according to @jcoyne you all know everything about active fedora and may be able to help me get to the root of this.

@awead
Copy link
Contributor

awead commented Aug 3, 2015

@HackMasterA I don't have time to look into this at the moment, but as a workaround, I would suggest creating your own solr indexer for the classes that have nested attributes. Then you could either index ids directly, or even index the properties of the nested attributes. That's might a better option anyway, because you'd have all the terms you'd need in one solr document.

Another option would be to do solr joins, but that's @Cam156 territory.

@hackartisan
Copy link
Contributor Author

See also this hydra-tech list message:

https://groups.google.com/forum/#!topic/hydra-tech/bybriXid38Y

I'm going to try changing my model as suggested by that poster. That doesn't invalidate the bug, though.

@hackartisan
Copy link
Contributor Author

Thanks, @awead! It doesn't look like the indexing itself is the problem; both IDs are in solr when I look at it directly. It's definitely an issue with the retrieval.

I take your point, though, that indexing the attributes themselves would probably be useful, as well.

@awead
Copy link
Contributor

awead commented Aug 3, 2015

Ah, ok, I misunderstood the problem. The issue is likely with an incomplete object profile. Take a look at:

https://github.com/projecthydra/active_fedora/blob/master/lib/active_fedora/solr_instance_loader.rb

It's loading the object based on the object_profile_ssm field in the solr document. Check and see if each nested attribute is there. I've never looked into this, so I'm not sure what you should expect.

@hackartisan
Copy link
Contributor Author

@awead, yes that's the class the test is written against. when you say "Check and see if each nested attribute is there" do you mean in my solr index itself?

@ojlyytinen
Copy link
Contributor

When I ran into this (presumably the same) issue earlier, I spent quite some time tracing it through Active Fedora. It's been a while but from what I remember, everything is stored correctly in Solr, but the loading overwrites values and leaves only the last one.

From my earlier post to Hydra-Tech list (linked above). When the execution gets to ActiveFedora::Associations::RDF.replace all the author_ids are still intact. But there it loops over each of them and goes to ActiveFedora::LoadableFromJson::SolrBackedResource.insert, one id at a time. That then goes to the set_value in the same file which just replaces any existing value rather than adding them all to a list, thus only the last of the author_ids is set in the end.

@hackartisan
Copy link
Contributor Author

@ojlyytinen It sounds like you got a little deeper than I did, but your account agrees exactly with what I have observed -- only the last id comes through, even though all are there when you observe solr directly.

@hackartisan
Copy link
Contributor Author

@ojlyytinen any chance you can point me to your models so I can see how you reworked them?

@hackartisan
Copy link
Contributor Author

Working up a PR based on @ojlyytinen's observations. Small fix, assuming it doesn't break anything else...

@ojlyytinen
Copy link
Contributor

I didn't dare try fix this myself as I don't know Active Fedora that well and the problem seemed a little complicated. You can probably make a simple fix in that insert function but I have a feeling that might break something else. It seems like the current behaviour might be the right thing to do in some cases.

Our model has contributors that belong to files or collections and files and collections have many contributors. The contributor model is here and files/collections metadata file here. The stuff with contributorable is just due to the polymorphic relationship, a contributor can belong to either a collection or a file. If your time spans only apply to files then you probably don't need that.

The key thing to work around this bug seemed to be to not use the has_and_belongs_to_many. Active Fedora then does the reflections differently and loading from Solr works fine.

@hackartisan
Copy link
Contributor Author

Yeah, it seems to work fine to fix it in set_value, but I sort of worry that would result in duplications of data somewhere else. I also tried fixing it in ActiveFedora::Associations::RDF.replace but that is not working out (lots of nested arrays)

@hackartisan
Copy link
Contributor Author

I'm getting side effects when I try to fix the root problem. Not able to continue pursuing this.

@hackartisan
Copy link
Contributor Author

@ojlyytinen the problem I have with the alternate modeling is that it creates the relationship in the wrong direction. In my case if I follow this structure I have a TimeSpan that knows what file it belongs to, but the file itself doesn't point out to the TimeSpan. It's not a logical direction for this relationship; no one will ever think about the data that way. I realize I can store the data as needed in solr, it just seems backwards.

The n-1 relationships don't exist in the other direction in AF. So to get the relationship in the right direction you have to use has_and_belongs_to_many and has_many. It makes more sense to me to model as n-n in the right direction than as n-1 in the wrong direction.

Unfortunately, this bug prevents me from being able to do so.

(corrected; I initially had the wrong cardinality)

@ojlyytinen
Copy link
Contributor

@HackMasterA yes in Fedora that's how it's stored, the link will only be in the TimeSpan. However, you can still traverse the relationship either way. ActiveFedora should automatically load everything fine and you should be able to access the time spans of a file using file.date_of_publication or something similar.

If the time spans need to be indexed in Solr, then you'll need to override the to_solr method and add them manually. See our metadata.rb file above and the overridden to_solr method there.

Here's a minimal test to demonstrate the bug itself. This is from my old debugging which used books and authors instead of files and time spans.

class Book < ActiveFedora::Base
  has_and_belongs_to_many :authors, predicate: ::RDF::DC.creator, inverse_of: :books
end

class Author < ActiveFedora::Base   
  has_and_belongs_to_many :books, predicate: ::RDF::URI.new('http://example.org/ns#author_of'), inverse_of: :authors
end

b=Book.create()
b.authors.create()
b.authors.create()

Book.find(b.id).authors #returns both
Book.load_instance_from_solr(b.id).authors #only returns the last author

If you change the model to this, then Fedora and Solr return the same authors. But obviously now a single Author can belong to only one book instead of any number of books.

class Book < ActiveFedora::Base
  has_many :authors, inverse_of: :book
end

class Author < ActiveFedora::Base   
  belongs_to :book, predicate: ::RDF::URI.new('http://example.org/ns#author_of')
end

@hackartisan
Copy link
Contributor Author

@ojlyytinen thanks for this sample -- I was having trouble because I still had a predicate on the has_many relationship. obstinately trying to get the relationship in the direction I like... :)

I think we'll be able to use your workaround for now, with the long-term hope that I or someone else will add more associations to active fedora in the future. (and/or gain enough familiarity with AF to resolve this bug).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants