Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support multi-index ES search #57

Merged
merged 7 commits into from
May 20, 2015

Conversation

Rizziepit
Copy link
Contributor

No description provided.

@Rizziepit
Copy link
Contributor Author

@smn @miltontony I'd like your input on this. I'm weighing up the pros and cons of implementing multi-index search vs. using the same index for multiple repos.

Multi-index

  • Pro: We can easily search across a collection of workspaces. I can reuse existing elastic-git functionality to make this happen.
  • Con: Elasticsearch's scoring operates per index (more info here). I might be mistaken, but I don't think there is a way to rank results from multiple indices given that the scores aren't comparable. Either of you know more?
  • Con: Probably a slight performance hit given the small amount of data per repo.

Single-index, multiple repos

  • Pro: Search scoring works as intended.
  • Con: We can't use the index to identify the repo. So we have to store that information in Elasticsearch. The one repo = one index assumption in elastic-git means I'll have to rewrite some functionality.

@smn
Copy link
Contributor

smn commented May 13, 2015

How are we going to map the list of workspaces to indexes in the underlying elastig-git S() stuff?

@smn
Copy link
Contributor

smn commented May 13, 2015

What if we changed the idea of workspaces to include multiple repositories?

ws = EG.workspace(dir1, dir2, dir3)

Would we use the same index prefix for all content directories or have separate ones for each and then query multiple indexes with S()?

Which, now that I'm reading things again, is what I suppose you're asking too?

@Rizziepit
Copy link
Contributor Author

@smn I like upgrading a workspace to have multiple content dirs. I'm slightly inclined towards implementing the same index for all content directories. This approach doesn't break scoring. Is there a good reason to put content dirs in separate indices? @miltontony?

@smn
Copy link
Contributor

smn commented May 13, 2015

Are we sure scoring breaks across different indexes?
I like the idea of separate indexes because then we still have the option of querying them separately.

@Rizziepit
Copy link
Contributor Author

No, I'm not sure. I'm basing this on the fact that scoring can be wonky across shards. You can explicitly use a DFS query to fix it, but I can't find mention of similar functionality across indices. It might be because people tend to put different types in different indices. I'm probably going down a rabbit hole though. Ignore and implement separate indices?

@smn
Copy link
Contributor

smn commented May 13, 2015

Perhaps we could allow both? Always single indexes, optionally a combined index?

@Rizziepit
Copy link
Contributor Author

+1

@Rizziepit
Copy link
Contributor Author

@smn @miltontony feedback on what I've done so far would be much appreciated.

So far I've partially converted Workspace and StorageManager to use multiple repos. It adds a lot of complexity, but I don't think it's a good idea to put multi-repo logic elsewhere (e.g. a Unicore frontend site). Thoughts?

@Rizziepit
Copy link
Contributor Author

New approach for multi-index search:

# indexes = ['repo1-master', 'repo2-master']
[obj] = S(
    Localisation,
    in_=['repos/repo1', 'repos/repo2'])

# indexes = ['tz-repo1-prod-master', 'tz-repo2-prod-master']
[obj] = S(
    Localisation,
    in_=['repos/repo1', 'repos/repo2'],
    index_prefixes=['tz-repo1-prod', 'tz-repo2-prod'])

Queries return ReadOnlyModelMappingType objects which have a to_object method - get_object raises NotImplementedError.

@Rizziepit
Copy link
Contributor Author

@smn @miltontony ready for review.

I'm getting a failure locally on repeat test runs because TestSearch.workspace1's index isn't deleted after tests. I can't figure out why. Any ideas?

@smn
Copy link
Contributor

smn commented May 18, 2015

@Rizziepit try running that single test with KEEP_REPO=1 py.test unicore -k <name of test> and then inspect what stuff is left lying around in the .repos dir? The KEEP_REPO environment flag is in elasticgit.tests.base.ModelBaseTest

@@ -24,25 +35,57 @@ def get_mapping_type_name(cls):
def get_model(self):
return self.model_class

def get_object(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rather raise a NotImplementedError here instead of removing the function entirely.

@smn
Copy link
Contributor

smn commented May 19, 2015

Sorry for the questions overload, I'm probably missing some obvious stuff...

@smn
Copy link
Contributor

smn commented May 19, 2015

👍 on the code.
Could you add some documentation for this?

return obj


class SM(S):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smn this could use a re-review. Instead of a single S class, there is now S and SM where SM is explicitly instantiated with model_class and in_. Workspace still uses S so that it can specify the read-write mapping type.

@smn
Copy link
Contributor

smn commented May 19, 2015

👍 from me on the S() -> SM() changes.

@smn
Copy link
Contributor

smn commented May 20, 2015

👍 again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants