Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking/abstract out examine attempt 2 #14495

Closed

Conversation

bielu
Copy link
Contributor

@bielu bielu commented Jul 4, 2023

Retargeted pr, please read convs here:
#13851 (comment)

See also #13620

@bielu bielu changed the base branch from contrib to v13/dev July 4, 2023 07:47
@github-actions
Copy link

github-actions bot commented Jul 4, 2023

Hi there @bielu, thank you for this contribution! 👍

While we wait for one of the Core Collaborators team to have a look at your work, we wanted to let you know about that we have a checklist for some of the things we will consider during review:

  • It's clear what problem this is solving, there's a connected issue or a description of what the changes do and how to test them
  • The automated tests all pass (see "Checks" tab on this PR)
  • The level of security for this contribution is the same or improved
  • The level of performance for this contribution is the same or improved
  • Avoids creating breaking changes; note that behavioral changes might also be perceived as breaking
  • If this is a new feature, Umbraco HQ provided guidance on the implementation beforehand
  • 💡 The contribution looks original and the contributor is presumably allowed to share it

Don't worry if you got something wrong. We like to think of a pull request as the start of a conversation, we're happy to provide guidance on improving your contribution.

If you realize that you might want to make some changes then you can do that by adding new commits to the branch you created for this work and pushing new commits. They should then automatically show up as updates to this pull request.

Thanks, from your friendly Umbraco GitHub bot 🤖 🙂

@JasonElkin
Copy link
Contributor

Hey @bielu, I know that you'll need some more input from HQ to get this over the line, just want to flag a couple of things in the mean time.

Draft PRs don't necessarily alert anyone, so once you're ready for some input be sure to tag someone to take a look (I don't know if simply changing it from a draft will alert anyone either, and as you've already started the conversation on #13851 it's best to tag Bjarke and/or Kenn when you're ready).

Core Collaborators don't have the power to merge to v13/dev so you'll either need to get someone from HQ to do that or retarget it to /contrib.

Thanks, and good luck!

@bielu
Copy link
Contributor Author

bielu commented Jul 18, 2023

@JasonElkin It is draft on purpose, as agreed with Umbraco HQ :) as it is one of PRs which we want avoid merging by incident.
I know I need get Umbraco Employers to merge to dev branches, it is not first cms core pr which I am doing :)

@JasonElkin
Copy link
Contributor

Totally, @bielu. TBH my reply was more to stop the Core Collabs automated alert from complaining to to us that nobody had gotten back to you yet (no reply for 201 business hours, it tells me 😱).

But also, on the grounds that this isn't the usual contribution path, and we're doing things in the open, it's helpful to make the process clear for the benefit of any others that come across this.

@bielu
Copy link
Contributor Author

bielu commented Jul 23, 2023

@kjac when you have time, I need some back and forward about unit tests.
There is my thinking:

  • as we now rely on provider not directly on examine, we should rewrite test based on provider in way so we dont know what is underhood, so instead of indexing directly to examine in unit tests as we doing now, we should have unit tests which basically use provider first to index and later retrieve items from provider, and we should be able to switch providers in this tests :)
  • we should have now integration test for providers including checking Headless API responses
    I think I might miss some bits around here, but for sure we need come with test strategy which covers less examine and more abstraction.

@kjac
Copy link
Contributor

kjac commented Jul 24, 2023

Hi @bielu 👋

we should rewrite test based on provider in way so we dont know what is underhood, ... , and we should be able to switch providers in this tests

Yes, definitively 💯

we should have now integration test for providers including checking Headless API responses

I'm not entirely sure what you mean here. If the Delivery API query providers are rewritten to use the search abstractions you're creating here, wouldn't the above mentioned provider based test be sufficient? Don't get me wrong, I'm all for more tests 😄 just making sure you don't scope creep yourself here.

@bielu
Copy link
Contributor Author

bielu commented Jul 24, 2023

@kjac I meant that we now have special abstraction for filters, which I would love to cover with examples from current requests to headless api, to ensure swap from examine to search providers doesnt change outcome :)
As umbraco responses are pretty basic and use really simple requests, headless use filtering which makes it little harder to test just in provider as I can write test but without context they might not cover correctly parts which headless is using

@bielu
Copy link
Contributor Author

bielu commented Aug 2, 2023

@kjac I started implementing unit test, so to ensure the abstraction and idea works as we wanted, I introduced Lifti provider, which is basically in memory search provider :)
also I noticed CodeScene complains about a lot stuff, because I didnt move it but duplicated by obsoleting them, any tips how can I make it less unhappy without refactoring bits which imho shouldn't be touch as part of this PR?

@bielu
Copy link
Contributor Author

bielu commented Aug 28, 2023

@kjac managed to correct indexing of Node Alias! btw when you requested from api can you share test db + request which you doing as I cannot reproduce issue with api not returning content 😢

@kjac
Copy link
Contributor

kjac commented Aug 29, 2023

@bielu I see the alias now 😄

Still don't get anything from the Delivery API though. DB attached here - admin login is test@test / Test123456

I'm still using the "default" Lucene directory factory, but hopefully that is not the issue:

  "Umbraco": {
    "CMS": {
      "Examine": {
        "LuceneDirectoryFactory": "Default"
      },
      ...

I can query single items just fine (/umbraco/delivery/api/v1.0/content/item/{key}) but that endpoint does not use the Examine index, so that makes sense.

Querying multiple items (/umbraco/delivery/api/v1.0/content/, /umbraco/delivery/api/v1.0/content/?filter=contentType:page, ...) all yield zero results.

@bielu
Copy link
Contributor Author

bielu commented Aug 29, 2023

@kjac will have a look over next few days, but going for holidays today so will be less responsive

@kjac
Copy link
Contributor

kjac commented Aug 30, 2023

Don't fret - enjoy holidays 😄

@bielu
Copy link
Contributor Author

bielu commented Sep 16, 2023

@kjac I am back from holiday, I am looking into issue now, btw I was trying to update branch with changes from v13/dev but I cannot get v13/dev to built at all, do you know if there are any issues with this branch or maybe it is git issue and time for fresh pull? 😕
Edit: fresh pull fixed issues

@bielu
Copy link
Contributor Author

bielu commented Sep 16, 2023

@kjac just pulled and play around with your db,
image
so it has to be one of settings somewhere 😕 but now is question which one!
Do you have maybe time following week to catch up on call and debug together to see why it fails on your machine? :)
As I am not able to reproduce it at all. I am working on .net 8 and windows 11, and I am assuming you are using same OS?

@bielu
Copy link
Contributor Author

bielu commented Sep 17, 2023

Also added small quality of live change in dashboard:
for indexes:
image
for searchers:
image
so it is clear which native query syntax people should use

@kjac
Copy link
Contributor

kjac commented Sep 18, 2023

Hi @bielu - welcome back.

I'm pressed for time this week... the community teams are visiting, so a fair bit of my time is reserved for that. I'll try to find time to pull and test again during the week, but it will likely be next week.

@bielu
Copy link
Contributor Author

bielu commented Sep 27, 2023

@kjac Lets find time next week to sitdown and see why it doesnt work if it still doesnt work for you?

@kjac
Copy link
Contributor

kjac commented Sep 27, 2023

Hi @bielu,

Coincidentally I did debug the PR this morning.

The concrete problem has to do with the published field. When indexing for the Delivery API, the field is stored as __Published, meaning any query for published:y explicitly filters out all results.

I'm seeing a few other issues with the query when querying the Delivery API index. The Lucene query looks like this: +*:* +published:y +(+__IndexType:content) +(+((culture:en culture:us) culture:none))

However, the corresponding Lucene query in v12 and v13 looks like this: +__IndexType:content +(culture:en-us culture:none) +published:y

  • The culture part seems wrong in the Lucene first query. The culture field should be indexed and queried as raw text.
  • I'm not sure I feel comfortable with the +*:* in the beginning of the first Lucene query. Won't that have a negative performance impact?

On top of that, I can't see any difference in the generated Lucene queries when appending filters from my Delivery API query - i.e. appending filter=contentType:[alias]. The resulting Lucene query is exactly the same as the one above. To compare, in v12 and v13 it would be appended like this: +__IndexType:content +(culture:en-us culture:none) +published:y +contentType:page

I haven't tested appending sorting to the Delivery API queries yet, so I can't tell if the same issue is present for sorting.

On a side note, the solution cannot build because Umbraco.Tests.Integration.csproj is malformed - likely due to a bad merge. See this line:

<ProjectReference Include="..\Umbraco.Tests.Common\Umbraco.Tests.Common.csproj" />

Now, we've been discussing things internally and there are a few more things to consider.

Testing

When it comes to Examine we're basically in a bad place test coverage wise. You also addressed this earlier. We've currently started looking into building a test suite for Examine, which will help ensure that we do not break things unintentionally while abstracting the search functionality.

These tests will eventually cover both search and indexing for "regular" Examine indexes (like against the internal and external index) and Delivery API index.

Tracking changes to the existing code

In an effort to cut Examine from the core, you have copied out a bunch of code to new class libraries and obsoleted the old code. I like it, but I'm afraid it might be a bit premature.

We are continuously making changes to that code. Particularly the bits that involve the Delivery API change frequently, but we might also be bugfixing things for "regular" Examine as well. I fear your approach will make it quite hard to track the ongoing changes and ensure that they're ported over to the new libraries.

Maybe it would be better to do all these changes in-place to start with and split things out later? It would of course mean that Examine would still be a first-class citizen, but the upsides in form of change tracking may outweigh that.

Thoughts?

Backwards compatability

It seems I'm still able to use ExamineManager etc. directly to perform searches. That's good for backwards compatability. We must be able to, otherwise the upgrade path will be a nightmare.

What I can't entirely grasp is... how will all these changes affect community search packages and generally the ability to perform advanced stuff like faceting and spatial search?

@bielu
Copy link
Contributor Author

bielu commented Sep 27, 2023

  1. about published field it is related to inconsistency between indexing, we should make decision if we going to be consistent with other indexes or with v12 delivery api
  2. "+*:* " has negilable performance impact, the most used way of searching sadly is managed query which generates only 4 queries per index field... I wish we could skip match all query but AFAIK we can't if we want reuse this query with correct way of doing it
  3. let me check Umbraco.Tests.Integration.csproj but I think it is related to change in same file and wrong git decision aroudn which are master chanes
  4. Tracking changes to the existing code - orginal pr was moving classes around without obsoleting them! which would be cleaner and easier to track changes, but because of Umbraco breaking changes policy I moved it to use obsoleted classes and new duplicated classes. we can work around it if we do v12 pr to obselete classes and remove obseleting classes in this pr! it would make it much easier to track!
  5. Backwards compatability - TBH you never should speak directly with ExamineManager but umbraco never abstracted it out earlier... in this code we do keep ExamineManager in container which means people can access it, but they shouldn't all code should use new abstraction code, which makes code search engine agnostic :) but yeah because Examine is so long not correctly abstract out it will be quite a road!
  6. culture field - just readed again and I will fix it asap.
  7. for faceting and spatial search we agreed with Bjarke to not think about it right now, as Examine is not supporting it yet, but we can introduce abstraction to handle it as well :)

@kjac
Copy link
Contributor

kjac commented Sep 29, 2023

about published field it is related to inconsistency between indexing, we should make decision if we going to be consistent with other indexes or with v12 delivery api

I'm all for consistency. But we can't really change it at this point, because it'd require everyone to rebuild their indexes. We also can't assume what the field name is for custom indexes, or if it's even present (InternalIndex for example must be able to query unpublished content).

I wish we could skip match all query but AFAIK we can't if we want reuse this query with correct way of doing it

I believe +__IndexType:content does the trick here? That part of the query basically ensures that we select everything in the index, because everything has type content.

In terms of change tracking and backwards compat I'm mostly expressing concerns here. At this point it's way more important to keep up with core changes than to worry about obsoletion strategies in my opinion.

Also, I understand all the changes correctly, the obsoleted code won't actually be run (used in the DI) anyway? If that's correct we need to treat this whole thing as one big breaking change. That is, we still need to ensure that ExamineManager works for at least one major, as people have been using it for ages. But ... well, generally speaking, the obsoletion likely won't make a lot of sense, if the obsoleted code doesn't run.

Ideally we'll have sufficient test coverage for indexing and searching to ensure that any functionally breakage is discovered by said tests 😄

@bielu
Copy link
Contributor Author

bielu commented Sep 29, 2023

I'm all for consistency. But we can't really change it at this point, because it'd require everyone to rebuild their indexes. We also can't assume what the field name is for custom indexes, or if it's even present (InternalIndex for example must be able to query unpublished content).

We can duplicate property for now, that will have slight impact, it will just make marginally slower indexes.

I believe +__IndexType:content does the trick here? That part of the query basically ensures that we select everything in the index, because everything has type content.

Are we sure indexType is correct query? abstraction will be used between all types of indexes 🤔 maybe we should add dummy field and search for it?

In terms of change tracking and backwards compat I'm mostly expressing concerns here. At this point it's way more important to keep up with core changes than to worry about obsoletion strategies in my opinion.

i can remove obsoleting part and do just breaking changes, than we can track changes and git should try help use find what change and where 🤔 (we can add not breaking obsolete tags in separate prs than we dont have to worrying about introducing obsoletes in this pr?

Also, I understand all the changes correctly, the obsoleted code won't actually be run (used in the DI) anyway? If that's correct we need to treat this whole thing as one big breaking change. That is, we still need to ensure that ExamineManager works for at least one major, as people have been using it for ages. But ... well, generally speaking, the obsoletion likely won't make a lot of sense, if the obsoleted code doesn't run.

Yes, you understood correctly, we still use ExamineManager in there so you can access it. We can do path with introducing obseleting in v12.x, but yeah it doesnt change classes would have to be in v13 still present, but yeah we have a lot code which doesnt run in umbraco and tagged and obsolete (logger injecting helpers etc)

I think it would be beneficial to have call and include Bjarke to this and make full strategy on it, as right now we looking into it without any.

@kjac
Copy link
Contributor

kjac commented Oct 2, 2023

I agree. We should talk this through soon.

Bjarke is gone all week for the US summit. I'll get back to you next week, we'll see how the calendars might align 👍

@bielu
Copy link
Contributor Author

bielu commented Oct 11, 2023

Hi @kjac,
I just had though about it and came with this POC strategy:

  • we introduce obsolete information in next patch for v12
  • we keep them in v13
  • I will readdress this pr to v14, so we can have clean transition without having maintaining both approaches in v14
    as keeping both namespaces in DI will cause us troubles.
    Anyway I still think we should sit down on call and prepare full strategy for it. :)

@bielu
Copy link
Contributor Author

bielu commented Oct 16, 2023

Hey @kjac any updates on your side?

@bielu
Copy link
Contributor Author

bielu commented Oct 24, 2023

@kjac Hello mate, I have some time on my plate and wanted do some work related to this pr, but not sure if I should continue or should open new retargated to v14?

@bergmania
Copy link
Member

Hi @bielu

I apologize many times that we have been so slow to respond. It's just been really busy lately.

In the team we have discussed how we can handle this PR and generally how we want to abstract the search in a good way.

I'm afraid we'll have to go back into the thinking box a bit, so we make sure we get the right approach to all the use cases we see. I am very sorry for that, as I know you have already put in a lot of effort, which fortunately I believe we can get a lot out of. Sadly it will certainly mean that Umbraco 13 is not realistic to reach.

The search abstraction is very central for us to get right. I think the next step is a RFC to be published with our thoughts, so that we can also involve a larger part of the community.

Best Regards

@bielu
Copy link
Contributor Author

bielu commented Nov 9, 2023

@bergmania that is fine, I am just little annoyed about lack of communication.
It is okay to bin this pr, as I already would have to do it after latest changes in umbraco related to some bits. About RFC yeah might a way to go. I am happy to help with implementation or giving some opinion points about it as package developer for search providers.

@bielu
Copy link
Contributor Author

bielu commented Dec 13, 2023

Hey @bergmania,
Any ETA for RFC? :)
Best,
Arkadiusz

@bergmania
Copy link
Member

We haven't startet yet, but the aim in next year

@bielu
Copy link
Contributor Author

bielu commented Apr 17, 2024

Hey @bergmania any updates on RFC?

@bergmania
Copy link
Member

No.. It's currently not a priority. Goal is still some time this year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants