Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Remove CollectionDescription.Schema #1965

Merged

Conversation

AndrewSisley
Copy link
Contributor

@AndrewSisley AndrewSisley commented Oct 13, 2023

Relevant issue(s)

Resolves #1958

Description

Removes CollectionDescription.Schema.

Also splits the storage of schema out from within collection.

The storage of schema has been broken out to a new sub-package of db, at the moment it is a very simple file, but collection will be moved there in #1964. I was planning on doing that in this PR (in part, to provide context for reviewers, as atm it is basically a single-file package), but it proved to be non-trivial due to some existing messiness in that space and was broken out to two more tasks.

I also wish for stuff in that directory to eventually follow a repository-like pattern, where stuff is cached (within a context/txn's context) instead of fetching from store on each call.

Moving this stuff out to a new directory instead of preserving it in the (already very large) db directory should make both db and the new sub-package a fair bit more cohesive and easier to read.

@AndrewSisley AndrewSisley added feature New feature or request area/schema Related to the schema system area/collections Related to the collections system refactor This issue specific to or requires *notable* refactoring of existing codebases and components code quality Related to improving code quality labels Oct 13, 2023
@AndrewSisley AndrewSisley added this to the DefraDB v0.8 milestone Oct 13, 2023
@AndrewSisley AndrewSisley requested a review from a team October 13, 2023 16:07
@AndrewSisley AndrewSisley self-assigned this Oct 13, 2023
@AndrewSisley AndrewSisley force-pushed the 1958-schema-collection-storage branch 2 times, most recently from b16ff04 to 159aa0b Compare October 13, 2023 16:49
@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

Attention: 54 lines in your changes are missing coverage. Please review.

Comparison is base (20b6329) 74.30% compared to head (17bd290) 74.03%.
Report is 1 commits behind head on develop.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1965      +/-   ##
===========================================
- Coverage    74.30%   74.03%   -0.28%     
===========================================
  Files          242      244       +2     
  Lines        23436    23508      +72     
===========================================
- Hits         17414    17402      -12     
- Misses        4837     4899      +62     
- Partials      1185     1207      +22     
Flag Coverage Δ
all-tests 74.03% <65.38%> (-0.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
client/descriptions.go 70.42% <ø> (ø)
db/schema.go 85.37% <100.00%> (+0.03%) ⬆️
planner/sum.go 88.98% <100.00%> (ø)
core/key.go 88.97% <87.50%> (-0.06%) ⬇️
db/collection.go 69.35% <88.46%> (-0.05%) ⬇️
db/description/errors.go 0.00% <0.00%> (ø)
db/description/schema.go 56.31% <56.31%> (ø)

... and 8 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d0c7fb...17bd290. Read the comment docs.

db/collection.go Outdated
existingDescriptionsByName map[string]client.CollectionDescription,
proposedDescriptionsByName map[string]client.CollectionDescription,
existingSchemaByName map[string]client.SchemaDescription,
proposedDescriptionsByName map[string]client.SchemaDescription,
desc client.CollectionDescription,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Passing this in no longer makes any sense, not does fetching it in patchSchema. Drop the patchSchema call to getCollectionsByName, and fetch it on demand in here.

See also #1939 (comment)

  • Remove unwanted complexity RE collectionDescription param

Comment on lines +146 to +147
collectionSchemaVersionKey := core.NewCollectionSchemaVersionKey(schema.VersionID)
// Whilst the schemaVersionKey is global, the data persisted at the key's location
// is local to the node (the global only elements are not useful beyond key generation).
err = txn.Systemstore().Put(ctx, collectionSchemaVersionKey.ToDS(), buf)
if err != nil {
return nil, err
}

collectionSchemaKey := core.NewCollectionSchemaKey(schemaID)
err = txn.Systemstore().Put(ctx, collectionSchemaKey.ToDS(), []byte(schemaVersionID))
collectionSchemaKey := core.NewCollectionSchemaKey(schema.SchemaID)
err = txn.Systemstore().Put(ctx, collectionSchemaKey.ToDS(), []byte(schema.VersionID))
if err != nil {
return nil, err
}

err = txn.Systemstore().Put(ctx, collectionKey.ToDS(), []byte(schemaVersionID))
err = txn.Systemstore().Put(ctx, collectionKey.ToDS(), []byte(schema.VersionID))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: These make no sense anymore. I think this is what you were referring to here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Although I like the functions that you create here, I'm not a fan of the descriptions package. I would much rather see this under db and db.schema.go is still small enough to host the content of this file without causing trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasons for this is documented in the description, whilst it is very small now, it will hopefully grow very soon. This package provides a single cohesive location for all similar stuff (schema, collection, index, repo-stuff) to live alongside each other without getting blended into the rest of the db package.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the description. My comment still stands. I don't think that even after moving collection in there that it would make sense to me. I think having the content under a file with a representative name under db would be much better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be quite a lot of code. The schema stuff is 170 lines, collection and index and anything else that follows would be the same. Plus the repository layer on top of them. Why would you want all of that spread about a fairly large and general purpose package (db)? It would be much much easier to read and maintain if it all lived in isolation in the same location.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it would make any difference for maintainability. The schema stuff was already in there. You took it out of existing functions. If the files are well named and represent their contents, I don't really care how large the package is in terms of lines of code. Readability would be unchanged. I also care more about the size of functions than the size of files.

If you move collection to a sub package, we will end up with the sub-package having its parent as a dependency and eventually risk hitting circular dependency issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collection depends on db...

I do not understand what you mean here. You might be misunderstanding what will be living in this package.

Why not just add collection_description.go and schema_description.go instead of creating a sub-package?

Because they are going to be separated by 10 or so other files in the db package that are alphabetically larger than the first and smaller than the second, with index_description.go somewhere in between, plus whatever files the repo stuff will use, all causing cohesion to be lost.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If schema_description.go is next to schema.go and functions inside schema.go call functions inside schema_description.go, I think that's pretty good cohesion.

Why would having schema_description.go next to collection_description.go be more cohesive than the above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would having schema_description.go next to collection_description.go be more cohesive than the above?

The below is the definition of 'cohesive' returned from duck-duck-go. By definition if two files not next to each other, they are less cohesive than two files that are next to each other.

cohesive:
Holding the particles of a homogeneous body together; ; producing cohesion
Cohering, or sticking together, as in a mass; capable of cohering; tending to cohere.

For example, if the files were called description_schema.go and description_collection.go, they would be co-located, and it would be much easier to spot the shared concepts. A directory is just that really but using a / instead of an _, with some special treatment from file browser - this also emphasises the fact that they are separate conceptually from the rest of the db package contents.

When I look at the codebase, I'd like to see the stuff defining how to save and fetch schema next to the code used to save and fetch collections, and I'd also want the two neatly organised so that the implementation to fetch schema isn't mixed up with the code to save collections.

This lets me make sure that they all look and behave in predictable ways, and if I want to change all of them it is much more obvious what 'all of them' is, so that I don't accidentally miss some similar stuff (e.g. index descriptions).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I'll get behind your approach. I would like that you name it in the singular if it's not too big an ask as pointed out by islam.

I am worried that we will eventually hit a circular dependency situation given the proximity to the types in the root db package. But we can cross that bridge when we get there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay cheers, I'll add a task/note against the package name convo to rename it

@AndrewSisley AndrewSisley force-pushed the 1958-schema-collection-storage branch 3 times, most recently from 74704e0 to c8170a0 Compare October 17, 2023 16:18
@@ -132,6 +133,12 @@ type CollectionIndexKey struct {

var _ Key = (*CollectionIndexKey)(nil)

type SchemaVersionKey struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I think it needs a comment here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers - I missed that, will add

  • Doc key

@@ -257,6 +264,11 @@ func NewCollectionSchemaVersionKey(schemaVersionId string) CollectionSchemaVersi
return CollectionSchemaVersionKey{SchemaVersionId: schemaVersionId}
}

func NewCollectionSchemaVersionKeyFromString(key string) CollectionSchemaVersionKey {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Don't you think it makes sense to add some unit tests? There is a dedicated file that tests keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not. This is very simple, and tested by pretty much every integration test we have. Unit tests would only slow down development (now and in the future).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would be relying on an integration test to flag an issue with something that can easily be locally tested? That means that if we make a change to any of the keys, we would have to run the integration test suite to see if there are any problems with the change instead of running the much faster local unit tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a closer look and I think that not unit testing our keys could cause bad changes to go unnoticed until the breaking change tests are executed which could be quite annoying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really in the mood to get back into the integration vs unit test conversation at the moment. I have written a large amount on this topic before and you know my opinion on this matter.

Copy link
Collaborator

@fredcarle fredcarle Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two overall objective with testing:
1- Testing behaviour that is meaningful to the end user (like you mentioned above)
2- Reaching maximum code coverage so that all possible code paths have been tested (sometimes that is currently impossible to do just with integration tests). This is a goal that the team set for itself (significantly increase code coverage).

I'll end my input on this conversation with reminding you of the above objectives. If you believe that simply relying on integration tests in case contributes towards achieving the above objectives, then I think we're good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you believe that simply relying on integration tests in case contributes towards achieving the above objectives

I very strongly believe so. Sometimes the integration tests will need to be more imaginative than our current core suite, for example chaos monkey like magic, or providing a deliberately faulty/test datastore implementation (the root datastore is something that can be publicly provided), but ultimately everything should be testable via public only access points. If it isn't, it is dead code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just another note: it is practically impossible to reach a desired test coverage just with integration tests.
Just imagine setting up "a deliberately faulty/test datastore" that you want to trigger for a specific request at a specific code location. You saw it was hard to achieve it with unit tests, doing it with integration tests will be a nightmare.

And those tests helped me write better code and fix bugs (not reported) in the existing code. Like this:

err := someAction()
if err != nil {
    return err
}
...
res :=rsultsIter.NextSync()
if res.Error != nil {
    return err // here is the problem
}

Usually we don't write test that such paths. But we can easily overlook the details and ship incorrect code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such resilience tests don't need to be coded like traditional do-this, do-that tests. You can use this https://en.wikipedia.org/wiki/Chaos_engineering approach instead. This can allow the tests to be somewhat proactive (searching for bugs) instead reactive (testing known pathways where bugs are expected).

Also a lot of our errors can be reached my manipulating the datastore values directly, no need to mess around with mocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want to introduce a really magic testing layer, where randomized queries and setup data are fed in. Traditionally you can do this using recorded real-world data/actions, but there has been significant movement in using 'AI' to generate stuff like queries for such things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I think conventional way of naming packages in go is in singular form

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I always saw that as when talking about a single thing. For example, the db package contains a single db definition, that can be used to instantiate multiple db instances. In descriptions we have multiple description definitions (schema, collection, indexes - eventually), each of which can be used to instantiate multiple instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said, there is no official rule for it, it's just community convention.
The main argument for it how you going to use it. Yes, the package might deal with several different descriptions, but for the most packages most of of their public methods deal with a single entity (a.k.a description).
Like context.Context allows you to create (or wrap) different types of contexts, but you handle a single one at a time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like context.Context allows you to create (or wrap) different types of contexts, but you handle a single one at a time.

Yes, but a context.Context is still just a (singular) context.Context. The descriptions in this package do (will) not share an interface, or base inherited type. The description types are all public and independent, 'descriptions' is just a made up umbrella term because I needed to give the package a name - there is no Description interface or type.

Another example is the enumerable package, this is singular, because it describes a single thing, although there are many implementations, and some extensions of the base interface.

Note: I'm mostly just chatting for fun here, I dont really care about the name of the package, and think description(s) is a bit awkward anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fred also prefers description over descriptions, renaming :)

  • Rename package

"github.com/sourcenetwork/defradb/datastore"
)

// CreateSchemaVersion creates and saves to the store a new schema version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: "a new schema description version"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do disagree, and don't think adding description there adds any clarity. It does also clash with Fred's long-term naming suggestion where SchemaDescription will be renamed to Schema (or maybe SchemaVersion, either way Description will be dropped).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was just observation as a reader that I was looking for "description" word because the method is in descriptions package, but it wasn't clear to me at first sight why.

return desc, nil
}

// GetSchemas returns the schema of all the default schemas in the system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what is "default schema"? Is it a common term in defra?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a fairly new term that came to be when we allowed the toggling of schema versions. There is a word missing here, it should read default schema versions. A default schema version is a version that is the one used by a collection/query/etc when not explicitly requesting a version.

  • Tweak doc wording

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all right, the term "default schema versions" I know :)

}
defer func() {
if err := collectionSchemaVersionQuery.Close(); err != nil {
log.Error(ctx, NewErrFailedToCloseSchemaQuery(err).Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: #1450 (comment) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha, nice 😁 Thanks Islam 😆

  • Return error

@@ -661,86 +661,6 @@ func TestNonUniqueDrop_ShouldDeleteStoredIndexedFields(t *testing.T) {
assert.Len(t, f.getPrefixFromDataStore(prodCatKey.ToString()), 1)
}

func TestNonUniqueDrop_IfDataStorageFails_ReturnError(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I know we already touched on this one, but it really hard to see these tests are being cut off.

I know these unit test might bit a hard to maintain with all these mocks but they reach to the place where integration tests can't reach and help us achieve a greater code coverage which we are striving for.

I wonder how do you decide if some tests need to be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we already touched on this one, but it really hard to see these tests are being cut off.

I really do appreciate that, and it is a big personal reason why I do not want the removal of these tests to be blended in with the key.go unit test discussion.

help us achieve a greater code coverage which we are striving for

  1. Code coverage is not an end goal, it is just an indicator. I do not strive for code coverage, I strive for correct and maintainable code. Good tests contribute to code correctness and lower maintenance costs, bad tests can have a negative effect on both.
  2. Removing these tests does not appear to have affected code coverage.

I wonder how do you decide if some tests need to be removed.

:) I'm a lot more conservative than I want to be RE preserving unit tests. I have tried several times before to remove many of the unit tests that I have spend notable time changing in these last few PRs.

The index unit tests however have cost me a lot of time the last 10 days or so. It took me roughly half a day of development to get them working in a previous PR, I didn't bother trying in this one and removed the ones that started failing.

I do not consider these to be worth their maintenance cost.

They are the only tests to change in this PR.

This PR does not change any public behaviour.

This PR does not touch any index related code.

Yet these tests started to fail. These tests inhibit useful work, they prevent us from making useful changes to the codebase. Changes that are irrelevant to what they are testing for.

Almost everything that the tests test, is covered by integration tests. Some tests in these files do test errors which do not yet have integration test coverage, however I consider testing those errors to be very low value.

As well as the above, fixing these once they start failing is particularly painful. There is quite a lot of mock setup, and it is not easy code to read. Fixing them requires guessing how many internal calls to various mocks used by the tests are the 'correct' number, with the 'correct' input params and responses.

Somewhat related to the conversation we had RE the index-explain integration tests, these unit tests make unimportant stuff really important, and this wastes a lot of time.

In my opinion, the cost of these tests far outweighs any benefit of having them. Good tests save developer time (including external users' time in the case of bugs), that is why we write them - these tests cost more time than they will ever save.

I am not against the unit testing of stuff. Unit tests are useful in many cases.

I am not against mocks, they have their place, however here, with the amount of irrelevant stuff that they need to mock, they make these tests bad tests that cost more than they save.

If we wish to preserve tests similar in concept to the tests in this file they should be reworked somehow. They really should not fail when making the changes I have made in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet these tests started to fail.

To clarify on this, I changed the commit order - I try quite hard to make to all of my commits pass the CI, it makes a lot of things easier. So they started to fail part way through this PR, I removed them, then moved the commit to the start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey Andy, thanks for the detailed reasoning. I appreciate it.

I agree, these tests turned out to be a bit too invasive and have high maintenance cost with relatively little benefit.

These have cost me more than a day's worth of time the last week already and they have started failing again due to various internal mocks requiring the exact number of calls with the exact params etc and I have decided that is enough.  These are bad tests that cost far more time to maintain than they will ever save.  No public behaviour has changed here, yet they insist on failing, and require constant deep diving into really dense mock setup code that has very little to do with anything useful to our users.  They are being removed now.
Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing good progress on detangling collection and schema. Keep it up! 🙂

@AndrewSisley AndrewSisley merged commit 0efe835 into sourcenetwork:develop Oct 17, 2023
29 checks passed
@AndrewSisley AndrewSisley deleted the 1958-schema-collection-storage branch October 17, 2023 22:10
nasdf pushed a commit to nasdf/defradb that referenced this pull request Oct 18, 2023
## Relevant issue(s)

Resolves sourcenetwork#1958

## Description

Removes CollectionDescription.Schema.

Also splits the storage of schema out from within collection.

The storage of schema has been broken out to a new sub-package of db, at
the moment it is a very simple file, but collection will be moved there
in sourcenetwork#1964. I was planning
on doing that in this PR (in part, to provide context for reviewers, as
atm it is basically a single-file package), but it proved to be
non-trivial due to some existing messiness in that space and was broken
out to two more tasks.

I also wish for stuff in that directory to eventually follow a
repository-like pattern, where stuff is cached (within a context/txn's
context) instead of fetching from store on each call.

Moving this stuff out to a new directory instead of preserving it in the
(already very large) db directory should make both db and the new
sub-package a fair bit more cohesive and easier to read.
shahzadlone pushed a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
## Relevant issue(s)

Resolves sourcenetwork#1958

## Description

Removes CollectionDescription.Schema.

Also splits the storage of schema out from within collection.

The storage of schema has been broken out to a new sub-package of db, at
the moment it is a very simple file, but collection will be moved there
in sourcenetwork#1964. I was planning
on doing that in this PR (in part, to provide context for reviewers, as
atm it is basically a single-file package), but it proved to be
non-trivial due to some existing messiness in that space and was broken
out to two more tasks.

I also wish for stuff in that directory to eventually follow a
repository-like pattern, where stuff is cached (within a context/txn's
context) instead of fetching from store on each call.

Moving this stuff out to a new directory instead of preserving it in the
(already very large) db directory should make both db and the new
sub-package a fair bit more cohesive and easier to read.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collections Related to the collections system area/schema Related to the schema system code quality Related to improving code quality feature New feature or request refactor This issue specific to or requires *notable* refactoring of existing codebases and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store collection and schema descriptions against different keys
3 participants