New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New work interface implementation #1819

Open
wants to merge 20 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@marta-lokhova
Contributor

marta-lokhova commented Oct 29, 2018

Work interface makeover based on the following discussions:
#1755 - original issue with design brainstorm
#1801 - rough prototype to exhibit how Work would operate

@marta-lokhova marta-lokhova changed the title from New work interface builds to New work interface implementation Oct 29, 2018

@MonsieurNicolas

overall things look good. I went over most of the new implementations, and things look a lot better now.
I spent more time on BasicWork/Work as it seems we can make certain improvements.

Show resolved Hide resolved src/main/ApplicationImpl.cpp
State getState() const;
bool isDone() const;
size_t getMaxRetries() const;

This comment has been minimized.

@MonsieurNicolas

MonsieurNicolas Nov 2, 2018

Contributor

why do we need getMaxRetries in the public interface?

This comment has been minimized.

@marta-lokhova

marta-lokhova Nov 9, 2018

Contributor

GetHistoryArchiveState is using this method to set number of retries for its children. At first I thought it might not be necessary to use it, and just default to RETRY_A_FEW, but looks like it affects the timing on some unrelated tests (e.g. upgrade tests), causing them to fail.

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

GetHistoryArchiveStateWork only has one constructor that takes maxRetries as a parameter. Instead of calling getMaxRetries it should just store that value during the constructor and use that. But more theoretically, it doesn't really make sense for the number of retries of a child to depend on the number of retries of the parent. We should fix this behavior and update the tests.

Show resolved Hide resolved src/work/BasicWork.h Outdated
Show resolved Hide resolved src/work/BasicWork.h
Show resolved Hide resolved src/work/BasicWork.h
mApp.getHistoryManager().checkpointContainingLedger(mRange.first());
mHdrIn.close();
mTxIn.close();
return BasicWork::getStatus();

This comment has been minimized.

@MonsieurNicolas

MonsieurNicolas Nov 2, 2018

Contributor

in general, I would prefer to not special case the status strings too much and use proper delegation: ie, use the status string of the parent & optionally add some extra details.
In this case:
if applying a checkpoint is in progress (running), just append to BasicWork::getStatus() a string similar to what you have.

That way it's easy to match status changes.

This comment has been minimized.

@marta-lokhova

marta-lokhova Nov 12, 2018

Contributor

That makes sense, but looks like it's already handled by mApp.getCatchupManager().logAndUpdateCatchupStatus

Show resolved Hide resolved src/catchup/CatchupWork.cpp Outdated
namespace stellar
{
BatchWork::BatchWork(Application& app, std::string name)

This comment has been minimized.

@MonsieurNicolas

MonsieurNicolas Nov 2, 2018

Contributor

why are you introducing BatchWork in this PR? I thought you wanted to merge the others first? #1701

This comment has been minimized.

@marta-lokhova

marta-lokhova Nov 7, 2018

Contributor

Before we discussed the merge order, I already had BatchWork in this implementation (as it actually didn't change that much comparing to #1701), so I just kept it here for now. Once everything else is merged, I can re-visit this and properly rebase.

Show resolved Hide resolved src/historywork/VerifyBucketWork.cpp Outdated
@@ -0,0 +1,5 @@
//
// Created by Marta Lokhava on 10/29/18.

This comment has been minimized.

@MonsieurNicolas

MonsieurNicolas Nov 2, 2018

Contributor

I don't think you wanted to check in those files (GZipAndPutFilesWork.cpp/h) in this commit

uint64_t getRetryETA() const;
// Main method for state transition, mostly dictated by `onRun`
void crankWork();

This comment has been minimized.

@vogel

vogel Nov 5, 2018

Contributor

I think it would be nice to also remove this from public interface. It is used only in classes derivered from BasicWork itself (Work and WorkScheduler). As it is used through shared_ptrs it may not be as trivial as just making it protected, but it may be worth it.

This comment has been minimized.

@marta-lokhova

marta-lokhova Nov 9, 2018

Contributor

Work should be able to call "crank" on its children (children could be instances of BasicWork, not just Work). While it's possible to do it with crankWork being protected, I'd prefer not to use any hacks :)

Show resolved Hide resolved src/work/WorkScheduler.cpp Outdated
virtual void onReset();
virtual State onRun() = 0;
virtual void onWakeUp();
virtual void onFailureRetry();

This comment has been minimized.

@vogel

vogel Nov 5, 2018

Contributor

This one seems not to be used.
Probably only onFailureRaise is needed (and can be renamed to onFailure).

This comment has been minimized.

@marta-lokhova

marta-lokhova Nov 10, 2018

Contributor

Looks like it's needed for RepairMissingBucketsWork (as it needs to differentiate between a retry vs terminal failure to call a handler)

Show resolved Hide resolved src/work/BasicWork.cpp Outdated
*/
class WorkSequence : public Work
{
std::vector<std::shared_ptr<BasicWork>> mSequenceOfWork;

This comment has been minimized.

@vogel

vogel Nov 6, 2018

Contributor

Does that mean that WorkSequence has two lists of work: mSequenceOfWork and Work::mChildren? Seems like items from mSequenceOfWork are not shutted down when WorkSequence is.

This comment has been minimized.

@marta-lokhova

marta-lokhova Nov 7, 2018

Contributor

All items in mSequenceOfWork (except the one currently executing), are not part of the work tree (orchestrated by WorkScheduler) and are in reset/terminal state, and by definition cannot be running. However, it's probably good to shut them down as well for clarity, so that they are in proper DESTRUCTING state.

This comment has been minimized.

@marta-lokhova

marta-lokhova Nov 13, 2018

Contributor

(Note: this change will happen with proper abort implementation though, because we need to be able to properly transition from ABORTED to PENDING state for work sequence retries)

Show resolved Hide resolved src/work/Work.cpp Outdated
scheduleComplete(WORK_COMPLETE_FATAL);
auto child = std::make_shared<T>(mApp, std::forward<Args>(args)...);
addChild(child);
child->startWork(wakeUpCallback());

This comment has been minimized.

@vogel

vogel Nov 6, 2018

Contributor

Is that ever going to change for a single instance of Work? Maybe it can be moved into constructor parameter of Work class?

This comment has been minimized.

@marta-lokhova

marta-lokhova Nov 10, 2018

Contributor

I assumed it might, so kept it like this for flexibility.

}
CLOG(DEBUG, "History") << "Downloading and unzipping " << mFileType
<< " for checkpoint " << mNext;
auto getAndUnzip = addWork<GetAndUnzipRemoteFileWork>(ft);

This comment has been minimized.

@vogel

vogel Nov 6, 2018

Contributor

I think it is strange to make implementors of yieldMoreWork do the addWork themselves.

I think that:

  1. we should replace that addWork with non-templated addWork that just gets shared_ptr (you got rid of passing *this to new work instances, so this wont hurt)
  2. caller of yieldMoreWork should then do addWork() on result (not it justs adds result to mBatch).
  3. note: interesting, Work::addChild does not check for duplicate work, but BatchWork checks for that

This comment has been minimized.

@marta-lokhova

marta-lokhova Nov 13, 2018

Contributor

Added a TODO, as BatchWork will be updated after #1701 is merged.

@marta-lokhova

This comment has been minimized.

Contributor

marta-lokhova commented Nov 13, 2018

Pushed a few updates earlier today - I've addressed most of the comments; I've marked some as resolved, and replied to the rest (in case people have more comments; upon agreement I'll mark those as resolved as well). Note that this PR still needs to be rebased to address comments on some commits (I did not want to mess with commits just yet, so that it's easier to review)

@marta-lokhova marta-lokhova referenced this pull request Nov 13, 2018

Closed

New work interface #1801

std::set<BasicWork::Transition> const BasicWork::ALLOWED_TRANSITIONS = {
Transition(InternalState::PENDING, InternalState::RUNNING),
Transition(InternalState::PENDING, InternalState::DESTRUCTING),
Transition(InternalState::RUNNING, InternalState::RUNNING),

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

I also think that moving to the same state should be a bug (I remarked about the same issue in #1801). I don't think your example invalidates this because ApplyBucketsWork does not repeatedly transition from RUNNING -> RUNNING while doing its work in chunks. There may be other examples that show that this does make sense though?

static size_t const RETRY_A_LOT;
static size_t const RETRY_FOREVER;
// Publicly exposed state of work:

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

Nitpick: Can you reflow the text of this comment so it is easier to read?

WORK_RUNNING,
WORK_WAITING,
WORK_SUCCESS,
WORK_FAILURE,

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

Extra trailing comma?

State getState() const;
bool isDone() const;
size_t getMaxRetries() const;

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

GetHistoryArchiveStateWork only has one constructor that takes maxRetries as a parameter. Instead of calling getMaxRetries it should just store that value during the constructor and use that. But more theoretically, it doesn't really make sense for the number of retries of a child to depend on the number of retries of the parent. We should fix this behavior and update the tests.

void wakeUp();
// Default wakeUp callback that implementers can use
std::function<void()> wakeUpCallback();

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

I'm not sure that this should be a member function: it doesn't depend on any private variables.

bool
CatchupWork::applyBuckets()
BasicWork::State
CatchupWork::doWork()

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

Can we refactor this function into several smaller functions? Or better yet, could CatchupWork be refactored to take advantage of WorkSequence?

}
}
else
{
scheduleSuccess();
if (allChildrenSuccessful())

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

I've seen this exact block a few times, maybe this should be a helper function.

{
std::shared_ptr<StateSnapshot> mSnapshot;
std::unique_ptr<VirtualTimer> mTimer;
std::chrono::seconds const mDelay{1};

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

I don't see a lot of reason to store this as a member variable, it is only used in one function.

}
else if (state == State::WORK_SUCCESS)
{
mNextChild = mChildren.erase(next);

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

This looks like a memory leak to me. It is possible for a work to fail without causing stellar-core to terminate (I believe PublishWork is such an example). If the Work in WORK_FAILURE state is not erased then that memory will never be reclaimed since Work stores a std::shared_ptr. But WorkScheduler lives for ever and is derived from Work, so it will accumulate failed work over time.

return State::WORK_RUNNING;
}
else
{
auto state = w->getState();
if (state == State::WORK_SUCCESS)
{
clearChildren();
mCurrentExecuting = nullptr;
mNextInSequence++;
return State::WORK_RUNNING;

This comment has been minimized.

@jonjove

jonjove Nov 16, 2018

Contributor

This looks inefficient: WorkSequence is scheduled but actually doesn't crank any of its children.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment