Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not change type or health map for health reasons. #1685

Merged
merged 32 commits into from
May 11, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
f33951b
First pass at not going spare if unhealthy.
alainjobart May 6, 2016
e97391d
Fixing tabletmanager.py with new healthcheck.
alainjobart May 6, 2016
0a6d9de
Fixing this file too.
alainjobart May 6, 2016
bc6af81
Now ChangeType also runs a health check.
alainjobart May 6, 2016
fc89dab
Re-init tablets to fix vtgatev2_test.py.
alainjobart May 9, 2016
859c558
Removing tablet type parameter to RunHealthCheck.
alainjobart May 9, 2016
5c7b385
Removing targetTabletType from health check proto.
alainjobart May 9, 2016
8ecfa8a
Now making target_tablet_type obsolete.
alainjobart May 9, 2016
6d815ad
No more healthmap in tablet.
alainjobart May 9, 2016
25bc282
Better doc for tablet type, better unit tests.
alainjobart May 9, 2016
1071e4e
Now run a health check after master demotion.
alainjobart May 9, 2016
1161251
TER now goes to REPLICA as well. Updating tests.
alainjobart May 9, 2016
e1118ee
Experimental server type now runs query service.
alainjobart May 9, 2016
a1f9fee
Improvements to the mysql health module.
alainjobart May 9, 2016
1b6e299
Fixing health check.
alainjobart May 10, 2016
1a5f6b1
Lots of fixes after Anthony's comments.
alainjobart May 10, 2016
4000a72
Fixing a corner case in health check.
alainjobart May 10, 2016
9f93aaa
Removing target_tablet_type from tests.
alainjobart May 10, 2016
67a9196
Fixing this test after changes.
alainjobart May 10, 2016
7d2b3bb
Addressing couple more review comments.
alainjobart May 10, 2016
022854e
De-duplicating some of the tablet state logic.
alainjobart May 10, 2016
2702ec2
Fixing tests after last change.
alainjobart May 10, 2016
84939d6
Changing expectations on default state.
alainjobart May 10, 2016
ec97892
Fixing this integration test.
alainjobart May 11, 2016
c4d287e
Now always entering lameduck on shutdown.
alainjobart May 11, 2016
6e55faf
Fixing this test after state change.
alainjobart May 11, 2016
1132202
Also fixing this test.
alainjobart May 11, 2016
552ba12
Re-generating the doc.
alainjobart May 11, 2016
c10a84f
Merge branch 'master' into sparenomore
alainjobart May 11, 2016
8902d03
Adding result of merge (I think)
alainjobart May 11, 2016
839b600
Nox fixing these tests.
alainjobart May 11, 2016
9656240
Fixing the last 3 failing tests.
alainjobart May 11, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
342 changes: 144 additions & 198 deletions doc/VitessApi.md

Large diffs are not rendered by default.

30 changes: 7 additions & 23 deletions doc/vtctlReference.md
Original file line number Diff line number Diff line change
Expand Up @@ -636,7 +636,7 @@ Starts a transaction on the provided server.
| connect_timeout | Duration | Connection timeout for vttablet client |
| keyspace | string | keyspace the tablet belongs to |
| shard | string | shard the tablet belongs to |
| tablet_type | string | tablet type we expect from the tablet (use unknown to use sessionId) |
| tablet_type | string | tablet type we expect from the tablet |


#### Arguments
Expand Down Expand Up @@ -669,7 +669,7 @@ Commits a transaction on the provided server.
| connect_timeout | Duration | Connection timeout for vttablet client |
| keyspace | string | keyspace the tablet belongs to |
| shard | string | shard the tablet belongs to |
| tablet_type | string | tablet type we expect from the tablet (use unknown to use sessionId) |
| tablet_type | string | tablet type we expect from the tablet |


#### Arguments
Expand Down Expand Up @@ -703,7 +703,7 @@ Executes the given query on the given tablet.
| json | Boolean | Output JSON instead of human-readable table |
| keyspace | string | keyspace the tablet belongs to |
| shard | string | shard the tablet belongs to |
| tablet_type | string | tablet type we expect from the tablet (use unknown to use sessionId) |
| tablet_type | string | tablet type we expect from the tablet |
| transaction_id | Int | transaction id to use, if inside a transaction. |


Expand Down Expand Up @@ -738,7 +738,7 @@ Rollbacks a transaction on the provided server.
| connect_timeout | Duration | Connection timeout for vttablet client |
| keyspace | string | keyspace the tablet belongs to |
| shard | string | shard the tablet belongs to |
| tablet_type | string | tablet type we expect from the tablet (use unknown to use sessionId) |
| tablet_type | string | tablet type we expect from the tablet |


#### Arguments
Expand Down Expand Up @@ -2000,35 +2000,19 @@ Reparent a tablet to the current master in the shard. This only works if the cur

### RunHealthCheck

Runs a health check on a remote tablet with the specified target type.
Runs a health check on a remote tablet.

#### Example

<pre class="command-example">RunHealthCheck &lt;tablet alias&gt; &lt;target tablet type&gt;</pre>
<pre class="command-example">RunHealthCheck &lt;tablet alias&gt;</pre>

#### Arguments

* <code>&lt;tablet alias&gt;</code> &ndash; Required. A Tablet Alias uniquely identifies a vttablet. The argument value is in the format <code>&lt;cell name&gt;-&lt;uid&gt;</code>.
* <code>&lt;target tablet type&gt;</code> &ndash; Required. The vttablet's role. Valid values are:

* <code>backup</code> &ndash; A slaved copy of data that is offline to queries other than for backup purposes
* <code>batch</code> &ndash; A slaved copy of data for OLAP load patterns (typically for MapReduce jobs)
* <code>experimental</code> &ndash; A slaved copy of data that is ready but not serving query traffic. The value indicates a special characteristic of the tablet that indicates the tablet should not be considered a potential master. Vitess also does not worry about lag for experimental tablets when reparenting.
* <code>master</code> &ndash; A primary copy of data
* <code>rdonly</code> &ndash; A slaved copy of data for OLAP load patterns
* <code>replica</code> &ndash; A slaved copy of data ready to be promoted to master
* <code>restore</code> &ndash; A tablet that is restoring from a snapshot. Typically, this happens at tablet startup, then it goes to its right state.
* <code>schema_apply</code> &ndash; A slaved copy of data that had been serving query traffic but that is now applying a schema change. Following the change, the tablet will revert to its serving type.
* <code>snapshot_source</code> &ndash; A slaved copy of data where mysqld is <b>not</b> running and where Vitess is serving data files to clone slaves. Use this command to enter this mode: <pre>vtctl Snapshot -server-mode ...</pre> Use this command to exit this mode: <pre>vtctl SnapshotSourceEnd ...</pre>
* <code>spare</code> &ndash; A slaved copy of data that is ready but not serving query traffic. The data could be a potential master tablet.
* <code>worker</code> &ndash; A tablet that is in use by a vtworker process. The tablet is likely lagging in replication.




#### Errors

* The <code>&lt;tablet alias&gt;</code> and <code>&lt;target tablet type&gt;</code> arguments are required for the <code>&lt;RunHealthCheck&gt;</code> command. This error occurs if the command is not called with exactly 2 arguments.
* The <code>&lt;tablet alias&gt;</code> argument is required for the <code>&lt;RunHealthCheck&gt;</code> command. This error occurs if the command is not called with exactly one argument.


### SetReadOnly
Expand Down
4 changes: 2 additions & 2 deletions go/cmd/vtcombo/tablet_map.go
Original file line number Diff line number Diff line change
Expand Up @@ -557,13 +557,13 @@ func (itmc *internalTabletManagerClient) RefreshState(ctx context.Context, table
})
}

func (itmc *internalTabletManagerClient) RunHealthCheck(ctx context.Context, tablet *topo.TabletInfo, targetTabletType topodatapb.TabletType) error {
func (itmc *internalTabletManagerClient) RunHealthCheck(ctx context.Context, tablet *topo.TabletInfo) error {
t, ok := tabletMap[tablet.Tablet.Alias.Uid]
if !ok {
return fmt.Errorf("tmclient: cannot find tablet %v", tablet.Tablet.Alias.Uid)
}
return t.agent.RPCWrap(ctx, actionnode.TabletActionRunHealthCheck, nil, nil, func() error {
t.agent.RunHealthCheck(ctx, targetTabletType)
t.agent.RunHealthCheck(ctx)
return nil
})
}
Expand Down
34 changes: 18 additions & 16 deletions go/cmd/vttablet/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,11 @@ var (
{{if .BlacklistedTables}}
BlacklistedTables: {{range .BlacklistedTables}}{{.}} {{end}}<br>
{{end}}
{{if .DisableQueryService}}
Query Service disabled by TabletControl<br>
{{if .DisallowQueryService}}
Query Service disabled: {{.DisallowQueryService}}<br>
{{end}}
{{if .DisableUpdateStream}}
Update Stream disabled<br>
{{end}}
</td>
<td width="25%" border="">
Expand Down Expand Up @@ -173,22 +176,21 @@ var onStatusRegistered func()
func addStatusParts(qsc tabletserver.Controller) {
servenv.AddStatusPart("Tablet", tabletTemplate, func() interface{} {
return map[string]interface{}{
"Tablet": topo.NewTabletInfo(agent.Tablet(), -1),
"BlacklistedTables": agent.BlacklistedTables(),
"DisableQueryService": agent.DisableQueryService(),
"Tablet": topo.NewTabletInfo(agent.Tablet(), -1),
"BlacklistedTables": agent.BlacklistedTables(),
"DisallowQueryService": agent.DisallowQueryService(),
"DisableUpdateStream": !agent.EnableUpdateStream(),
}
})
servenv.AddStatusFuncs(template.FuncMap{
"github_com_youtube_vitess_health_html_name": healthHTMLName,
})
servenv.AddStatusPart("Health", healthTemplate, func() interface{} {
return &healthStatus{
Records: agent.History.Records(),
Config: tabletmanager.ConfigHTML(),
}
})
if agent.IsRunningHealthCheck() {
servenv.AddStatusFuncs(template.FuncMap{
"github_com_youtube_vitess_health_html_name": healthHTMLName,
})
servenv.AddStatusPart("Health", healthTemplate, func() interface{} {
return &healthStatus{
Records: agent.History.Records(),
Config: tabletmanager.ConfigHTML(),
}
})
}
qsc.AddStatusPart()
servenv.AddStatusPart("Binlog Player", binlogTemplate, func() interface{} {
return agent.BinlogPlayerMap.Status()
Expand Down
59 changes: 35 additions & 24 deletions go/vt/health/health.go
Original file line number Diff line number Diff line change
@@ -1,20 +1,25 @@
package health

import (
"errors"
"fmt"
"html/template"
"sort"
"strings"
"sync"
"time"

"github.com/youtube/vitess/go/vt/concurrency"
)

var (
// DefaultAggregator is the global aggregator to use for real
// programs. Use a custom one for tests.
DefaultAggregator *Aggregator

// ErrSlaveNotRunning is returned by health plugins when replication
// is not running and we can't figure out the replication delay.
// Note everything else should be operational, and the underlying
// MySQL instance should be capable of answering queries.
ErrSlaveNotRunning = errors.New("slave is not running")
)

func init() {
Expand Down Expand Up @@ -64,47 +69,53 @@ func NewAggregator() *Aggregator {
}
}

type singleResult struct {
name string
delay time.Duration
err error
}

// Report aggregates health statuses from all the reporters. If any
// errors occur during the reporting, they will be logged, but only
// the first error will be returned.
// The returned replication delay will be the highest of all the replication
// delays returned by the Reporter implementations (although typically
// only one implementation will actually return a meaningful one).
func (ag *Aggregator) Report(isSlaveType, shouldQueryServiceBeRunning bool) (time.Duration, error) {
var (
wg sync.WaitGroup
rec concurrency.AllErrorRecorder
)

results := make(chan time.Duration, len(ag.reporters))
wg := sync.WaitGroup{}
results := make([]singleResult, len(ag.reporters))
index := 0
ag.mu.Lock()
for name, rep := range ag.reporters {
wg.Add(1)
go func(name string, rep Reporter) {
go func(index int, name string, rep Reporter) {
defer wg.Done()
replicationDelay, err := rep.Report(isSlaveType, shouldQueryServiceBeRunning)
if err != nil {
rec.RecordError(fmt.Errorf("%v: %v", name, err))
return
}
results <- replicationDelay
}(name, rep)
results[index].name = name
results[index].delay, results[index].err = rep.Report(isSlaveType, shouldQueryServiceBeRunning)
}(index, name, rep)
index++
}
ag.mu.Unlock()
wg.Wait()
close(results)
if err := rec.Error(); err != nil {
return 0, err
}

// merge and return the results
var result time.Duration
for replicationDelay := range results {
if replicationDelay > result {
result = replicationDelay
var err error
for _, s := range results {
switch s.err {
case ErrSlaveNotRunning:
// Return the ErrSlaveNotRunning sentinel
// value, only if there are no other errors.
err = ErrSlaveNotRunning
case nil:
if s.delay > result {
result = s.delay
}
default:
return 0, fmt.Errorf("%v: %v", s.name, s.err)
}
}
return result, nil
return result, err
}

// Register registers rep with ag. Only keys specified in keys will be
Expand Down
21 changes: 16 additions & 5 deletions go/vt/health/health_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,32 +8,43 @@ import (

func TestReporters(t *testing.T) {

// two aggregators returning valid numbers
ag := NewAggregator()

ag.Register("a", FunctionReporter(func(bool, bool) (time.Duration, error) {
return 10 * time.Second, nil
}))

ag.Register("b", FunctionReporter(func(bool, bool) (time.Duration, error) {
return 5 * time.Second, nil
}))

delay, err := ag.Report(true, true)

if err != nil {
t.Error(err)
}
if delay != 10*time.Second {
t.Errorf("delay=%v, want 10s", delay)
}

// three aggregators, third one returning an error
cReturns := errors.New("e error")
ag.Register("c", FunctionReporter(func(bool, bool) (time.Duration, error) {
return 0, errors.New("e error")
return 0, cReturns
}))
if _, err := ag.Report(true, false); err == nil {
t.Errorf("ag.Run: expected error")
} else {
want := "c: e error"
if got := err.Error(); got != want {
t.Errorf("got wrong error: got '%v' expected '%v'", got, want)
}
}

// three aggregators, third one returning ErrSlaveNotRunning
cReturns = ErrSlaveNotRunning
if _, err := ag.Report(true, false); err != ErrSlaveNotRunning {
t.Errorf("ag.Run: expected error: %v", err)
}

// check name is good
name := ag.HTMLName()
if string(name) != "FunctionReporter&nbsp; + &nbsp;FunctionReporter&nbsp; + &nbsp;FunctionReporter" {
t.Errorf("ag.HTMLName() returned: %v", name)
Expand Down
37 changes: 31 additions & 6 deletions go/vt/mysqlctl/health.go
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
package mysqlctl

import (
"fmt"
"html/template"
"time"

Expand All @@ -10,7 +9,14 @@ import (

// mysqlReplicationLag implements health.Reporter
type mysqlReplicationLag struct {
mysqld *Mysqld
// set at construction time
mysqld MysqlDaemon
now func() time.Time

// store the last time we successfully got the lag, so if we
// can't get the lag any more, we can extrapolate.
lastKnownValue time.Duration
lastKnownTime time.Time
}

// Report is part of the health.Reporter interface
Expand All @@ -21,12 +27,28 @@ func (mrl *mysqlReplicationLag) Report(isSlaveType, shouldQueryServiceBeRunning

slaveStatus, err := mrl.mysqld.SlaveStatus()
if err != nil {
// mysqld is not running. We can't report healthy.
return 0, err
}
if !slaveStatus.SlaveRunning() {
return 0, fmt.Errorf("Replication is not running")
// mysqld is running, but slave is not replicating (most likely,
// replication has been stopped). See if we can extrapolate.
if mrl.lastKnownTime.IsZero() {
// we can't.
return 0, health.ErrSlaveNotRunning
}

// we can extrapolate with the worst possible
// value (that is we made no replication
// progress since last time, and just fell more behind).
elapsed := mrl.now().Sub(mrl.lastKnownTime)
return elapsed + mrl.lastKnownValue, nil
}
return time.Duration(slaveStatus.SecondsBehindMaster) * time.Second, nil

// we got a real value, save it.
mrl.lastKnownValue = time.Duration(slaveStatus.SecondsBehindMaster) * time.Second
mrl.lastKnownTime = mrl.now()
return mrl.lastKnownValue, nil
}

// HTMLName is part of the health.Reporter interface
Expand All @@ -36,6 +58,9 @@ func (mrl *mysqlReplicationLag) HTMLName() template.HTML {

// MySQLReplicationLag lag returns a reporter that reports the MySQL
// replication lag.
func MySQLReplicationLag(mysqld *Mysqld) health.Reporter {
return &mysqlReplicationLag{mysqld}
func MySQLReplicationLag(mysqld MysqlDaemon) health.Reporter {
return &mysqlReplicationLag{
mysqld: mysqld,
now: time.Now,
}
}
Loading