Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate information of cluster #9

Closed
zhangxin511 opened this issue May 22, 2019 · 11 comments
Closed

Duplicate information of cluster #9

zhangxin511 opened this issue May 22, 2019 · 11 comments

Comments

@zhangxin511
Copy link

zhangxin511 commented May 22, 2019

The influxcluster in rwha-sample.influxdb-srelay.cong seems duplicated info with query-router-endpoint-api?

For the SyncFlux, we can run as a HA Cluster, with 1 master 1 slave.

My question is:

  1. Do we still need to define the cluster information again in srelay?
  2. Current implementation of SyncFlux, the HA cluster can have two DB Backend one master one slave. can it have more slaves?
  3. I am not quite understand the current example rwha-sample.influxdb-srelay.conf, say I have two DB Backends running, one on localhost:8086 one on localhost:8087. We can define two [[influxdb]], but the example show we can also define two endpoints in query-router-endpoint-api, which is the cluster endpoint of SyncFlux. My understanding is we can only have one SyncFlux cluster by the existing two db.
  4. Also I am not quite understand how hamonitor of SyncFlux works. I tested to write data when one node is down (either master or slave), but I don't see data was synced.
@toni-moreno
Copy link
Owner

Hi @zhangxin511 , you should keep in mind the proposed architecture layout.

HA_architecture_with_syncflux

As you could see , there are:

  • 2 influxdb-srelay instances
  • 2 syncflux instances

Working on both nodes

In this layout with the dbbackend names in influxdb-srelay.conf and the influxdb name in the syncflux.conf shoud be the same.

Supose the rwha-sample. influxdb-srelay.conf

influxdb-srelay.conf ( in myinfluxdb01_server )

...
...
[[influxdb]]
  name = "myinfluxdb01"
  location = "http://myinfluxdb01_server:8086/"
  timeout = "10s"


[[influxdb]]
  name = "myinfluxdb02"
  location = "http://myinfluxdb02_server:8086/"
  timeout = "10s"

[[influxcluster]]
  # name = cluster id for route configs and logs
  name  = "ha_cluster"
  # members = array of influxdb backends
  members = ["myinfluxdb01","myinfluxdb02"]
  log-file = "ha_cluster.log"
  log-level = "info"
  type = "HA"
  query-router-endpoint-api = ["http://myinfluxdb01_server:4090/api/queryactive","http://myinfluxdb02_server:4090/api/queryactive"]
..
...

influxdb-srelay.conf ( in myinfluxdb02_server )

...
...
[[influxdb]]
  name = "myinfluxdb01"
  location = "http://myinfluxdb01_server:8086/"
  timeout = "10s"


[[influxdb]]
  name = "myinfluxdb02"
  location = "http://myinfluxdb02_server:8086/"
  timeout = "10s"

[[influxcluster]]
  # name = cluster id for route configs and logs
  name  = "ha_cluster"
  # members = array of influxdb backends
  members = ["myinfluxdb02","myinfluxdb01"] 
  log-file = "ha_cluster.log"
  log-level = "info"
  type = "HA"
  query-router-endpoint-api = ["http://myinfluxdb02_server:4090/api/queryactive","http://myinfluxdb01_server:4090/api/queryactive"]
...
...

Only changes the members and query-router-endpoint-api order to query first its own syncflux

syncflux.conf (on myinfluxdb01_server )

 master-db = "myinfluxdb01"
 slave-db = "myinfluxdb02"

[[influxdb]]
 release = "1x"
 name = "myinfluxdb01"
 location = "http://myinfluxdb01_server:8086/"
 admin-user = "admin"
 admin-passwd = "admin"
 timeout = "10s"

[[influxdb]]
 release = "1x"
 name = "myinfluxdb02"
 location = "http://myinfluxdb02_server:8086/"
 admin-user = "admin"
 admin-passwd = "admin"
 timeout = "10s"

syncflux.conf (on myinfluxdb02_server )

Only swaps mater and slave values

 master-db = "myinfluxdb02"
 slave-db = "myinfluxdb01"

[[influxdb]]
 release = "1x"
 name = "myinfluxdb01"
 location = "http://myinfluxdb01_server:8086/"
 admin-user = "admin"
 admin-passwd = "admin"
 timeout = "10s"

[[influxdb]]
 release = "1x"
 name = "myinfluxdb02"
 location = "http://myinfluxdb02_server:8086/"
 admin-user = "admin"
 admin-passwd = "admin"
 timeout = "10s"

About your questions:

  1. As in the previous example ( let me know if more doubts on that issue)
  2. Right now , there is no way to have more than one (as this one is also a very young project) , but if needed won't be difficult to add this feature.
    3,4) Config as in the layout and example. If you have more questions about how to config and possible errors please open and specific issue in the https://github.com/toni-moreno/syncflux issue manager.

I hope you can understand how smart-relay and syncflux can work together to build a better HA solution when we can not run a InfluxDB Enterprise cluster.

Any other question?

@zhangxin511
Copy link
Author

Appreciated for you detailed info. I tried your approach, since I am not sure where the HA load Balance coming from I made only one srelay instance but keeps the others as you suggested, but still can't make data in sync when one node is down.
This is what I have done:

  1. docker-compose up
  2. curl -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE mydb"
  3. curl -XPOST http://localhost:8087/query --data-urlencode "q=CREATE DATABASE mydb" (haven't try to turn on the admin on influx to use your admin endpoint)
  4. Baseline, curl -i -XPOST "http://127.0.0.1:9096/write?db=mydb" --data-binary "cpu_load_short,host=server01,region=us-west value=0.64 1434055561000000000", both database backend got the data 2015-06-11T20:46:01Z server01 us-west 0.64
  5. stop influx-a(running on 8086), docker-compose stop influx-a
  6. Try insert data while a is down, curl -i -XPOST "http://127.0.0.1:9096/write?db=mydb" --data-binary "cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000", which is 2015-06-11T20:46:02Z server01 us-west 0.64
  7. Start influx-a again docker-compose start influx-a
  8. Wait, then check database, the 2015-06-11T20:46:02Z server01 us-west 0.64 never synced back to a, while b have the two data entries.

Here is my setup:

docker-compose.yml

version: '3.7'
services:
  influx-a:
    image: influxdb:1.7
    ports:
      - 8086:8086
    volumes:
      - C:/Docker/InfluxHA/Influxdb/a:/var/lib/influxdb
  influx-b:
    image: influxdb:1.7
    ports: 
      - 8087:8086
    volumes:
      - C:/Docker/InfluxHA/Influxdb/b:/var/lib/influxdb
  influx-relay:
    image: tonimoreno/influxdb-srelay:latest
    ports:
      - 9096:9096
    links:
      - influx-a
      - influx-b
      - sync-flux-a
      - sync-flux-b
    volumes:
      - C:/Docker/InfluxHA/Influx-srelay/conf/influxdb-srelay.conf:/etc/influxdb-srelay/influxdb-srelay.conf
      - C:/Docker/InfluxHA/Influx-srelay/log/:/var/log/
  sync-flux-a:
    image: tonimoreno/syncflux
    ports:
      - 4090:4090
    links:
      - influx-a
      - influx-b
    volumes:
      - C:/Docker/InfluxHA/Sync-flux/a/conf/:/opt/syncflux/conf/
      - C:/Docker/InfluxHA/Sync-flux/a/log/:/opt/syncflux/log/
  sync-flux-b:
    image: tonimoreno/syncflux
    ports:
      - 4091:4090
    links:
      - influx-a
      - influx-b
    volumes:
      - C:/Docker/InfluxHA/Sync-flux/b/conf/:/opt/syncflux/conf/
      - C:/Docker/InfluxHA/Sync-flux/b/log/:/opt/syncflux/log/  

The configuration file and folder structure are attached. I am sorry to bother you like this, but could you take a look and let me know what went wrong? Much appreciated!
InfluxHA.zip

@toni-moreno
Copy link
Owner

Hi @zhangxin511 I will check your config ASAP

@toni-moreno
Copy link
Owner

Hi @zhangxin511 first thing I've detected is in your syncflux.toml config.

db names should be the same in both engines srelay and syncflux config

influxdb-srelay.conf

  [[influxdb]]
    name = "myinfluxdb01"
    location = "http://influx-a:8086/"
    timeout = "10s"
  
  [[influxdb]]
    name = "myinfluxdb02"
    location = "http://influx-b:8086/"
    timeout = "10s"

syncflux-a.toml

master-db = "myinfluxdb01"
slave-db = "myinfluxdb02"

[[influxdb]]
 release = "1x"
 name = "myinfluxdb01"
 location = "http://influx-a:8086/"
 admin-user = "admin"
 admin-passwd = "admin"
 timeout = "10s"

[[influxdb]]
 release = "1x"
 name = "myinfluxdb02"
 location = "http://influx-b:8086/"
 admin-user = "admin"
 admin-passwd = "admin"
 timeout = "10s"

syncflux-b.toml

master-db = "myinfluxdb02"
slave-db = "myinfluxdb01"

[[influxdb]]
 release = "1x"
 name = "myinfluxdb01"
 location = "http://influx-a:8086/"
 admin-user = "admin"
 admin-passwd = "admin"
 timeout = "10s"

[[influxdb]]
 release = "1x"
 name = "myinfluxdb02"
 location = "http://influx-b:8086/"
 admin-user = "admin"
 admin-passwd = "admin"
 timeout = "10s"

Could you fix these config files and test again please?

@zhangxin511
Copy link
Author

@toni-moreno I was not able to make recover works by using your suggestions only. But after change the docker-compose file from using tonimoreno/syncflux to tonimoreno/syncflux:latest data got synced! looks like I was using some old version of syncflux which might not have recover feature(I was able to see partial of the recovery logs but not all)?. Anyway it is finally working for me now. I will do more performance testing and keep you posted.

I understand this is still in the early development phase, a suggestion based on my issue: it looks like srelay and syncflux are tightly related, maybe consider to merge these two repos to one repo, or break the naming retraction?

@zhangxin511
Copy link
Author

Sorry, too early, it looks like the data recovery is not ALWAYS working for my case , I actually get only one good replication so far and others all failed.

I do see the syncflux was trying to recover data based on the log. When the recovery not recovering data shows:

time="2019-05-29 18:45:42" level=info msg="HACluster check...."
time="2019-05-29 18:45:42" level=info msg="HACLuster: detected UP Last(2019-05-29 18:45:32.1542501 +0000 UTC m=+540.024756701) Duratio OK (9.9980293s) RECOVERING"
time="2019-05-29 18:45:42" level=info msg="HACLUSTER: INIT RECOVERY : FROM [ 2019-05-29 18:44:32.1545497 +0000 UTC m=+480.024959901 ] TO [ 2019-05-29 18:45:32.1542501 +0000 UTC m=+540.024756701 ]"
time="2019-05-29 18:45:42" level=info msg="Replicating Data from DB mydb RP autogen..."
time="2019-05-29 18:45:42" level=debug msg="SYNC-DB-RP[mydb|&{autogen 0s 168h0m0s %!s(int64=1) %!s(bool=true) map[]}] From:2019-05-29 18:44:32.1545497 +0000 UTC m=+480.024959901 To:2019-05-29 18:45:32.1542501 +0000 UTC m=+540.024756701 | Duration: 59.9997968s || #chunks: 1  | chunk Duration 1h0m0s "
time="2019-05-29 18:45:42" level=info msg="InfluxMonitor: InfluxDB : myinfluxdb01  OK (Version  1.7.6 : Duration 1.454ms )"
time="2019-05-29 18:45:42" level=info msg="Processed Chunk [1/1](100%) from [1559151932][2019-05-29 17:45:32 +0000 UTC] to [1559155532][2019-05-29 18:45:32 +0000 UTC] (0) Points Took [9.7µs]"
time="2019-05-29 18:45:42" level=info msg="Processed DB data from myinfluxdb02[mydb|autogen] to myinfluxdb01[mydb|autogen] has done  #Points (0)  Took [798.4µs] !\n"
time="2019-05-29 18:45:42" level=info msg="HACLUSTER: DATA SYNCRONIZATION Took 1.9629ms"

While only one time there was a good recover, which give this output:

time="2019-05-29 18:23:34" level=info msg="HACluster check...."
time="2019-05-29 18:23:34" level=info msg="HACLuster: detected UP Last(2019-05-29 18:23:24.5806122 +0000 UTC m=+280.038978101) Duratio OK (9.9964807s) RECOVERING"
time="2019-05-29 18:23:34" level=info msg="HACLUSTER: INIT RECOVERY : FROM [ 2019-05-29 18:22:34.5799461 +0000 UTC m=+230.038399201 ] TO [ 2019-05-29 18:23:24.5806122 +0000 UTC m=+280.038978101 ]"
time="2019-05-29 18:23:34" level=info msg="Replicating Data from DB mydb RP autogen..."
time="2019-05-29 18:23:34" level=debug msg="SYNC-DB-RP[mydb|&{autogen 0s 168h0m0s %!s(int64=1) %!s(bool=true) map[cpu_load_short:%!s(*agent.MeasurementSch=&{cpu_load_short map[value:0xc000276600]})]}] From:2019-05-29 18:22:34.5799461 +0000 UTC m=+230.038399201 To:2019-05-29 18:23:24.5806122 +0000 UTC m=+280.038978101 | Duration: 50.0005789s || #chunks: 1  | chunk Duration 1h0m0s "
time="2019-05-29 18:23:34" level=debug msg="processing Database mydb Measurement cpu_load_short from 1559150604 to 1559154204"
time="2019-05-29 18:23:34" level=info msg="InfluxMonitor: InfluxDB : myinfluxdb01  OK (Version  1.7.6 : Duration 1.6828ms )"
time="2019-05-29 18:23:34" level=info msg="InfluxMonitor: InfluxDB : myinfluxdb02  OK (Version  1.7.6 : Duration 2.414ms )"
time="2019-05-29 18:23:34" level=debug msg="Query [select * from  \"cpu_load_short\" where time  > 1559150604s and time < 1559154204s group by *] took 1.5465ms "
time="2019-05-29 18:23:34" level=debug msg="processed 4 points"
time="2019-05-29 18:23:34" level=debug msg="Write attempt [1] took 3.3763ms "
time="2019-05-29 18:23:34" level=info msg="Processed Chunk [1/1](100%) from [1559150604][2019-05-29 17:23:24 +0000 UTC] to [1559154204][2019-05-29 18:23:24 +0000 UTC] (4) Points Took [6.3572ms]"
time="2019-05-29 18:23:34" level=info msg="Processed DB data from myinfluxdb02[mydb|autogen] to myinfluxdb01[mydb|autogen] has done  #Points (4)  Took [6.6578ms] !\n"
time="2019-05-29 18:23:34" level=info msg="HACLUSTER: DATA SYNCRONIZATION Took 8.6714ms"

It looks like this block of code is not always executed, which I have no idea how: https://github.com/toni-moreno/syncflux/blob/6627a8281cd93305f9315b6b6be325f4cdbd0dbb/pkg/agent/client.go#L594-L615

@toni-moreno
Copy link
Owner

Hi @zhangxin511 my workmate @sbengo will review your case ASAP

@zhangxin511
Copy link
Author

Thank you @toni-moreno fpr your continuously helps! Let me know if you need anything else @sbengo

@zhangxin511
Copy link
Author

@sbengo @toni-moreno I figured out partially why my data was not recovered:

  1. Whenever the log processing Database mydb Measurement cpu_load_short appears, it means it is a "good" recover states for me. But even in the good state, there are two issues for the logic of syncflux getvalues := fmt.Sprintf("select * from \"%v\" where time > %vs and time < %vs group by *", m, startsec, endsec) [here] (https://github.com/toni-moreno/syncflux/blob/6627a8281cd93305f9315b6b6be325f4cdbd0dbb/pkg/agent/client.go#L602):
  • If I insert data without specify a time, influx will use its current utc time to add to the time tag key. The logic above will find the missing data during the node down and backfill when the node alive again. However, due to the time difference the influx server may have, and when the syncflux detects when the cluster went down, it may add duplicate records. For example, the value 2 got inserted before node went down, and after node recovered, syncflux think it should add this value again and got duplicated
    image
  • For us, most data we inserted has an previous timestamp due to our delayed/scheduled/batch job. Say a node was down at time 8, and at time 10 we insert an entry happened at time 3, then the broken node recovered at time 11. The entry will not be synced at all due to above logic.
  1. I still don't know why some times the processing Database mydb Measurement cpu_load_short log never appear and put the recovery in a "bad" state.

@sbengo
Copy link
Collaborator

sbengo commented Jun 3, 2019

Hi @zhangxin511 , thanks for the info and sorry for the late response!

When the Syncflux is initiated it gets info about available databases, rps and measurements attached to them (a.k.a schema), and currently it never refresh it (only on init).

As I can see on your logs, on the failing case the schema seems to be empty so it won't iterate over the measurements (on linked function)

Bad case:

...
time="2019-05-29 18:45:42" level=debug msg="SYNC-DB-RP[mydb|&{autogen 0s 168h0m0s %!s(int64=1) %!s(bool=true) map[]}] From:2019-05-29 18:44:32.1545497 +0000 UTC m=+480.024959901 To:2019-05-29 18:45:32.1542501 +0000 UTC m=+540.024756701 | Duration: 59.9997968s || #chunks: 1  | chunk Duration 1h0m0s "
...

Working case:

time="2019-05-29 18:23:34" level=debug msg="SYNC-DB-RP[mydb|&{autogen 0s 168h0m0s %!s(int64=1) %!s(bool=true) map[cpu_load_short:%!s(*agent.MeasurementSch=&{cpu_load_short map[value:0xc000276600]})]}] From:2019-05-29 18:22:34.5799461 +0000 UTC m=+230.038399201 To:2019-05-29 18:23:24.5806122 +0000 UTC m=+280.038978101 | Duration: 50.0005789s || #chunks: 1  | chunk Duration 1h0m0s "

Review

I think its related with schema creation (if there were no data, the schema would be empty: only db, rp was stored). So:

  • There were data on your DB when you init (up) srelay + syncflux stack?

@toni-moreno opened an issue (I think it was before your comment!) asking for a reload schema toni-moreno/syncflux#16 . We have discussed it and we think we will add this feature on next days, so the schema will be always reloaded before the sync data process.


About timing issues/feature, we will keep discussing , but we currently doesn't support those cases

Thanks,
Regards!

@zhangxin511
Copy link
Author

@sbengo Thank you for your detailed response.
Yes I created DB AFTER started the syncflux, therefore syncflux won't know the DB. The change you mentioned makes sense.

It would be great to backfill data based on when the data was inserted instead of based on pure the time tag, because a lot of influxDB data are inserted by scheduled JOBs instead of real-time insertion.

Lastly, I think the syncflux takes time to start, there is a noticeable delay. Hope you can take a look at.

With these being said, I have a full srelay setup and working as your specific, I will close this issue now. Appreciate all your helps @toni-moreno and @sbengo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants