Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse Cluster Support #29

Closed
poundifdef opened this issue Oct 13, 2023 · 3 comments
Closed

Clickhouse Cluster Support #29

poundifdef opened this issue Oct 13, 2023 · 3 comments

Comments

@poundifdef
Copy link
Contributor

poundifdef commented Oct 13, 2023

We want to be able to support cluster operations for creating tables.

  1. When you create a database/table in Clickhouse, you have to specify an "ON CLUSTER" parameter in order to make sure the command is replicated across all database servers. You will need to run clickhouse-keeper as part of your docker setup. keeper is Clickhouse's version of Zookeeper which allows different DB servers to coordinate.

  2. When we run a CREATE DATABASE/TABLE, or ALTER TABLE command in Clickhouse, we should be able to specify an "ON CLUSTER" parameter. We should add a new configuration param on the User object, GetCluster() string and use that as the cluster name. If that user does not have a cluster assigned (if the cluster is blank) then we should omit the ON CLUSTER query clause.

@amrullahfarook
Copy link

amrullahfarook commented Oct 14, 2023

Hey there @poundifdef! I came across Scratch DB and was interested in contributing, can I take up this issue?

@poundifdef
Copy link
Contributor Author

Sure.

@poundifdef
Copy link
Contributor Author

poundifdef commented Oct 30, 2023

I think there need to be 2 data structures: one users and servers.

The users data structure looks like this:

{
  "users": [
    {"api_key": "A","cluster": "A"},
    {"api_key": "B","cluster": "B"}
   ]
}

The servers structure would look like this:

{
  "servers": {
    "1.1.1.1": [
      {"cluster": "A", "shard": "A1"},
      {"cluster": "B", "shard": "B1"}
    ],
    "2.2.2.2": [
      {"cluster": "A", "shard": "A2"},
      {"cluster": "B", "shard": "B1"}
    ]
  }
}

Now we have a correlation between a user and their cluster with the users.cluster.

The servers structure should be able to directly translate to a clickhouse cluster XML config. For example, the above would translate to:

<clickhouse>
    <remote_servers>
        <A>
            <shard>
            <!-- shard A1 -->
                <replica>
                    <host>1.1.1.1</host>
                </replica>
            </shard>
            <shard>
            <!-- shard A2 -->
                <replica>
                    <host>2.2.2.2</host>
                </replica>
            </shard>
        </A>
        <B>
            <shard>
                <!-- shard B1, 2 replicas -->
                <replica>
                    <host>1.1.1.1</host>
                </replica>
                <replica>
                    <host>2.2.2.2</host>
                </replica>
            </shard>
        </B>
    </remote_servers>
</clickhouse>

I would also be open to the inverse - having the config store data which mirrors the Clickhouse required config, and using that to translate to figure out which clusters are hosted on which servers.

At the end of this project, we should be able to do the following:

  1. Ensure that we are performing database options on the correct cluster depending on which API key was used
  2. Write algorithms to choose which individual shard or replica to write data to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants