Skip to content

Commit

Permalink
[INLONG-2029] add pulsar example document for the InLong (apache#230)
Browse files Browse the repository at this point in the history
Co-authored-by: dockerzhang <dockerzhang@tencent.com>
  • Loading branch information
dockerzhang and dockerzhang committed Dec 20, 2021
1 parent b4c2c6b commit fe415de
Show file tree
Hide file tree
Showing 16 changed files with 188 additions and 10 deletions.
10 changes: 5 additions & 5 deletions docs/quick_start/hive_example.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@ sidebar_position: 2

Here we use a simple example to help you experience InLong by Docker.

## 1 Install Hive
## Install Hive
Hive is the necessary component. If you don't have Hive in your machine, we recommand using Docker to install it. Details can be found [here](https://github.com/big-data-europe/docker-hive).

> Note that if you use Docker, you need to add a port mapping `8020:8020`, because it's the port of HDFS DefaultFS, and we need to use it later.
## 2 Install InLong
## Install InLong
Before we begin, we need to install InLong. Here we provide two ways:
1. Install InLong with Docker by according to the [instructions here](deployment/docker.md).(Recommanded)
2. Install InLong binary according to the [instructions here](deployment/bare_metal.md).

## 3 Create a data access
## Create a data access
After deployment, we first enter the "Data Access" interface, click "Create an Access" in the upper right corner to create a new date access, and fill in the data streams group information as shown in the figure below.

![Create Group](img/create-group.png)
Expand All @@ -38,12 +38,12 @@ Note that the target table does not need to be created in advance, as InLong Man

Then we click the "Submit for Approval" button, the connection will be created successfully and enter the approval state.

## 4 Approve the data access
## Approve the data access
Then we enter the "Approval Management" interface and click "My Approval" to approve the data access that we just applied for.

At this point, the data access has been created successfully. We can see that the corresponding table has been created in Hive, and we can see that the corresponding topic has been created successfully in the management GUI of TubeMQ.

## 5 Configure the agent
## Configure the agent
Here we use `docker exec` to enter the container of the agent and configure it.
```
$ docker exec -it agent sh
Expand Down
Binary file added docs/quick_start/img/pulsar-arch.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/quick_start/img/pulsar-data.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/quick_start/img/pulsar-group.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/quick_start/img/pulsar-hive.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/quick_start/img/pulsar-stream.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/quick_start/img/pulsar-topic.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
90 changes: 90 additions & 0 deletions docs/quick_start/pulsar_example.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: Pulsar Example
sidebar_position: 2
---

Apache InLong has increased the ability to access data through Apache Pulsar, taking full advantage of Pulsar's technical advantages that are different from other MQ, and providing complete solutions for data access scenarios with higher data quality requirements such as finance and billing.
In the following content, we will use a complete example to introduce Apache Pulsar to access data through Apache InLong.

![Create Group](img/pulsar-arch.png)

## Install Pulsar
Please refer to [Official Installation Guidelines](https://pulsar.apache.org/docs/en/standalone/).

## Install Hive
Hive is the necessary component. If you don't have Hive in your machine, we recommand using Docker to install it. Details can be found [here](https://github.com/big-data-europe/docker-hive).

> Note that if you use Docker, you need to add a port mapping `8020:8020`, because it's the port of HDFS DefaultFS, and we need to use it later.
## Install InLong
Before we begin, we need to install InLong. Here we provide two ways:
1. Install InLong with Docker by according to the [instructions here](deployment/docker.md).(Recommanded)
2. Install InLong binary according to the [instructions here](deployment/bare_metal.md).

Unlike InLong TubeMQ, if you use Apache Pulsar, you need to configure Pulsar cluster information
in the Manager component installation. The format is as follows:
```
# Pulsar admin URL
pulsar.adminUrl=http://127.0.0.1:8080,127.0.0.2:8080,127.0.0.3:8080
# Pulsar broker address
pulsar.serviceUrl=pulsar://127.0.0.1:6650,127.0.0.1:6650,127.0.0.1:6650
# Default tenant of Pulsar
pulsar.defaultTenant=public
```

## Create a data access
### Configure data streams group information
![](img/pulsar-group.png)
When creating data access, the message middleware that the data stream group can use is Pulsar,
and other configuration items related to Pulsar include:
- Queue module: Parallel or Serial, when selecting parallel, you can set the number of topic partitions
- Write quorum: Number of copies to store for each message
- Ack quorum: Number of guaranteed copies (acks to wait before write is complete)
- retention time: retention time for the consumed message
- ttl: The default Time to Live for message
- retention size: retention size for the consumed message

### Configure data stream
![](img/pulsar-stream.png)
When configuring the message source, the file path in the file data source can be referred to [file-agent-configuration](https://inlong.apache.org/docs/next/modules/agent/file#file-agent-configuration).

### Configure data information
![](img/pulsar-data.png)

### Configure Hive cluster
Save Hive cluster information, click "Ok" to submit.
![](img/pulsar-hive.png)

## Data access Approval
Enter **Approval** page, click **My Approval**, abd approve the data access application. After the approval is over,
the topics and subscriptions required for the data stream will be created in the Pulsar cluster synchronously.
We can use the command-line tool in the Pulsar cluster to check whether the topic is created successfully:
![](img/pulsar-topic.png)

## Configure File Agent
When configuring the file agent, you must create the file in the directory specified when creating the data access:
```
touch /data/test_file.txt;
```

Write data to the file according to the data source format when creating the data stream:
```
echo -e "1|test\n2|test\n" >> /data/test_file.txt
```

## Data Check
Finally, we log in to the Hive cluster and use Hive SQL commands to check
whether data is successfully inserted in the `test_stream` table.

## Troubleshooting
If data is not correctly written to the Hive cluster, you can check whether the `DataProxy` and `Sort` related information are synchronized:
- Check whether the topic information corresponding to the data stream is correctly written in the `conf/topics.properties` folder of `InLong DataProxy`:
```
b_test_group/test_stream=persistent://public/b_test_group/test_stream
```

- Check whether the configuration information of the data stream is successfully pushed in
- the ZooKeeper monitored by `InLong Sort`
```
get /inlong_hive/dataflows/{{sink_id}}
```
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,18 @@ sidebar_position: 2
本节用一个简单的示例,帮助您使用 Docker 快速体验 InLong 的完整流程。


## 1 安装 Hive
## 安装 Hive
Hive 是运行的必备组件。如果您的机器上没有 Hive,这里推荐使用 Docker 进行快速安装,详情可见 [这里](https://github.com/big-data-europe/docker-hive)

> 注意,如果使用以上 Docker 镜像的话,我们需要在 namenode 中添加一个端口映射 `8020:8020`,因为它是 HDFS DefaultFS 的端口,后面在配置 Hive 时需要用到。
## 2 安装 InLong
## 安装 InLong
在开始之前,我们需要安装 InLong 的全部组件,这里提供两种方式:
1. 按照 [这里的说明](deployment/docker.md),使用 Docker 进行快速部署。(推荐)
2. 按照 [这里的说明](deployment/bare_metal.md),使用二进制包依次安装各组件。


## 3 新建接入
## 新建接入
部署完毕后,首先我们进入 “数据接入” 界面,点击右上角的 “新建接入”,新建一条接入,按下图所示填入数据流 Group 信息

![Create Group](img/create-group.png)
Expand All @@ -40,12 +40,12 @@ Hive 是运行的必备组件。如果您的机器上没有 Hive,这里推荐

然后点击“提交审批”按钮,该接入就会创建成功,进入审批状态。

## 4 审批接入
## 审批接入
进入“审批管理”界面,点击“我的审批”,将刚刚申请的接入通过。

到此接入就已经创建完毕了,我们可以在 Hive 中看到相应的表已经被创建,并且在 TubeMQ 的管理界面中可以看到相应的 topic 已经创建成功。

## 5 配置 agent
## 配置 agent
然后我们使用 docker 进入 agent 容器内,创建相应的 agent 配置。
```
$ docker exec -it agent sh
Expand Down
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
title: 使用 Pulsar 示例
sidebar_position: 2
---

Apache InLong 增加了通过 Apache Pulsar 接入数据的能力,充分利用了 Pulsar 不同于其它 MQ 的技术优势,为金融、计费等数据质量要求更高的数据接入场景,提供完整的解决方案。
在下面的内容中,我们将通过一个完整的示例介绍如何通过 Apache InLong 使用 Apache Pulsar 接入数据。

![Create Group](img/pulsar-arch.png)

## 安装 Pulsar
部署Apache Pulsar 集群可以参考[官方安装指引](https://pulsar.apache.org/docs/en/standalone/).

## 安装 Hive
Hive 是运行的必备组件。如果您的机器上没有 Hive,这里推荐使用 Docker 进行快速安装,详情可见 [这里](https://github.com/big-data-europe/docker-hive)

> 注意,如果使用以上 Docker 镜像的话,我们需要在 namenode 中添加一个端口映射 `8020:8020`,因为它是 HDFS DefaultFS 的端口,后面在配置 Hive 时需要用到。
## 安装 InLong
在开始之前,我们需要安装 InLong 的全部组件,这里提供两种方式:
1. 按照 [这里的说明](deployment/docker.md),使用 Docker 进行快速部署。(推荐)
2. 按照 [这里的说明](deployment/bare_metal.md),使用二进制包依次安装各组件。

区别于 InLong TubeMQ,如果使用 Apache Pulsar,需要在 Manager 组件安装中配置 Pulsar 集群信息,格式如下:
```
# Pulsar admin URL
pulsar.adminUrl=http://127.0.0.1:8080,127.0.0.2:8080,127.0.0.3:8080
# Pulsar broker address
pulsar.serviceUrl=pulsar://127.0.0.1:6650,127.0.0.1:6650,127.0.0.1:6650
# Default tenant of Pulsar
pulsar.defaultTenant=public
```

## 创建数据接入
### 配置数据流Group 信息
![](img/pulsar-group.png)
在创建数据接入时,数据流 Group 可选用的消息中间件选择 Pulsar,其它跟 Pulsar 相关的配置项还包括:
- Queue module:队列模型,并行或者顺序,选择并行时可设置 Topic 的分区数,顺序则为一个分区;
- Write quorum:消息写入的副本数
- Ack quorum:确认写入 Bookies 的数量
- retention time:已被 consumer 确认的消息被保存的时间
- ttl:未被确认的消息的过期时间
- retention size:已被 consumer 确认的消息被保存的大小

### 配置数据流
![](img/pulsar-stream.png)
配置消息来源时,文件数据源中的文件路径,可参照 inlong-agent 中[File Agent的详细指引](https://inlong.apache.org/docs/next/modules/agent/file#file-agent-configuration)

### 配置数据格式
![](img/pulsar-data.png)

### 配置 Hive 集群
保存 Hive 集群信息,点击“确定”。
![](img/pulsar-hive.png)

## 数据接入审批
进入**审批管理**页面,点击**我的审批**,审批上面提交的接入申请,审批结束后会在 Pulsar 集群同步创建数据流需要的 Topic 和订阅。
我们可以在 Pulsar 集群使用命令行工具检查 Topic 是否创建成功:
![](img/pulsar-topic.png)

## 配置文件 Agent
在配置文件 Agent 时,需要根据数据接入创建时指定的目录下创建文件:
```
touch /data/test_file.txt;
```

按照创建数据流时的数据源格式,向文件中写入数据(可以按格式写入更多数据):
```
echo -e "1|test\n2|test\n" >> /data/test_file.txt
```

## 数据落地检查

最后,我们登入 Hive 集群,通过 Hive 的 SQL 命令查看 `test_stream` 表中是否成功插入了数据。

## 问题排查
如果出现数据未正确写入 Hive 集群,可以检查 `DataProxy``Sort` 相关信息是否同步:
- 检查 `InLong DataProxy``conf/topics.properties` 文件夹中是否正确写入该数据流对应的Topic 信息:
```
b_test_group/test_stream=persistent://public/b_test_group/test_stream
```

- 检查 InLong Sort 监听的 ZooKeeper 中是否成功推送了数据流的配置信息:
```
get /inlong_hive/dataflows/{{sink_id}}
```


0 comments on commit fe415de

Please sign in to comment.