Skip to content

Commit

Permalink
[INLONG-1814] Show document file subdirectories and change the docume…
Browse files Browse the repository at this point in the history
…nt directory level (apache#190)
  • Loading branch information
bluewang committed Nov 20, 2021
1 parent a279898 commit b5e9420
Show file tree
Hide file tree
Showing 35 changed files with 260 additions and 260 deletions.
12 changes: 6 additions & 6 deletions docs/modules/agent/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,27 @@
title: Architecture
---

## 1. Overview of InLong-Agent
## 1 Overview of InLong-Agent
InLong-Agent is a collection tool that supports multiple types of data sources, and is committed to achieving stable and efficient data collection functions between multiple heterogeneous data sources including file, sql, Binlog, metrics, etc.

### The brief architecture diagram is as follows:
### 1.1 The brief architecture diagram is as follows:
![](img/architecture.png)

### design concept
### 1.2 design concept
In order to solve the problem of data source diversity, InLong-agent abstracts multiple data sources into a unified source concept, and abstracts sinks to write data. When you need to access a new data source, you only need to configure the format and reading parameters of the data source to achieve efficient reading.

### Current status of use
### 1.3 Current status of use
InLong-Agent is widely used within the Tencent Group, undertaking most of the data collection business, and the amount of online data reaches tens of billions.

## 2. InLong-Agent architecture
## 2 InLong-Agent architecture
The InLong Agent task is used as a data acquisition framework, constructed with a channel + plug-in architecture. Read and write the data source into a reader/writer plug-in, and then into the entire framework.

+ Reader: Reader is the data collection module, responsible for collecting data from the data source and sending the data to the channel.
+ Writer: Writer is a data writing module, which reuses data continuously to the channel and writes the data to the destination.
+ Channel: The channel used to connect the reader and writer, and as the data transmission channel of the connection, which realizes the function of data reading and monitoring


## 3. Different kinds of agent
## 3 Different kinds of agent
### 3.1 file agent
File collection includes the following functions:

Expand Down
18 changes: 9 additions & 9 deletions docs/modules/agent/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@
title: Build && Deployment
---

## 1Configuration
## 1 Configuration
```
cd inlong-agent
```

The agent supports two modes of operation: local operation and online operation


### Agent configuration
### 1.1 Agent configuration

Online operation needs to pull the configuration from inlong-manager, the configuration conf/agent.properties is as follows:
```ini
Expand All @@ -20,25 +20,25 @@ agent.manager.vip.http.host=manager web host
agent.manager.vip.http.port=manager web port
```

## 2run
## 2 run
After decompression, run the following command

```bash
sh agent.sh start
```


## 3Add job configuration in real time
## 3 Add job configuration in real time

#### 3.1 agent.properties Modify the following two places
### 3.1 agent.properties Modify the following two places
```ini
# whether enable http service
agent.http.enable=true
# http default port
agent.http.port=Available ports
```

#### 3.2 Execute the following command
### 3.2 Execute the following command
```bash
curl --location --request POST 'http://localhost:8008/config/job' \
--header 'Content-Type: application/json' \
Expand Down Expand Up @@ -78,7 +78,7 @@ agent.http.port=Available ports
- proxy.streamId: The streamId type used when writing proxy, streamId is the data flow id showed on data flow window in inlong-manager


## 4eg for directory config
## 4 eg for directory config

E.g:
/data/inlong-agent/test.log //Represents reading the new file test.log in the inlong-agent folder
Expand All @@ -87,7 +87,7 @@ agent.http.port=Available ports
/data/inlong-agent/^\\d+(\\.\\d+)? // Start with one or more digits, followed by. or end with one. or more digits (? stands for optional, can match Examples: "5", "1.5" and "2.21"


## 5. Support to get data time from file name
## 5 Support to get data time from file name

Agent supports obtaining the time from the file name as the production time of the data. The configuration instructions are as follows:
/data/inlong-agent/***YYYYMMDDHH***
Expand Down Expand Up @@ -143,7 +143,7 @@ curl --location --request POST'http://localhost:8008/config/job' \
}'
```

## 6. Support time offset reading
## 6 Support time offset reading

After the configuration is read by time, if you want to read data at other times than the current time, you can configure the time offset to complete
Configure the job attribute name as job.timeOffset, the value is number + time dimension, time dimension includes day and hour
Expand Down
16 changes: 8 additions & 8 deletions docs/modules/dataproxy-sdk/architecture.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
---
title: Architecture
---
# 1、intro
## 1 intro
When the business uses the message access method, the business generally only needs to format the data in a proxy-recognizable format (such as six-segment protocol, digital protocol, etc.)
After group packet transmission, data can be connected to inlong. But in order to ensure data reliability, load balancing, and dynamic update of the proxy list and other security features
The user program needs to consider more and ultimately leads to the program being too cumbersome and bloated.

The original intention of API design is to simplify user access and assume some reliability-related logic. After the user integrates the API in the service delivery program, the data can be sent to the proxy without worrying about the grouping format, load balancing and other logic.

# 2、functions
## 2 functions

## 2.1 overall functions
### 2.1 overall functions

| function | description |
| ---- | ---- |
Expand All @@ -22,17 +22,17 @@ The original intention of API design is to simplify user access and assume some
| proxy list persistence (new)| Persist the proxy list according to the business group id to prevent the configuration center from failing to send data when the program starts


## 2.2 Data transmission function description
### 2.2 Data transmission function description

### Synchronous batch function
#### Synchronous batch function

public SendResult sendMessage(List<byte[]> bodyList, String groupId, String streamId, long dt, long timeout, TimeUnit timeUnit)

Parameter Description

bodyListIt is a collection of multiple pieces of data that users need to send. The total length is recommended to be less than 512k. groupId represents the service id, and streamId represents the interface id. dt represents the time stamp of the data, accurate to the millisecond level. It can also be set to 0 directly, and the api will get the current time as its timestamp in the background. timeout & timeUnit: These two parameters are used to set the timeout time for sending data, and it is generally recommended to set it to 20s.

### Synchronize a single function
#### Synchronize a single function

public SendResult sendMessage(byte[] body, String groupId, String streamId, long dt, long timeout, TimeUnit timeUnit)

Expand All @@ -41,7 +41,7 @@ The original intention of API design is to simplify user access and assume some
body is the content of a single piece of data that the user wants to send, and the meaning of the remaining parameters is basically the same as the batch sending interface.


### Asynchronous batch function
#### Asynchronous batch function

public void asyncSendMessage(SendMessageCallback callback, List<byte[]> bodyList, String groupId, String streamId, long dt, long timeout,TimeUnit timeUnit)

Expand All @@ -50,7 +50,7 @@ The original intention of API design is to simplify user access and assume some
SendMessageCallback is a callback for processing messages. The bodyList is a collection of multiple pieces of data that users need to send. The total length of multiple pieces of data is recommended to be less than 512k. groupId is the service id, and streamId is the interface id. dt represents the time stamp of the data, accurate to the millisecond level. It can also be set to 0 directly, and the api will get the current time as its timestamp in the background. timeout and timeUnit are the timeout time for sending data, generally recommended to be set to 20s.


### Asynchronous single function
#### Asynchronous single function


public void asyncSendMessage(SendMessageCallback callback, byte[] body, String groupId, String streamId, long dt, long timeout, TimeUnit timeUnit)
Expand Down
8 changes: 4 additions & 4 deletions docs/modules/dataproxy/architecture.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
---
title: Architecture
---
# 1、intro
## 1 intro

Inlong-dataProxy belongs to the inlong proxy layer and is used for data collection, reception and forwarding. Through format conversion, the data is converted into TDMsg1 format that can be cached and processed by the cache layer
InLong-dataProxy acts as a bridge from the InLong collection end to the InLong buffer end. Dataproxy pulls the relationship between the business group id and the corresponding topic name from the manager module, and internally manages the producers of multiple topics
The overall architecture of inlong-dataproxy is based on Apache Flume. On the basis of this project, inlong-bus expands the source layer and sink layer, and optimizes disaster tolerance forwarding, which improves the stability of the system.


# 2、architecture
## 2 architecture

![](img/architecture.png)

1. The source layer opens port monitoring, which is realized through netty server. The decoded data is sent to the channel layer
2. The channel layer has a selector, which is used to choose which type of channel to go. If the memory is eventually full, the data will be processed.
3. The data of the channel layer will be forwarded through the sink layer. The main purpose here is to convert the data to the TDMsg1 format and push it to the cache layer (tube is more commonly used here)

# 3、DataProxy support configuration instructions
## 3 DataProxy support configuration instructions

DataProxy supports configurable source-channel-sink, and the configuration method is the same as the configuration file structure of flume:

Expand Down Expand Up @@ -158,7 +158,7 @@ agent1.sinks.meta-sink-more1.max-survived-size = 3000000
Maximum number of caches
```
# 4、Monitor metrics configuration instructions
## 4 Monitor metrics configuration instructions
DataProxy provide monitor indicator based on JMX, user can implement the code that read the metrics and report to user-defined monitor system.
Source-module and Sink-module can add monitor metric class that is the subclass of org.apache.inlong.commons.config.metrics.MetricItemSet, and register it to MBeanServer. User-defined plugin can get module metric with JMX, and report metric data to different monitor system.
Expand Down
14 changes: 7 additions & 7 deletions docs/modules/dataproxy/quick_start.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: Build && Deployment
---
## Deploy DataProxy
## 1 Deploy DataProxy

All deploying files at `inlong-dataproxy` directory.

### config TubeMQ master
### 1.1 config TubeMQ master

`tubemq_master_list` is the rpc address of TubeMQ Master.
```
Expand All @@ -14,33 +14,33 @@ $ sed -i 's/TUBE_LIST/tubemq_master_list/g' conf/flume.conf

notice that conf/flume.conf FLUME_HOME is proxy the directory for proxy inner data

### Environmental preparation
### 1.2 Environmental preparation

```
sh prepare_env.sh
```

### config manager web url
### 1.3 config manager web url

configuration file: `conf/common.properties`:
```
# manager web
manager_hosts=ip:port
```

## run
## 2 run

```
sh bin/start.sh
```


## check
## 3 check
```
telnet 127.0.0.1 46801
```

## Add DataProxy configuration to InLong-Manager
## 4 Add DataProxy configuration to InLong-Manager

After installing the DataProxy, you need to insert the IP and port of the DataProxy service is located into the backend database of InLong-Manager.

Expand Down
10 changes: 5 additions & 5 deletions docs/modules/manager/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@
title: Architecture
---

## Introduction to Apache InLong Manager
## 1 Introduction to Apache InLong Manager

+ Target positioning: Apache inlong is positioned as a one-stop data access solution, providing complete coverage of big data access scenarios from data collection, transmission, sorting, and technical capabilities.

+ Platform value: Users can complete task configuration, management, and indicator monitoring through the platform's built-in management and configuration platform. At the same time, the platform provides SPI extension points in the main links of the process to implement custom logic as needed. Ensure stable and efficient functions while lowering the threshold for platform use.

+ Apache InLong Manager is the user-oriented unified UI of the entire data access platform. After the user logs in, it will provide different function permissions and data permissions according to the corresponding role. The page provides maintenance portals for the platform's basic clusters (such as mq, sorting), and you can view basic maintenance information and capacity planning adjustments at any time. At the same time, business users can complete the creation, modification and maintenance of data access tasks, and index viewing and reconciliation functions. The corresponding background service will interact with the underlying modules when users create and start tasks, and deliver the tasks that each module needs to perform in a reasonable way. Play the role of coordinating the execution process of the serial back-end business.
## Architecture
## 2 Architecture

![](img/inlong-manager.png)


##Module division of labor
## 3 Module division of labor

| Module | Responsibilities |
| :----| :---- |
Expand All @@ -24,9 +24,9 @@ title: Architecture
| manager-web | Front-end interactive response interface |
| manager-workflow-engine | Workflow Engine |

## use process
## 4 use process
![](img/interactive.jpg)


## data model
## 5 data model
![](img/datamodel.jpg)
12 changes: 6 additions & 6 deletions docs/modules/manager/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Build && Deployment
---

# 1. Environmental preparation
## 1 Environmental preparation
- Install and start MySQL 5.7+, copy the `doc/sql/apache_inlong_manager.sql` file in the inlong-manager module to the
server where the MySQL database is located (for example, copy to `/data/` directory), load this file through the
following command to complete the initialization of the table structure and basic data:
Expand All @@ -25,15 +25,15 @@ title: Build && Deployment
to [Compile and deploy TubeMQ Manager](https://inlong.apache.org/zh-cn/docs/modules/tubemq/tubemq-manager/quick_start.html)
, install and start TubeManager.

# 2. Deploy and start manager-web
## 2 Deploy and start manager-web

**manager-web is a background service that interacts with the front-end page.**

## 2.1 Prepare installation files
### 2.1 Prepare installation files

All installation files at `inlong-manager-web` directory.

## 2.2 Modify configuration
### 2.2 Modify configuration

Go to the decompressed `inlong-manager-web` directory and modify the `conf/application.properties` file:

Expand Down Expand Up @@ -74,7 +74,7 @@ The dev configuration is specified above, then modify the `conf/application-dev.
sort.appName=inlong_app
```

## 2.3 Start the service
### 2.3 Start the service

Enter the decompressed directory, execute `sh bin/startup.sh` to start the service, and check the
log `tailf log/manager-web.log`. If a log similar to the following appears, the service has started successfully:
Expand All @@ -83,7 +83,7 @@ log `tailf log/manager-web.log`. If a log similar to the following appears, the
Started InLongWebApplication in 6.795 seconds (JVM running for 7.565)
```

# 3. Service access verification
## 3 Service access verification

Verify the manager-web service:

Expand Down
18 changes: 9 additions & 9 deletions docs/modules/sort/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,31 +7,31 @@ Inlong-sort is used to extract data from different source systems, then transfor
Inlong-sort is simply an Flink application, and relys on Inlong-manager to manage meta data(such as the source informations and storage informations)

# features
## multi-tenancy
## 1 multi-tenancy
Inlong-sort is an multi-tenancy system, which means you can extract data from different sources(these sources must be of the same source type) and load data into different sinks(these sinks must be of the same storage type).
e.g. you can extract data form different topics of inlong-tubemq and the load them to different hive clusters.

## change meta data without restart
## 2 change meta data without restart
Inlong-sort uses zookeeper to manage its meta data, every time you change meta data on zk, inlong-sort application will be informed immediately.
e.g if you want to change the schema of your data, just change the meta data on zk without restart your inlong-sort application.

# supported sources
## 3 supported sources
- inlong-tubemq
- pulsar

# supported storages
## 4 supported storages
- clickhouse
- hive (Currently we just support parquet file format)

# limitations
## 5 limitations
Currently, we just support extracting specified fields in the stage of **Transform**.

# future plans
## More kinds of source systems
## 6 future plans
### 6.1 More kinds of source systems
kafka and etc

## More kinds of storage systems
### 6.2 More kinds of storage systems
Hbase, Elastic Search, and etc

## More kinds of file format in hive sink
### 6.3 More kinds of file format in hive sink
sequence file, orc
4 changes: 2 additions & 2 deletions docs/modules/sort/protocol_introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Currently the metadata management of inlong-sort relies on inlong-manager.

Metadata interaction between inlong-sort and inlong-manager is performed via ZK.

# Zookeeper's path structure
## 1 Zookeeper's path structure

![img.png](img.png)

Expand All @@ -20,6 +20,6 @@ A path at the top of the figure indicates which dataflow are running in a cluste

The path below is used to store the details of the dataflow.

# Protocol
## 2 Protocol
Please reference
`org.apache.inlong.sort.protocol.DataFlowInfo`
Loading

0 comments on commit b5e9420

Please sign in to comment.