Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exchange update export #2588

Merged
merged 3 commits into from
Feb 15, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ Exchange {{exchange.release}} 支持将以下格式或来源的数据转换为 N

{{ ent.ent_begin }}

此外,企业版 Exchange 支持以 NebulaGraph 为源,将数据[导出到 CSV 文件](../use-exchange/ex-ug-export-from-nebula.md)。
此外,企业版 Exchange 支持以 NebulaGraph 为源,将数据[导出到 CSV 文件或另一个 NebulaGraph](../use-exchange/ex-ug-export-from-nebula.md)。
cooper-lzy marked this conversation as resolved.
Show resolved Hide resolved

{{ ent.ent_end }}

Expand Down
257 changes: 212 additions & 45 deletions docs-2.0/nebula-exchange/use-exchange/ex-ug-export-from-nebula.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# 导出 NebulaGraph 数据

本文以一个示例说明如何使用 Exchange NebulaGraph 中的数据导出到 CSV 文件中
Exchange 支持将 NebulaGraph 中的数据导出到 CSV 文件或另一个 NebulaGraph 数据库中,本文介绍具体的操作步骤

!!! enterpriseonly

仅企业版 Exchange 支持导出 NebulaGraph 数据,且仅能导出到 CSV 文件
仅企业版 Exchange 支持导出 NebulaGraph 数据。

## 环境准备

Expand Down Expand Up @@ -49,81 +49,246 @@ CentOS 7.9.2009

2. 修改配置文件。

企业版 Exchange 提供了导出 NebulaGraph 数据专用的配置文件模板`export_application.conf`,其中各配置项的说明参见 [Exchange 配置](../parameter-reference/ex-ug-parameter.md)。本示例使用的配置文件核心内容如下:
企业版 Exchange 提供了导出 NebulaGraph 数据专用的配置文件模板`export_to_csv.conf`和`export_to_nebula.conf`,其中各配置项的说明参见 [Exchange 配置](../parameter-reference/ex-ug-parameter.md)。本示例使用的配置文件核心内容如下:

- 导出到 CSV 文件:

```conf
...
# Use the command to submit the exchange job:

# spark-submit \
# --master "spark://master_ip:7077" \
# --driver-memory=2G --executor-memory=30G \
# --num-executors=3 --executor-cores=20 \
# --class com.vesoft.nebula.exchange.Exchange \
# nebula-exchange-3.0-SNAPSHOT.jar -c export_to_csv.conf

{
# Spark config
spark: {
app: {
name: NebulaGraph Exchange
}
}

# Nebula Graph config
# if you export nebula data to csv, please ignore these nebula config
nebula: {
address:{
graph:["127.0.0.1:9669"]
# if your NebulaGraph server is in virtual network like k8s, please config the leader address of meta.
# use `SHOW meta leader` to see your meta leader's address
meta:["127.0.0.1:9559"]
}
user: root
pswd: nebula
space: test

# nebula client connection parameters
connection {
# socket connect & execute timeout, unit: millisecond
timeout: 30000
}

error: {
# max number of failures, if the number of failures is bigger than max, then exit the application.
max: 32
# failed data will be recorded in output path, format with ngql
output: /tmp/errors
}

# use google's RateLimiter to limit the requests send to NebulaGraph
rate: {
# the stable throughput of RateLimiter
limit: 1024
# Acquires a permit from RateLimiter, unit: MILLISECONDS
# if it can't be obtained within the specified timeout, then give up the request.
timeout: 1000
}
}

# Processing tags
# There are tag config examples for different dataSources.
tags: [
# export NebulaGraph tag data to csv, only support export to CSV for now.
{
name: player
# you can ignore the tag name when export nebula data to csv
name: tag-name-1
type: {
source: Nebula
sink: CSV
source: nebula
sink: csv
}
# the path to save the NebulaGrpah data, make sure the path doesn't exist.
path:"hdfs://192.168.8.177:9000/vertex/player"
# if no need to export any properties when export NebulaGraph tag data
# if noField is configured true, just export vertexId
noField:false
# define properties to export from NebulaGraph tag data
# if return.fields is configured as empty list, then export all properties
return.fields:[]
# NebulaGraph space partition number
partition:10
}

...

metaAddress:"127.0.0.1:9559"
space:"test"
label:"person"
# config the fields you want to export from nebula
cooper-lzy marked this conversation as resolved.
Show resolved Hide resolved
fields: [nebula-field-0, nebula-field-1, nebula-field-2]
noFields:false # default false, if true, just export id
partition: 60
limit:10000
# config the path to save your csv file. if your file in not in hdfs, config "file:///path/ test.csv"
path: "hdfs://ip:port/path/person"
separator: ","
header: true
}
]

# Processing edges
# There are edge config examples for different dataSources.
# process edges
edges: [
# export NebulaGraph tag data to csv, only support export to CSV for now.
{
name: follow
# you can ignore the edge name when export nebula data to csv
name: edge-name-1
type: {
source: Nebula
sink: CSV
source: nebula
sink: csv
}
# the path to save the NebulaGrpah data, make sure the path doesn't exist.
path:"hdfs://192.168.8.177:9000/edge/follow"
# if no need to export any properties when export NebulaGraph edge data
# if noField is configured true, just export src,dst,rank
noField:false
# define properties to export from NebulaGraph edge data
# if return.fields is configured as empty list, then export all properties
return.fields:[]
# NebulaGraph space partition number
partition:10
metaAddress:"127.0.0.1:9559"
space:"test"
label:"friend"
cooper-lzy marked this conversation as resolved.
Show resolved Hide resolved
# config the fields you want to export from nebula
fields: [nebula-field-0, nebula-field-1, nebula-field-2]
noFields:false # default false, if true, just export id
partition: 60
limit:10000
# config the path to save your csv file. if your file in not in hdfs, config "file:///path/ test.csv"
path: "hdfs://ip:port/path/friend"
separator: ","
header: true
}
]
}
```

- 导出到另一个 NebulaGraph:

```conf
# Use the command to submit the exchange job:

# spark-submit \
# --master "spark://master_ip:7077" \
# --driver-memory=2G --executor-memory=30G \
# --num-executors=3 --executor-cores=20 \
# --class com.vesoft.nebula.exchange.Exchange \
# nebula-exchange-3.0-SNAPSHOT.jar -c export_to_nebula.conf

{
# Spark config
spark: {
app: {
name: NebulaGraph Exchange
}
}

...
# Nebula Graph config, just config the sink nebula information
nebula: {
address:{
graph:["127.0.0.1:9669"]
# if your NebulaGraph server is in virtual network like k8s, please config the leader address of meta.
# use `SHOW meta leader` to see your meta leader's address
meta:["127.0.0.1:9559"]
}
user: root
pswd: nebula
space: test

# nebula client connection parameters
connection {
# socket connect & execute timeout, unit: millisecond
timeout: 30000
}

error: {
# max number of failures, if the number of failures is bigger than max, then exit the application.
max: 32
# failed data will be recorded in output path, format with ngql
output: /tmp/errors
}

# use google's RateLimiter to limit the requests send to NebulaGraph
rate: {
# the stable throughput of RateLimiter
limit: 1024
# Acquires a permit from RateLimiter, unit: MILLISECONDS
# if it can't be obtained within the specified timeout, then give up the request.
timeout: 1000
}
}

# Processing tags
tags: [
{
name: tag-name-1
type: {
source: nebula
sink: client
}
# data source nebula config
metaAddress:"127.0.0.1:9559"
space:"test"
label:"person"
# mapping the fields of the original NebulaGraph to the fields of the target NebulaGraph.
fields: [source_nebula-field-0, source_nebula-field-1, source_nebula-field-2]
nebula.fields: [target_nebula-field-0, target_nebula-field-1, target_nebula-field-2]
limit:10000
vertex: _vertexId # must `be _vertexId`
cooper-lzy marked this conversation as resolved.
Show resolved Hide resolved
batch: 2000
partition: 60
}
]

# process edges
edges: [
{
name: edge-name-1
type: {
source: csv
sink: client
}
# data source nebula config
metaAddress:"127.0.0.1:9559"
space:"test"
label:"friend"
fields: [source_nebula-field-0, source_nebula-field-1, source_nebula-field-2]
nebula.fields: [target_nebula-field-0, target_nebula-field-1, target_nebula-field-2]
limit:1000
source: _srcId # must be `_srcId`
target: _dstId # must be `_dstId`
ranking: source_nebula-field-2
batch: 2000
partition: 60
}
]
}
```

3. 使用如下命令导出 NebulaGraph 中的数据。

!!! note

Driver 和 Executor 进程的相关参数可以根据自身配置灵活修改。

```bash
<spark_install_path>/bin/spark-submit --master "local" --class com.vesoft.nebula.exchange.Exchange nebula-exchange-x.y.z.jar_path> -c <export_application.conf_path>
<spark_install_path>/bin/spark-submit --master "spark://<master_ip>:7077" \
--driver-memory=2G --executor-memory=30G \
--num-executors=3 --executor-cores=20 \
cooper-lzy marked this conversation as resolved.
Show resolved Hide resolved
--class com.vesoft.nebula.exchange.Exchange nebula-exchange-x.y.z.jar_path> \
-c <conf_file_path>
```

本示例使用的导出命令如下
例如导出到 CSV 文件的示例命令如下

```bash
$ ./spark-submit --master "local" --class com.vesoft.nebula.exchange.Exchange \
~/exchange-ent/nebula-exchange-ent-{{exchange.release}}.jar -c ~/exchange-ent/export_application.conf
$ ./spark-submit --master "spark://192.168.10.100:7077" \
--driver-memory=2G --executor-memory=30G \
--num-executors=3 --executor-cores=20 \
cooper-lzy marked this conversation as resolved.
Show resolved Hide resolved
--class com.vesoft.nebula.exchange.Exchange ~/exchange-ent/nebula-exchange-ent-{{exchange.release}}.jar \
-c ~/exchange-ent/export_to_csv.conf
```

4. 检查导出的数据。

1. 查看目标路径下是否成功生成了 CSV 文件。
- 导出到 CSV 文件:

查看目标路径下是否成功生成了 CSV 文件,并检查文件内容。

```bash
$ hadoop fs -ls /vertex/player
Expand All @@ -141,4 +306,6 @@ CentOS 7.9.2009
-rw-r--r-- 3 nebula supergroup 119 2021-11-05 07:36 /vertex/player/ part-00009-17293020-ba2e-4243-b834-34495c0536b3-c000.csv
```

2. 检查 CSV 文件内容,确定数据导出成功。
- 导出到另一个 NebulaGraph:

登录新的 NebulaGraph,通过`SUBMIT JOB STATS`和`SHOW STATS`命令查看统计信息,确认是否导出成功。