Importing VertexID with UTF8 characters in CSV file #257

goranc · 2023-01-02T11:22:42Z

Importing data with escaped UTF8 characters for string type VertexID is not converting input string to UTF8 character, but inserts escaped characters as is from CSV file.

My Nebula cluster is using 3.3.0 version.
VertexID is fixed string type with 28 characters length.
I'm using custom algorithm for VertexID to avoid collision. It is combination of Lexicographic prefix based on string which have 8 characters length and concatenated with hash (standard Nebula hash function) converted to string.

Steps to reproduce the behavior:

Create space with VertexID definition as fixed string which use 28 characters
Create TAG for URL vertex
Import data for URL with UTF8 having specific characters

Examples:
CREATE SPACE IF NOT EXISTS graph(partition_num=128, replica_factor=3, vid_type=fixed_string(28));

USE graph;

CREATE TAG url(link string, subdomain_name string, domain_name string, protocol string, classification string);

Try to import data into TAG

"stubhub\xe6-2541048767624938324": ("http://stubhub手数料3.xyz","stubhub手数料3.xyz","stubhub手数料3.xyz","http",""),
"download1336853390718461484": ("http://downloads.sourceforge.net/project/orz123/a23.mp3?r=&ts=1448325706&use_mirror=heanet","downloads.sourceforge.net","sourceforge.net","http",""),
"oss.jfro1186231920510779202": ("https://oss.jfrog.org/artifactory/jcenter-remote/com/google/apis/google-api-services-cloudkms/v1-rev20-1.21.0/google-api-services-cloudkms-v1-rev20-1.21.0.jar","oss.jfrog.org","jfrog.org","https","");
ErrMsg: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit., ErrCode: -1005

We expect to get specific characters in VertexID field as it is in Domain field, but instead it is not converted and we got error about VertexID exceeded length.

wey-gu · 2023-01-03T02:02:11Z

cc @veezhang

veezhang · 2023-01-06T03:13:49Z

@goranc Hi, can you paste your csv data here?

goranc · 2023-01-11T16:29:54Z

OK, I've invested a little bit more time with this issue.

What is changed from previous testing is that now I'm handling strings to be complete UTF8 characters, so we avoid using escaping sequences started with hexadecimal escape codes (like \xe6 in previous example) and that can be completely different feature which can be provided.

So let's concentrate on importing regular UTF8 strings as VertexID.
The issue here is that VertexID is limited with fixed string size and if we use UTF8 characters they can have length in bytes more than 2 bytes, like it was case with Chinese, Russian, Japanese and other character sets. Those characters have 3 bytes in size and cause to overflow VertexID in size, and that shouldn't be the case if we defined that VertexID is 28 characters in size.

You can try to import this Domain data I've got errors for, and see that this is the case here with size.

Tag definition for this records is:

CREATE SPACE IF NOT EXISTS graph(partition_num=128, replica_factor=3, vid_type=fixed_string(28));
USE graph;
CREATE TAG domain(name string, classification string, active bool);

And the Insert commands which have issues is like in this example:

[ERROR] handler.go:63: Client 8 fail to execute: 

INSERT VERTEX `domain`(`name`,`classification`,`active`) VALUES  
"neuroeco-2725713350576147783": ("neuroeconomia.com.br","",true), "majortoo-1093788498676804281": ("majortool.website","",true), 
"f1024pro7198050941800472293": ("f1024proku.cn","",true), "pixers.p-6753124849544050968": ("pixers.pl","",true), 
"iojet.co9111344562419197580": ("iojet.com","",true), "christba-5296937626571511539": ("christbaumservice.de","",true), 
"badgerfa5077377065993132103": ("badgerfarms.com","",true), "adventpo6252634886331700143": ("adventpowerprotection.com","",true), 
"davidsto5677837830947720780": ("davidstout.net","",true), "nicolasp-2162137739921036202": ("nicolaspoggi.com","",true), 
"intercon-6026190726284770122": ("intercontb.com","",true), "exclusiv-7590762108140403075": ("exclusiveagencyofficial.com","",true), 
"kawn.inf3823846560429494917": ("kawn.info","",true), "cengocen-5500962025655561744": ("cengocengo.github.io","",true), 
"中醫中藥cn.t4508929864515433325": ("中醫中藥cn.top","",true), "zomerter-2623832337387839851": ("zomerterras50bar.com","",true);

, ErrMsg: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit., ErrCode: -1005

goranc · 2023-01-11T16:37:01Z

Just to explain this hybrid VertexID structure.

It is combination of Lexicographic prefix and hashing function, so we use that to avoid collisions in the graph space.

The VertexID is generated based on TAG Property domain.name as:
substring(domain.name,1,8) + toString(hash(domain.name))

Note:
I think this is an issue with Nebula VertexID and internal definition about string size, not only connected with importing data from CSV files.

wey-gu · 2023-01-12T01:59:06Z

Dear @goranc

Sorry @whitewum was not aware that you cannot read Chinese, we have this screen capture in Chinese Documentation mentioning that one Chineses UTF-8 char is 3-byte(may be not as you expected when calculating its length?)

As the following:

(root@nebula) [nba]> show create space nba
+-------+-------------------------------------------------------------------------------------------------------------------------------+
| Space | Create Space                                                                                                                  |
+-------+-------------------------------------------------------------------------------------------------------------------------------+
| "nba" | "CREATE SPACE `nba` (partition_num = 7, replica_factor = 1, charset = utf8, collate = utf8_bin, vid_type = FIXED_STRING(32))" |
+-------+-------------------------------------------------------------------------------------------------------------------------------+
Got 1 rows (time spent 989/28945 us)

# 11 chinese utf8 chars
(root@nebula) [nba]> insert vertex player(name,age) values "中中中中中中中中中中中":('length_11_chinese_utf8', 42);
[ERROR (-1005)]: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit.

Thu, 12 Jan 2023 09:55:35 CST

# 10 chinese utf8 chars + 2 ascii chars
(root@nebula) [nba]> insert vertex player(name,age) values "中中中中中中中中中中01":('length_10_and_2_chinese_utf8', 42);
Execution succeeded (time spent 1257/25011 us)

Thu, 12 Jan 2023 09:55:51 CST

In this case the length of "中醫中藥cn.t4508929864515433325" is actually 35

In [7]: 3 * len("中醫中藥") + len("cn.t4508929864515433325")
Out[7]: 35

We will cover this info to en documentation later, sorry for this.

We have this note in cn docs already. related issue: vesoft-inc/nebula-importer#257

goranc · 2023-01-16T14:10:22Z

Ok, it is clear now, what is behind the scene.
So we just need to be aware of that, and it is good to be included in documentation and explains, like we have it here in our discussion, with examples including specific multibyte characters.

wey-gu · 2023-01-17T03:39:25Z

Thanks @goranc do you think this patch to doc is enough or?

https://github.com/vesoft-inc/nebula-docs/pull/1871/files

wey-gu · 2023-02-03T02:39:41Z

closing it, thanks @goranc !

wey-gu mentioned this issue Jan 7, 2023

Weekly Report 2023-01-06 vesoft-inc/nebula-community#185

Closed

goranc changed the title ~~Importing VertexID with escaped UTF8 characters in CSV file~~ Importing VertexID with UTF8 characters in CSV file Jan 11, 2023

wey-gu added a commit to vesoft-inc/nebula-docs that referenced this issue Jan 12, 2023

add note for utf-8 char length in vid

6795f77

We have this note in cn docs already. related issue: vesoft-inc/nebula-importer#257

wey-gu mentioned this issue Jan 12, 2023

add note for utf-8 char length in vid vesoft-inc/nebula-docs#1871

Merged

1 task

whitewum pushed a commit to vesoft-inc/nebula-docs that referenced this issue Jan 12, 2023

add note for utf-8 char length in vid (#1871)

b48b1a0

We have this note in cn docs already. related issue: vesoft-inc/nebula-importer#257

wey-gu closed this as completed Feb 3, 2023

wey-gu mentioned this issue Feb 4, 2023

Weekly Report 2023-02-03 vesoft-inc/nebula-community#315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing VertexID with UTF8 characters in CSV file #257

Importing VertexID with UTF8 characters in CSV file #257

goranc commented Jan 2, 2023

wey-gu commented Jan 3, 2023

veezhang commented Jan 6, 2023 •

edited

Loading

goranc commented Jan 11, 2023 •

edited

Loading

goranc commented Jan 11, 2023 •

edited

Loading

wey-gu commented Jan 12, 2023 •

edited

Loading

goranc commented Jan 16, 2023

wey-gu commented Jan 17, 2023

wey-gu commented Feb 3, 2023

Importing VertexID with UTF8 characters in CSV file #257

Importing VertexID with UTF8 characters in CSV file #257

Comments

goranc commented Jan 2, 2023

Try to import data into TAG

wey-gu commented Jan 3, 2023

veezhang commented Jan 6, 2023 • edited Loading

goranc commented Jan 11, 2023 • edited Loading

goranc commented Jan 11, 2023 • edited Loading

wey-gu commented Jan 12, 2023 • edited Loading

goranc commented Jan 16, 2023

wey-gu commented Jan 17, 2023

wey-gu commented Feb 3, 2023

veezhang commented Jan 6, 2023 •

edited

Loading

goranc commented Jan 11, 2023 •

edited

Loading

goranc commented Jan 11, 2023 •

edited

Loading

wey-gu commented Jan 12, 2023 •

edited

Loading