Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing VertexID with UTF8 characters in CSV file #257

Closed
goranc opened this issue Jan 2, 2023 · 8 comments
Closed

Importing VertexID with UTF8 characters in CSV file #257

goranc opened this issue Jan 2, 2023 · 8 comments

Comments

@goranc
Copy link

goranc commented Jan 2, 2023

Importing data with escaped UTF8 characters for string type VertexID is not converting input string to UTF8 character, but inserts escaped characters as is from CSV file.

My Nebula cluster is using 3.3.0 version.
VertexID is fixed string type with 28 characters length.
I'm using custom algorithm for VertexID to avoid collision. It is combination of Lexicographic prefix based on string which have 8 characters length and concatenated with hash (standard Nebula hash function) converted to string.

Steps to reproduce the behavior:

Create space with VertexID definition as fixed string which use 28 characters
Create TAG for URL vertex
Import data for URL with UTF8 having specific characters

Examples:
CREATE SPACE IF NOT EXISTS graph(partition_num=128, replica_factor=3, vid_type=fixed_string(28));

USE graph;

CREATE TAG url(link string, subdomain_name string, domain_name string, protocol string, classification string);

Try to import data into TAG

"stubhub\xe6-2541048767624938324": ("http://stubhub手数料3.xyz","stubhub手数料3.xyz","stubhub手数料3.xyz","http",""),
"download1336853390718461484": ("http://downloads.sourceforge.net/project/orz123/a23.mp3?r=&ts=1448325706&use_mirror=heanet","downloads.sourceforge.net","sourceforge.net","http",""),
"oss.jfro1186231920510779202": ("https://oss.jfrog.org/artifactory/jcenter-remote/com/google/apis/google-api-services-cloudkms/v1-rev20-1.21.0/google-api-services-cloudkms-v1-rev20-1.21.0.jar","oss.jfrog.org","jfrog.org","https","");
ErrMsg: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit., ErrCode: -1005

We expect to get specific characters in VertexID field as it is in Domain field, but instead it is not converted and we got error about VertexID exceeded length.

@wey-gu
Copy link
Contributor

wey-gu commented Jan 3, 2023

cc @veezhang

@veezhang
Copy link
Contributor

veezhang commented Jan 6, 2023

@goranc Hi, can you paste your csv data here?

@goranc goranc changed the title Importing VertexID with escaped UTF8 characters in CSV file Importing VertexID with UTF8 characters in CSV file Jan 11, 2023
@goranc
Copy link
Author

goranc commented Jan 11, 2023

OK, I've invested a little bit more time with this issue.

What is changed from previous testing is that now I'm handling strings to be complete UTF8 characters, so we avoid using escaping sequences started with hexadecimal escape codes (like \xe6 in previous example) and that can be completely different feature which can be provided.

So let's concentrate on importing regular UTF8 strings as VertexID.
The issue here is that VertexID is limited with fixed string size and if we use UTF8 characters they can have length in bytes more than 2 bytes, like it was case with Chinese, Russian, Japanese and other character sets. Those characters have 3 bytes in size and cause to overflow VertexID in size, and that shouldn't be the case if we defined that VertexID is 28 characters in size.

You can try to import this Domain data I've got errors for, and see that this is the case here with size.

Tag definition for this records is:

CREATE SPACE IF NOT EXISTS graph(partition_num=128, replica_factor=3, vid_type=fixed_string(28));
USE graph;
CREATE TAG domain(name string, classification string, active bool);

And the Insert commands which have issues is like in this example:

[ERROR] handler.go:63: Client 8 fail to execute: 

INSERT VERTEX `domain`(`name`,`classification`,`active`) VALUES  
"neuroeco-2725713350576147783": ("neuroeconomia.com.br","",true), "majortoo-1093788498676804281": ("majortool.website","",true), 
"f1024pro7198050941800472293": ("f1024proku.cn","",true), "pixers.p-6753124849544050968": ("pixers.pl","",true), 
"iojet.co9111344562419197580": ("iojet.com","",true), "christba-5296937626571511539": ("christbaumservice.de","",true), 
"badgerfa5077377065993132103": ("badgerfarms.com","",true), "adventpo6252634886331700143": ("adventpowerprotection.com","",true), 
"davidsto5677837830947720780": ("davidstout.net","",true), "nicolasp-2162137739921036202": ("nicolaspoggi.com","",true), 
"intercon-6026190726284770122": ("intercontb.com","",true), "exclusiv-7590762108140403075": ("exclusiveagencyofficial.com","",true), 
"kawn.inf3823846560429494917": ("kawn.info","",true), "cengocen-5500962025655561744": ("cengocengo.github.io","",true), 
"中醫中藥cn.t4508929864515433325": ("中醫中藥cn.top","",true), "zomerter-2623832337387839851": ("zomerterras50bar.com","",true);

, ErrMsg: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit., ErrCode: -1005

@goranc
Copy link
Author

goranc commented Jan 11, 2023

Just to explain this hybrid VertexID structure.

It is combination of Lexicographic prefix and hashing function, so we use that to avoid collisions in the graph space.

The VertexID is generated based on TAG Property domain.name as:
substring(domain.name,1,8) + toString(hash(domain.name))

Note:
I think this is an issue with Nebula VertexID and internal definition about string size, not only connected with importing data from CSV files.

@wey-gu
Copy link
Contributor

wey-gu commented Jan 12, 2023

Dear @goranc

Sorry @whitewum was not aware that you cannot read Chinese, we have this screen capture in Chinese Documentation mentioning that one Chineses UTF-8 char is 3-byte(may be not as you expected when calculating its length?)

As the following:

(root@nebula) [nba]> show create space nba
+-------+-------------------------------------------------------------------------------------------------------------------------------+
| Space | Create Space                                                                                                                  |
+-------+-------------------------------------------------------------------------------------------------------------------------------+
| "nba" | "CREATE SPACE `nba` (partition_num = 7, replica_factor = 1, charset = utf8, collate = utf8_bin, vid_type = FIXED_STRING(32))" |
+-------+-------------------------------------------------------------------------------------------------------------------------------+
Got 1 rows (time spent 989/28945 us)

# 11 chinese utf8 chars
(root@nebula) [nba]> insert vertex player(name,age) values "中中中中中中中中中中中":('length_11_chinese_utf8', 42);
[ERROR (-1005)]: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit.

Thu, 12 Jan 2023 09:55:35 CST

# 10 chinese utf8 chars + 2 ascii chars
(root@nebula) [nba]> insert vertex player(name,age) values "中中中中中中中中中中01":('length_10_and_2_chinese_utf8', 42);
Execution succeeded (time spent 1257/25011 us)

Thu, 12 Jan 2023 09:55:51 CST

In this case the length of "中醫中藥cn.t4508929864515433325" is actually 35

In [7]: 3 * len("中醫中藥") + len("cn.t4508929864515433325")
Out[7]: 35

We will cover this info to en documentation later, sorry for this.

wey-gu added a commit to vesoft-inc/nebula-docs that referenced this issue Jan 12, 2023
We have this note in cn docs already.

related issue: vesoft-inc/nebula-importer#257
whitewum pushed a commit to vesoft-inc/nebula-docs that referenced this issue Jan 12, 2023
We have this note in cn docs already.

related issue: vesoft-inc/nebula-importer#257
@goranc
Copy link
Author

goranc commented Jan 16, 2023

Ok, it is clear now, what is behind the scene.
So we just need to be aware of that, and it is good to be included in documentation and explains, like we have it here in our discussion, with examples including specific multibyte characters.

@wey-gu
Copy link
Contributor

wey-gu commented Jan 17, 2023

Thanks @goranc do you think this patch to doc is enough or?

https://github.com/vesoft-inc/nebula-docs/pull/1871/files

@wey-gu
Copy link
Contributor

wey-gu commented Feb 3, 2023

closing it, thanks @goranc !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants