Skip to content
This repository has been archived by the owner on Sep 18, 2021. It is now read-only.

how to create index for field of sub-struct in thrift #3

Open
packageyao opened this issue Oct 31, 2012 · 4 comments
Open

how to create index for field of sub-struct in thrift #3

packageyao opened this issue Oct 31, 2012 · 4 comments

Comments

@packageyao
Copy link

I have a thrift struct

struct A
{
1: string a1,
......
}

struct B
{
1 : int b1,
2: A a,
......
}

If I use pig to load the data file, how can I create the index for a.a1 and how to filter the block by using the statement "a.a1=='1234'"

the data file uses base64 line lzo format.

thanks.

@dvryaboy
Copy link
Contributor

That's exactly how you refer to it:

stuff = load ...;
filtered_stuff = filter stuff by a.a1 == '1234';

Does this not work? Could you post the script you are using and the error you get?
Could you also post the result of running "describe" on the relation you are trying to filter?

@packageyao
Copy link
Author

thrift file

struct Company
{
1:required string Id,
2:required string name,
3:required string address,
4:required string tele,
}

struct Person
{
1:required string ID,
2:required string name,
3:required byte age,
4:required Company company,
5:required string phone,
}

pig script

T1 = LOAD 'data_dir' USING com.twitter.elephanttwin.retrieval.IndexedPigLoader('com.twitter.elephantbird.pig.load.ThriftPigLoader', 'Person', 'index_dir');
T2 = FILTER T1 BY company.address=='address_12';
DUMP T2;

if I create index for Person::name, I could using the index in pig and get the correct result.

I also want to create index for Person::Company::address, so I modify the source code , and in creating index job I could get the value of Person::Company::address, the partition key is "company.address", but when I use the script above, the pig scans all the blocks instead of the block indexed to find the record I want.

I read the pig source code and found the setPartitionFilter method is not invoked, so the index is not used.

I use pig 0.8.1

Can you give me some advice? thanks.

@dvryaboy
Copy link
Contributor

dvryaboy commented Nov 2, 2012

Ah I see what's happening. I think this is a Pig bug -- it needs to push down the filter, but nested relations confuse it. I don't see any reason Elephant-Twin wouldn't be able to support it if Pig can push it. Could you open a Jira with Apache Pig?

@packageyao
Copy link
Author

Now I write my own pig loader. In this loader, I add a field for filter expression, and add the expression to the inputformat directly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants