Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnalysisException: resolved attribute(s) missing for blockEntityLinkage #150

Closed
yeikel opened this issue Feb 8, 2019 · 12 comments
Closed
Assignees
Labels
Projects

Comments

@yeikel
Copy link
Contributor

yeikel commented Feb 8, 2019

Describe the bug

I'd like to link two dataframes blocking on a field named cd using blockEntityLinkage with the following Schemas :

df1
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- cd: string (nullable = true)
df2
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |--cd: string (nullable = true)

 val linkedResults = LuceneRDD.blockEntityLinkage(df1 , df2 ,
    linker,
    Array("cd"),
    Array("cd"),
    500
  )

But it produces the following exception :

Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) cd#351 missing from name#444,address#454,cd#101 in operator !Project [name#444, address#454, cd#101, concat(cd#351) AS __PARTITION_COLUMN__#481];;
!Project [name#444, address#454, cd#101, concat(cd#351) AS __PARTITION_COLUMN__#481]
+- Project [name#444, address#454, cd#101]

A self join in df1 works just fine

I tried renaming the column but it did not work.

I am running Spark 2.1.0 and lucenerdd 0.3.3

Edit : An explanation about the issue can be found here but I don't believe it can be fixed on my end.

Thank you

@yeikel yeikel changed the title org.apache.spark.sql.AnalysisException: resolved attribute(s) missing org.apache.spark.sql.AnalysisException: resolved attribute(s) missing for blockEntityLinkage Feb 8, 2019
@yeikel yeikel changed the title org.apache.spark.sql.AnalysisException: resolved attribute(s) missing for blockEntityLinkage AnalysisException: resolved attribute(s) missing for blockEntityLinkage Feb 8, 2019
@yeikel
Copy link
Contributor Author

yeikel commented Feb 8, 2019

For now I can use a regular link with no blocking , but ideally I'd like to know if it possible to use the blocking method.

@zouzias
Copy link
Owner

zouzias commented Feb 8, 2019

Can you try with version 0.3.5 and report back?

@yeikel
Copy link
Contributor Author

yeikel commented Feb 25, 2019

Seems to be fixed in 0.3.5

@yeikel yeikel closed this as completed Feb 25, 2019
@yeikel yeikel reopened this Feb 25, 2019
@yeikel
Copy link
Contributor Author

yeikel commented Feb 25, 2019

Actually it is not..

I am still seeing this in 0.3.5

org.apache.spark.sql.AnalysisException: resolved attribute(s) cd#594 missing from name#738,address#748,mkt_cd#101,id#2 in operator !Project [name#738, address#748, mkt_cd#101, id#2, concat(mkt_cd#594) AS __PARTITION_COLUMN__#778];;

@zouzias
Copy link
Owner

zouzias commented Feb 25, 2019

Can you share the full exception here?

@zouzias
Copy link
Owner

zouzias commented Feb 26, 2019

I pushed a hotfix here: https://github.com/zouzias/spark-lucenerdd/pull/151/files (feedback is welcome) and I plan to release it tonight under 0.3.6-SNAPSHOT.

@yeikel
Copy link
Contributor Author

yeikel commented Feb 27, 2019

Will this be available with Spark 2.1?

I will be testing as soon as it is in maven central

@zouzias
Copy link
Owner

zouzias commented Feb 27, 2019

I reproduced here. I will try to fix now.

@zouzias
Copy link
Owner

zouzias commented Feb 27, 2019

Released a fix under 0.3.6-SNAPSHOT. Can you look again if the exception appears?

Tests are clean on the CI: https://travis-ci.org/zouzias/spark-lucenerdd/jobs/499546155

@zouzias zouzias added the bug label Feb 27, 2019
@zouzias zouzias self-assigned this Feb 27, 2019
@zouzias zouzias added this to To do in Kanban via automation Feb 27, 2019
@zouzias zouzias moved this from To do to In progress in Kanban Feb 27, 2019
@yeikel
Copy link
Contributor Author

yeikel commented Mar 8, 2019

I believe it is fixed now. Thank you

@yeikel yeikel closed this as completed Mar 8, 2019
Kanban automation moved this from In progress to Done Mar 8, 2019
@zouzias
Copy link
Owner

zouzias commented Mar 9, 2019

Glad to hear.

@yeikel
Copy link
Contributor Author

yeikel commented Mar 11, 2019

This is very strange.

When I run it in a cluster (reading from a hive table) , I don't see this error anymore. On the other hand, when I run it in my local (reading parquet files) , I see it. I am not sure how to replicate it and the contents of the files are sensitive , so I can't share them here.

I need to test more , but let's leave it closed for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Kanban
  
Done
Development

No branches or pull requests

2 participants