Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDD is removing null columns on fuzzy linking #258

Open
saumyasuhagiya opened this issue Aug 17, 2020 · 3 comments
Open

RDD is removing null columns on fuzzy linking #258

saumyasuhagiya opened this issue Aug 17, 2020 · 3 comments

Comments

@saumyasuhagiya
Copy link

saumyasuhagiya commented Aug 17, 2020

Describe the bug
RDD is removing null columns on fuzzy linking

To Reproduce

  1. Take sample RDD with null values in some column
  2. Do fuzzy join by link method.

-- Code --

    `<dependency>
        <groupId>org.zouzias</groupId>
        <artifactId>spark-lucenerdd_2.11</artifactId>
        <version>0.3.7</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>8.5.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>8.5.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-codecs</artifactId>
        <version>8.5.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        <version>8.5.2</version>
    </dependency>

/------------------------------------/

ClassTag simpleRowTag = scala.reflect.ClassTag$.MODULE$.apply(Row.class);

      LuceneRDD<Row> rightDsLuceneRDD = LuceneRDD.apply(rightDs
        .withColumn(rightColumn, lower(col(rightColumn))),
        "org.apache.lucene.analysis.standard.ClassicAnalyzer",
        "org.apache.lucene.analysis.standard.ClassicAnalyzer",
        "org.apache.lucene.search.similarities.BM25Similarity");

     String leftColumn = "a";
     String rightColumn = "b";

    RDD<Tuple2<Row, Row[]>> fuzzyJoinResults =
        rightDsLuceneRDD.link(leftDs.rdd(), new SearchQuery<Row, String>() {
            @Override
            public String apply(Row input) {
                Row row = (Row) input;
                String leftRDDValue = row.getAs(leftColumn).toString();
                String rightRDDColumn = rightColumn;
                String query = rightRDDColumn + ":" + QueryParser
                    .escape(leftRDDValue.toLowerCase()) + "~" + fuzziness;
                return query;
            }
        }, noOfResults, null, simpleRowTag);

`

Expected behavior
It should not remove any null columns and should give back all fields which were there in RDD

Versions (please complete the following information):

  • spark-lucenerdd version: [0.3.7]
  • Spark Version: [2.4.5]
  • Java version: [Java 8]

Additional context
I am doing this coding in Java.

@zouzias
Copy link
Owner

zouzias commented Aug 29, 2020

Many thanks for reporting the issue.

As far as I understand, the issue is that after linkage, the Row object does not contain the columns where their original values are null, correct?

@saumyasuhagiya
Copy link
Author

saumyasuhagiya commented Aug 31, 2020

Yes @zouzias. You understood it right.

So issue happens when you want to convert data back to the data set, and I was trying with the schema of the first row.
It had some null values and other subsequent rows didn't had null, and it started failing there.

Ex.
1, null, "abc"
1, 2, "bcd"

I am also thinking to contribute back java example with dataset's data types intact (input two datasets and get back modified dataset) once we resolve this.

@saumyasuhagiya
Copy link
Author

Hi @zouzias Let me know if I can help with further details or POCs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants