Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parso library update #10

Closed
treysp opened this issue May 26, 2016 · 15 comments
Closed

Parso library update #10

treysp opened this issue May 26, 2016 · 15 comments

Comments

@treysp
Copy link

treysp commented May 26, 2016

FYI - the updated Parso library is now available:

https://github.com/epam/parso/releases

@saurfang
Copy link
Owner

saurfang commented Nov 19, 2016

Just to give some pointers in case someone wants to help. Here you may find the total diff I made on the original 1.2.1 parso library to support splitting: https://www.diffchecker.com/V6tGyqNC
The main takeaway is we have to make sure we can seek to the position we'd like to read, read fully when reading, and avoid stackoverflow for large dataset.

It looks like the library has been refactored quite a bit in 2.0 so correspondingly touch points need to be found and updated.

@Tagar
Copy link
Contributor

Tagar commented Oct 19, 2017

@saurfang could you please give us some guidance how to update saurfang/spark-sas7bdat to latest parso library?

I see https://github.com/saurfang/spark-sas7bdat/blob/master/src/main/java/com/ggasoftware/parso/SasFileParser.java
Can we just overwrite it with latest file from https://github.com/epam/parso/tree/master/src/main/java/com/epam/parso ? I am probably oversimplifying this. Or you expect something will break? If something breaks, we will be glad to help with that too.

We will try to get this resolved. There are a few bugs we ran into today and both of them seems to be could be resolved by getting to latest praso library.

@saurfang
Copy link
Owner

saurfang commented Oct 22, 2017

@Tagar thank you for taking the interests. Like I outlined in my comment above, the whole reason that we have our own copy of SasFileParser.java in the repo is that we need to make modifications to support parallel read from a split location. The diff exactly shows the points where these logics are implemented.

To see #19 to be successful, I would like to see a minimal test data that replicates the failure on 1.2.1. Only then will we able to confidently say that any changes we lay on top of it to upgrade to 2.0 actually solve the problems. That is not to say #19 is incorrect but I just do not have the capacity to perform detailed code review and has to rely on unit tests to convince myself that the change is actually doing and fixing what it claims.

@Tagar
Copy link
Contributor

Tagar commented Oct 25, 2017

Thank you @saurfang
A colleague of mine and me are working on this currently.
It may take some more time.

Can we ask why that diff in SasFileParser.java can't be pushed over to Parso library itself?
If it'll be accepted into Parso library itself, this diff doesn't have to be maintained and
re-applied after each Parso library release.

@Tagar
Copy link
Contributor

Tagar commented Oct 25, 2017

epam/parso#21

@Tagar
Copy link
Contributor

Tagar commented Oct 26, 2017

@saurfang that diff is already in upstream Parso library.
See comments from @printsev in epam/parso#21
We created a fork of spark-sas7bdat and

  • added unit test that you were referencing to
  • get it running with Parso 2.0.7
    We can confirm that all issues we had earlier with reading compressed files and some
    other issues are now resolved.

Please let us know if that's okay with you to create a new PR for this work ?

@saurfang
Copy link
Owner

Yes please! I forked the implementation originally because their library development went dormant. Now that they are back on Github I am more than happy to push/pull changes to the upstream library.

@printsev
Copy link

@Tagar as you are working on the PR to update parso to the latest version, could you please update link to parso in README.md in the Related Work section? It should be https://github.com/epam/parso instead of scitouch site. @saurfang hope you are ok with the link change.

@Tagar
Copy link
Contributor

Tagar commented Oct 31, 2017

@saurfang our manager asked to bounce our contribution at legal department first as it's first time we contribute back to an open-source project. I don't expect this will be an issue, but will just take some time. We delivered this change internally a week ago. Thanks.

@printsev sure - will do. Thanks.

@spatil6
Copy link

spatil6 commented Dec 16, 2017

With earlier version of parso, was getting error while reading compressed files. So I have updated parso 2.0.7 and done relevant changes, now able to read compressed file. :)
However while reading big file (2 GB) from HDFS,Getting Error #9

@Tagar Seems like you have successfully implemented. Have you faced this issue? If yes, can you please share your idea/implementation. Please.

@saurfang Any suggestion or inputs, would like to provide on this.

@spatil6
Copy link

spatil6 commented Dec 21, 2017

After doing some analysis on #9 with updated parso library 2.0.7. I have found that issue is with below method.

private def getPartialPageLength(pos: Long) = (pos - sasFileProperties.getHeaderLength) % sasFileProperties.getPageLength

When it try to access bytes from previous page, it won't able read subheader.

@saurfang Can you please elabarote this method, on what basis we are calculating partial page length. Any supporting blog/document.

@saurfang
Copy link
Owner

If I remember correctly, this is based on the reverse engineered http://www2.uaem.mx/r-mirror/web/packages/sas7bdat/vignettes/sas7bdat.pdf @spatil6

@printsev
Copy link

@spatil6
Do you face the same issue when working with parso directly? If yes, could you please raise a new issue in parso library itself? If you could share the dataset, it would be great.
The line you reference to looks like a part of spark-sas7bdat to me, but there could be something in parso as well.

@spatil6
Copy link

spatil6 commented Jan 23, 2018

Issue resolved.
@printsev Thanks for asking. There is issue in spark-sas7bdat, not in parso library.
@saurfang Existing SasRecoder class implementation won't work with latest parso library. It require some modification, to make it work. So make note of it.
Thanks.

@saurfang
Copy link
Owner

saurfang commented Feb 3, 2018

Thanks all for the discussion and contribution. Kudos to @mulya to pull everything through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants