Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spring data hdfs #242

Merged
merged 9 commits into from Jun 17, 2020
Merged

Spring data hdfs #242

merged 9 commits into from Jun 17, 2020

Conversation

aditya300899
Copy link
Member

1.) I did not find the filesystem class to be thread-safe. Source as mentioned in the doc "The implementations of FileSystem shipped with Apache Hadoop do not make any attempt to synchronize access to the working directory field." and was also mentioned on a stack overflow answer.
2) I did not understand the wildcard part of the FIleConnectionImpl (the backward compatibility) and also a little more detail can help, thus I just did a simple glob matching there
3)When listing files I by the understanding of how things are done for fileConnectionImpl made the recursive boolean to false by default.
4)The build will possibly fail as I was getting duplicate classes on the classpath on the build. I'll see as to how to resolve this or would ask for your help in case I'm not able to
Submitting the first draft so that we can have an idea of things ahead!

@aditya300899
Copy link
Member Author

The build issue
-Coming due to the plugin duplicate-finder-maven-plugin:1.3.0

  • duplicate classes found for following artifacts: [com.google.guava:failureaccess:1.0, com.google.guava:guava:27.0-jre], [jakarta.activation:jakarta.activation-api:1.2.2,
    javax.activation:javax.activation-api:1.2.0], [commons-logging:commons-logging:1.1.3, org.springframework:spring-jcl:5.2.5.RELEASE]:
  • not able to understand why javax.activation:javax.activation-api:1.2.0 is there on the classpath, it should not be, not even on the dependency tree
    I think I would have to configure the plugin. Your views @shawkins?

@shawkins
Copy link
Contributor

shawkins commented Jun 8, 2020

I think I would have to configure the plugin. Your views @shawkins?

Some of the dependency stuff can be tricky. Is everything up-to-date on your branch? I can try to make some suggestions by looking at it locally.

@aditya300899
Copy link
Member Author

Yes it's up to date, no changes have been made post the PR! Please have a look!

@shawkins
Copy link
Contributor

shawkins commented Jun 9, 2020

Here's an initial refinement: shawkins@3385a02

If possible dependencies should be managed in the root pom - this makes it easy to track all versions in use and to share dependencies. @rareddy just aligned things to guava 20, so I've reverted back to that - if hadoop common actually requires version 27, then we need to make sure that's good across everything. Finally I think we're good without the managed version of commons-net, but if that something like that is needed we should be able to just upgrade the Teiid version.

@aditya300899
Copy link
Member Author

Here's an initial refinement: shawkins@3385a02

Thanks for the help!

If possible dependencies should be managed in the root pom - this makes it easy to track all versions in use and to share dependencies. @rareddy just aligned things to guava 20, so I've reverted back to that - if Hadoop common actually requires version 27, then we need to make sure that's good across everything. Finally I think we're good without the managed version of commons-net, but if that something like that is needed we should be able to just upgrade the Teiid version.

I used the latest version for Hadoop commons i.e 3.2.1 but version 3.1.2 and below use guava 11.0.2. Can downgrading help?

@shawkins
Copy link
Contributor

shawkins commented Jun 9, 2020

To prevent any possible mix up later, add explicit dependencies in the teiid spring data hdfs to the actual jars we want to use:

  • the replacement for commons-logging is https://mvnrepository.com/artifact/org.springframework/spring-jcl

  • the replacement for the excluded guava is just the same guava dependency, but that one will reference our managed version

  • the replacement for javax.activation is jakarta.activation

  • there is no replacement for the jdk.tools - but there is a duplicate exclusion that should be removed

any of the these we don't actually directly use in our code should be marked with a runtime scope.

The configuration we can simplify to just the uri and an optional configuration file resource. That config file will need to get added to the Configuration you construct as a resource - most likely as a classpath resource for spring boot, but in wildfly it may make sense to also have it be on the filesystem.

For testing see if you can use one of the mini fs options to create a cluster and test out your basic operations. You'll probably need to make the createfs method protected so that you can inject the local/mock filesystem instead of trying to create a real instance.

@aditya300899
Copy link
Member Author

@shawkins please have a look and suggest changes! Look at the glob search part carefully, rest seems fine to me! I wasn't able to do a recursive glob search, "**" is not working!

@shawkins
Copy link
Contributor

This seems to resolve all the dependency issues for me: shawkins@006b1f7

It also moves the test class into a package and removes the exception handling - generally in tests you won't need to wrap exceptions.

I wasn't able to do a recursive glob search, "**" is not working!

Can you workaround with multiple explicit directory searches ///*.txt

If that's the case we can just call this a known issue.

@aditya300899
Copy link
Member Author

aditya300899 commented Jun 16, 2020

@shawkins as suggested I've done the changes. Though the log4j warning still pops up. I would add the moustache tomorrow and carry on with the s3 source

Copy link
Contributor

@shawkins shawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good work Aditya.

@shawkins shawkins merged commit 40824f8 into teiid:master Jun 17, 2020
@shawkins
Copy link
Contributor

There are a couple of follow on tasks.

One is to get this functionality to Teiid Wildfly - https://issues.redhat.com/browse/TEIID-3647 - I will handle that likely by pulling the necessary code from here over to Teiid and have Aditya review the changes.

We'll also need a .mustache file and sample project.

@aditya300899
Copy link
Member Author

aditya300899 commented Jun 17, 2020

We'll also need a .mustache file and sample project.

I'll do this part. What would this sample look like?

@aditya300899 aditya300899 deleted the spring-data-hdfs branch June 17, 2020 08:39
@rareddy
Copy link
Member

rareddy commented Jun 17, 2020

some time ago I added the following text for task to be done for every source we add to the Spring Boot https://github.com/teiid/teiid-spring-boot/blob/master/docs/CustomSource.adoc#house-keeping-tasks

@aditya300899
Copy link
Member Author

aditya300899 commented Jun 17, 2020

Thanks @rareddy, the doc explains things well.
I have an idea of the sample as I did the Cassandra sample but it was a database whereas HDFS in a file system. Would the sample process a file, let's say csv as in the ftp example?

@rareddy
Copy link
Member

rareddy commented Jun 17, 2020

The way I had for others is written example, but there may need a step sthat user needs to do like installing the HDFS etc, so we won't be able to test it using Junit, but if someone wants they can follow it they can follow and set it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants