Spring data hdfs #242

aditya300899 · 2020-06-06T13:44:00Z

1.) I did not find the filesystem class to be thread-safe. Source as mentioned in the doc "The implementations of FileSystem shipped with Apache Hadoop do not make any attempt to synchronize access to the working directory field." and was also mentioned on a stack overflow answer.
2) I did not understand the wildcard part of the FIleConnectionImpl (the backward compatibility) and also a little more detail can help, thus I just did a simple glob matching there
3)When listing files I by the understanding of how things are done for fileConnectionImpl made the recursive boolean to false by default.
4)The build will possibly fail as I was getting duplicate classes on the classpath on the build. I'll see as to how to resolve this or would ask for your help in case I'm not able to
Submitting the first draft so that we can have an idea of things ahead!

aditya300899 · 2020-06-08T13:38:44Z

The build issue
-Coming due to the plugin duplicate-finder-maven-plugin:1.3.0

duplicate classes found for following artifacts: [com.google.guava:failureaccess:1.0, com.google.guava:guava:27.0-jre], [jakarta.activation:jakarta.activation-api:1.2.2,
javax.activation:javax.activation-api:1.2.0], [commons-logging:commons-logging:1.1.3, org.springframework:spring-jcl:5.2.5.RELEASE]:
not able to understand why javax.activation:javax.activation-api:1.2.0 is there on the classpath, it should not be, not even on the dependency tree
I think I would have to configure the plugin. Your views @shawkins?

shawkins · 2020-06-08T13:43:50Z

I think I would have to configure the plugin. Your views @shawkins?

Some of the dependency stuff can be tricky. Is everything up-to-date on your branch? I can try to make some suggestions by looking at it locally.

aditya300899 · 2020-06-08T14:15:54Z

Yes it's up to date, no changes have been made post the PR! Please have a look!

shawkins · 2020-06-09T02:15:57Z

Here's an initial refinement: shawkins@3385a02

If possible dependencies should be managed in the root pom - this makes it easy to track all versions in use and to share dependencies. @rareddy just aligned things to guava 20, so I've reverted back to that - if hadoop common actually requires version 27, then we need to make sure that's good across everything. Finally I think we're good without the managed version of commons-net, but if that something like that is needed we should be able to just upgrade the Teiid version.

aditya300899 · 2020-06-09T06:14:44Z

Here's an initial refinement: shawkins@3385a02

Thanks for the help!

If possible dependencies should be managed in the root pom - this makes it easy to track all versions in use and to share dependencies. @rareddy just aligned things to guava 20, so I've reverted back to that - if Hadoop common actually requires version 27, then we need to make sure that's good across everything. Finally I think we're good without the managed version of commons-net, but if that something like that is needed we should be able to just upgrade the Teiid version.

I used the latest version for Hadoop commons i.e 3.2.1 but version 3.1.2 and below use guava 11.0.2. Can downgrading help?

shawkins · 2020-06-09T13:16:24Z

To prevent any possible mix up later, add explicit dependencies in the teiid spring data hdfs to the actual jars we want to use:

the replacement for commons-logging is https://mvnrepository.com/artifact/org.springframework/spring-jcl
the replacement for the excluded guava is just the same guava dependency, but that one will reference our managed version
the replacement for javax.activation is jakarta.activation
there is no replacement for the jdk.tools - but there is a duplicate exclusion that should be removed

any of the these we don't actually directly use in our code should be marked with a runtime scope.

The configuration we can simplify to just the uri and an optional configuration file resource. That config file will need to get added to the Configuration you construct as a resource - most likely as a classpath resource for spring boot, but in wildfly it may make sense to also have it be on the filesystem.

For testing see if you can use one of the mini fs options to create a cluster and test out your basic operations. You'll probably need to make the createfs method protected so that you can inject the local/mock filesystem instead of trying to create a real instance.

modifying dependencies making code checks pass

aditya300899 · 2020-06-15T13:46:21Z

@shawkins please have a look and suggest changes! Look at the glob search part carefully, rest seems fine to me! I wasn't able to do a recursive glob search, "**" is not working!

shawkins · 2020-06-15T19:30:58Z

This seems to resolve all the dependency issues for me: shawkins@006b1f7

It also moves the test class into a package and removes the exception handling - generally in tests you won't need to wrap exceptions.

I wasn't able to do a recursive glob search, "**" is not working!

Can you workaround with multiple explicit directory searches ///*.txt

If that's the case we can just call this a known issue.

aditya300899 · 2020-06-16T15:18:54Z

@shawkins as suggested I've done the changes. Though the log4j warning still pops up. I would add the moustache tomorrow and carry on with the s3 source

shawkins

LGTM. Good work Aditya.

shawkins · 2020-06-17T01:32:52Z

There are a couple of follow on tasks.

One is to get this functionality to Teiid Wildfly - https://issues.redhat.com/browse/TEIID-3647 - I will handle that likely by pulling the necessary code from here over to Teiid and have Aditya review the changes.

We'll also need a .mustache file and sample project.

aditya300899 · 2020-06-17T03:43:12Z

We'll also need a .mustache file and sample project.

I'll do this part. What would this sample look like?

rareddy · 2020-06-17T11:58:01Z

some time ago I added the following text for task to be done for every source we add to the Spring Boot https://github.com/teiid/teiid-spring-boot/blob/master/docs/CustomSource.adoc#house-keeping-tasks

aditya300899 · 2020-06-17T12:32:44Z

Thanks @rareddy, the doc explains things well.
I have an idea of the sample as I did the Cassandra sample but it was a database whereas HDFS in a file system. Would the sample process a file, let's say csv as in the ftp example?

rareddy · 2020-06-17T14:33:30Z

The way I had for others is written example, but there may need a step sthat user needs to do like installing the HDFS etc, so we won't be able to test it using Junit, but if someone wants they can follow it they can follow and set it up.

aditya300899 force-pushed the spring-data-hdfs branch from 3385a02 to 3dfcebe Compare June 9, 2020 05:57

aditya300899 and others added 6 commits June 15, 2020 16:49

initial commit

5dae57d

implementing the hdfs connection

63fafb9

TEIIDSB-206 refinements to hdfs supports

92e93b4

modifying dependencies making code checks pass

revert commit

4b6e75b

revert changes due to IDE

f7f5e6b

adding test class and refining code

23d2b7b

aditya300899 force-pushed the spring-data-hdfs branch from 50e406f to 23d2b7b Compare June 15, 2020 13:41

shawkins and others added 3 commits June 16, 2020 15:04

TEIIDSB-206 updating dependencies and style changes to the test class

8cb4c72

refining getfiles method and adding dependency on log4j-api

ed24242

Merge branch 'master' into spring-data-hdfs

a8ccd54

shawkins approved these changes Jun 17, 2020

View reviewed changes

shawkins merged commit 40824f8 into teiid:master Jun 17, 2020

aditya300899 deleted the spring-data-hdfs branch June 17, 2020 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spring data hdfs #242

Spring data hdfs #242

aditya300899 commented Jun 6, 2020

aditya300899 commented Jun 8, 2020

shawkins commented Jun 8, 2020

aditya300899 commented Jun 8, 2020

shawkins commented Jun 9, 2020

aditya300899 commented Jun 9, 2020

shawkins commented Jun 9, 2020

aditya300899 commented Jun 15, 2020

shawkins commented Jun 15, 2020

aditya300899 commented Jun 16, 2020 •

edited

shawkins left a comment

shawkins commented Jun 17, 2020

aditya300899 commented Jun 17, 2020 •

edited

rareddy commented Jun 17, 2020

aditya300899 commented Jun 17, 2020 •

edited

rareddy commented Jun 17, 2020

Spring data hdfs #242

Spring data hdfs #242

Conversation

aditya300899 commented Jun 6, 2020

aditya300899 commented Jun 8, 2020

shawkins commented Jun 8, 2020

aditya300899 commented Jun 8, 2020

shawkins commented Jun 9, 2020

aditya300899 commented Jun 9, 2020

shawkins commented Jun 9, 2020

aditya300899 commented Jun 15, 2020

shawkins commented Jun 15, 2020

aditya300899 commented Jun 16, 2020 • edited

shawkins left a comment

Choose a reason for hiding this comment

shawkins commented Jun 17, 2020

aditya300899 commented Jun 17, 2020 • edited

rareddy commented Jun 17, 2020

aditya300899 commented Jun 17, 2020 • edited

rareddy commented Jun 17, 2020

aditya300899 commented Jun 16, 2020 •

edited

aditya300899 commented Jun 17, 2020 •

edited

aditya300899 commented Jun 17, 2020 •

edited