New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for hadoop fs -appendToFile #910
Conversation
lgtm |
Actually after proposing this and digging a bit more into the matter I realise that perhaps it's worth discussing if it makes sense to add appending capabilities to luigi. Tomorrow I will write some thoughts down on the matter, but I think it would be useful to have also for: |
Yea, let us know when you've done some more research. Let's not merge this for now. Looking forward to your findings! :) |
@Tarrasch so I have started adding append support to My use case is that I have a bunch of smallish files (you can think of them like log files, but they are a bit more complex) and I would like to aggregate them all together into a daily view (hadoop does not perform well when processing many small files, but works best with files that are > it's block size which is I believe around 60 MB). I don't think I have yet fully grasped the execution model of luigi, but I am assuming that tasks will parallelise on tasks returned by This allows me to consolidate my logs into daily views and ensure consistency on the output file. Commit: hellais@3e7816b shows this and I will now proceed in testing this with my data. Let me know if you have any questions or feedback on this PR. |
Looks great but you should also add support for append mode in |
@@ -45,6 +47,7 @@ def generate_tmp_path(self, path): | |||
|
|||
|
|||
class LocalFileSystem(FileSystem): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these really pep8 fixes??? I mean, you didn't change this code and the pep8 check (tox pep8
) ran fine before. Hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pylint claims they are.
From pep8: https://www.python.org/dev/peps/pep-0008/#blank-lines
"Method definitions inside a class are surrounded by a single blank line."
I guess it's up to interpretation if you should consider the constructor to be a class method or not. Pylint seems to believe that is the case.
Thanks for the code review.
I am still a bit uncertain if this is in line with the overall design philosophy of this framework. While testing this feature for my use case I realised that I have to use another approach. The difficulty that I am running into is that it's not possible for me to dynamically generate the |
I'm not sure what this means. I mean, when defining a task, you have the same context for both defining the input() and output(). |
Yeah not sure what this means, sorry |
@Tarrasch @erikbern what I mean is that I don't see how it's possible to accomplish something like this: def list_resources():
# Returns some tasks or is iterable
def get_output_from_input(input):
# Returns the output Target given a certain input
def process(input):
# Transforms the input
class ReportStreams(luigi.Task):
def requires(self):
for task in list_resources():
tasks.append(task)
return task
def output(self, input=None):
return get_output_from_input(input)
def run(self):
for input in self.input():
in_file = input.open("r")
with self.output(input).open('w') as out_file:
out_file.write(process(input))
in_file.close() |
Use mutex lock to avoid concurrent file appends
6fe7dc8
to
5e7c422
Compare
I created a new pull request for this where I integrate your feedback: #973 |
In this pull request I add support for the Hadoop appendToFile command.