-
Notifications
You must be signed in to change notification settings - Fork 3
Prod 399 version 2 #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| appobj=None, # application_model object | ||
| eventlog=None, # spark eventlog path, | ||
| spark_eventlog_parsed_path=None, | ||
| spark_eventlog_path=None, # application_model object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the new names, but I don't think this comment lines up.
|
|
||
| if (appobj is not None) or (eventlog is not None): # Load an application_model or eventlog | ||
| if (appobj is not None) or ( | ||
| self.eventlog is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't this attribute renamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objfile was renamed. appobj is a very old option that doesn't get used anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on second thought, I'll clean this up a bit more to remove appobj and make sure attributes are all consistent.
tests/test_parse.py
Outdated
| event_log, parsed = eventlog.EventLogBuilder(event_log_paths, temp_dir).build() | ||
| sparkApplication(spark_eventlog_parsed_path=str(event_log)) | ||
|
|
||
| assert parsed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be slightly easier to follow if the assert immediately followed the line in which parsed is assigned.
tests/test_parse.py
Outdated
| obj = sparkApplication(spark_eventlog_path=str(event_log)) | ||
|
|
||
| assert list(obj.sqlData.index.values) == [0, 2, 3, 5, 6, 7, 8] | ||
| assert not parsed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved up too.
| try: # Test if it is a parsed log | ||
| with open(path) as fobj: | ||
| data = json.load(fobj) | ||
| if "jobData" in data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, sparkApplication should be making the determination whether the object is a valid event log or not. It creates the parsed logs after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I agree, and in the future I we can make that happen. For now though, you'll see in the sync_backend sister PR that parsed/unparsed judgement has to come before the log has been parsed because that must be known before entering the spark_log_parser through SparkApplicationAdvanced, which inherits from sparkApplication. I know, it's real janky right now, but it's getting better bit by bit. A separation of these classes will happen, and we can make this even cleaner then -- I don't want to get carried away in the one PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, I was thinking it could be a classmethod like,
T = TypeVar('T', bound='sparkApplication')
class sparkApplication:
@classmethod
def is_parsed_log(cls: Type[T], eventlog: dict) -> boolean:
...
(btw, I learned the type annotations for classmethods here: https://stackoverflow.com/questions/44640479/type-annotation-for-classmethod-returning-instance)
which allows us to call the method before we create an instance of the class. That way we keep the parsed log logic in the same module.
I totally agree that we are making good steady progress. It's a balance between improving the structure, not losing any of the knowledge in the code, and delivering features n' fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or staticmethod if we don't need that class variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh I see what you mean, that's nice. I'll keep this in mind for the next batch of log_parser work.
rmoneys
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I think the small change to cli.py that I suggested is all that's needed.
As I mentioned in a comment, it'd be great a some point to move the logic that determines whether an object constitutes a parsed log to the class that creates the parsed log (maybe a class method that takes a dict) - preexisting tech debt though.
Also, at some point it'd be good to add some notes in sync_backend's README.md about how to test these changes there.
spark_log_parser/cli.py
Outdated
|
|
||
| if parsed: | ||
| print("Input log is already parsed") | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this constitutes an abnormal exit since we're not putting the result in the result directory. In which case we can simply do,
return "Input already contains a parsed log file"
The invocation is wrapped in a sys.exit in the script that's created when the package is installed:
% cat ~/.virtualenvs/sync_backend/bin/spark-log-parser
#!/Users/scottromney/.virtualenvs/sync_backend/bin/python
# -*- coding: utf-8 -*-
import re
import sys
from spark_log_parser.cli import main
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
sys.exit(main())
So returning the string will result in a message printed to stderr and an exit code of 1 as expected.
Alternatively, we could treat this as a normal scenario and copy the parsed event log to the result directory (it might be buried in an archive after all) logging a warning to indicate that it was already parsed. That way this function would continue to have a single point of return, but I think this behavior is more in keeping with this command line utility as a parser only - when it can't parse you're using it incorrectly and should expect an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice good call, I changed it to copy over the parsed logs to the destination regardless.
|
Kudos, SonarCloud Quality Gate passed!
|








Modified Eventlog to take in parsed logs without error. The output of
.build()is now a tuple with both the eventlog_path and parsed (bool) to indicate which type of log it is. This PR pairs bugfix/PROD-399-version-2 in sync_backend. A new test was introduced for the new case.Some error messages were also adjusted to be more informative on the exact problem, this was necessary with the introduction of parsed logs.