Skip to content

Conversation

@gorskysd
Copy link
Contributor

Modified Eventlog to take in parsed logs without error. The output of .build() is now a tuple with both the eventlog_path and parsed (bool) to indicate which type of log it is. This PR pairs bugfix/PROD-399-version-2 in sync_backend. A new test was introduced for the new case.

Some error messages were also adjusted to be more informative on the exact problem, this was necessary with the introduction of parsed logs.

@gorskysd gorskysd requested review from NKSync and rmoneys September 19, 2022 18:58
appobj=None, # application_model object
eventlog=None, # spark eventlog path,
spark_eventlog_parsed_path=None,
spark_eventlog_path=None, # application_model object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the new names, but I don't think this comment lines up.


if (appobj is not None) or (eventlog is not None): # Load an application_model or eventlog
if (appobj is not None) or (
self.eventlog is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this attribute renamed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objfile was renamed. appobj is a very old option that doesn't get used anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, I'll clean this up a bit more to remove appobj and make sure attributes are all consistent.

event_log, parsed = eventlog.EventLogBuilder(event_log_paths, temp_dir).build()
sparkApplication(spark_eventlog_parsed_path=str(event_log))

assert parsed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be slightly easier to follow if the assert immediately followed the line in which parsed is assigned.

obj = sparkApplication(spark_eventlog_path=str(event_log))

assert list(obj.sqlData.index.values) == [0, 2, 3, 5, 6, 7, 8]
assert not parsed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved up too.

try: # Test if it is a parsed log
with open(path) as fobj:
data = json.load(fobj)
if "jobData" in data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, sparkApplication should be making the determination whether the object is a valid event log or not. It creates the parsed logs after all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I agree, and in the future I we can make that happen. For now though, you'll see in the sync_backend sister PR that parsed/unparsed judgement has to come before the log has been parsed because that must be known before entering the spark_log_parser through SparkApplicationAdvanced, which inherits from sparkApplication. I know, it's real janky right now, but it's getting better bit by bit. A separation of these classes will happen, and we can make this even cleaner then -- I don't want to get carried away in the one PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I was thinking it could be a classmethod like,

T = TypeVar('T', bound='sparkApplication')


class sparkApplication:
    @classmethod
    def is_parsed_log(cls: Type[T], eventlog: dict) -> boolean:
        ...

(btw, I learned the type annotations for classmethods here: https://stackoverflow.com/questions/44640479/type-annotation-for-classmethod-returning-instance)
which allows us to call the method before we create an instance of the class. That way we keep the parsed log logic in the same module.

I totally agree that we are making good steady progress. It's a balance between improving the structure, not losing any of the knowledge in the code, and delivering features n' fixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or staticmethod if we don't need that class variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh I see what you mean, that's nice. I'll keep this in mind for the next batch of log_parser work.

Copy link
Contributor

@rmoneys rmoneys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think the small change to cli.py that I suggested is all that's needed.

As I mentioned in a comment, it'd be great a some point to move the logic that determines whether an object constitutes a parsed log to the class that creates the parsed log (maybe a class method that takes a dict) - preexisting tech debt though.

Also, at some point it'd be good to add some notes in sync_backend's README.md about how to test these changes there.


if parsed:
print("Input log is already parsed")
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this constitutes an abnormal exit since we're not putting the result in the result directory. In which case we can simply do,

return "Input already contains a parsed log file"

The invocation is wrapped in a sys.exit in the script that's created when the package is installed:

% cat ~/.virtualenvs/sync_backend/bin/spark-log-parser
#!/Users/scottromney/.virtualenvs/sync_backend/bin/python
# -*- coding: utf-8 -*-
import re
import sys
from spark_log_parser.cli import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

So returning the string will result in a message printed to stderr and an exit code of 1 as expected.

Alternatively, we could treat this as a normal scenario and copy the parsed event log to the result directory (it might be buried in an archive after all) logging a warning to indicate that it was already parsed. That way this function would continue to have a single point of return, but I think this behavior is more in keeping with this command line utility as a parser only - when it can't parse you're using it incorrectly and should expect an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice good call, I changed it to copy over the parsed logs to the destination regardless.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@gorskysd gorskysd requested a review from rmoneys September 20, 2022 00:34
@gorskysd gorskysd merged commit bb82e49 into main Sep 20, 2022
@gorskysd gorskysd deleted the PROD-399-version-2 branch September 20, 2022 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants