Prod 399 version 2 #16

gorskysd · 2022-09-19T18:58:04Z

Modified Eventlog to take in parsed logs without error. The output of .build() is now a tuple with both the eventlog_path and parsed (bool) to indicate which type of log it is. This PR pairs bugfix/PROD-399-version-2 in sync_backend. A new test was introduced for the new case.

Some error messages were also adjusted to be more informative on the exact problem, this was necessary with the introduction of parsed logs.

…Error messages

rmoneys · 2022-09-19T20:56:48Z

spark_log_parser/parsing_models/application_model_v2.py

-        appobj=None,  # application_model object
-        eventlog=None,  # spark eventlog path,
+        spark_eventlog_parsed_path=None,
+        spark_eventlog_path=None,  # application_model object


Love the new names, but I don't think this comment lines up.

rmoneys · 2022-09-19T20:58:26Z

spark_log_parser/parsing_models/application_model_v2.py


-        if (appobj is not None) or (eventlog is not None):  # Load an application_model or eventlog
+        if (appobj is not None) or (
+            self.eventlog is not None


Wasn't this attribute renamed?

No objfile was renamed. appobj is a very old option that doesn't get used anymore.

on second thought, I'll clean this up a bit more to remove appobj and make sure attributes are all consistent.

rmoneys · 2022-09-19T21:07:40Z

tests/test_parse.py

+        event_log, parsed = eventlog.EventLogBuilder(event_log_paths, temp_dir).build()
+        sparkApplication(spark_eventlog_parsed_path=str(event_log))
+
+    assert parsed


I think this would be slightly easier to follow if the assert immediately followed the line in which parsed is assigned.

rmoneys · 2022-09-19T21:09:13Z

tests/test_parse.py

+        obj = sparkApplication(spark_eventlog_path=str(event_log))

    assert list(obj.sqlData.index.values) == [0, 2, 3, 5, 6, 7, 8]
+    assert not parsed


This can be moved up too.

rmoneys · 2022-09-19T22:14:42Z

spark_log_parser/eventlog.py

+                try:  # Test if it is a parsed log
+                    with open(path) as fobj:
+                        data = json.load(fobj)
+                        if "jobData" in data:


Ideally, sparkApplication should be making the determination whether the object is a valid event log or not. It creates the parsed logs after all.

In principle I agree, and in the future I we can make that happen. For now though, you'll see in the sync_backend sister PR that parsed/unparsed judgement has to come before the log has been parsed because that must be known before entering the spark_log_parser through SparkApplicationAdvanced, which inherits from sparkApplication. I know, it's real janky right now, but it's getting better bit by bit. A separation of these classes will happen, and we can make this even cleaner then -- I don't want to get carried away in the one PR.

Just to be clear, I was thinking it could be a classmethod like,

T = TypeVar('T', bound='sparkApplication') class sparkApplication: @classmethod def is_parsed_log(cls: Type[T], eventlog: dict) -> boolean: ...

(btw, I learned the type annotations for classmethods here: https://stackoverflow.com/questions/44640479/type-annotation-for-classmethod-returning-instance)
which allows us to call the method before we create an instance of the class. That way we keep the parsed log logic in the same module.

I totally agree that we are making good steady progress. It's a balance between improving the structure, not losing any of the knowledge in the code, and delivering features n' fixes.

...or staticmethod if we don't need that class variable

ohh I see what you mean, that's nice. I'll keep this in mind for the next batch of log_parser work.

rmoneys

Looks good. I think the small change to cli.py that I suggested is all that's needed.

As I mentioned in a comment, it'd be great a some point to move the logic that determines whether an object constitutes a parsed log to the class that creates the parsed log (maybe a class method that takes a dict) - preexisting tech debt though.

Also, at some point it'd be good to add some notes in sync_backend's README.md about how to test these changes there.

rmoneys · 2022-09-20T00:07:12Z

spark_log_parser/cli.py

+
+        if parsed:
+            print("Input log is already parsed")
+            return


I think this constitutes an abnormal exit since we're not putting the result in the result directory. In which case we can simply do,

return "Input already contains a parsed log file"

The invocation is wrapped in a sys.exit in the script that's created when the package is installed:

% cat ~/.virtualenvs/sync_backend/bin/spark-log-parser #!/Users/scottromney/.virtualenvs/sync_backend/bin/python # -*- coding: utf-8 -*- import re import sys from spark_log_parser.cli import main if __name__ == '__main__': sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) sys.exit(main())

So returning the string will result in a message printed to stderr and an exit code of 1 as expected.

Alternatively, we could treat this as a normal scenario and copy the parsed event log to the result directory (it might be buried in an archive after all) logging a warning to indicate that it was already parsed. That way this function would continue to have a single point of return, but I think this behavior is more in keeping with this command line utility as a parser only - when it can't parse you're using it incorrectly and should expect an error.

Nice good call, I changed it to copy over the parsed logs to the destination regardless.

sonarqubecloud · 2022-09-20T00:34:09Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

No Coverage information
0.0% Duplication

Sean Gorsky added 4 commits September 19, 2022 10:49

Added ability to pass through parsed logs. Added more detail to Value…

41f6e35

…Error messages

Change kwargs to be better aligned and more informative

9164df0

Merge branch 'main' into PROD-399-version-2

d95f279

incremented version

d3e004c

gorskysd requested review from NKSync and rmoneys September 19, 2022 18:58

rmoneys reviewed Sep 19, 2022

View reviewed changes

Sean Gorsky added 2 commits September 19, 2022 17:36

PR cleanup

9a2ac18

more tidying

667d70c

rmoneys reviewed Sep 19, 2022

View reviewed changes

rmoneys suggested changes Sep 20, 2022

View reviewed changes

Cleaned up cli.py to still copy parsed logs to destination

af61f93

gorskysd requested a review from rmoneys September 20, 2022 00:34

rmoneys approved these changes Sep 20, 2022

View reviewed changes

gorskysd merged commit bb82e49 into main Sep 20, 2022

gorskysd deleted the PROD-399-version-2 branch September 20, 2022 01:14

Prod 399 version 2 #16

Prod 399 version 2 #16

Uh oh!

Conversation

gorskysd commented Sep 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmoneys left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Sep 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants