Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler Finished Time added to done list and improvements #45

Merged
merged 4 commits into from
Apr 26, 2019

Conversation

N0taN3rd
Copy link
Contributor

implemented adding the time a crawler finishes fixes #44
implemented the ability to specify screen shot dimensions via the SCREENSHOT_DIMENSIONS environment key
improved the handling of out link collection in order to handle pages with 1k+ links present in the dom at a time
switch to using ABCMeta and EventEmitterS in order to ensure we are not forcing a dict on classes that do not opt in for one
unified slots format and removed un-necessary ABC usage in intermediate abstract classes
updated README with new environment variables

implemented the ability to specify screen shot dimensions via the SCREENSHOT_DIMENSIONS environment key
improved the handling of out link collection in order to handle pages with 1k+ links present in the dom at a time
switch to using ABCMeta and EventEmitterS in order to ensure we are not forcing a __dict__ on classes that do not opt in for one
unified slots format and removed un-necessary ABC usage in intermediate abstract classes
updated README with new environment variables
@N0taN3rd N0taN3rd requested a review from ikreymer April 24, 2019 21:37
…order to be more clear about what it does

when all_frames is true, previously manual_collection, keyword arg in collect_outlinks is true, both all frame and behavior out link collection occurs rather than one or the other
added typing to BehaviorTabs __slots__ to make linting happy
@@ -165,7 +197,8 @@ def main_frame_getter(self) -> Frame:
exc_info=e,
)

self.logger.info(logged_method, "crawl loop task ended")
end_info = Helper.json_string(id=self.reqid, time=time.time())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end_info = Helper.json_string(id=self.reqid, time=time.time())
end_info = Helper.json_string(id=self.reqid, time=int(time.time()))

Otherwise includes a long decimal for microseconds, don't really need that :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@N0taN3rd N0taN3rd requested a review from ikreymer April 26, 2019 15:07
@ikreymer ikreymer merged commit 8d39df4 into master Apr 26, 2019
@ikreymer ikreymer deleted the crawler-done-timing-and-tweaks branch April 26, 2019 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set crawl state to done when last browser is done
2 participants