Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #48: Unit tests for crawler and new crawler modules, improved scraping #52

Merged
merged 7 commits into from
Oct 25, 2017

Conversation

bobheadxi
Copy link
Member

@bobheadxi bobheadxi commented Oct 23, 2017

Related Issue

#48, #49
This also grew a little out of scope, oops

Description

  • Unit tests + test data, improved documentation for the new crawler modules added in Closes #44, closes #37: Scraper upgrades聽#45 馃
  • Updated generic_page_parser to get site name and page name if available, both of which are now inserted into Solr
  • Removed pageTitle from Solr model and tests (the other two name fields are enough I think)
  • Improved parser robustness with more safe xpath extracts, and other tweaks to parser utils
  • Added .coveragerc which should hopefully exclude coverage of our tests from overall coverage

WIKI Updates

  • Another day, another Schema update

Todos

General:

  • Tests 馃敟 馃敟 馃敟
  • Documentation
  • Wiki

@coveralls
Copy link

Coverage Status

Coverage increased (+0.7%) to 83.228% when pulling 36cb5dd on 48-scraper-tests into c85f6d0 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.3%) to 83.861% when pulling 117c081 on 48-scraper-tests into c85f6d0 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+2.6%) to 85.127% when pulling 05d6fed on 48-scraper-tests into c85f6d0 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+2.6%) to 85.127% when pulling bd7c60b on 48-scraper-tests into c85f6d0 on master.

Copy link
Member

@bfbachmann bfbachmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few minor suggestions.

@@ -0,0 +1,3 @@
[report]
omit =
*/tests/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, just read your description.

@@ -1,50 +1,64 @@
import scrapy
import sleuth_crawler.scraper.scraper.spiders.parsers.utils as utils
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do

from sleuth_crawler.scraper.scraper.spiders.parsers import utils

@@ -19,7 +19,7 @@ def strip_content(data):
return lines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that you have lines.append(line.strip()), but you already stripped the line so there's no need to call strip() again.

Copy link
Member Author

@bobheadxi bobheadxi Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ay good catch - fixed in b865341

@@ -19,7 +19,7 @@ def strip_content(data):
return lines
except Exception:
# if page is not a webpage, catch errors on attempted parse
return None
return [""]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return an empty list?

Copy link
Member Author

@bobheadxi bobheadxi Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, empty lines get discarded anyway in pipeline, I'll fix that
edit: fixed in b865341

titles = title.split('|')
if len(titles) == 2:
title = titles[0].strip()
site_title = titles[1].strip()
desc = utils.extract_element(response.xpath("//meta[@name='description']/@content"), 0)
raw_content = utils.strip_content(response.body)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where there is an error in strip_content it will return an array with an empty string, so we should probably check for that and not try to create a ScrapyGenericPage() in that case.

Copy link
Member Author

@bobheadxi bobheadxi Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bfbachmann My goal with strip_content returning an empty string was to save non-webpage links, since PDFs/other files are still somewhat relevant search results, though if you don't think that's a good idea I'll change it. From what I have seen this is the only case where strip_content errors, since Scrapy itself catches pretty much every other edge case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to leave this in for now, since we might want to handle other file types properly in the future

return ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner here to just check the item_list length before trying to access it at a particular index, thereby avoiding a try-catch block all together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in b865341

@coveralls
Copy link

Coverage Status

Coverage increased (+2.5%) to 85.079% when pulling b865341 on 48-scraper-tests into c85f6d0 on master.

@bobheadxi bobheadxi merged commit 818a6e8 into master Oct 25, 2017
@bobheadxi bobheadxi deleted the 48-scraper-tests branch October 26, 2017 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants