-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #48: Unit tests for crawler and new crawler modules, improved scraping #52
Conversation
117c081
to
05d6fed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few minor suggestions.
@@ -0,0 +1,3 @@ | |||
[report] | |||
omit = | |||
*/tests/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, just read your description.
@@ -1,50 +1,64 @@ | |||
import scrapy | |||
import sleuth_crawler.scraper.scraper.spiders.parsers.utils as utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do
from sleuth_crawler.scraper.scraper.spiders.parsers import utils
@@ -19,7 +19,7 @@ def strip_content(data): | |||
return lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that you have lines.append(line.strip())
, but you already stripped the line so there's no need to call strip() again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ay good catch - fixed in b865341
@@ -19,7 +19,7 @@ def strip_content(data): | |||
return lines | |||
except Exception: | |||
# if page is not a webpage, catch errors on attempted parse | |||
return None | |||
return [""] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return an empty list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, empty lines get discarded anyway in pipeline
, I'll fix that
edit: fixed in b865341
titles = title.split('|') | ||
if len(titles) == 2: | ||
title = titles[0].strip() | ||
site_title = titles[1].strip() | ||
desc = utils.extract_element(response.xpath("//meta[@name='description']/@content"), 0) | ||
raw_content = utils.strip_content(response.body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where there is an error in strip_content
it will return an array with an empty string, so we should probably check for that and not try to create a ScrapyGenericPage() in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bfbachmann My goal with strip_content
returning an empty string was to save non-webpage links, since PDFs/other files are still somewhat relevant search results, though if you don't think that's a good idea I'll change it. From what I have seen this is the only case where strip_content
errors, since Scrapy itself catches pretty much every other edge case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to leave this in for now, since we might want to handle other file types properly in the future
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner here to just check the item_list
length before trying to access it at a particular index, thereby avoiding a try-catch block all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in b865341
Related Issue
#48, #49
This also grew a little out of scope, oops
Description
generic_page_parser
to get site name and page name if available, both of which are now inserted into SolrpageTitle
from Solr model and tests (the other twoname
fields are enough I think)utils
.coveragerc
which should hopefully exclude coverage of our tests from overall coverageWIKI Updates
Todos
General: