Closes #48: Unit tests for crawler and new crawler modules, improved scraping #52

bobheadxi · 2017-10-23T22:24:49Z

Related Issue

#48, #49
This also grew a little out of scope, oops

Description

Unit tests + test data, improved documentation for the new crawler modules added in Closes #44, closes #37: Scraper upgrades #45 🥇
Updated generic_page_parser to get site name and page name if available, both of which are now inserted into Solr
Removed pageTitle from Solr model and tests (the other two name fields are enough I think)
Improved parser robustness with more safe xpath extracts, and other tweaks to parser utils
Added .coveragerc which should hopefully exclude coverage of our tests from overall coverage

WIKI Updates

Another day, another Schema update

Todos

General:

Tests 🔥 🔥 🔥
Documentation
Wiki

coveralls · 2017-10-23T22:27:50Z

Coverage increased (+0.7%) to 83.228% when pulling 36cb5dd on 48-scraper-tests into c85f6d0 on master.

coveralls · 2017-10-23T22:45:04Z

Coverage increased (+1.3%) to 83.861% when pulling 117c081 on 48-scraper-tests into c85f6d0 on master.

coveralls · 2017-10-23T22:51:56Z

Coverage increased (+2.6%) to 85.127% when pulling 05d6fed on 48-scraper-tests into c85f6d0 on master.

coveralls · 2017-10-24T03:47:50Z

Coverage increased (+2.6%) to 85.127% when pulling bd7c60b on 48-scraper-tests into c85f6d0 on master.

bfbachmann

Looks good, just a few minor suggestions.

bfbachmann · 2017-10-24T05:49:49Z

.coveragerc

@@ -0,0 +1,3 @@
+[report]
+omit =
+    */tests/*


What does this do?

Never mind, just read your description.

bfbachmann · 2017-10-25T04:03:57Z

sleuth_crawler/scraper/scraper/spiders/parsers/course_parser.py

@@ -1,50 +1,64 @@
 import scrapy
+import sleuth_crawler.scraper.scraper.spiders.parsers.utils as utils


I think you can just do

from sleuth_crawler.scraper.scraper.spiders.parsers import utils

bfbachmann · 2017-10-25T04:24:51Z

sleuth_crawler/scraper/scraper/spiders/parsers/utils.py

@@ -19,7 +19,7 @@ def strip_content(data):
        return lines


Just realized that you have lines.append(line.strip()), but you already stripped the line so there's no need to call strip() again.

ay good catch - fixed in b865341

bfbachmann · 2017-10-25T04:26:25Z

sleuth_crawler/scraper/scraper/spiders/parsers/utils.py

@@ -19,7 +19,7 @@ def strip_content(data):
        return lines
    except Exception:
        # if page is not a webpage, catch errors on attempted parse
-        return None
+        return [""]


Why not return an empty list?

True, empty lines get discarded anyway in pipeline, I'll fix that
edit: fixed in b865341

bfbachmann · 2017-10-25T04:28:21Z

sleuth_crawler/scraper/scraper/spiders/parsers/generic_page_parser.py

+    titles = title.split('|')
+    if len(titles) == 2:
+        title = titles[0].strip()
+        site_title = titles[1].strip()
    desc = utils.extract_element(response.xpath("//meta[@name='description']/@content"), 0)
    raw_content = utils.strip_content(response.body)


In the case where there is an error in strip_content it will return an array with an empty string, so we should probably check for that and not try to create a ScrapyGenericPage() in that case.

@bfbachmann My goal with strip_content returning an empty string was to save non-webpage links, since PDFs/other files are still somewhat relevant search results, though if you don't think that's a good idea I'll change it. From what I have seen this is the only case where strip_content errors, since Scrapy itself catches pretty much every other edge case

I am going to leave this in for now, since we might want to handle other file types properly in the future

bfbachmann · 2017-10-25T04:30:04Z

sleuth_crawler/scraper/scraper/spiders/parsers/utils.py

I think it would be cleaner here to just check the item_list length before trying to access it at a particular index, thereby avoiding a try-catch block all together.

fixed in b865341

coveralls · 2017-10-25T16:19:19Z

Coverage increased (+2.5%) to 85.079% when pulling b865341 on 48-scraper-tests into c85f6d0 on master.

bobheadxi added 3 commits October 22, 2017 10:12

Mock data for scraper parsers, updated mock response provider

f0a3328

GenericPage scraping for titles, safer parsing, other changes

592fc39

Crawler tests, updated test for GenericPage model

36cb5dd

bobheadxi requested a review from bfbachmann October 23, 2017 22:24

Updated generic title parsing, tests

05d6fed

bobheadxi force-pushed the 48-scraper-tests branch from 117c081 to 05d6fed Compare October 23, 2017 22:47

Reverse pageName and siteName

bd7c60b

bfbachmann approved these changes Oct 25, 2017

View reviewed changes

bobheadxi added 2 commits October 25, 2017 00:04

Merge branch 'master' into 48-scraper-tests

5027e29

Requested changes from #52: tweaks and polish

b865341

bobheadxi merged commit 818a6e8 into master Oct 25, 2017

bobheadxi deleted the 48-scraper-tests branch October 26, 2017 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #48: Unit tests for crawler and new crawler modules, improved scraping #52

Closes #48: Unit tests for crawler and new crawler modules, improved scraping #52

bobheadxi commented Oct 23, 2017 •

edited

coveralls commented Oct 23, 2017

coveralls commented Oct 23, 2017

coveralls commented Oct 23, 2017

coveralls commented Oct 24, 2017

bfbachmann left a comment

bfbachmann Oct 24, 2017

bfbachmann Oct 25, 2017

bfbachmann Oct 25, 2017

bfbachmann Oct 25, 2017

bobheadxi Oct 25, 2017 •

edited

bfbachmann Oct 25, 2017

bobheadxi Oct 25, 2017 •

edited

bfbachmann Oct 25, 2017

bobheadxi Oct 25, 2017 •

edited

bobheadxi Oct 25, 2017

bfbachmann Oct 25, 2017

bobheadxi Oct 25, 2017

coveralls commented Oct 25, 2017

		@@ -1,50 +1,64 @@
		import scrapy
		import sleuth_crawler.scraper.scraper.spiders.parsers.utils as utils

Closes #48: Unit tests for crawler and new crawler modules, improved scraping #52

Closes #48: Unit tests for crawler and new crawler modules, improved scraping #52

Conversation

bobheadxi commented Oct 23, 2017 • edited

Related Issue

Description

WIKI Updates

Todos

coveralls commented Oct 23, 2017

coveralls commented Oct 23, 2017

coveralls commented Oct 23, 2017

coveralls commented Oct 24, 2017

bfbachmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobheadxi Oct 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobheadxi Oct 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobheadxi Oct 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Oct 25, 2017

bobheadxi commented Oct 23, 2017 •

edited

bobheadxi Oct 25, 2017 •

edited

bobheadxi Oct 25, 2017 •

edited

bobheadxi Oct 25, 2017 •

edited