Please sign in to comment.
Added new crawler attribute for finer control over Media Type detection
New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244
- Loading branch information...
Showing with 187 additions and 45 deletions.
- +1 −0 htroot/CrawlProfileEditor_p.xml
- +21 −0 htroot/CrawlStartExpert.html
- +7 −0 htroot/CrawlStartExpert.java
- +3 −1 htroot/Crawler_p.java
- +13 −6 source/net/yacy/crawler/CrawlStacker.java
- +23 −2 source/net/yacy/crawler/data/CrawlProfile.java
- +12 −4 source/net/yacy/crawler/data/NoticedURL.java
- +6 −2 source/net/yacy/crawler/retrieval/Response.java
- +30 −2 source/net/yacy/document/TextParser.java
- +71 −28 source/net/yacy/search/Switchboard.java
Oops, something went wrong.