Crawls a series of urls and extracts content from the page into an xml file. The original goal of this project was to extract test data for prototype projects.
Solution is split into multiple projects.
Contains the core processing logic for scraping data from a webpage.
Utilizes the Cidean.WebScraper.Core to run a console application executing the scrape process.
Usage:
webscrape -m "example-datamap.xml" -o "example-output.xml"
Other projects will be included showcasing using the scraper in a web application.
The configuration for a given scrap/extract is stored in a Data Map, DataMap file in xml format.
Output name of data item
Text, Link, Image, List
CSS Selector path of element
<DataMap Name="Amazon" >
<Urls>
<Url><![CDATA[amazon-mystery-list.html]]></Url>
<Url><![CDATA[https://web.archive.org/web/20150616214557/http://www.amazon.com/gp/bestsellers/books/18]]></Url>
</Urls>
<DataMapItems>
<DataMapItem Type="text" Path="#zg_listTitle" Name="Title"/>
<DataMapItem Type="list" Path=".zg_itemImmersion" ListName="Books" Name="Book">
<DataMapItems>
<DataMapItem Type="text" Path=".zg_rankDiv" Name="Rank"/>
<DataMapItem Type="text" Path=".zg_title" Name="Title"/>
<DataMapItem Type="text" Path=".zg_byline" Name="Byline"/>
<DataMapItem Type="text" Path=".price" Name="Price"/>
<DataMapItem Type="image" Path="img" Name="Thumb"/>
</DataMapItems>
</DataMapItem>
</DataMapItems>
</DataMap>