Skip to content

scdunn/Cidean.WebScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cidean WebScraper

Crawls a series of urls and extracts content from the page into an xml file. The original goal of this project was to extract test data for prototype projects.

Build Status

Solution is split into multiple projects.

Cidean.WebScraper.Core

Contains the core processing logic for scraping data from a webpage.

Cidean.WebScraper.Runner

Utilizes the Cidean.WebScraper.Core to run a console application executing the scrape process.

Usage:

webscrape -m "example-datamap.xml" -o "example-output.xml"

Other

Other projects will be included showcasing using the scraper in a web application.

Configuration

The configuration for a given scrap/extract is stored in a Data Map, DataMap file in xml format.

Name

Output name of data item

Type

Text, Link, Image, List

Path

CSS Selector path of element

<DataMap Name="Amazon"  >
<Urls>
  <Url><![CDATA[amazon-mystery-list.html]]></Url>
  <Url><![CDATA[https://web.archive.org/web/20150616214557/http://www.amazon.com/gp/bestsellers/books/18]]></Url>
</Urls>
<DataMapItems>
  <DataMapItem Type="text" Path="#zg_listTitle" Name="Title"/>
  <DataMapItem Type="list" Path=".zg_itemImmersion" ListName="Books" Name="Book">
      <DataMapItems>
        <DataMapItem Type="text" Path=".zg_rankDiv" Name="Rank"/>
        <DataMapItem Type="text" Path=".zg_title" Name="Title"/>
        <DataMapItem Type="text" Path=".zg_byline" Name="Byline"/>
        <DataMapItem Type="text" Path=".price" Name="Price"/>
        <DataMapItem Type="image" Path="img" Name="Thumb"/>
      </DataMapItems>
    </DataMapItem>
</DataMapItems>    

</DataMap>

About

Configurable console application to crawl websites and scrapes content using rules based on css selectors.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages