Cidean WebScraper

Crawls a series of urls and extracts content from the page into an xml file. The original goal of this project was to extract test data for prototype projects.

Solution is split into multiple projects.

Cidean.WebScraper.Core

Contains the core processing logic for scraping data from a webpage.

Cidean.WebScraper.Runner

Utilizes the Cidean.WebScraper.Core to run a console application executing the scrape process.

Usage:

webscrape -m "example-datamap.xml" -o "example-output.xml"

Other

Other projects will be included showcasing using the scraper in a web application.

Configuration

The configuration for a given scrap/extract is stored in a Data Map, DataMap file in xml format.

Name

Output name of data item

Type

Text, Link, Image, List

Path

CSS Selector path of element

<DataMap Name="Amazon"  >
<Urls>
  <Url><![CDATA[amazon-mystery-list.html]]></Url>
  <Url><![CDATA[https://web.archive.org/web/20150616214557/http://www.amazon.com/gp/bestsellers/books/18]]></Url>
</Urls>
<DataMapItems>
  <DataMapItem Type="text" Path="#zg_listTitle" Name="Title"/>
  <DataMapItem Type="list" Path=".zg_itemImmersion" ListName="Books" Name="Book">
      <DataMapItems>
        <DataMapItem Type="text" Path=".zg_rankDiv" Name="Rank"/>
        <DataMapItem Type="text" Path=".zg_title" Name="Title"/>
        <DataMapItem Type="text" Path=".zg_byline" Name="Byline"/>
        <DataMapItem Type="text" Path=".price" Name="Price"/>
        <DataMapItem Type="image" Path="img" Name="Thumb"/>
      </DataMapItems>
    </DataMapItem>
</DataMapItems>    

</DataMap>

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Cidean.WebScraper.sln		Cidean.WebScraper.sln
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitattributes

.gitattributes

.gitignore

.gitignore

Cidean.WebScraper.sln

Cidean.WebScraper.sln

README.md

README.md

azure-pipelines.yml

azure-pipelines.yml

Repository files navigation

Cidean WebScraper

Cidean.WebScraper.Core

Cidean.WebScraper.Runner

Other

Configuration

Name

Type

Path

About

Releases

Packages

Contributors 2

Languages

scdunn/Cidean.WebScraper

Folders and files

Latest commit

History

Repository files navigation

Cidean WebScraper

Cidean.WebScraper.Core

Cidean.WebScraper.Runner

Other

Configuration

Name

Type

Path

About

Topics

Resources

Stars

Watchers

Forks

Languages