Skip to content

w4p/w4parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

W4Parser - Java HTML/XML to POJO parser

W4Parser - is a Java library for working with real-world HTML data and transform required part of data to simple java object(POJO). It provides a very convenient API for extracting data and based on popular java parser Jsoup.

W4Parser repo

<dependency>
    <groupId>com.github.w4p</groupId>
    <artifactId>w4parser</artifactId>
    <version>1.0.1</version>
</dependency>

Example

Quick steps to parse HTML data to java object.

  1. Prepare java object
    public class BBC {
    
        @W4Parse(select = "li[class=media-list__item media-list__item--1]")
        private BBCNews mainNews;
    
        public static class BBCNews {
    
            @W4Parse(select = "h3.media__title/a")
            private String title;
    
            @W4Parse(select = "p.media__summary")
            private String desc;
        }
    }
  1. Run parser
BBC bbc = W4Parser.url("http://www.bbc.com/", BBC.class).get();

Well done! We already fetched data from BBC website and got BBC class object. It is very simple.

For supported "select" options in W4Parser annotation please read the jsoup docs.

Open source

W4Parser is an open source project distributed under the liberal MIT license.

Status

W4Parser now under active development and has release status.

Advanced usage

For example we want to fetch additional data from external url. In this case we can use W4Fetch annotation. For example:

public class BBC {

    @W4Fetch(href = @W4Parse(select = "//section.module--promo//li[class='media-list__item media-list__item--1']//a.media__link/@href"))
    private BBCNews mainNews;

    public static class BBCNews {

        @W4Parse(select = "h1.story-body__h1")
        private String title;

        @W4Parse(select = "div.story-body__inner")
        private String fulltext;
    }
}
///.......
BBC bbc = W4Parser.url("http://www.bbc.com/", BBC.class).get();

In this case W4Parser parse the top page and find all links with selected rules @W4Parse(select = "//section.module--promo//li[class='media-list__item media-list__item--1']//a.media__link/@href") . Then fetch data from this link and parse page with BBCNews class.

or we can fecth the list of remote pages

public class BBC {

    @W4Fetch(href = @W4Parse(select = "//section.module--promo//a.media__link/@href"),
            maxFetch = 5)
    private List<BBCNews> mainNews;

    public static class BBCNews {

        @W4Parse(select = "h1.story-body__h1")
        private String title;

        @W4Parse(select = "div.story-body__inner")
        private String fulltext;
    }
}
///.......
BBC bbc = W4Parser.url("http://www.bbc.com/", BBC.class).get();

where maxFetch - is limit for W4Parser

W4Parser support for predefined remote url too

public class BBC {

    @W4Fetch(url="http://www.bbc.com/news/world-australia-40822310")
    private BBCNews mainNews;

    public static class BBCNews {

        @W4Parse(select = "h1.story-body__h1")
        private String title;

        @W4Parse(select = "div.story-body__inner")
        private String fulltext;
    }
}
///.......
BBC bbc = W4Parser.url("http://www.bbc.com/", BBC.class).get();

Also with W4Parser we can fetch remote & parse remote pages asynchronously

W4QueueResult result = W4Parser
                .url("http://www.bbc.com/", BBC.class)
                .url("http://www.cnn.com/", CNN.class)
                .run();

or fully async implementation with promise

W4Parser
    .url("http://www.bbc.com/", BBC.class)
    .url("http://www.cnn.com/", CNN.class)
    .run((result) -> {
        //Process W4Parser results.
    });

and what about progress of our task queue. No problem

W4Parser
    .url("http://www.bbc.com/", BBC.class)
    .url("http://www.cnn.com/", CNN.class)
    .onProgress((taskResult) -> {
        //Here we can manipulate with completed task results
    })
    .run((result) -> {
         //Process W4Parser results.
     });

Used by

4weiver

Author

Vadim Dobroskok