Tika cleanup
==========

This notebook experiments with how to cleanup a cluttered website and only extract the content that is relevant to the subject a website is describing. This means removing things like: navigation, cookie walls, footers and other elements that you commonly find on websites, but that do not add informational value.

Merchandising op infonu.nl
---------------------------------------

One webpage that looks promising as an information source is: https://zakelijk.infonu.nl/marketing/4984-merchandising-verhoogt-je-omzet.html
The raw Tika output for this website is added as ``merchandising-verhoogt-je-omzet.tika-raw.json`` in the directory of this notebook and its output XML is inside ``merchandising-verhoogt-je-omzet.tika-xml-content.html``

Looking at the XML it seems like a good idea to try to see which headers belong to the article and which headers are generic site headers. If we only take content below the article headers and stop when we encounter a site header, then the text should be much cleaner than taking all text (which is current default for harvester)

Input data
--------------

We're selecting data based on a few requirements:
* It's a website. For PDF we could probably do a similar approach, but since there is no such thing as a PDF "domain" we would have to find another way to group similar PDF's together
* It's not a homepage. We want to extract information from a detail page. It's ambigious what information we want to collect from homepages and which headers are truly relevant

There are 4386 documents in Edusources harvester that match these criteria. Here is a list of the top 20 domains of the webpages that match our criteria:

* wur.yuja.com (671)
* youtube.com (639)
* dekooktips.com (178)
* economiehulp.nl (118)
* ocw.tudelft.nl (109)
* maken.wikiwijs.nl (92)
* wiki.utwente.nl (77)
* infonu.nl (63)
* food-info.net (58)
* leren.nl (56)
* hyfoma.com (53)
* vimeo.com (53)
* aa1car.com (41)
* xoteur.12change.eu (41)
* libguides.vu.nl (40)
* huidinfo.nl (39)
* nrc.nl (36)
* turnstep.com (35)
* video.vu.nl (34)
* libguides.ru.nl (30)

Youtube.com and Youtu.be as well as \*.infonu.nl webpages have been merged under the same domain. All the HttpTikaResources from Edusources can be found in the ``http-tika-resources.json`` file. It is too large to commit at the moment and because it's large the file uses the JSON lines format.

In [1]:
import json
from urllib.parse import urlparse

Transformer for infonu.nl
------------------------------------

In ``transformer.py`` we've made a transformer that will look at headers of websites during the fit phase and then it transforms any Tika output to a list of string where only the texts we want are returned.

To create this transformer I've used examples from infonu.nl. Later we'll take a look at how it performs on other domains.

In [2]:
from transformer import TikaWebsiteContentTransformer

In [3]:
def load_resources(data):
    for resource in data:
        resource["content"] = json.loads(resource["fields"]["body"])
        resource["url"] = resource["fields"]["request"]["args"][0]

In [4]:
with open("infonu-tika-resources.json", "r") as tika_resources:
    data = json.load(tika_resources)

load_resources(data)

In [5]:
train = data[:-10]
infonu = TikaWebsiteContentTransformer()
infonu.fit([("infonu.nl", rsc["content"][0]["X-TIKA:content"]) for rsc in train])

In [6]:
test = data[-10:]
for rsc in test:
    print(rsc["url"])
    print(infonu.transform(("infonu.nl", rsc["content"][0]["X-TIKA:content"])))
    print("*"*80)

http://mens-en-samenleving.infonu.nl/filosofie/973-le-sacre-du-printemps-als-uiting-van-modernisme.html
["'Le Sacre du printemps' als uiting van Modernisme", "'Le Sacre du printemps' van Igor Stravinski kan gezien worden als het begin van het modernisme. Om deze aanname te kunnen beargumenteren moet het ballet besproken worden zowel als de kenmerken van het modernisme en deze tegen elkaar uitgezet worden. Op deze wijze kan het begrip modernisme makkelijk uitgelegd worden.", 'Le Sacre du printemps', "'Het Lenteoffer' is een ballet van Sergej Diaghilev's Ballets Russes op muziek van de Russische componist Igor Stravinsky. De première vond plaats op 29 mei 1913 in het Théâtre des Champs-Elysées te Parijs.\n\nDe voorstelling is wat betreft muziek als choreografie een onconventioneel ballet. Dit ligt vooral in het feit dat dit stuk niet voldeed aan de strenge convensies die er bestonden voor dansvoorstellingen. De etiquette voor ballet werden volledig geschonden op meerdere gebieden: de man

Other domains
---------------------

The output above looks pretty solid to me. It still has a bit of copyright notification at the bottom, but all site navigation has been stripped and it didn't lose any of the website content.

To see how the transformer works on other domains we'll make a ``sample/headers_only`` directory. From each domain we take the first Tika output. Then we save the Tika output next to the transformers output. After this we can compare the files in the sample directory to see how the transformer is doing.

In [7]:
data = []
with open("http-tika-resources.json", "r") as tika_resources:
    for line in tika_resources.readlines():
        data.append(json.loads(line))

load_resources(data)

In [8]:
train = data[:-10]
transformer = TikaWebsiteContentTransformer()
transformer.fit([(urlparse(rsc["url"]).netloc, rsc["content"][0]["X-TIKA:content"],) for rsc in train])

In [9]:
sites_of_interest = {
    "wur.yuja.com",
    "youtube.com",
    "dekooktips.com",
    "economiehulp.nl",
    "ocw.tudelft.nl",
    "maken.wikiwijs.nl",
    "wiki.utwente.nl",
    "infonu.nl",
    "food-info.net",
    "leren.nl",
    "hyfoma.com",
    "vimeo.com",
    "aa1car.com",
    "xoteur.12change.eu",
    "libguides.vu.nl",
    "huidinfo.nl",
    "nrc.nl",
    "turnstep.com",
    "video.vu.nl",
    "libguides.ru.nl",
}

In [10]:
for rsc in data:
    domain = transformer._parse_domain(urlparse(rsc["url"]).netloc)
    if domain not in sites_of_interest:
        continue
    with open(f"samples/headers_only/{domain}.html", "w") as xml_file:
        xml_file.write(rsc["content"][0]["X-TIKA:content"])
    with open(f"samples/headers_only{domain}.txt.json", "w") as json_file:
        text = transformer.transform((domain, rsc["content"][0]["X-TIKA:content"],))
        json.dump(text, json_file, indent=4)

Samples of domains
-----------------------------

Here's a quick rundown of top 20 most popular domains in the dataset.

#### aa1car.com 
Looks solid to me

#### dekooktips.com
For this domain we could consider stripping out \<a\> tags, or at least process them similarly to the \<h*\> tags. That would help a lot with removing the navigation I think.

#### economiehulp.nl
This is a typical example of a "learning material" that is not a very useful learning material. At the end our procedure picks up on a sentence or two that should not be there. For the rest the content is not very good, because the website is not very informative (it only redirects to other sources).

#### food-info.net
The difficulty with this example is that it holds a number of tables with nutritional values. It is part of the content, but it's also more or less an illustration. I can't really imagine that the amount of grams are useful for search or as content that helps to classify. Ingredients are maybe useful, but we should consider removing tables with a high density of numbers.

#### huidinfo.nl
This example is a redirect. The redirect itself does not contain text and therefor Tika is not picking up anything. We could look at the "refresh meta head tag" to see where the redirect went to (at least in this case). On that page there seems to be some interesting content.

#### hyfoma.com
Similar to dekooktips.com this domain would be helped with \<a\> processing similarly to how we process \<h*\> to remove navigation trees. A respons form would still make it into the content if we process \<a\> correctly. Perhaps we can sha1 all text that is on a single line and use that as a hash key to make comparisons with other articles on the same domain. The strings can also be used directly as a hash key, but the keys might get very long if we try to put entire paragraphs into the hash map.

#### infonu.nl
Looks pretty solid to me. It still has a bit of copyright notification at the bottom, but all site navigation has been stripped and it didn't lose any of the website content. Using a hash approach or a \<a\> approach where we don't only ignore the link, but also the text immediatelly around it would get rid of the copyright notification

#### leren.nl
This one has a tricky problem. It (mis)uses a \<h1\> tag to include its breadcrumb. This breadcrumb is of course relatively unique across articles so it gets included in the content text. One strategy that I use elsewhere is to look for the \<title\> within the text. It's good SEO practice to put the title of the article in both the \<title\> and \<h1\> tags. If we ignore anything before the "title" then we will also exclude this breadcrumb. At the end some footer paragraphs get included, which would be removed if we look at repeating texts as well as repeating headers. Looking at the \<a\> tags would also help. I'm surprised that a "Over de auteur" header gets included. Perhaps we could see how often it occurs and whether our threshold is too liberal.

#### libguides.ru.nl
Also a tough one. It also includes too much at the beginning which would be helped with the \<title\> \<h1\> trick if they didn't slighty spell the titles differently. Maybe something like a "string distance" comparison would help. The content here is not very interesting, this is basically a table of contents, with all the interesting information in links. It is a good test case to see whether another navigation strip method is not too aggressive, because the navigation structure in the table of contents should be part of the content I think (regardless of how little information the page gives us)

#### libguides.vu.nl
This is the first one to go really astray. It sees the title "Introduction" as very common and then stops further processing, but of course that's just when the interesting stuff starts to come. This libguide is different enough from the RU libguide to warrent separation. However as with the other libguide that we've seen. This is mostly a table of content which does not necesarrily hold the interesting information about the topic it tries to teach.

#### maken.wikiwijs.nl
This is an instruction on how to use the library. Again this is more of a table of contents. It takes too much text as content at the end, but this would cleanup if we look at repeating text and repeating \<a\>

#### nrc.nl
This is an archive page with a website on it from the 90's. No headers, no chocola ;)

#### ocw.tudelft.nl
Another table of contents. It takes too much text as content at the end, but this would cleanup if we look at repeating text and repeating \<a\>. The content just isn't interesting that is the more worrysome conclusion here.

#### turnstep.com
We can remove the \<a\> navigation structure. It's interesting that it's embedded in a \<table\>. Perhaps we can go up the element tree and strip parents wherever we encounter repeating \<a\>

#### video.vu.nl
Doesn't have interesting content because of the video content, but it's extracted reasonably well. Perhaps we want to blacklist words like "comments" and "views". That would clean it up even more.

#### vimeo.com
Again no interesting content, but considering it's video content it's extracted pretty well.

#### wiki.utwente.nl
Also only a video. Doesn't seem to be much content again.

#### wur.yuja.com
This site seems to rely heavily on Javascript. No interesting content is extracted by Tika at all. So it's hard for us to improve things here. We could issue a ticket at Tika.

#### xoteur.12change.eu
This site also contains no information in the Tika output. So we can't do much with this except for seeing if there are Tika developers that want to have a crack at this.

#### youtube.com
Here we should include the title. For the rest there isn't any interesting text in the example so the fact that we get no output is a good thing. We may want to look deeper into Youtube and compare more different videos with one another. We could also simply call the Youtube API and get content the official way. For a platform like Youtube that seems a fair approach.


