# Webstruct

- dependencies
    - `lxml` for parsing HTML;
    - `scikit-learn` >= 0.14.

## TODOS
- Webstruct Tutorial: https://webstruct.readthedocs.io/en/latest/tutorial.html
- WebAnnotator: https://www.aclweb.org/anthology/L12-1021/

In [2]:
#!pip install webstruct

In [22]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import webstruct
from webstruct import WebAnnotatorLoader, HtmlFeatureExtractor

In [4]:
TRAIN_PATH_JSON = "data/Branchen/data/train.ndjson"
TEST_PATH_JSON = "data/Branchen/data/test.ndjson"
TRAIN_PATH_CSV = "data/Branchen/data/train.csv"
TEST_PATH_CSV = "data/Branchen/data/test.csv"

## Load csvs and create test html

In [5]:
%%time
train = pd.read_csv(TRAIN_PATH_CSV)
train.head(2)

CPU times: user 8.54 s, sys: 996 ms, total: 9.53 s
Wall time: 9.53 s


Unnamed: 0,text,html,industry,country,industry_name
0,Home | NETZkultur GmbH\n\nZum Inhalt wechseln\...,"<!DOCTYPE html>\n<html lang=""de-DE"">\n<head>\n...",4,DE,Computer Software
1,"\n\nNXP Semiconductors | Automotive, Security,...",<!DOCTYPE html>\n<html>\n<head>\n\t<title>NXP ...,7,,Semiconductors


In [7]:
testhtml = train.iloc[0]["html"]
with open("data/train/test.html", "w+") as f:
    f.write(testhtml)

TODO: all HTMLs to train

## ...

In [9]:
loader = WebAnnotatorLoader()  
loader.load("data/train/test.html")

<Element html at 0x7fb6dd5a95f0>

In [10]:
trees = webstruct.load_trees("data/train/*.html", WebAnnotatorLoader())

#### HTML Tokenizer

- 2 arrays:
    - Liste von `HTMLToken` Instanzen (hier `X`)
    - Liste von Tags (hier `y`)
        - BIO/IOB Format (= inside, outside, beginning)
            - B: Anfang eines chunks
            - I: Innerhalb eines chunks
            - O: kein chunk

In [11]:
html_tokenizer = webstruct.HtmlTokenizer()
X, y = html_tokenizer.tokenize(trees)

In [16]:
X[0][0]

HtmlToken(token='Home', parent=<Element title at 0x7fb6dd5b5230>, index=0, position=0, length=4)

In [18]:
y[0][0]

'O'

In [20]:
def token_identity(html_token):
    return {'token': html_token.token}

def token_isupper(html_token):
    return {'isupper': html_token.token.isupper()}

def parent_tag(html_token):
    return {'parent_tag': html_token.parent.tag}

def border_at_left(html_token):
    return {'border_at_left': html_token.index == 0}

In [23]:
feature_extractor = HtmlFeatureExtractor(
    token_features = [
        token_identity,
        token_isupper,
        parent_tag,
        border_at_left
    ]
)

In [24]:
features = feature_extractor.fit_transform(X)

In [25]:
features

[[{'token': 'Home',
   'isupper': False,
   'parent_tag': 'title',
   'border_at_left': True},
  {'token': '|',
   'isupper': False,
   'parent_tag': 'title',
   'border_at_left': False},
  {'token': 'NETZkultur',
   'isupper': False,
   'parent_tag': 'title',
   'border_at_left': False},
  {'token': 'GmbH',
   'isupper': False,
   'parent_tag': 'title',
   'border_at_left': False},
  {'token': 'Zum',
   'isupper': False,
   'parent_tag': 'a',
   'border_at_left': True},
  {'token': 'Inhalt',
   'isupper': False,
   'parent_tag': 'a',
   'border_at_left': False},
  {'token': 'wechseln',
   'isupper': False,
   'parent_tag': 'a',
   'border_at_left': False},
  {'token': 'Home',
   'isupper': False,
   'parent_tag': 'a',
   'border_at_left': True},
  {'token': 'Dienste',
   'isupper': False,
   'parent_tag': 'a',
   'border_at_left': True},
  {'token': 'Unser',
   'isupper': False,
   'parent_tag': 'a',
   'border_at_left': True},
  {'token': 'Team',
   'isupper': False,
   'parent_tag':