# Classify Industries

## TODOS 

- top 10 klassen oder so
- preprocessing function einführen
    - `\n` weg
    - andere unnütze Zeichen wie `|` etc.
- remove pos einführen (siehe `clustering_whole_corpus`)
    - vielleicht mit Language identifier?
- ~~"country" als colum~~~~

## IDEEN

### Datensatzaufbereitung

- Übersetzung der Websites in einheitliche Sprache (z.b. Englisch)
- Andere Klassenlabels?
    - Erst allgemeinere Klassen und dann in diesen Klassen feiner klassifizieren?
        - Von: https://towardsdatascience.com/industrial-classification-of-websites-by-machine-learning-with-hands-on-python-3761b1b530f1
            - Technology, Office, & Education products website (Class_1)
            - Consumer products website (Class_2)
            - Industrial Tools and Hardware products website (Class_3)
    - seltene Klassenlables wegwerfen?

### HTML Klassifizierung


- Text zusammenfassen und dann klassifizieren? Dafür auch HTML-Tags verwenden?
- ~~HTML Struktur verwenden, um vorher **Boilerplate Content** von Main Content zu entfernen:~~
    - ~~Plain Text ist sehr noisy (viel unnötiges drin)~~
    - ICH: gemacht mit CLEAN HTML, aber ohne explizites Boilerplate Content Removal
- Bestimmten Wörtern/Tags höhere Gewichtungen geben
    - Anchor Text (= klickbarer Text in einem Hyperlink)
        - alleine zu wenig Inhalt (QI, S. 12)
        - umliegende Wörter interessant! (QI, S. 12)
        - auch für Nachbar-Seiten-Ansatz
    - Title, Headers (QI, S. 12)
        - auch für Nachbar-Seiten-Ansatz
    - Keywords für Branchen
        ```python3  
        Class_1_keywords = ['Office', 'School', 'phone', 'Technology', 'Electronics', 'Cell', 'Business', 'Education', 'Classroom']
        
        Class_2_keywords = ['Restaurant', 'Hospitality', 'Tub', 'Drain', 'Pool', 'Filtration', 'Floor', 'Restroom', 'Consumer', 'Care', 'Bags', 'Disposables']
        
        Class_3_keywords = ['Pull', 'Lifts', 'Pneumatic', 'Emergency', 'Finishing', 'Hydraulic', 'Lockout', 'Towers', 'Drywall', 'Tools', 'Packaging', 'Measure', 'Tag ']
        ```
- NER mit **Tags** als weitere Tokens
- Features von "Nachbarseiten" verwenden
    - Hilfreich, da mehr Infos als "nur" Startseite
    - Fragen: 
        - Was sind Nachbarseiten, wie definieren?
            - Webgraph Webseiten?
            - Weitere Seiten des Unternehmens?
        - Wie viele Nachbarseiten?
        - Wieviel von den Nachbarseiten verwenden?
            - Ganze Seite?
            - text, title, heading, Metadaten?
    
- *Weiteres*:
    - Flat classification oder Hierarchical classification?
        - Flat: parallele Klassen
        - Hierarchical: hierarchische Klassen, bauen aufeinander auf
    - Nur nach bestimmten Keywords filtern? (das geht jedoch mehr Richtung PLAIN-Textclassification)
    - "implicit links": Seiten, die beide bei Suche von **Suchmaschine** erschienen sind und auf die beide der User geklickt hat (QI, S. 12) &rarr; nicht wirklich realisierbar


## Paper / Repos

- **Boilerplate Removal using a Neural Sequence Labeling Model** (2020): https://arxiv.org/pdf/2004.14294.pdf
    - Verbesserung von **Web2Text** &rarr; basiert nicht auf teuren, handgemachten Feature Engineering
    - <u>Hypothese</u>: "Our hypothesis is that the **order** of text blocks in a web page **encodes important information** about their type, i.e. content or boilerplate, as the placement is determined by the authoring style"
- **Web2Text: Deep Structured Boilerplate Removal** (2018): https://arxiv.org/pdf/1801.02607.pdf
- **Mozillas readability**: https://github.com/mozilla/readability
- **Webpage Classification based on Compound of Using HTML Features & URLFeatures and Features of Sibling Pages** (2010): https://www.researchgate.net/publication/220419545_Webpage_Classification_based_on_Compound_of_Using_HTML_Features_URL_Features_and_Features_of_Sibling_Pages
    - TODO
- **Web Page Classification: Features and Algorithms** (2009): https://www.cs.ucf.edu/~dcm/Teaching/COT4810-Fall%202012/Literature/WebPageClassification.pdf
    - S. 7: Using On-Page Features
        - GOLUB, ARDO (2005): title, headings, metadata, main text
    - TODO

## Tests

- Evaluation metric: **F1 Scores**
- TF-IDF Vectorizer
    - kein lowercase
    - stop words werden entfernt
    - keine max features
- Top $n$ classes = most frequent classes
- CLEAN HTML auch für Test Set (ansonsten unglaublich schlechte Accuracy und etwas sinnlos)


#### Label: `group_representatives`

| Experiment | SGD F1 (Precision) | LSVM F1 (Precision) |
| ---------- |:-----:| ----:|
| HTML (10000 features) | **0.5292** (0.5962) | **0.5493** (0.6371) |
| HTML (10000 features) ((1, 3) ngrams) | **0.4035** (0.463) | **0.4188** (0.5345) |
| HTML (10000 features) ((2, 2) ngrams) | **0.2442** (0.2787) | **0.252** (0.3146) |

## Aktuelle TODOS zum ausprobieren (26.04)
- N-gramme &rarr; erstmal lassen
- Max Features
- Count Vectorizer / TF-IDF Vectorizer manipulieren

In [39]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# sklearn classification
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC

# sklearn general
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score
from sklearn.preprocessing import LabelEncoder


from stop_words import get_stop_words
import ujson as json


import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from app.utils import clean_html_boilerplate

## Paths

In [4]:
DATA_DIR_PATH = "../data/"
LANG = "_DE"

INDUSTRIES_PATH_CSV = DATA_DIR_PATH + "industries.csv"
TRAIN_PATH_CSV = DATA_DIR_PATH + "train" + LANG + ".csv"
TEST_PATH_CSV = DATA_DIR_PATH + "test" + LANG + ".csv"

## Load train csv

In [17]:
%%time
train = pd.read_csv(TRAIN_PATH_CSV)

CPU times: user 6.33 s, sys: 1.08 s, total: 7.41 s
Wall time: 7.39 s


In [18]:
train.head(1)

Unnamed: 0,url,industry,industry_label,group,group_representative,html,text,source,country,group_representative_label
0,http://www.blume-rollen.com,135,Mechanical or Industrial Engineering,"cons, gov, man",135,"<html class=""no-js"" lang=""de""> <head> ...",BLUME FÖRDERANLAGEN | BLUME-ROLLEN GMBH - FÖRD...,xing,DE,Mechanical or Industrial Engineering


In [19]:
train.shape

(19857, 10)

## Hyperparameters

In [40]:
# "text" or "html"
TEXT_COL = "html"

# "group_representative", "group_representative_label", "industry", "industry_label" or "group"
CLASS_COL = "group_representative"
CLASS_NAMES = "group_representative_label"

MAX_DOCUMENT_FREQUENCY = 1.
MAX_FEATURES = 100
NGRAM_RANGE = (1,1)
LOWERCASE = False
STOP_WORDS = get_stop_words("de") 

### Vectorizing text

TODO: eigener vectorizer

In [77]:
train = train.head() # todo weg

In [78]:
%%time

train_text = train[TEXT_COL]
train_labels = train[CLASS_COL].values

vectorizer = CountVectorizer(max_df=MAX_DOCUMENT_FREQUENCY,
                             lowercase=LOWERCASE,
                             max_features=MAX_FEATURES,
                             ngram_range=NGRAM_RANGE,
                             stop_words=STOP_WORDS)
transformer = TfidfTransformer()

vector = vectorizer.fit_transform(train_text)
train_vector = transformer.fit_transform(vector)

CPU times: user 25.4 ms, sys: 100 µs, total: 25.5 ms
Wall time: 31.8 ms


TODO:
- building own preprocessor for html tags
- vielleicht auch eigenen tokenizer der html tags als einzelne tokens ansieht
    - https://stackoverflow.com/questions/47549856/tokenizing-an-html-document
    - hier vielleicht auch gucken, welche tokens man nur verwenden möchte
    - vorher klassen und sowas entfernen? ja oder, da sich (wahrscheinlich) nur auf hauseigene css klassen und sowas bezieht, ist ja eher unwichtig
    -  d.h. sowas schreiben wie `TagCleaner` oder so. mal bei lxml gucken &rarr; https://lxml.de/api/lxml.html.clean.Cleaner-class.html

In [80]:
vectorizer.get_feature_names()

['Sie',
 '_blank',
 'amp',
 'anlegen',
 'antrieb',
 'background',
 'block',
 'blume',
 'box',
 'br',
 'button',
 'carousel',
 'cf',
 'class',
 'col',
 'col10',
 'com',
 'component',
 'container',
 'content',
 'de',
 'div',
 'dropdown',
 'dropdown__contact',
 'dropdown__link',
 'dropdown__title',
 'entry',
 'et',
 'et_pb_bg_layout_light',
 'et_pb_column',
 'et_pb_css_mix_blend_mode_passthrough',
 'et_pb_module',
 'et_pb_text',
 'et_pb_text_align_left',
 'et_pb_text_inner',
 'fa',
 'foerderelemente',
 'footer',
 'gem',
 'george',
 'h2',
 'h3',
 'h4',
 'header',
 'href',
 'https',
 'icon',
 'id',
 'image',
 'initialized',
 'inner',
 'item',
 'karten',
 'keller',
 'konto',
 'label',
 'left',
 'level',
 'lg',
 'li',
 'link',
 'list',
 'md',
 'menu',
 'mobile',
 'module',
 'nav',
 'navigation',
 'noe',
 'none',
 'not',
 'object',
 'page',
 'post_type',
 'privatkunden',
 'produkte',
 'qbiq',
 'qbiq_liste_01',
 'reinhold',
 'right',
 'rollen',
 'section',
 'services',
 'site',
 'sm',
 'spacer'

# Test Dataset

In [57]:
%%time
test = pd.read_csv(TEST_PATH_CSV)
    

test_vector = vectorizer.transform(test[TEXT_COL].values)
test_labels = test[CLASS_COL].values

CPU times: user 14.6 s, sys: 139 ms, total: 14.7 s
Wall time: 14.8 s


# SGD

In [58]:
%%time
print("SGD CLF", "\n-------------------------")
# training
clf = SGDClassifier()
clf.fit(train_vector, train_labels)

# prediction
train_preds = clf.predict(test_vector)

# evaluation
precision = precision_score(test_labels, train_preds, average="macro", zero_division=0)
recall = recall_score(test_labels, train_preds, average="macro", zero_division=0)
f1 = f1_score(test_labels, train_preds, average="macro", zero_division=0)
clf1_f1 = np.round(f1, decimals=4)
clf1_precision = np.round(precision, decimals=4)

print(np.round(precision, decimals=4), "\tPrecision")
print(np.round(recall, decimals=4), "\tRecall")
print(np.round(f1, decimals=4), "\tF1")
print()

clf1_report = classification_report(test_labels, 
                                   train_preds, 
                                   target_names = np.unique(test[CLASS_NAMES]), 
                                   zero_division = 0)

SGD CLF 
-------------------------
0.0145 	Precision
0.057 	Recall
0.0211 	F1

CPU times: user 36.8 ms, sys: 714 µs, total: 37.5 ms
Wall time: 70.8 ms


# LSVM

In [59]:
%%time
print("LSVM CLF", "\n-------------------------")
# training
clf = LinearSVC()
clf.fit(train_vector, train_labels)

# prediction
train_preds = clf.predict(test_vector)

# evaluation
precision = precision_score(test_labels, train_preds, average="macro", zero_division=0)
recall = recall_score(test_labels, train_preds, average="macro", zero_division=0)
f1 = f1_score(test_labels, train_preds, average="macro", zero_division=0)
clf2_f1 = np.round(f1, decimals=4)
clf2_precision = np.round(precision, decimals=4)

print(np.round(precision, decimals=4), "\tPrecision")
print(np.round(recall, decimals=4), "\tRecall")
print(np.round(f1, decimals=4), "\tF1")
print()

clf2_report = classification_report(test_labels, 
                                   train_preds, 
                                   target_names = np.unique(test[CLASS_NAMES]),
                                   zero_division = 0)

LSVM CLF 
-------------------------
0.0212 	Precision
0.0471 	Recall
0.0138 	F1

CPU times: user 29 ms, sys: 168 µs, total: 29.2 ms
Wall time: 65.1 ms


## Summary: Classification Results

In [61]:
result = "| "

if TEXT_COL == "text":
    result += "Plain Text"
else:
    result += "HTML"
    
if MAX_FEATURES is None:
    result += " (all features)"
else:
    result += f" ({MAX_FEATURES} features)"
    
if NGRAM_RANGE != (1,1):
    result += f" ({NGRAM_RANGE} ngrams)"
    
            
result += f" | **{clf1_f1}** ({clf1_precision}) | **{clf2_f1}** ({clf2_precision}) |"
print(CLASS_COL)
print()
print(result)

group_representative

| HTML (100 features) | **0.0211** (0.0145) | **0.0138** (0.0212) |


# Confusion Matrix

TODO: label und text names und so; allg. änderungen von oben hier ergänzen

In [None]:
NORMALIZE_CM = True
INDUSTRY_TRESHOLD = 250
PLT_SCALING_FACTOR = 0.8

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

filtered_train = train.groupby(CLASS_COL).filter(lambda x: len(x)>INDUSTRY_TRESHOLD)
remaining_industries = filtered_train[CLASS_NAMES].drop_duplicates().tolist()


cnf_matrix = confusion_matrix(test_labels, train_preds)

classes = train[CLASS_COL].drop_duplicates().tolist()

cnf_df = pd.DataFrame(cnf_matrix, index=classes, columns=classes)
cnf_df = cnf_df[remaining_industries]
cnf_df = cnf_df.loc[remaining_industries]

In [None]:
plt.figure(figsize=(10*PLT_SCALING_FACTOR, 8*PLT_SCALING_FACTOR))

if NORMALIZE_CM:
    normalized_cnf_df = cnf_df.astype('float') / cnf_df.sum(axis=1)[:, np.newaxis]
    sns.heatmap(normalized_cnf_df, annot=True, cmap=sns.color_palette("Blues"), fmt='.2f')
else:
    sns.heatmap(cnf_df, annot=True, cmap=sns.color_palette("Blues"), fmt='g')
plt.tight_layout()