# Classify Industries

## TODOS 

- top 10 klassen oder so
- preprocessing function einführen
    - `\n` weg
    - andere unnütze Zeichen wie `|` etc.
- remove pos einführen (siehe `clustering_whole_corpus`)
    - vielleicht mit Language identifier?
- ~~"country" als colum~~~~

## IDEEN

### Datensatzaufbereitung

- Übersetzung der Websites in einheitliche Sprache (z.b. Englisch)
- Andere Klassenlabels?
    - Erst allgemeinere Klassen und dann in diesen Klassen feiner klassifizieren?
        - Von: https://towardsdatascience.com/industrial-classification-of-websites-by-machine-learning-with-hands-on-python-3761b1b530f1
            - Technology, Office, & Education products website (Class_1)
            - Consumer products website (Class_2)
            - Industrial Tools and Hardware products website (Class_3)
    - seltene Klassenlables wegwerfen?

### HTML Klassifizierung


- Text zusammenfassen und dann klassifizieren? Dafür auch HTML-Tags verwenden?
- ~~HTML Struktur verwenden, um vorher **Boilerplate Content** von Main Content zu entfernen:~~
    - ~~Plain Text ist sehr noisy (viel unnötiges drin)~~
    - ICH: gemacht mit CLEAN HTML, aber ohne explizites Boilerplate Content Removal
- Bestimmten Wörtern/Tags höhere Gewichtungen geben
    - Anchor Text (= klickbarer Text in einem Hyperlink)
        - alleine zu wenig Inhalt (QI, S. 12)
        - umliegende Wörter interessant! (QI, S. 12)
        - auch für Nachbar-Seiten-Ansatz
    - Title, Headers (QI, S. 12)
        - auch für Nachbar-Seiten-Ansatz
    - Keywords für Branchen
        ```python3  
        Class_1_keywords = ['Office', 'School', 'phone', 'Technology', 'Electronics', 'Cell', 'Business', 'Education', 'Classroom']
        
        Class_2_keywords = ['Restaurant', 'Hospitality', 'Tub', 'Drain', 'Pool', 'Filtration', 'Floor', 'Restroom', 'Consumer', 'Care', 'Bags', 'Disposables']
        
        Class_3_keywords = ['Pull', 'Lifts', 'Pneumatic', 'Emergency', 'Finishing', 'Hydraulic', 'Lockout', 'Towers', 'Drywall', 'Tools', 'Packaging', 'Measure', 'Tag ']
        ```
- NER mit **Tags** als weitere Tokens
- Features von "Nachbarseiten" verwenden
    - Hilfreich, da mehr Infos als "nur" Startseite
    - Fragen: 
        - Was sind Nachbarseiten, wie definieren?
            - Webgraph Webseiten?
            - Weitere Seiten des Unternehmens?
        - Wie viele Nachbarseiten?
        - Wieviel von den Nachbarseiten verwenden?
            - Ganze Seite?
            - text, title, heading, Metadaten?
    
- *Weiteres*:
    - Flat classification oder Hierarchical classification?
        - Flat: parallele Klassen
        - Hierarchical: hierarchische Klassen, bauen aufeinander auf
    - Nur nach bestimmten Keywords filtern? (das geht jedoch mehr Richtung PLAIN-Textclassification)
    - "implicit links": Seiten, die beide bei Suche von **Suchmaschine** erschienen sind und auf die beide der User geklickt hat (QI, S. 12) &rarr; nicht wirklich realisierbar


## Paper / Repos

- **Boilerplate Removal using a Neural Sequence Labeling Model** (2020): https://arxiv.org/pdf/2004.14294.pdf
    - Verbesserung von **Web2Text** &rarr; basiert nicht auf teuren, handgemachten Feature Engineering
    - <u>Hypothese</u>: "Our hypothesis is that the **order** of text blocks in a web page **encodes important information** about their type, i.e. content or boilerplate, as the placement is determined by the authoring style"
- **Web2Text: Deep Structured Boilerplate Removal** (2018): https://arxiv.org/pdf/1801.02607.pdf
- **Mozillas readability**: https://github.com/mozilla/readability
- **Webpage Classification based on Compound of Using HTML Features & URLFeatures and Features of Sibling Pages** (2010): https://www.researchgate.net/publication/220419545_Webpage_Classification_based_on_Compound_of_Using_HTML_Features_URL_Features_and_Features_of_Sibling_Pages
    - TODO
- **Web Page Classification: Features and Algorithms** (2009): https://www.cs.ucf.edu/~dcm/Teaching/COT4810-Fall%202012/Literature/WebPageClassification.pdf
    - S. 7: Using On-Page Features
        - GOLUB, ARDO (2005): title, headings, metadata, main text
    - TODO

## Tests

- Evaluation metric: **F1 Scores**
- TF-IDF Vectorizer
    - kein lowercase
    - stop words werden entfernt
    - keine max features
- Top $n$ classes = most frequent classes
- CLEAN HTML auch für Test Set (ansonsten unglaublich schlechte Accuracy und etwas sinnlos)


#### Label: `group_representatives`

| Experiment | SGD F1 (Precision) | LSVM F1 (Precision) |
| ---------- |:-----:| ----:|
| HTML (10000 features) | **0.5292** (0.5962) | **0.5493** (0.6371) |
| HTML (kept stop words) (10000 features) | **0.5268** (0.5845) | **0.5473** (0.6439) |
| HTML (10000 features) ((1, 3) ngrams) | **0.4035** (0.463) | **0.4188** (0.5345) |
| HTML (10000 features) ((2, 2) ngrams) | **0.2442** (0.2787) | **0.252** (0.3146) |
| ---------- |-----| ----|
| *ALL LANGS* HTML (kept stop words) (10000 features) | **0.5781** (0.6464) | **0.6406** (0.7024) |

## Aktuelle TODOS zum ausprobieren (26.04)
- N-gramme &rarr; erstmal lassen
- Max Features
- Count Vectorizer / TF-IDF Vectorizer manipulieren

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# sklearn classification
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC

# sklearn general
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder


from stop_words import get_stop_words
import ujson as json


import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from app.utils import clean_html_boilerplate, tokenizing_html, trim_html

## Paths

In [2]:
DATA_DIR_PATH = "../data/"
LANG = ""

INDUSTRIES_PATH_CSV = DATA_DIR_PATH + "industries.csv"
TRAIN_PATH_CSV = DATA_DIR_PATH + "train" + LANG + ".csv"
TEST_PATH_CSV = DATA_DIR_PATH + "test" + LANG + ".csv"

## Load train csv

In [3]:
%%time
train = pd.read_csv(TRAIN_PATH_CSV)

CPU times: user 1min 32s, sys: 34.1 s, total: 2min 7s
Wall time: 2min 6s


In [4]:
train.head(1)

Unnamed: 0,url,industry,industry_label,group,group_representative,html,text,source,country,group_representative_label
0,http://www.energy-net.de,8,Telecommunications,"gov, tech",8,<html><head><title>Energy Net Apple Reseller</...,Energy Net Apple Reseller\n\nSpringe zum Inhal...,xing,DE,Telecommunications


In [5]:
train.shape

(30292, 10)

## Hyperparameters

In [10]:
# "text" or "html"
TEXT_COL = "html"

# "group_representative", "group_representative_label", "industry", "industry_label" or "group"
CLASS_COL = "group_representative"
CLASS_NAMES = "group_representative_label"

MAX_DOCUMENT_FREQUENCY = 1.
MAX_FEATURES = 10000
NGRAM_RANGE = (1,1)
LOWERCASE = False
#STOP_WORDS = get_stop_words("de")
STOP_WORDS = None

TAG_LIST = ['a', 'b', 'em', 'h1', 'h2', 'h3', 'i', 'li', 'p', 'strong', 'title']

In [8]:
tokenizing_html(train.iloc[0].html, token_list)

['<title>',
 'Energy',
 'Net',
 'Apple',
 'Reseller',
 '</title>',
 '<a>',
 'Springe',
 'zum',
 'Inhalt',
 '</a>',
 '<a>',
 '</a>',
 '<a>',
 '<i>',
 '</i>',
 '</a>',
 '<li>',
 '<a>',
 'Home',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Unternehmen',
 '</a>',
 '<li>',
 '<a>',
 'Über',
 'Energy',
 'Net',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Partner',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Referenzen',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Aktuelles',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Stellenangebote',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Soziales',
 'Engagement',
 '</a>',
 '</li>',
 '</li>',
 '<li>',
 '<a>',
 'Lösungen',
 '</a>',
 '<li>',
 '<a>',
 'Apple',
 'Enterprise',
 'Services',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Collaboration',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Publishing',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Print',
 'amp',
 'Copy',
 '</a>',
 '</li>',
 '<li>',
 '<a>',
 'Training',
 'amp',
 'Events',
 '</a>',
 '</li>',
 '</li>',
 '<li>',
 '<a>',
 'Services',
 '</a>',
 '</li

## Trim HTML

In [9]:
train2 = train.head(10)

In [17]:
%%time
train["html"] = train["html"].apply(lambda x: trim_html(x, tag_list = TAG_LIST, tagless_output_string=True))

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

### Vectorizing text

In [11]:
%%time

train_text = train[TEXT_COL]
train_labels = train[CLASS_COL].values

vectorizer = CountVectorizer(max_df=MAX_DOCUMENT_FREQUENCY,
                             lowercase=LOWERCASE,
                             max_features=MAX_FEATURES,
                             ngram_range=NGRAM_RANGE,
                             stop_words=STOP_WORDS,
                            tokenizer=tokenizing_html)
transformer = TfidfTransformer()

vector = vectorizer.fit_transform(train_text)
train_vector = transformer.fit_transform(vector)

CPU times: user 2.5 ms, sys: 18.5 ms, total: 21 ms
Wall time: 17.6 ms


# Test Dataset

In [12]:
%%time
test = pd.read_csv(TEST_PATH_CSV)
    
test_vector = vectorizer.transform(test[TEXT_COL].values)
test_vector = transformer.transform(test_vector)
test_labels = test[CLASS_COL].values

CPU times: user 5.33 s, sys: 0 ns, total: 5.33 s
Wall time: 5.7 s


# SGD

In [13]:
%%time
print("SGD CLF", "\n-------------------------")
# training
clf = SGDClassifier()
clf.fit(train_vector, train_labels)

# prediction
train_preds = clf.predict(test_vector)

# evaluation
precision = precision_score(test_labels, train_preds, average="macro", zero_division=0)
recall = recall_score(test_labels, train_preds, average="macro", zero_division=0)
f1 = f1_score(test_labels, train_preds, average="macro", zero_division=0)
clf1_f1 = np.round(f1, decimals=4)
clf1_precision = np.round(precision, decimals=4)

print(np.round(precision, decimals=4), "\tPrecision")
print(np.round(recall, decimals=4), "\tRecall")
print(np.round(f1, decimals=4), "\tF1")
print()

clf1_report = classification_report(test_labels, 
                                   train_preds, 
                                   target_names = np.unique(test[CLASS_NAMES]), 
                                   zero_division = 0)

SGD CLF 
-------------------------
0.0555 	Precision
0.076 	Recall
0.0502 	F1

CPU times: user 58.4 ms, sys: 0 ns, total: 58.4 ms
Wall time: 53.3 ms


# LSVM

In [14]:
%%time
print("LSVM CLF", "\n-------------------------")
# training
clf = LinearSVC()
clf.fit(train_vector, train_labels)

# prediction
train_preds = clf.predict(test_vector)

# evaluation
precision = precision_score(test_labels, train_preds, average="macro", zero_division=0)
recall = recall_score(test_labels, train_preds, average="macro", zero_division=0)
f1 = f1_score(test_labels, train_preds, average="macro", zero_division=0)
clf2_f1 = np.round(f1, decimals=4)
clf2_precision = np.round(precision, decimals=4)

print(np.round(precision, decimals=4), "\tPrecision")
print(np.round(recall, decimals=4), "\tRecall")
print(np.round(f1, decimals=4), "\tF1")
print()

clf2_report = classification_report(test_labels, 
                                   train_preds, 
                                   target_names = np.unique(test[CLASS_NAMES]),
                                   zero_division = 0)

LSVM CLF 
-------------------------
0.1247 	Precision
0.0506 	Recall
0.0222 	F1

CPU times: user 40.2 ms, sys: 0 ns, total: 40.2 ms
Wall time: 34.7 ms


## Summary: Classification Results

In [15]:
result = "| "

if TEXT_COL == "text":
    result += "Plain Text"
else:
    result += "HTML"
    
if STOP_WORDS is None:
    result += " (kept stop words)"
    
if MAX_FEATURES is None:
    result += " (all features)"
else:
    result += f" ({MAX_FEATURES} features)"
    
if NGRAM_RANGE != (1,1):
    result += f" ({NGRAM_RANGE} ngrams)"
    
            
result += f" | **{clf1_f1}** ({clf1_precision}) | **{clf2_f1}** ({clf2_precision}) |"
print(CLASS_COL)
print()
print(result)

group_representative

| Plain Text (kept stop words) (10000 features) | **0.0502** (0.0555) | **0.0222** (0.1247) |


# Confusion Matrix

TODO: label und text names und so; allg. änderungen von oben hier ergänzen

In [None]:
NORMALIZE_CM = True
INDUSTRY_TRESHOLD = 250
PLT_SCALING_FACTOR = 0.8

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

filtered_train = train.groupby(CLASS_COL).filter(lambda x: len(x)>INDUSTRY_TRESHOLD)
remaining_industries = filtered_train[CLASS_NAMES].drop_duplicates().tolist()


cnf_matrix = confusion_matrix(test_labels, train_preds)

classes = train[CLASS_COL].drop_duplicates().tolist()

cnf_df = pd.DataFrame(cnf_matrix, index=classes, columns=classes)
cnf_df = cnf_df[remaining_industries]
cnf_df = cnf_df.loc[remaining_industries]

In [None]:
plt.figure(figsize=(10*PLT_SCALING_FACTOR, 8*PLT_SCALING_FACTOR))

if NORMALIZE_CM:
    normalized_cnf_df = cnf_df.astype('float') / cnf_df.sum(axis=1)[:, np.newaxis]
    sns.heatmap(normalized_cnf_df, annot=True, cmap=sns.color_palette("Blues"), fmt='.2f')
else:
    sns.heatmap(cnf_df, annot=True, cmap=sns.color_palette("Blues"), fmt='g')
plt.tight_layout()