# Classify Industries

## TODOS 

- "country" als column
- preprocessing function einführen
    - `\n` weg
    - andere unnütze Zeichen wie `|` etc.
- remove pos einführen (siehe `clustering_whole_corpus`)
    - vielleicht mit Language identifier?

## IDEEN

### Datensatzaufbereitung

- Übersetzung der Websites in einheitliche Sprache (z.b. Englisch)
- Andere Klassenlabels?
    - Erst allgemeinere Klassen und dann in diesen Klassen feiner klassifizieren?
        - Von: https://towardsdatascience.com/industrial-classification-of-websites-by-machine-learning-with-hands-on-python-3761b1b530f1
            - Technology, Office, & Education products website (Class_1)
            - Consumer products website (Class_2)
            - Industrial Tools and Hardware products website (Class_3)
    - seltene Klassenlables wegwerfen?

### HTML Klassifizierung


- Text zusammenfassen und dann klassifizieren? Dafür auch HTML-Tags verwenden?

- HTML Struktur verwenden, um vorher **Boilerplate Content** von Main Content zu entfernen:
    - Plain Text ist sehr noisy (viel unnötiges drin)
- Bestimmten Wörtern/Tags höhere Gewichtungen geben
    - Anchor Text (= klickbarer Text in einem Hyperlink)
        - alleine zu wenig Inhalt (QI, S. 12)
        - umliegende Wörter interessant! (QI, S. 12)
        - auch für Nachbar-Seiten-Ansatz
    - Title, Headers (QI, S. 12)
        - auch für Nachbar-Seiten-Ansatz
    - Keywords für Branchen
        ```python3  
        Class_1_keywords = ['Office', 'School', 'phone', 'Technology', 'Electronics', 'Cell', 'Business', 'Education', 'Classroom']
        
        Class_2_keywords = ['Restaurant', 'Hospitality', 'Tub', 'Drain', 'Pool', 'Filtration', 'Floor', 'Restroom', 'Consumer', 'Care', 'Bags', 'Disposables']
        
        Class_3_keywords = ['Pull', 'Lifts', 'Pneumatic', 'Emergency', 'Finishing', 'Hydraulic', 'Lockout', 'Towers', 'Drywall', 'Tools', 'Packaging', 'Measure', 'Tag ']
        ```
- flat classification oder hierarchical classification?
    - flat: parallele Klassen
    - hierarchical: hierarchische Klassen, bauen aufeinander auf
- Nur nach bestimmten Keywords filtern? (das geht jedoch mehr Richtung PLAIN-Textclassification)
- "implicit links": Seiten, die beide bei Suche von **Suchmaschine** erschienen sind und auf die beide der User geklickt hat (QI, S. 12) &rarr; nicht wirklich realisierbar


## Paper / Repos

- **Boilerplate Removal using a Neural Sequence Labeling Model** (2020): https://arxiv.org/pdf/2004.14294.pdf
    - Verbesserung von **Web2Text** &rarr; basiert nicht auf teuren, handgemachten Feature Engineering
    - <u>Hypothese</u>: "Our hypothesis is that the **order** of text blocks in a web page **encodes important information** about their type, i.e. content or boilerplate, as the placement is determined by the authoring style"
- **Web2Text: Deep Structured Boilerplate Removal** (2018): https://arxiv.org/pdf/1801.02607.pdf
- **Mozillas readability**: https://github.com/mozilla/readability
- **Webpage Classification based on Compound of Using HTML Features & URLFeatures and Features of Sibling Pages** (2010): https://www.researchgate.net/publication/220419545_Webpage_Classification_based_on_Compound_of_Using_HTML_Features_URL_Features_and_Features_of_Sibling_Pages
    - TODO
- **Web Page Classification: Features and Algorithms** (2009): https://www.cs.ucf.edu/~dcm/Teaching/COT4810-Fall%202012/Literature/WebPageClassification.pdf
    - S. 7: Using On-Page Features
        - GOLUB, ARDO (2005): title, headings, metadata, main text
    - TODO

## Tests

Evaluation metric: **F1 Scores**

| Experiment | Dummy | LSVM |
| ---------- |:-----:| ----:|
| Plain Text (DE, 10000 samples) | 0.0268 | 0.6326 |
| Plain Text (DE, all samples) | 0.0271 | 0.6501 |
| Plain HTML (DE, 10000 samples) | 0.0319 | 0.5648 |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# sklearn classification
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# sklearn clustering
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# sklearn general
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score
from sklearn.preprocessing import LabelEncoder


from stop_words import get_stop_words
import ujson as json


import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from app.utils import remove_pos

In [2]:
DATA_DIR_PATH = "../data/"
TRAIN_PATH_CSV = DATA_DIR_PATH + "train.csv"
TEST_PATH_CSV = DATA_DIR_PATH + "test.csv"

In [40]:
# "text" or "html"
TEXT_COL = "text"

# "group_representative", "industry", "industry_label" or "group"
CLASS_COL = "group_representative"
TEXT_CLASS_COL = 

DIM_RED = False
MAX_DOCUMENT_FREQUENCY = 1.
MAX_FEATURES = 1000000
LOWERCASE = False
STOP_WORDS = get_stop_words("de")

# POS TAGGING
POS_TAGGING = False
POS_TAGS = ["NOUN"]

# SUBSAMPLING
SUBSAMPLING = True
N_SAMPLES = 10000
USED_LANG = ["DE"] # "ALL" for no removal

# Load train csv

In [59]:
%%time
train = pd.read_csv(TRAIN_PATH_CSV)
train.head(5)

CPU times: user 19.4 s, sys: 2.09 s, total: 21.5 s
Wall time: 21.5 s


Unnamed: 0,url,industry,industry_label,group,group_representative,html,text,source,country
0,http://www.dps-software.de,4,Computer Software,tech,96,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<...",DPS Software: DPS Software GmbH - Wir finden L...,xing,DE
1,http://www.sales-rockstars.com,96,Information Technology and Services,tech,96,"<!doctype html>\n<html lang=""de-DE"">\n <head>...",Sales Rockstars – Kommunikationsagentur für de...,xing,DE
2,http://www.immobilien-ps.de,44,Real Estate,"cons, fin, good",44,"<!DOCTYPE html>\n<html lang=""de-DE"" class=""no-...",Paul Schmidmaier Immobilien – Immobilienmakler...,xing,DE
3,http://www.hdp-profitools.de,133,Wholesale,good,133,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 S...",HDP Bauwerkzeuge - Ihr leistungsstarker Servic...,xing,DE
4,http://www.avaya.com,8,Telecommunications,"gov, tech",8,"<!doctype html> \r\n\t<html class=""no-js"" lang...",Avaya | Leader in Business Communication and C...,linkedin,EN


In [66]:
codes.iloc[63].industry

112

In [69]:
train["group_representative_label"] = train.apply(lambda row: codes.iloc[row.group_representative].industry_label, axis=1)

In [70]:
train

Unnamed: 0,url,industry,industry_label,group,group_representative,html,text,source,country,group_representative_label
0,http://www.dps-software.de,4,Computer Software,tech,96,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<...",DPS Software: DPS Software GmbH - Wir finden L...,xing,DE,Law Practice
1,http://www.sales-rockstars.com,96,Information Technology and Services,tech,96,"<!doctype html>\n<html lang=""de-DE"">\n <head>...",Sales Rockstars – Kommunikationsagentur für de...,xing,DE,Law Practice
2,http://www.immobilien-ps.de,44,Real Estate,"cons, fin, good",44,"<!DOCTYPE html>\n<html lang=""de-DE"" class=""no-...",Paul Schmidmaier Immobilien – Immobilienmakler...,xing,DE,Research
3,http://www.hdp-profitools.de,133,Wholesale,good,133,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 S...",HDP Bauwerkzeuge - Ihr leistungsstarker Servic...,xing,DE,Hospitality
4,http://www.avaya.com,8,Telecommunications,"gov, tech",8,"<!doctype html> \r\n\t<html class=""no-js"" lang...",Avaya | Leader in Business Communication and C...,linkedin,EN,Museums and Institutions
...,...,...,...,...,...,...,...,...,...,...
30287,http://www.connex-stb.de,10,Legal Services,leg,10,<!DOCTYPE html>\n<!--[if lt IE 7 ]><html lang=...,"Connex: Steuerberatung, Unternehmensberatung, ...",xing,DE,Writing and Editing
30288,http://www.frettwork.de,11,Management Consulting,"corp, consulting",11,"<!DOCTYPE html>\r\n<html xmlns=""https://www.w3...",Frettwork Network\n\nDE | EN | NL\n\n\nFRETTWO...,xing,DE,Motion Pictures and Film
30289,http://www.zwf.de,4,Computer Software,tech,96,"<!DOCTYPE html> \n<html dir=""ltr"" lang=""de-DE""...","ERP, Infor, Comet PA\n\nZWF Software & Consult...",xing,DE,Law Practice
30290,http://www.imsengineering.co.za,56,Mining & Metals,man,55,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n<...",IMS ENGINEERING - Homepage\n\nMENU\n\n\nAbout ...,linkedin,EN,Apparel & Fashion


In [60]:
lookup = pd.Series(train.industry_label.values, index = train.industry).to_dict() 
train["group_representative_label"] = train.apply(lambda row: lookup[row.group_representative], axis=1)

KeyError: 30

In [61]:
lookup

{4: 'Computer Software',
 96: 'Information Technology and Services',
 44: 'Real Estate',
 133: 'Wholesale',
 8: 'Telecommunications',
 6: 'Internet',
 31: 'Hospitality',
 34: 'Food & Beverages',
 56: 'Mining & Metals',
 41: 'Banking',
 53: 'Automotive',
 135: 'Mechanical or Industrial Engineering',
 116: 'Logistics and Supply Chain',
 50: 'Architecture & Planning',
 80: 'Marketing and Advertising',
 11: 'Management Consulting',
 15: 'Pharmaceuticals',
 113: 'Online Media',
 137: 'Human Resources',
 17: 'Medical Devices',
 48: 'Construction',
 43: 'Financial Services',
 124: 'Health, Wellness and Fitness',
 27: 'Retail',
 10: 'Legal Services',
 42: 'Insurance',
 144: 'Renewables & Environment',
 55: 'Machinery',
 112: 'Electrical/Electronic Manufacturing',
 57: 'Oil & Energy'}

In [5]:
train.shape

(30292, 9)

## Some informations about the dataset

In [6]:
print("Most frequent countries:\n")
train.country.value_counts().head(5)

Most frequent countries:



DE    19848
EN     8244
NL      433
FR      407
ES      304
Name: country, dtype: int64

In [7]:
text_percentage = train.apply(lambda row: len(row.text)/len(row.html), axis=1)

print(f"Average/mean share of actual/plain text of HTML: {np.round(np.mean(text_percentage), decimals=2)*100}%")

Average/mean share of actual/plain text of HTML: 8.0%


In [42]:
unique_classes = list(np.unique(train[CLASS_COL]))

print(f"{CLASS_COL} ({len(unique_classes)}): \n")
for idx, i in enumerate(unique_classes):
    print(str(idx+1)+". "+str(i), end="\t")

group_representative (21): 

1. 8	2. 10	3. 11	4. 13	5. 25	6. 30	7. 40	8. 42	9. 43	10. 44	11. 48	12. 53	13. 55	14. 80	15. 96	16. 116	17. 126	18. 133	19. 135	20. 137	21. 144	

## Subsampling

- Only specific language (e.g. "DE")
- Only $n$ samples (e.g. 1000)
- Stratified sampling by industry col

In [44]:
if SUBSAMPLING:
    
    if USED_LANG[0] != "ALL":
        train = train[train.country.isin(USED_LANG)]
    train = train.sample(n=N_SAMPLES, weights=CLASS_COL, random_state=1).reset_index(drop=True)
    
    
unique_sampled_classes = len(train[CLASS_COL].unique())
print("Count of classes (sampled train):", unique_sampled_classes)
print("Equal to original train?", unique_sampled_classes == len(unique_classes))
train.shape

Count of classes (sampled train): 21
Equal to original train? True


(10000, 9)

# Data preprocessing (vectorizing, dimension reducing etc.)

- ignore terms with a document frequency > MAX_DOCUMENT_FREQUENCY (`max_df` in TF-IDF)

In [45]:
if TEXT_COL == "html":
    POS_TAGGING = False
train_text_plain = train[TEXT_COL].values


train_labels = train[CLASS_COL].values
unique_train_labels = list(np.unique(train[CLASS_COL]))
print("Count of unique classes in train set:", len(unique_train_labels))
print("Count of unique languages in train set:", len(np.unique(train["country"].values)))

Count of unique classes in train set: 21
Count of unique languages in train set: 1


In [46]:
%%time

if POS_TAGGING:
    train_text = remove_pos(train, pos_tags=POS_TAGS)
else:
    train_text = train_text_plain
    print("No POS TAGS are removed.\n")

No POS TAGS are removed.

CPU times: user 855 µs, sys: 58 µs, total: 913 µs
Wall time: 911 µs


### Vectorizing text

In [47]:
%%time

vectorizer = TfidfVectorizer(max_df=MAX_DOCUMENT_FREQUENCY,
                             lowercase=LOWERCASE,
                             max_features=MAX_FEATURES,
                             stop_words=STOP_WORDS)


vectorizer.fit(train_text)

train_vector = vectorizer.transform(train_text)

CPU times: user 8.05 s, sys: 66.2 ms, total: 8.11 s
Wall time: 8.11 s


# Test Dataset

In [48]:
%%time
test = pd.read_csv(TEST_PATH_CSV)

if SUBSAMPLING:
    if USED_LANG[0] != "ALL":
        test = test[test.country.isin(USED_LANG)]
    test = test.sample(n=test.shape[0], weights=CLASS_COL, random_state=1).reset_index(drop=True)


test_vector = vectorizer.transform(test[TEXT_COL].values)
test_labels = test[CLASS_COL].values

CPU times: user 6.88 s, sys: 427 ms, total: 7.31 s
Wall time: 7.31 s


In [49]:
test.shape

(4976, 9)

# Naive Bayes (Multinomial)

In [51]:
%%time
print("Multinomial Naive Bayes CLF", "\n-------------------------")
# training
clf = MultinomialNB(alpha=1.0)
clf.fit(train_vector, train_labels)

# prediction
train_preds = clf.predict(test_vector)

# evaluation
precision = precision_score(test_labels, train_preds, average="macro")
recall = recall_score(test_labels, train_preds, average="macro")
f1 = f1_score(test_labels, train_preds, average="macro")

print(np.round(precision, decimals=4), "\tPrecision")
print(np.round(recall, decimals=4), "\tRecall")
print(np.round(f1, decimals=4), "\tF1")
print()

clf_report = classification_report(test_labels, train_preds, target_names = np.unique(test[CLASS_COL]))

Multinomial Naive Bayes CLF 
-------------------------
0.4133 	Precision
0.1782 	Recall
0.1466 	F1



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TypeError: object of type 'numpy.int64' has no len()

# LSVM

In [15]:
%%time
print("LSVM CLF", "\n-------------------------")
# training
clf = LinearSVC(C = 1)
clf.fit(train_vector, train_labels)

# prediction
train_preds = clf.predict(test_vector)

# evaluation
precision = precision_score(test_labels, train_preds, average="macro")
recall = recall_score(test_labels, train_preds, average="macro")
f1 = f1_score(test_labels, train_preds, average="macro")
print(np.round(precision, decimals=4), "\tPrecision")
print(np.round(recall, decimals=4), "\tRecall")
print(np.round(f1, decimals=4), "\tF1")
print()

clf_report = classification_report(test_labels, train_preds, target_names = np.unique(test.industry_label))

LSVM CLF 
-------------------------
0.6517 	Precision
0.5683 	Recall
0.5486 	F1

CPU times: user 4.5 s, sys: 18.1 ms, total: 4.52 s
Wall time: 4.52 s


# Confusion Matrix

TODO: label und text names und so; allg. änderungen von oben hier ergänzen

In [None]:
NORMALIZE_CM = True
INDUSTRY_TRESHOLD = 250
PLT_SCALING_FACTOR = 0.8

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

filtered_train = train.groupby("industry_name").filter(lambda x: len(x)>INDUSTRY_TRESHOLD)
remaining_industries = filtered_train["industry_name"].drop_duplicates().tolist()



cnf_matrix = confusion_matrix(test_labels, train_preds)

classes = train["industry_name"].drop_duplicates().tolist()

cnf_df = pd.DataFrame(cnf_matrix, index=classes, columns=classes)
cnf_df = cnf_df[remaining_industries]
cnf_df = cnf_df.loc[remaining_industries]

In [None]:
plt.figure(figsize=(10*PLT_SCALING_FACTOR, 8*PLT_SCALING_FACTOR))

if NORMALIZE_CM:
    normalized_cnf_df = cnf_df.astype('float') / cnf_df.sum(axis=1)[:, np.newaxis]
    sns.heatmap(normalized_cnf_df, annot=True, cmap=sns.color_palette("Blues"), fmt='.2f')
else:
    sns.heatmap(cnf_df, annot=True, cmap=sns.color_palette("Blues"), fmt='g')
plt.tight_layout()