# Classify Industries

## TODOS 

- remove pos einführen (siehe `clustering_whole_corpus`)
    - vielleicht mit Language identifier?
- preprocessing function einführen
    - `\n` weg
    - andere unnütze Zeichen wie `|` etc.

## IDEEN

### Datensatzaufbereitung

- Übersetzung der Websites in einheitliche Sprache (z.b. Englisch)
- Andere Klassenlabels?
    - Erst allgemeinere Klassen und dann in diesen Klassen feiner klassifizieren?
        - Von: https://towardsdatascience.com/industrial-classification-of-websites-by-machine-learning-with-hands-on-python-3761b1b530f1
            - Technology, Office, & Education products website (Class_1)
            - Consumer products website (Class_2)
            - Industrial Tools and Hardware products website (Class_3)
    - seltene Klassenlables wegwerfen?

### HTML Klassifizierung


- Text zusammenfassen und dann klassifizieren? Dafür auch HTML-Tags verwenden?

- HTML Struktur verwenden, um vorher **Boilerplate Content** von Main Content zu entfernen:
    - Plain Text ist sehr noisy (viel unnötiges drin)
- Bestimmten Wörtern/Tags höhere Gewichtungen geben
    - Anchor Text (= klickbarer Text in einem Hyperlink)
        - alleine zu wenig Inhalt (QI, S. 12)
        - umliegende Wörter interessant! (QI, S. 12)
        - auch für Nachbar-Seiten-Ansatz
    - Title, Headers (QI, S. 12)
        - auch für Nachbar-Seiten-Ansatz
    - Keywords für Branchen
        ```python3  
        Class_1_keywords = ['Office', 'School', 'phone', 'Technology', 'Electronics', 'Cell', 'Business', 'Education', 'Classroom']
        
        Class_2_keywords = ['Restaurant', 'Hospitality', 'Tub', 'Drain', 'Pool', 'Filtration', 'Floor', 'Restroom', 'Consumer', 'Care', 'Bags', 'Disposables']
        
        Class_3_keywords = ['Pull', 'Lifts', 'Pneumatic', 'Emergency', 'Finishing', 'Hydraulic', 'Lockout', 'Towers', 'Drywall', 'Tools', 'Packaging', 'Measure', 'Tag ']
        ```
- flat classification oder hierarchical classification?
    - flat: parallele Klassen
    - hierarchical: hierarchische Klassen, bauen aufeinander auf
- Nur nach bestimmten Keywords filtern? (das geht jedoch mehr Richtung PLAIN-Textclassification)
- "implicit links": Seiten, die beide bei Suche von **Suchmaschine** erschienen sind und auf die beide der User geklickt hat (QI, S. 12) &rarr; nicht wirklich realisierbar


## Paper / Repos

- **Boilerplate Removal using a Neural Sequence Labeling Model** (2020): https://arxiv.org/pdf/2004.14294.pdf
    - Verbesserung von **Web2Text** &rarr; basiert nicht auf teuren, handgemachten Feature Engineering
    - <u>Hypothese</u>: "Our hypothesis is that the **order** of text blocks in a web page **encodes important information** about their type, i.e. content or boilerplate, as the placement is determined by the authoring style"
- **Web2Text: Deep Structured Boilerplate Removal** (2018): https://arxiv.org/pdf/1801.02607.pdf
- **Mozillas readability**: https://github.com/mozilla/readability
- **Webpage Classification based on Compound of Using HTML Features & URLFeatures and Features of Sibling Pages** (2010): https://www.researchgate.net/publication/220419545_Webpage_Classification_based_on_Compound_of_Using_HTML_Features_URL_Features_and_Features_of_Sibling_Pages
    - TODO
- **Web Page Classification: Features and Algorithms** (2009): https://www.cs.ucf.edu/~dcm/Teaching/COT4810-Fall%202012/Literature/WebPageClassification.pdf
    - S. 7: Using On-Page Features
        - GOLUB, ARDO (2005): title, headings, metadata, main text
    - TODO

## Tests

Evaluation metric: **F1 Scores**

| Experiment | Dummy | LSVM |
| ---------- |:-----:| ----:|
| Plain Text (kein POS Removal) | 0.01665 | 0.50609 |
| Plain Text (POS Removal: VERB, ADJ, NOUN) | 0.01417 | 0.01395 |
| HTML | 0.0145 | 0.46618 |
| Clean HTML | 0.01554 | 0.39454 |
| Plain Text (POS Removal: NOUN, only DE) | 0.01707 | 0.01865 |
| HTML (only DE) | 0.01709 | 0.41037 |




In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# sklearn classification
from sklearn.dummy import DummyClassifier
from sklearn.svm import LinearSVC

# sklearn clustering
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# sklearn general
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score
from sklearn.preprocessing import LabelEncoder


from stop_words import get_stop_words
import ujson as json


import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from app.utils import remove_pos

In [2]:
DATA_PATH = "../data/"

INDUSTRY_CODES_PATH = DATA_PATH + "linkedin-industry-codes.json"
TRAIN_PATH_JSON = DATA_PATH + "train.ndjson"
TEST_PATH_JSON = DATA_PATH + "test.ndjson"
TRAIN_PATH_CSV = DATA_PATH + "train.csv"
TEST_PATH_CSV = DATA_PATH + "test.csv"

In [8]:
DIM_RED = False
MAX_DOCUMENT_FREQUENCY = 1.
MAX_FEATURES = 1000000
LOWERCASE = False
STOP_WORDS = get_stop_words("de")

# POS TAGGING
POS_TAGGING = True
POS_TAGS = ["NOUN"]

# SUBSAMPLING
SUBSAMPLING = True
N_SAMPLES = 10000
USED_LANG = ["DE"] # "ALL" for no removal


# HTML
USE_HTML = True

# Load train csv

In [9]:
%%time
train = pd.read_csv(TRAIN_PATH_CSV)
train.head(5)

CPU times: user 8.36 s, sys: 889 ms, total: 9.25 s
Wall time: 9.25 s


Unnamed: 0,text,html,industry,country,industry_name
0,Home | NETZkultur GmbH\n\nZum Inhalt wechseln\...,"<!DOCTYPE html>\n<html lang=""de-DE"">\n<head>\n...",4,DE,Computer Software
1,"\n\nNXP Semiconductors | Automotive, Security,...",<!DOCTYPE html>\n<html>\n<head>\n\t<title>NXP ...,7,UNKNOWN,Semiconductors
2,Suer Nutzfahrzeugtechnik Onlineshop\n\nSie wis...,"<!DOCTYPE html>\n<html lang=""de"">\n <head>\...",53,DE,Automotive
3,Improve cash flows and long-term profitability...,"\n<!DOCTYPE html>\n<html lang=""en"" prefix=""og:...",43,UNKNOWN,Financial Services
4,Your specialist for plastic compounds\n\nMenu ...,"<!DOCTYPE html>\n<html xmlns:og=""http://ogp.me...",117,UNKNOWN,Plastics


In [10]:
train.shape

(13114, 5)

In [30]:
list(np.unique(train.industry_name))

['Apparel & Fashion',
 'Architecture & Planning',
 'Automotive',
 'Banking',
 'Building Materials',
 'Chemicals',
 'Civil Engineering',
 'Computer & Network Security',
 'Computer Hardware',
 'Computer Software',
 'Construction',
 'Consumer Goods',
 'Electrical/Electronic Manufacturing',
 'Events Services',
 'Financial Services',
 'Food & Beverages',
 'Furniture',
 'Government Administration',
 'Graphic Design',
 'Health, Wellness and Fitness',
 'Higher Education',
 'Human Resources',
 'Industrial Automation',
 'Information Technology and Services',
 'Insurance',
 'Internet',
 'Legal Services',
 'Leisure, Travel & Tourism',
 'Logistics and Supply Chain',
 'Management Consulting',
 'Marketing and Advertising',
 'Mechanical or Industrial Engineering',
 'Medical Devices',
 'Mining & Metals',
 'Oil & Energy',
 'Online Media',
 'Pharmaceuticals',
 'Plastics',
 'Printing',
 'Professional Training & Coaching',
 'Public Relations and Communications',
 'Publishing',
 'Real Estate',
 'Renewables 

### Subsampling

- Only specific language (e.g. "DE")
- Only first $n$ samples (e.g. 1000)

In [11]:
if SUBSAMPLING:
    
    if USED_LANG[0] != "ALL":
        train = train[train.country.isin(USED_LANG)]
    train = train.head(N_SAMPLES)
train.shape

(7375, 5)

# Data preprocessing (vectorizing, dimension reducing etc.)

- ignore terms with a document frequency > MAX_DOCUMENT_FREQUENCY (`max_df` in TF-IDF)

In [12]:
if USE_HTML:
    POS_TAGGING = False
    train_text_plain = train["html"].values
else:
    train_text_plain = train["text"].values


train_labels = train["industry"].values
unique_train_labels = list(np.unique(train["industry"]))
print("Count of unique industry names in train set:", len(unique_train_labels))
print("Count of unique languages in train set:", len(np.unique(train["country"].values)))

Count of unique industry names in train set: 50
Count of unique languages in train set: 1


In [13]:
%%time

if POS_TAGGING:
    train_text = remove_pos(train, pos_tags=POS_TAGS)
else:
    train_text = train_text_plain
    print("No POS TAGS are removed.\n")

No POS TAGS are removed.

CPU times: user 90 µs, sys: 37 µs, total: 127 µs
Wall time: 111 µs


### Vectorizing text

In [14]:
%%time


vectorizer = TfidfVectorizer(max_df=MAX_DOCUMENT_FREQUENCY,
                             lowercase=LOWERCASE,
                             max_features=MAX_FEATURES,
                             stop_words=STOP_WORDS)


vectorizer.fit(train_text)

train_vector = vectorizer.transform(train_text)

CPU times: user 1min 32s, sys: 389 ms, total: 1min 32s
Wall time: 1min 32s


# Test Dataset

There is one class/industry which appears in test set but not in the training set. All instances of this class were removed from the test set.

In [15]:
%%time
test = pd.read_csv(TEST_PATH_CSV)
test = test[test["industry"].isin(unique_train_labels)]

if SUBSAMPLING:
    test = test[test.country.isin(USED_LANG)]


test_vector = vectorizer.transform(test["text"].values)
test_labels = test["industry"].values

CPU times: user 3.21 s, sys: 261 ms, total: 3.47 s
Wall time: 3.48 s


#### Get industry names for test data

In [16]:
with open(INDUSTRY_CODES_PATH) as f:
    industry_codes = json.load(f)
    
def get_code(code_list, identifier):
    name = ""
    for entry in code_list:
        if entry["Code"] == identifier:
            name = entry["Description"]
            break
    return name

test_label_names = list(map(lambda x: get_code(industry_codes, x), dict(test["industry"]).values()))

# Dummy

In [17]:
%%time
print("Dummy CLF", "\n-------------------------")
clf = DummyClassifier(strategy="uniform")
clf.fit(train_vector, train_labels)

train_preds = clf.predict(test_vector)

precision = precision_score(test_labels, train_preds, average="macro")
recall = recall_score(test_labels, train_preds, average="macro")
f1 = f1_score(test_labels, train_preds, average="macro")
print(np.round(precision, decimals=5), "\tPrecision")
print(np.round(recall, decimals=5), "\tRecall")
print(np.round(f1, decimals=5), "\tF1")
print()

clf_report = classification_report(test_labels, train_preds, target_names = np.unique(test_label_names))

Dummy CLF 
-------------------------
0.02184 	Precision
0.023 	Recall
0.01709 	F1

CPU times: user 11.1 ms, sys: 1.3 ms, total: 12.4 ms
Wall time: 11.2 ms


# LSVM

In [18]:
%%time
clf = LinearSVC(C = 1)
clf.fit(train_vector, train_labels)

train_preds = clf.predict(test_vector)

# Metrics
print("LSVM CLF", "\n-------------------------")
precision = precision_score(test_labels, train_preds, average="macro")
recall = recall_score(test_labels, train_preds, average="macro")
f1 = f1_score(test_labels, train_preds, average="macro")
print(np.round(precision, decimals=5), "\tPrecision")
print(np.round(recall, decimals=5), "\tRecall")
print(np.round(f1, decimals=5), "\tF1")
print()

clf_report = classification_report(test_labels, train_preds, target_names = np.unique(test_label_names))

LSVM CLF 
-------------------------
0.56993 	Precision
0.36637 	Recall
0.41037 	F1

CPU times: user 42.1 s, sys: 182 ms, total: 42.3 s
Wall time: 42.3 s


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Confusion Matrix

In [None]:
NORMALIZE_CM = True
INDUSTRY_TRESHOLD = 250
PLT_SCALING_FACTOR = 0.8

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

filtered_train = train.groupby("industry_name").filter(lambda x: len(x)>INDUSTRY_TRESHOLD)
remaining_industries = filtered_train["industry_name"].drop_duplicates().tolist()



cnf_matrix = confusion_matrix(test_labels, train_preds)

classes = train["industry_name"].drop_duplicates().tolist()

cnf_df = pd.DataFrame(cnf_matrix, index=classes, columns=classes)
cnf_df = cnf_df[remaining_industries]
cnf_df = cnf_df.loc[remaining_industries]

In [None]:
plt.figure(figsize=(10*PLT_SCALING_FACTOR, 8*PLT_SCALING_FACTOR))

if NORMALIZE_CM:
    normalized_cnf_df = cnf_df.astype('float') / cnf_df.sum(axis=1)[:, np.newaxis]
    sns.heatmap(normalized_cnf_df, annot=True, cmap=sns.color_palette("Blues"), fmt='.2f')
else:
    sns.heatmap(cnf_df, annot=True, cmap=sns.color_palette("Blues"), fmt='g')
plt.tight_layout()