#  Leftover

## Tests

- Evaluation metric: **F1 Scores**
- TF-IDF Vectorizer
    - kein lowercase
    - stop words werden entfernt
    - keine max features
- Top $n$ classes = most frequent classes
- CLEAN HTML auch für Test Set (ansonsten unglaublich schlechte Accuracy und etwas sinnlos)


#### Label: `group_representatives`

| Experiment | KNN F1 (Precision) | LSVM F1 (Precision) |
| ---------- |:-----:| ----:|
| Plain Text (all samples) | **0.4295** (0.4774) | **0.6874** (0.7178) |
| Plain HTML (all samples) | **0.1474** (0.1954) | **0.5109** (0.6139) |
| | | |
| Plain Text ([DE] all samples) | **0.5224** (0.5563) | **0.7209** (0.7476) |
| Plain HTML ([DE] all samples) | **0.1515** (0.1915) | **0.5194** (0.6232) |
| Plain Text ([DE, EN] all samples) | **0.472** (0.5258) | **0.7012** (0.7262) |
| Plain HTML ([DE, EN] all samples) | **0.1466** (0.197) | **0.5166** (0.6211) |
| | | |
| Plain Text (all samples) (Top 10 classes) | **0.2059** (0.203) | **0.3024** (0.2721) |
| Plain HTML (all samples) (Top 10 classes) | **0.0943** (0.0995) | **0.2376** (0.2323) |
| Plain Text ([DE] all samples) (Top 10 classes) | **0.2212** (0.2036) | **0.3006** (0.2687) |
| Plain HTML ([DE] all samples) (Top 10 classes) | **0.089** (0.0912) | **0.2231** (0.2164) |
| | | |
| Clean HTML (all samples)  | **0.0612** (0.0912) | **0.5086** (0.6424) |
| Clean HTML ([DE] all samples)  | **0.0624** (0.0806) | **0.5579** (0.6607) |
| Clean HTML ([DE, EN] all samples)  | **0.065** (0.088) | **0.5354** (0.645) |

#### Label: `industry`

| Experiment | KNN F1 (Precision) | LSVM F1 (Precision) |
| ---------- |:-----:| ----:|
| Plain Text (all samples)  | **0.3788** (0.4416) | **0.6218** (0.6423) |
| Plain HTML (all samples)  | **0.1295** (0.1933) | **0.4589** (0.537) |
| | | |
| Plain Text ([DE] all samples)  | **0.4592** (0.5016) | **0.6446** (0.6591) |
| Plain HTML ([DE] all samples)  | **0.1285** (0.1949) | **0.4574** (0.5382) |

## Some informations about the dataset

In [1]:
print("Most frequent countries:\n")
train.country.value_counts().head(5)

Most frequent countries:



NameError: name 'train' is not defined

In [None]:
text_percentage = train.apply(lambda row: len(row.text)/len(row.html), axis=1)

print(f"Average/mean share of actual/plain text of HTML: {np.round(np.mean(text_percentage), decimals=2)*100}%")

In [None]:
unique_classes = list(np.unique(train[CLASS_NAMES]))

print(f"{CLASS_NAMES} ({len(unique_classes)}): \n")
for idx, i in enumerate(unique_classes):
    print(str(idx+1)+". "+str(i), end="\t")

## SUBSAMPLING

- Only specific language (e.g. "DE")
- Only $n$ samples (e.g. 1000)
- Stratified sampling by industry col

In [None]:
if SUBSAMPLING:
    if USED_LANG[0] != "ALL":
        train = train[train.country.isin(USED_LANG)]
    if N_SAMPLES < train.shape[0]:
        max_samples = N_SAMPLES
    else:
        max_samples = train.shape[0]
    train = train.sample(n=max_samples, weights=CLASS_COL, random_state=1).reset_index(drop=True)
    
    
unique_sampled_classes = len(train[CLASS_COL].unique())
print("Count of classes (sampled train):", unique_sampled_classes)
print("Equal to original train?", unique_sampled_classes == len(unique_classes))
train.shape

## USE TOP N LABELS

- only use top n classes

In [None]:
if USE_TOP_LABELS:
    top_n_classes = train[CLASS_COL].value_counts()[:TOP_N_LABELS].keys()
    train = train[train[CLASS_COL].isin(top_n_classes)]
    
    unique_classes = list(np.unique(train[CLASS_NAMES]))

    print(f"{CLASS_NAMES} ({len(unique_classes)}): \n")
    for idx, i in enumerate(unique_classes):
        print(str(idx+1)+". "+str(i), end="\t")
else:
    print("Using all labels.")

# Data preprocessing (vectorizing, dimension reducing etc.)

- ignore terms with a document frequency > MAX_DOCUMENT_FREQUENCY (`max_df` in TF-IDF)

In [None]:
%%time

train_labels = train[CLASS_COL].values
unique_train_labels = list(np.unique(train[CLASS_COL]))
print("Count of unique classes in train set:", len(unique_train_labels))
print("Count of unique languages in train set:", len(np.unique(train["country"].values)))

### Remove tokens with POS-Tagging

In [None]:
%%time

if POS_TAGGING:
    train_text = remove_pos(train, pos_tags=POS_TAGS)
else:
    train_text = train[TEXT_COL]
    print("No POS TAGS are removed.\n")

In [None]:
%%time
test = pd.read_csv(TEST_PATH_CSV)

if SUBSAMPLING:
    if USED_LANG[0] != "ALL":
        test = test[test.country.isin(USED_LANG)]
    test = test.sample(n=test.shape[0], weights=CLASS_COL, random_state=1).reset_index(drop=True)
    

test_vector = vectorizer.transform(test[TEXT_COL].values)
test_labels = test[CLASS_COL].values

# K-Nearest Neighbors

In [None]:
%%time
print("K-Nearest Neighbors CLF", "\n-------------------------")
# training
clf = KNeighborsClassifier()
clf.fit(train_vector, train_labels)

# prediction
train_preds = clf.predict(test_vector)

# evaluation
precision = precision_score(test_labels, train_preds, average="macro", zero_division=0)
recall = recall_score(test_labels, train_preds, average="macro", zero_division=0)
f1 = f1_score(test_labels, train_preds, average="macro", zero_division=0)
clf1_f1 = np.round(f1, decimals=4)
clf1_precision = np.round(precision, decimals=4)

print(np.round(precision, decimals=4), "\tPrecision")
print(np.round(recall, decimals=4), "\tRecall")
print(np.round(f1, decimals=4), "\tF1")
print()

clf_report = classification_report(test_labels, 
                                   train_preds, 
                                   target_names = np.unique(test[CLASS_NAMES]), 
                                   zero_division = 0)

In [None]:
result = "| "

if TEXT_COL == "text":
    result += "Plain Text"
else:
    result += "HTML"
    
result += " ("
    
if SUBSAMPLING:
    result += "[" + ", ".join(USED_LANG) + "]"
    if N_SAMPLES < train.shape[0]:
        result += f" {N_SAMPLES} samples"
    else:
        result += " all samples"
else:
    result += "all samples"
    
result += ") "

if USE_TOP_LABELS:
    result += f"(Top {TOP_N_LABELS} classes)"
        
result += f" | **{clf1_f1}** ({clf1_precision}) | **{clf2_f1}** ({clf2_precision}) |"
print(CLASS_COL)
print()
print(result)