## Experiments on R8 dataset

This notebooks runs the proposed method on the R8 dataset that was reported in the original paper:

![](figures/r8.png)

Note that the scores in the original paper are inflated or overly optimistic because of a bug in their code repository, which was described on [https://kenschutte.com/gzip-knn-paper/](https://kenschutte.com/gzip-knn-paper/).

In [1]:
import gzip
import os.path as op

import numpy as np
import pandas as pd

### Load dataset

Before running the code below, make sure to download the dataset from here: https://www.kaggle.com/datasets/weipengfei/ohr8r52

In [2]:
df_train = pd.read_csv("r8-train-stemmed.csv")
df_test = pd.read_csv("r8-test-stemmed.csv")

In [3]:
uniq = list(set(df_train["intent"].values))
labels = {j:i for i,j in zip(range(len(uniq)), uniq)}
labels

{'money-fx': 0,
 'crude': 1,
 'interest': 2,
 'trade': 3,
 'earn': 4,
 'grain': 5,
 'ship': 6,
 'acq': 7}

In [4]:
df_train["label"] = df_train["intent"].apply(lambda x: labels[x])
df_test["label"] = df_test["intent"].apply(lambda x: labels[x])

## Original

Reimplementation of the pseudocode in the *"Low-Resource" Text Classification: A Parameter-Free Classification Method with Compressors* paper ([https://aclanthology.org/2023.findings-acl.426/](https://aclanthology.org/2023.findings-acl.426/)) 


<img src="figures/pseudocode.png" width="500">


- Same code as [1_1_nn_plus_gzip_original.ipynb](1_1_nn_plus_gzip_original.ipynb)

In [6]:
k = 2

predicted_classes = []

for row_test in tqdm(df_test.iterrows(), total=df_test.shape[0]):
    test_text = row_test[1]["text"]
    test_label = row_test[1]["label"]
    c_test_text = len(gzip.compress(test_text.encode()))
    distance_from_test_instance = []
    
    for row_train in df_train.iterrows():
        train_text = row_train[1]["text"]
        train_label = row_train[1]["label"]
        c_train_text = len(gzip.compress(train_text.encode()))
        
        train_plus_test = " ".join([test_text, train_text])
        c_train_plus_test = len(gzip.compress(train_plus_test.encode()))
        
        ncd = ( (c_train_plus_test - min(c_train_text, c_test_text))
                / max(c_test_text, c_train_text) )
        distance_from_test_instance.append(ncd)
        
    sorted_idx = np.argsort(np.array(distance_from_test_instance))
    
    #top_k_class = list(df_train.iloc[sorted_idx[:k]]["label"].values)
    #predicted_class = max(set(top_k_class), key=top_k_class.count)
    top_k_class = df_train.iloc[sorted_idx[:k]]["label"].values
    predicted_class = np.argmax(np.bincount(top_k_class))
    
    predicted_classes.append(predicted_class)
     
print("Accuracy:", np.mean(np.array(predicted_classes) == df_test["label"].values))

100%|███████████████████████████████████████| 2189/2189 [09:44<00:00,  3.74it/s]

Accuracy: 0.8889904065783463





## With Tie-Breaking Fix

With improved tie breaking using `Counter` as described in [0_some-concepts.ipynb](0_some-concepts.ipynb). 

- Same code as [1_2_nn_plus_gzip_fix-tie-breaking.ipynb](1_2_nn_plus_gzip_fix-tie-breaking.ipynb)

In [5]:
from tqdm import tqdm
from collections import Counter

k = 2

predicted_classes = []

for row_test in tqdm(df_test.iterrows(), total=df_test.shape[0]):
    test_text = row_test[1]["text"]
    test_label = row_test[1]["label"]
    c_test_text = len(gzip.compress(test_text.encode()))
    distance_from_test_instance = []
    
    for row_train in df_train.iterrows():
        train_text = row_train[1]["text"]
        train_label = row_train[1]["label"]
        c_train_text = len(gzip.compress(train_text.encode()))
        
        train_plus_test = " ".join([test_text, train_text])
        c_train_plus_test = len(gzip.compress(train_plus_test.encode()))
        
        ncd = ( (c_train_plus_test - min(c_train_text, c_test_text))
                / max(c_test_text, c_train_text) )
        distance_from_test_instance.append(ncd)
        
    sorted_idx = np.argsort(np.array(distance_from_test_instance))
    top_k_class = np.array(df_train["label"])[sorted_idx[:k]]
    predicted_class = Counter(top_k_class).most_common()[0][0]
    
    predicted_classes.append(predicted_class)
        
print("Accuracy:", np.mean(np.array(predicted_classes) == df_test["label"].values))

100%|███████████████████████████████████████| 2189/2189 [09:49<00:00,  3.71it/s]

Accuracy: 0.912745545911375



