# Working with Entropy based weighting and the Rocchio classifier.
Author: Eric S. Tellez -- [donsadit@gmail.com](mailto:donsadit@gmail.com)

## Abstract

Another interesting way to represent a text in multiclass problems is the use of class-distributional representations, where each token is represented by its latent distribution. Here we test a simpler scheme based on a single number, a token's weight based on its empirical distribution's entropy. This scheme can be of use while solving complex tasks and many times its performance can surpass that of other more complex schemes. Additionally, this tutorial also presents how to use the Rocchio classifier implemented in `TextSearch.jl`, a simple classifier, specially designed for multiclass linear problems, which can be used whenever other schemes cannot be used.


## Preparation
The first step consists on initializing our environment and downloading our data.


In [1]:
using Pkg
pkg"activate ."
# uncomment to install the required packages
#pkg"add https://github.com/sadit/SimilaritySearch.jl https://github.com/sadit/TextSearch.jl https://github.com/sadit/KernelMethods.jl LIBLINEAR Random StatsBase CSV DataFrames StatsPlots"
using SimilaritySearch, TextSearch, LIBLINEAR, Random, StatsBase, KernelMethods, CSV, DataFrames, StatsPlots, SparseArrays

# fetching data
url = "http://ingeotec.mx/~sadit/emotions.csv"
!isfile(basename(url)) && download(url, basename(url))
db = DataFrame(CSV.read(basename(url)))

[32m[1mActivating[22m[39m environment at `~/Research/TextSearch.jl/examples/Project.toml`


Unnamed: 0_level_0,klass,text
Unnamed: 0_level_1,String,String
1,😰,@DanuFSanz Me gustaría amika pero tengo una panza enorme _emo_
2,😥,"Debería de estar en mi casa, no aquí sufriendo porque no hay luz _emo_"
3,😊,@PowerMusicRadio _emo_ Voto X #HayAmores de @julionalvarez para que ingrese en él #TopPower con @yuyu_perez @NoticiasJulion @ViejonasJAySNB
4,♡,@CelopanYT Muy buenos días bae _emo_
5,💔,"@IsamarPortilla Gracias no era necesario, después de que me rompes el corazón _emo_"
6,🙂,Buena idea venir a mis trámites donde hubiera mi gym pa desestresarme! _emo_ (at @SmartFit_mex) _url_ _url_
7,😋,Lasaña!!!. _emo_ _emo_ _emo_ @ Postodoro Xalapa Centro _url_
8,😊,Hay que aprender a vivir nuestra vida y dejar a los demás vivir la suya _emo_
9,♡,Tu por siempre _emo_ _url_
10,😪,"A penas tuve un problema con una R en una guardia porque la muy idiota creyó que andaba con el residente que seguramente le gustaba. Y cuando descubrió que literal, SÓLÓ ERA MI AMIGO!!... Ya me había hecho sacarle todos sus pendientes la muy tarada!! 🤦🏻\u200d♀️ _emo_"


In [2]:
display("text/markdown", "We take a small sample for training and validation collections to estimate the performance of our predictions; here we use a holdout scheme for cross-validation.")
        
function sample_and_partition(db, n, p)
    m = round(Int, n * p)
    G = db[sample(1:size(db, 1), n), :]
    corpus_train = G.text[1:m]
    corpus_test = G.text[m+1:end]
    labels_train = G.klass[1:m]
    labels_test = G.klass[m+1:end]
    le = fit(LabelEncoder, labels_train)
    labels_train = KernelMethods.transform.(le, labels_train)
    labels_test = KernelMethods.transform.(le, labels_test)

    (corpus_train=corpus_train, labels_train=labels_train, corpus_test=corpus_test, labels_test=labels_test, le=le)
end


We take a small sample for training and validation collections to estimate the performance of our predictions; here we use a holdout scheme for cross-validation.

sample_and_partition (generic function with 1 method)

## The experiments

The following functions runs and evaluate the model. The idea is to compare `EntModel` and `VectorModel` (TFIDF vectors) and also compare Rocchio and Linear SVM.

In [3]:
function prepare_model(db, kind, p)
    part = sample_and_partition(db, size(db, 1), p)
    config = TextConfig(qlist=[5, 7], nlist=[1])
    if kind == VectorModel
        model = fit(VectorModel, config, part.corpus_train)
    else
        model = fit(EntModel, config, part.corpus_train, part.labels_train, smooth=0.0)
    end

    @info "vocabulary: $(length(model.tokens))"
    Xtrain = [vectorize(model, text) for text in part.corpus_train]
    Xtest = [vectorize(model, text) for text in part.corpus_test]
    
    (part=part, model=model, Xtrain=Xtrain, Xtest=Xtest)
end

prepare_model (generic function with 1 method)

In [6]:
function run_rocchio(M)
    classifier = fit(Rocchio, M.Xtrain, M.part.labels_train)
    ypred = predict.(classifier, M.Xtest)
    scores(ypred, M.part.labels_test)
end

function run_lsvm(M)
    # liblinear needs a sparse Matrix in CSC format
    Xtrain = sparse(M.Xtrain, M.model.m)
    Xtest = sparse(M.Xtest, M.model.m)
    
    lsvm = linear_train(M.part.labels_train, Xtrain, C=0.1)
    predictions, decision_values = linear_predict(lsvm, Xtest)
    scores(predictions, M.part.labels_test)
end

perf = DataFrame(accuracy=Float64[], macro_recall=Float64[], macro_f1=Float64[], name=String[], time=[])
using Dates
for kind in [VectorModel, EntModel]
    M = prepare_model(db, kind, 0.9)
    start = now()
    p = run_rocchio(M)
    p[:time] = now() - start

    p[:name] = "Rocchio with $kind"
    push!(perf, p, columns=:intersect)
    p = run_lsvm(M)
    p[:time] = now() - start
    p[:name] = "Linear SVM with $kind"
    push!(perf, p, columns=:intersect)
end

perf

fitting VectorModel with 11270 items
xxxxxxxxxxxfinished VectorModel: 11270 processed items, voc-size: 341836
┌ Info: vocabulary: 341836
└ @ Main In[3]:10
fitting Rocchio classifier with 11270 items; and 16 classes
***********feeding DistModel with 11270 items, classes: 16
***********finished DistModel: 11270 processed items
┌ Info: vocabulary: 339982
└ @ Main In[3]:10
fitting Rocchio classifier with 11270 items; and 16 classes
***********

Unnamed: 0_level_0,accuracy,macro_recall,macro_f1,name,time
Unnamed: 0_level_1,Float64,Float64,Float64,String,Any
1,0.519169,0.545443,0.512344,Rocchio with VectorModel,422 milliseconds
2,0.653355,0.652828,0.649962,Linear SVM with VectorModel,2204 milliseconds
3,0.645367,0.662801,0.646621,Rocchio with EntModel,412 milliseconds
4,0.686901,0.691691,0.686189,Linear SVM with EntModel,2163 milliseconds


# Conclusions

Rocchio can be used instead of LIBLINEAR.jl whenever you want a pure julia solution, if Julia works then Rocchio will work.
Another major use case is when the training time can't be ignored, i.e, many training examples with a few elements to be evaluated; the same may apply for conversion (sparse matrix construction) times. Finally, please notice that Rocchio works surprissingly well with `EntModel` in contrast to `VectorModel`; this is because `EntModel` measures the notion of tokens being descriptors of a class while `TfidfModel` (the weighting scheme behind `VectorModel`) is designed to describe token importance in a flat-collection.