# Random search model selection for Emoji-predictors for Spanish tweets
Author: Eric S. Tellez -- [donsadit@gmail.com](mailto:donsadit@gmail.com)

## Abstract
This scripts shows how perform random search over a space of configurations to select a model that predicts the related emoji for a given short text.
The text model is based on the Entropy based weighting; we can reduce the size of the model using prunning techniques.
This example uses a Rocchio classifier.

## Example

The first step is to initialize the environment

In [1]:
using Pkg
pkg"activate ."
# uncomment to install the required packages
# pkg"add https://github.com/sadit/SimilaritySearch.jl https://github.com/sadit/TextSearch.jl https://github.com/sadit/KernelMethods.jl Random StatsBase CSV DataFrames StatsPlots IterTools"
using SimilaritySearch, TextSearch, Random, StatsBase, KernelMethods, CSV, DataFrames, StatsPlots, SparseArrays, IterTools

# fetching data
url = "http://ingeotec.mx/~sadit/emotions.csv"
!isfile(basename(url)) && download(url, basename(url))
db = DataFrame(CSV.read(basename(url)))

[32m[1mActivating[22m[39m environment at `~/Research/TextSearch.jl/tutorials/Project.toml`


Unnamed: 0_level_0,klass,text
Unnamed: 0_level_1,String,String
1,😰,@DanuFSanz Me gustaría amika pero tengo una panza enorme _emo_
2,😥,"Debería de estar en mi casa, no aquí sufriendo porque no hay luz _emo_"
3,😊,@PowerMusicRadio _emo_ Voto X #HayAmores de @julionalvarez para que ingrese en él #TopPower con @yuyu_perez @NoticiasJulion @ViejonasJAySNB
4,♡,@CelopanYT Muy buenos días bae _emo_
5,💔,"@IsamarPortilla Gracias no era necesario, después de que me rompes el corazón _emo_"
6,🙂,Buena idea venir a mis trámites donde hubiera mi gym pa desestresarme! _emo_ (at @SmartFit_mex) _url_ _url_
7,😋,Lasaña!!!. _emo_ _emo_ _emo_ @ Postodoro Xalapa Centro _url_
8,😊,Hay que aprender a vivir nuestra vida y dejar a los demás vivir la suya _emo_
9,♡,Tu por siempre _emo_ _url_
10,😪,"A penas tuve un problema con una R en una guardia porque la muy idiota creyó que andaba con el residente que seguramente le gustaba. Y cuando descubrió que literal, SÓLÓ ERA MI AMIGO!!... Ya me había hecho sacarle todos sus pendientes la muy tarada!! 🤦🏻\u200d♀️ _emo_"


## Partitioning the data


In [27]:
function create_model(db)
    le = fit(LabelEncoder, db.klass)
    labels = KernelMethods.transform.(le, db.klass)
    config = TextConfig(qlist=[4, 6], nlist=[1])
    model = fit(VectorModel, config, db.text)
    X = [vectorize(model, TfidfModel, text) for text in db.text]
    (le=le, labels=labels, model=model, X=X)
end

b = create_model(db)
@time centers, epsilon = kcenters(angle_distance, b.X, 256)

computing fartest point 1 of 256, epsilon: Inf, imax: 12362
computing fartest point 2 of 256, epsilon: 1.5707111879664133, imax: 7161
computing fartest point 3 of 256, epsilon: 1.5706330461126883, imax: 902
computing fartest point 4 of 256, epsilon: 1.5705911918880269, imax: 9074
computing fartest point 5 of 256, epsilon: 1.5705345651803486, imax: 7068
computing fartest point 6 of 256, epsilon: 1.5704574231292732, imax: 6559
computing fartest point 7 of 256, epsilon: 1.5704196363712994, imax: 1913
computing fartest point 8 of 256, epsilon: 1.570392519757142, imax: 11639
computing fartest point 9 of 256, epsilon: 1.5703651312505202, imax: 1088
computing fartest point 10 of 256, epsilon: 1.5703422323703797, imax: 1763
computing fartest point 11 of 256, epsilon: 1.5702946917077876, imax: 6757
computing fartest point 12 of 256, epsilon: 1.5702188581011196, imax: 972
computing fartest point 13 of 256, epsilon: 1.5701163751343568, imax: 8540
computing fartest point 14 of 256, epsilon: 1.5699

 12.399611 seconds (29.63 k allocations: 1.732 MiB)


([12362, 7161, 902, 9074, 7068, 6559, 1913, 11639, 1088, 1763  …  1871, 3091, 5344, 5472, 2599, 6949, 208, 11335, 630, 2556], 1.5527974407956515)

In [28]:
@time R = fftclustering(angle_distance, b.X, 256)

computing fartest point 1, dmax: Inf, imax: 6265, stop: #97
computing fartest point 2, dmax: 1.570696562603413, imax: 2102, stop: #97
computing fartest point 3, dmax: 1.5706642581413228, imax: 2368, stop: #97
computing fartest point 4, dmax: 1.5706216274082843, imax: 9074, stop: #97
computing fartest point 5, dmax: 1.5705596452476343, imax: 6025, stop: #97
computing fartest point 6, dmax: 1.5705021172963087, imax: 211, stop: #97
computing fartest point 7, dmax: 1.5704430261251963, imax: 2212, stop: #97
computing fartest point 8, dmax: 1.5703737338888617, imax: 11375, stop: #97
computing fartest point 9, dmax: 1.5703283083634763, imax: 2091, stop: #97
computing fartest point 10, dmax: 1.570272873064207, imax: 1353, stop: #97
computing fartest point 11, dmax: 1.5702389743085035, imax: 9087, stop: #97
computing fartest point 12, dmax: 1.5700032225926264, imax: 2417, stop: #97
computing fartest point 13, dmax: 1.5699503785527236, imax: 8971, stop: #97
computing fartest point 14, dmax: 1.56

 11.218806 seconds (66.03 k allocations: 4.057 MiB)


(NN = KnnResult{Int64}[KnnResult{Int64}(1, Item{Int64}[Item{Int64}(3332, 1.5359119466169147)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(4586, 1.5205805463196982)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(2601, 1.544811967760972)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(2362, 1.498977484431566)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(1406, 1.5153264534951683)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(8893, 1.4992366611199053)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(3986, 1.5327156008058773)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(1119, 1.4841330349296498)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(1072, 1.5064014972424489)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(9702, 1.5118774457288886)])  …  KnnResult{Int64}(1, Item{Int64}[Item{Int64}(9093, 1.2885156932788147)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(8755, 1.4993664791971053)]), KnnResult{Int64}(1, Item{Int64}[Item{Int64}(3937, 1.4778351612463423)]), KnnResult{Int64}(1, Item{In

In [26]:
R.NN

12522-element Array{KnnResult{Int64},1}:
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(598, 1.5323405779428851)])  
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(10858, 1.5268683983654376)])
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(8770, 1.5435320466061049)]) 
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(6392, 1.472948601141364)])  
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(8524, 1.5226563224844196)]) 
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(9987, 1.5184016267728833)]) 
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(5224, 1.5418699863902208)]) 
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(2438, 1.5164667545606327)]) 
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(9510, 1.4684981184194392)]) 
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(2330, 1.5376907307733527)]) 
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(6830, 1.3720130172210943)]) 
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(10858, 1.517515689299309)]) 
 KnnResult{Int64}(1, Item{Int64}[Item{Int64}(8442, 1.4839485591066095)]