# Predicting emojis for Spanish tweets
Author: Eric S. Tellez -- [donsadit@gmail.com](mailto:donsadit@gmail.com)

## Abstract
This scripts shows how to create a text model and a classifier that predicts the related emoji for a given short text.
The text model can be a classifical TFIDF model or an Entropy based weighting; we can reduce the size of the model using prunning techniques.
This example uses a linear SVM (LIBLINEAR.jl).

## Example


The first step is to initialize the environment

In [1]:


using Pkg
pkg"activate ."
# uncomment to install the required packages
pkg"add https://github.com/sadit/SimilaritySearch.jl https://github.com/sadit/TextSearch.jl https://github.com/sadit/KernelMethods.jl LIBLINEAR Random StatsBase"
using SimilaritySearch, TextSearch, LIBLINEAR, Random, StatsBase, KernelMethods


# fetching data
url = "http://ingeotec.mx/~sadit/emospace50k.json.gz"
!isfile(basename(url)) && download(url, basename(url))
db = loadtweets(basename(url))
n = length(db)

[32m[1mActivating[22m[39m environment at `~/Research/TextSearch.jl/examples/Project.toml`
[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[?25l[2K[?25h[32m[1m  Updating[22m[39m git-repo `https://github.com/sadit/SimilaritySearch.jl`
[?25l[2K[?25h[32m[1m  Updating[22m[39m git-repo `https://github.com/sadit/TextSearch.jl`
[2K[?25h[32m[1m  Updating[22m[39m git-repo `https://github.com/sadit/TextSearch.jl`
[?25l[2K[?25h[32m[1m  Updating[22m[39m git-repo `https://github.com/sadit/KernelMethods.jl`
[?25l[2K[?25h[32m[1m  Updating[22m[39m git-repo `https://github.com/sadit/KernelMethods.jl`
[?25l[2K[?25h[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `~/Research/TextSearch.jl/examples/Project.toml`
 [90m [7f6f6c8a][39m[93m ~ TextSearch v0.3.0 #master (https://github.com/sadit/TextSearch.jl)[39m
[32m[1m  Up

┌ Info: Recompiling stale cache file /Users/sadit/.julia/compiled/v1.2/TextSearch/mqUEb.ji for TextSearch [7f6f6c8a-3b03-11e9-223d-e7d88259bd6c]
└ @ Base loading.jl:1240


50000

## Partitioning the data

To estimate the performance of our predictions, we divide our dataset in a 50-50 partition for training and testing collections.

In [5]:

function entropy_vectors(corpus, labels)
    le = fit(LabelEncoder, labels)
    model = fit(EntModel, config, corpus[P1], KernelMethods.transform.(le, labels[P1]),smooth=9)
    model = prune_select_top(model, 0.2)
    @info "number-of-tokens:" length(model.tokens)

    X = [vectorize(model, EntModel, text) for text in corpus]
    X[P1], X[P2], labels[P1], labels[P2]
end

function partition(db)
    G = shuffle(1:n)
    P1 = G[1:div(length(G), 2)]
    P2 = G[div(length(G), 2)+1:end]

    corpus = get.(db, "text", "")
    labels = get.(db, "klass", "")
    (corpus_train=corpus[P1], labels_train=labels[P1], corpus_test=corpus[P2], labels_test=labels[P2])
end

function main_tfidf(db)
    part = partition(db)
    # TextConfig specifies the way the text will be processed;
    # note that emoticons are specially handled to remove them from the text
    config = TextConfig(qlist=[3, 5], nlist=[], group_emo=true)
    model_ = fit(VectorModel, config, part.corpus_train)
    for p in [1.0, 0.9, 0.7, 0.5, 0.3, 0.1]
        model = prune_select_top(model_, p, FreqModel)
        Xtrain = [vectorize(model, TfidfModel, text) for text in part.corpus_train]
        Xtest = [vectorize(model, TfidfModel, text) for text in part.corpus_test]

        #Xtrain, Xtest, ytrain, ytest = entropy_vectors(corpus, labels)
        # Xtrain, Xtest, ytrain, ytest = tfidf_vectors(corpus, labels)
        classifier = linear_train(part.labels_train, hcat(Xtrain...), C=0.1)
        predictions, decision_values = linear_predict(classifier, hcat(Xtest...))
        accuracy = mean(part.labels_test .== predictions)
        display(p => accuracy)
    end
end


@show main_tfidf(db)

fitting VectorModel with 25000 items
xxxxxxxxxxxxxxxxxxxxxxxxxfinished VectorModel: 25000 processed items, voc-size: 230048


MethodError: MethodError: Cannot `convert` an object of type Int64 to an object of type TextSearch.IdFreq
Closest candidates are:
  convert(::Type{T}, !Matched::T) where T at essentials.jl:167
  TextSearch.IdFreq(::Int64, !Matched::Int64) at /Users/sadit/.julia/packages/TextSearch/Ao92B/src/basicmodels.jl:16
  TextSearch.IdFreq(::Any, !Matched::Any) at /Users/sadit/.julia/packages/TextSearch/Ao92B/src/basicmodels.jl:16