# Proyecto integrador (2022)
Autor: Eric S. Tellez <eric.tellez@infotec.mx> <br/>

A lo largo del curso de Recuperación de Información se vieron diferentes maneras de modelar texto, tomando encuenta su vocabulario o su semántica. En particular, se manipulo de diferentes formas el vocabulario para resaltar o disminuir efectos que impactarán en la calidad de respuesta o la velocidad de los resultados. Se utilizó el índice invertido para representaciones dispersas y búsqueda métrica con vectores densos, dando diferentes posibilidades para cumplir con los diferentes requerimientos de los sistemas de información.

Adicional a la búsqueda, también se reviso la parte de organización por grupos, _clustering_, y visualización utilizando técnicas de reducción de dimensión no-lineal. 

Se espera que todos estos temas sean de utilidad en las actividades de un científico de datos, tanto en la etapa de análisis exploratorio de los datos como en la construcción de sistemas inteligentes.

# Actividades

- Construcción de un mini-sistema de información que solucione un problema que usted conozca
- Recolección de documentos
- Modelado de los documentos
- Indexamiento, búsqueda y presentación de documentos
- Análisis y visualización de datos
- Reporte


# Ejemplo: Paquetes en el registro principal del lenguaje Julia.

In [1]:
using Pkg
Pkg.activate(".")

[32m[1m  Activating[22m[39m project at `~/Cursos/IR-2024/Unidades`


In [2]:
Pkg.status()

[32m[1mStatus[22m[39m `~/Cursos/IR-2024/Unidades/Project.toml`
  [90m[87b137f8] [39mBagOfWords v0.3.3
  [90m[336ed68f] [39mCSV v0.10.12
  [90m[aaaa29a8] [39mClustering v0.15.6
  [90m[944b1d66] [39mCodecZlib v0.7.3
  [90m[a80b9123] [39mCommonMark v0.8.12
  [90m[a93c6f00] [39mDataFrames v1.6.1
  [90m[c5bfea45] [39mEmbeddings v0.4.5 `~/.julia/dev/Embeddings`
  [90m[ac1192a8] [39mHypertextLiteral v0.9.5
  [90m[7073ff75] [39mIJulia v1.24.2
  [90m[c601a237] [39mInteract v0.10.5
  [90m[b20bd276] [39mInvertedFiles v0.7.1
  [90m[033835bb] [39mJLD2 v0.4.41
  [90m[682c06a0] [39mJSON v0.21.4
  [90m[4dca28ae] [39mKNearestCenters v0.7.7
  [90m[8ef0a80b] [39mLanguages v0.4.6
  [90m[eb30cadb] [39mMLDatasets v0.7.14
  [90m[06eb3307] [39mManifoldLearning v0.9.0
[33m⌅[39m [90m[6f286f6a] [39mMultivariateStats v0.9.0
  [90m[91a5bcdd] [39mPlots v1.39.0
  [90m[438e738f] [39mPyCall v1.96.4
  [90m[ca7ab67e] [39mSimSearchManifoldLearning v0.2.10
  [90m[053f045d] 

In [3]:
using SimilaritySearch, Interact, SimSearchManifoldLearning, TextSearch, StatsBase, Clustering, ZipFile, CommonMark, JSON, Base64, Plots, LinearAlgebra, HypertextLiteral
include("read_datasets.jl")

get_julia_packages (generic function with 1 method)

In [4]:
function sections(readmetext, maxsections=1)
    S = []
    for p in eachmatch(r"#(.+?)\n([^#]+)"ims, readmetext)
        if length(p.captures) == 2
            head, text = p.captures
            head = strip(replace(head, "#" => ""))
            push!(S, (; head, text))
            length(S) == maxsections && break
        end
    end
    
    if 0 == length(S)
        push!(S, (head="", text=readmetext))
    end
    S
end

function packages_metadata()
    packages = ZipFile.Reader(get_julia_packages())
    readme = Dict{String, Int}()
    metadata = Dict{String, Int}()
    
    for (i, file) in enumerate(packages.files)
        arr = splitpath(file.name)
        name, kind = arr[end-1], arr[end]
        if kind == "Metadata.json"
            metadata[name] = i
        else
            readme[name] = i
        end
    end

    packages, readme, metadata
end

function read_zipped(z)
    seekstart(z)
    JSON.parse(read(z, String))
end

#metadata(name::String) = read_zipped(name, D.metadata)

function readme(z)
    r = read_zipped(z)
    String(base64decode(r["content"]))
end

# readme(name::String) = readme(D.readme[name])

function create_dataset()
    packages, readme_, metadata_ = packages_metadata()
    
    names = String[]
    urls = String[]
    descriptions = String[]
    stars = Int[]
    corpus = String[]
    zipid = typeof((; readme=1, metadata=1))[]
    name2id = Dict{String,Int}()
    
    for (k, i) in readme_
        s = only(sections(readme(packages.files[i]), 1))
        j = metadata_[k]
        m = read_zipped(packages.files[j])
        push!(names, k)
        push!(urls, m["html_url"])
        d = m["description"]
        push!(descriptions, d === nothing ? "_no description_" : d)
        push!(stars, m["watchers_count"])
        push!(corpus, s.text)
        push!(zipid, (readme=i, metadata=j))
        name2id[k] = length(names)
    end    
    
    (; names, urls, descriptions, stars, zipid, name2id, corpus, packages)
end

function readme_and_metadata(name::String, D::NamedTuple)
    f = D.zipid[D.name2id["SimilaritySearch"]]
    readme(D.packages.files[f.readme]), read_zipped(D.packages.files[f.metadata])
end

function package(name::String, D::NamedTuple)
    id = D.name2id[name]
    (; id, name, zipid=D.zipid[id], text=D.corpus[id])
end

package (generic function with 1 method)

In [5]:
D = create_dataset();

In [6]:
function create_index(vectors)
    dist = NormalizedCosineDistance()
    db = VectorDatabase(vectors)
    index = SearchGraph(; dist, db)
    minrecall = MinRecall(0.9)
    ctx = SearchGraphContext(hyperparameters_callback=OptimizeParameters(minrecall))
    index!(index, ctx)
    optimize_index!(index, ctx, minrecall)
    index
end

function text_model_and_vectors(corpus;
        textconfig = TextConfig(group_usr=false, group_url=true, del_diac=true, del_punc=true, lc=true, group_num=true, nlist=[], qlist=[4]),
        voc = Vocabulary(textconfig, corpus),
        model = VectorModel(IdfWeighting(), TfWeighting(), voc)
    )
    model = filter_tokens(model) do t
        5 ≤ t.ndocs ≤ 1000
    end
    vectors = vectorize_corpus(model, corpus)
    (; textconfig, model, vectors)
end

myvectorize(text::String, T::NamedTuple) = vectorize(T.model, text)
myvectorize_corpus(corpus, T::NamedTuple) = vectorize_corpus(T.model, corpus)

myvectorize_corpus (generic function with 1 method)

In [7]:
T = text_model_and_vectors(D.corpus)
@show T.model
@time index = create_index(T.vectors);

T.model = {VectorModel
    global_weighting: IdfWeighting()
    local_weighting: TfWeighting()
    vocsize: 21210
    trainsize=6686
    maxoccs=167309                                    
}
 19.421375 seconds (4.44 M allocations: 406.315 MiB, 1.11% gc time, 23.63% compilation time)


In [8]:
function search_and_display(text, k, D, T)
    res = KnnResult(k)
    search(index, myvectorize(text, T), res)
    display("text/markdown", "# Results for `$text`")
    
    for (i, p) in enumerate(res)
        display(@htl """
            <hr />
            <div style="padding: 0.5em; border-style: solid; border-color: #557799;">
            <div><a href="$(D.urls[p.id])">$(D.names[p.id])</a> &nbsp;&nbsp; ⭐ $(D.stars[p.id]) </div>
            <div> $(D.descriptions[p.id])</div>
            <span style="color: #aaaa30;">debug:: i: $(i), id: $(p.id), dist=$(p.weight)</span>
            </div>
        """)
        display("text/markdown", D.corpus[p.id])
    end
end

#search_and_display("similarity search nearest neighbor", 10, D, T)


search_and_display (generic function with 1 method)

In [9]:
search_and_display("similarity search", 10, D, T)

# Results for `similarity search`


[![Stable docs](https://img.shields.io/badge/docs-stable-blue.svg)](https://kernelmethod.github.io/LSHFunctions.jl/stable/) [![Dev docs](https://img.shields.io/badge/docs-dev-blue.svg)](https://kernelmethod.github.io/LSHFunctions.jl/dev/)
[![Build Status](https://github.com/kernelmethod/LSHFunctions.jl/workflows/CI/badge.svg)](https://github.com/kernelmethod/LSHFunctions.jl/actions)
[![Codecov](https://codecov.io/gh/kernelmethod/LSHFunctions.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/kernelmethod/LSHFunctions.jl)
[![DOI](https://zenodo.org/badge/197700982.svg)](https://zenodo.org/badge/latestdoi/197700982)

A Julia package for [locality-sensitive
hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) to accelerate
similarity search.

- [What's LSH?](

 [![][docs-dev-img]][docs-dev-url]
 
A simple Julia wrapper around the [Faiss](https://github.com/facebookresearch/Faiss) library for similarity search with [`PythonCall.jl`](https://github.com/cjdoris/PythonCall.jl).

While functional and faster than [`NearestNeighbors.jl`](https://github.com/KristofferC/NearestNeighbors.jl).

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Facebook AI Research.






SimilaritySearch.jl is a library for nearest neighbor search. In particular, it contains the implementation for `SearchGraph,` a fast and flexible search index using any metric function. It is designed to support multithreading in most of its functions and structures.

The package provides the following indexes:

- `ParallelExhaustiveSearch`: An brute force search index where each query is solved using all available threads.
- `ExhaustiveSearch`: A brute force search index, each query is solved using a single thread.
- `SearchGraph`: An approximate search index with parallel construction.

The main set of functions are:

- `search`: Solves a single query.
- `searchbatch`: Solves a set of queries.
- `allknn`: Computes the $k$ nearest neighbors for all elements in an index.
- `neardup`: Removes near-duplicates from a metric dataset.
- `closestpair`: Computes the closest pair in a metric dataset.

The precise definitions of these functions and the complete set of functions and structures can be found in the [documentation](https://sadit.github.io/SimilaritySearch.jl/dev).



# VisiualSearch



This package implements inverted files, also known as inverted indexes, that are data structures that represents a large sparse matrix, specially organized to compute some distance functions and fetch `k` nearest neighbors.
It is mainly used for full text search and other search tasks where data can be formulated as large sparse vectors.
In particular, the package implements three types of inverted files:

- `WeightedInvertedFile`: Inverted files for sparse vectors, it can solve $k$ nearest neighbors using the  normalized cosine distance, $1 - dot(u, q)$
- `BinaryInvertedFile`: Inverted file for sparse binary data, it can solve $k$ nearest neighbors using Jaccard, Dice, and Cosine distances, and also the intersection dissimilarity measure.
- `KnrIndex`: An approximated similarity search index based on inverted files. It supports general metric spaces.

These structs integrates with the `SimilaritySearch` environment, such that you can use it as a drop-in replacement of other indexes. In particular, inverted files are well-known for its scalability when the proper setup is used.



This package provides some support to use `SimilaritySearch` with manifold learning methods. In particular,
we implement the required methods to implement `knn` function for `ManifoldLearning` and also provides an `UMAP`
implementation that takes advantage of many `SimilaritySearch` features like multithreading and data independency; it supports string, sets, vectors, etc. under diverse distance functions.

The `ManifoldLearning` support is limited to some structure specification due to the design of the package. See the `ManifoldKnnIndex` type in the documentation pages.



![SearchLight Logo](https://dl.dropboxusercontent.com/s/sy04ofyyi8es388/searchlight-logo.png)

[![Docs](https://img.shields.io/badge/searchlight-docs-greenyellow)](https://www.genieframework.com/docs/)
# Genie / SearchLight
#### SearchLight is the ORM layer of Genie.jl, the high-performance high-productivity Julia web framework.



[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://johnnychen94.github.io/BlockMatching.jl/stable)
[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://johnnychen94.github.io/BlockMatching.jl/dev)
[![Build Status](https://github.com/johnnychen94/BlockMatching.jl/workflows/CI/badge.svg)](https://github.com/johnnychen94/BlockMatching.jl/actions)
[![Coverage](https://codecov.io/gh/johnnychen94/BlockMatching.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/johnnychen94/BlockMatching.jl)

`BlockMatching` aims to provide a sophisticated implementation on common [block matching
algorithms](https://en.wikipedia.org/wiki/Block-matching_algorithm) for image processing and
computer vision tasks. Block matching is a data and computational intense algorithm, performance is
of high priority for this package.

🚧 This is still a WIP project.

Two functions are provided as the standard API:

- `best_match`: finds the best matching candidate. This is also known as nearest neighbor search.
- `multi_match`: sort the similarities of all candidates and return the smallest K results. This is sometimes known as K nearest neighbor search or top-k selection.

Available block matching strategies:

- `FullSearch`(brute force): search among all possible candidates. This gives the most accurate result 
  but is computationally intensive. CUDA is supported for commonly used distances defined in 
  [Distances.jl].


[Distances.jl]: https://github.com/JuliaStats/Distances.jl



This package aims at providing an interface for branch and prune search in Julia.



*A ElasticSearch client for Julia*

