# Proyecto integrador
Autor: Eric S. Tellez <eric.tellez@infotec.mx> <br/>

A lo largo del curso de Recuperación de Información se vieron diferentes maneras de modelar texto, tomando encuenta su vocabulario o su semántica. En particular, se manipulo de diferentes formas el vocabulario para resaltar o disminuir efectos que impactarán en la calidad de respuesta o la velocidad de los resultados. Se utilizó el índice invertido para representaciones dispersas y búsqueda métrica con vectores densos, dando diferentes posibilidades para cumplir con los diferentes requerimientos de los sistemas de información.

Adicional a la búsqueda, también se reviso la parte de organización por grupos, _clustering_, y visualización utilizando técnicas de reducción de dimensión no-lineal. 

Se espera que todos estos temas sean de utilidad en las actividades de un científico de datos, tanto en la etapa de análisis exploratorio de los datos como en la construcción de sistemas inteligentes.

# Actividades

- Construcción de un mini-sistema de información que solucione un problema que usted conozca
- Recolección de documentos
- Modelado de los documentos
- Indexamiento, búsqueda y presentación de documentos
- Análisis y visualización de datos
- Reporte


# Ejemplo: Paquetes en el registro principal del lenguaje Julia.

In [1]:
using Pkg
Pkg.activate(".")

[32m[1m  Activating[22m[39m project at `~/IR-2022/Unidades`


In [2]:
using SimilaritySearch, Interact, SimSearchManifoldLearning, TextSearch, StatsBase, Clustering, ZipFile, CommonMark, JSON, Base64, Plots, LinearAlgebra, HypertextLiteral
include("read_datasets.jl")

get_julia_packages (generic function with 1 method)

In [3]:
function sections(readmetext, maxsections=1)
    S = []
    for p in eachmatch(r"#(.+?)\n([^#]+)"ims, readmetext)
        if length(p.captures) == 2
            head, text = p.captures
            head = strip(replace(head, "#" => ""))
            push!(S, (; head, text))
            length(S) == maxsections && break
        end
    end
    
    if 0 == length(S)
        push!(S, (head="", text=readmetext))
    end
    S
end

function packages_metadata()
    packages = ZipFile.Reader(get_julia_packages())
    readme = Dict{String, Int}()
    metadata = Dict{String, Int}()
    
    for (i, file) in enumerate(packages.files)
        arr = splitpath(file.name)
        name, kind = arr[end-1], arr[end]
        if kind == "Metadata.json"
            metadata[name] = i
        else
            readme[name] = i
        end
    end

    packages, readme, metadata
end

function read_zipped(z)
    seekstart(z)
    JSON.parse(read(z, String))
end

#metadata(name::String) = read_zipped(name, D.metadata)

function readme(z)
    r = read_zipped(z)
    String(base64decode(r["content"]))
end

# readme(name::String) = readme(D.readme[name])

function create_dataset()
    packages, readme_, metadata_ = packages_metadata()
    
    names = String[]
    urls = String[]
    descriptions = String[]
    stars = Int[]
    corpus = String[]
    zipid = typeof((; readme=1, metadata=1))[]
    name2id = Dict{String,Int}()
    
    for (k, i) in readme_
        s = only(sections(readme(packages.files[i]), 1))
        j = metadata_[k]
        m = read_zipped(packages.files[j])
        push!(names, k)
        push!(urls, m["html_url"])
        d = m["description"]
        push!(descriptions, d === nothing ? "_no description_" : d)
        push!(stars, m["watchers_count"])
        push!(corpus, s.text)
        push!(zipid, (readme=i, metadata=j))
        name2id[k] = length(names)
    end    
    
    (; names, urls, descriptions, stars, zipid, name2id, corpus, packages)
end

function readme_and_metadata(name::String, D::NamedTuple)
    f = D.zipid[D.name2id["SimilaritySearch"]]
    readme(D.packages.files[f.readme]), read_zipped(D.packages.files[f.metadata])
end

function package(name::String, D::NamedTuple)
    id = D.name2id[name]
    (; id, name, zipid=D.zipid[id], text=D.corpus[id])
end

package (generic function with 1 method)

In [4]:
D = create_dataset();

In [5]:
function create_index(vectors)
    dist = NormalizedCosineDistance()
    db = VectorDatabase(vectors)
    index = SearchGraph(; dist, db, verbose=false)
    index!(index)
    optimize!(index, MinRecall(0.9))
    index
end

function text_model_and_vectors(corpus;
        textconfig=TextConfig(group_usr=false, group_url=true, del_diac=true, del_punc=true, lc=true, group_num=true, nlist=[], qlist=[4]),
        model=VectorModel(IdfWeighting(), TfWeighting(), textconfig, corpus)
    )
    model = filter_tokens(model) do t
        5 ≤ t.ndocs ≤ 1000
    end
    vectors = vectorize_corpus(model, textconfig, corpus)
    (; textconfig, model, vectors)
end

myvectorize(text::String, T::NamedTuple) = vectorize(T.model, T.textconfig, text)
myvectorize_corpus(corpus, T::NamedTuple) = vectorize_corpus(T.model, T.textconfig, corpus)

myvectorize_corpus (generic function with 1 method)

In [6]:
T = text_model_and_vectors(D.corpus)
@show T.model
@time index = create_index(T.vectors);

T.model = {VectorModel global_weighting=IdfWeighting(), local_weighting=TfWeighting(), train-voc=26234, train-n=6686, maxoccs=1000}
 21.792111 seconds (12.09 M allocations: 1.523 GiB, 4.46% gc time, 30.78% compilation time)


In [7]:
function search_and_display(text, k, D, T)
    res = KnnResult(k)
    search(index, myvectorize(text, T), res)
    display("text/markdown", "# Results for `$text`")
    
    for (i, (id_, dist_)) in enumerate(res)
        display(@htl """
            <hr />
            <div style="padding: 0.5em; border-style: solid; border-color: #557799;">
            <div><a href="$(D.urls[id_])">$(D.names[id_])</a> &nbsp;&nbsp; ⭐ $(D.stars[id_]) </div>
            <div> $(D.descriptions[id_])</div>
            <span style="color: #aaaa30;">debug:: i: $i, id: $id_, dist=$dist_</span>
            </div>
        """)
        display("text/markdown", D.corpus[id_])
    end
end

#search_and_display("similarity search nearest neighbor", 10, D, T)


search_and_display (generic function with 1 method)

In [8]:
search_and_display("markdown parser", 10, D, T)

# Results for `markdown parser`


The macro `@markdown` lets you write [Markdown](https://www.markdownguide.org/getting-started/) inside Pluto notebooks. *Here is an example:*

```julia
@markdown("""


[![codecov](https://codecov.io/gh/jiachengzhang1/SwaggerMarkdown/branch/master/graph/badge.svg?token=GF65PUANJ2)](https://codecov.io/gh/jiachengzhang1/SwaggerMarkdown)

Swagger Markdown allows you to generate `swagger.json` for API documentation from the julia source code. The package uses marco to process the markdown that contains an API endpoint's documentation. The markdowon needs to follow the `paths` described by the OpenAPI Specification ([v3](https://swagger.io/specification/


[![Build Status](https://travis-ci.org/jonathanBieler/GtkMarkdownTextView.jl.svg?branch=master)](https://travis-ci.org/jonathanBieler/GtkMarkdownTextView.jl)

[![Coverage Status](https://coveralls.io/repos/jonathanBieler/GtkMarkdownTextView.jl/badge.svg?branch=master&service=github)](https://coveralls.io/github/jonathanBieler/GtkMarkdownTextView.jl?branch=master)

A widget to display simple markdown formatted text:

![screenshot](assets/GtkMarkdownTextView.png)

```julia
w = GtkWindow("")

md = """



![CI](https://github.com/JunoLab/Weave.jl/workflows/CI/badge.svg)
[![codecov](https://codecov.io/gh/JunoLab/Weave.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/JunoLab/Weave.jl)
[![](https://img.shields.io/badge/docs-stable-blue.svg)](http://weavejl.mpastell.com/stable/)
[![](https://img.shields.io/badge/docs-dev-blue.svg)](http://weavejl.mpastell.com/dev/)
[![](http://joss.theoj.org/papers/10.21105/joss.00204/status.svg)](http://dx.doi.org/10.21105/joss.00204)

Weave is a scientific report generator/literate programming tool for the [Julia programming language](https://julialang.org/).
It resembles
[Pweave](http://mpastell.com/pweave),
[knitr](https://yihui.org/knitr/),
[R Markdown](https://rmarkdown.rstudio.com/),
and [Sweave](https://stat.ethz.ch/R-manual/R-patched/library/utils/doc/Sweave.pdf).

You can write your documentation and code in input document using Markdown, Noweb or ordinal Julia script syntax,
and then use `weave` function to execute code and generate an output document while capturing results and figures.

**Current features**

- Publish markdown directly to HTML and PDF using Julia or [Pandoc](https://pandoc.org/MANUAL.html)
- Execute code as in terminal or in a unit of code chunk
- Capture [Plots.jl](https://github.com/JuliaPlots/Plots.jl) or [Gadfly.jl](https://github.com/GiovineItalia/Gadfly.jl) figures
- Supports various input format: Markdown, [Noweb](https://www.cs.tufts.edu/~nr/noweb/), [Jupyter Notebook](https://jupyter.org/), and ordinal Julia script
- Conversions between those input formats
- Supports various output document formats: HTML, PDF, GitHub markdown, Jupyter Notebook, MultiMarkdown, Asciidoc and reStructuredText
- Simple caching of results

**Citing Weave:** *Pastell, Matti. 2017. Weave.jl: Scientific Reports Using Julia. The Journal of Open Source Software. http://dx.doi.org/10.21105/joss.00204*

![Weave in Juno demo](https://user-images.githubusercontent.com/40514306/76081328-32f41900-5fec-11ea-958a-375f77f642a2.png)




This extension, **[toolips markdown](http://github.com/ChifiSource/ToolipsMarkdown.jl)** allows the conversion of regular markdown into Toolips components.
"""
heading1s = Style("h1", color = "pink")
heading1s:"hover":["color" => "lightblue"]

myroute = route("/") do c::Connection
    write!(c, heading1s)
    mdexample2 = tmd("mymarkdown", "


| **Build Status**                                        |
|:-------------------------------------------------------:|
| [![][gha-img]][gha-url] [![][codecov-img]][codecov-url] |

This package provides a Markdown / MkDocs backend to [`Documenter.jl`][documenter].

**Package status:** Currently, the package does not work with the 0.28 branch of Documenter, and
therefore the latest versions of Documenter do not have a Markdown backend available.
Older, released versions of this package can still be used together with older versions of Documenter (0.27
and earlier) to enable the Markdown backend built in to those versions of Documenter.

Right now, this package is not actively maintained. However, contributions are welcome by anyone
who might be interested in using and developing this backend.



```julia
using Parsers




[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://tkf.github.io/DisplayAs.jl/stable)
[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://tkf.github.io/DisplayAs.jl/dev)
[![Run tests](https://github.com/tkf/DisplayAs.jl/actions/workflows/test.yml/badge.svg)](https://github.com/tkf/DisplayAs.jl/actions/workflows/test.yml)
[![Codecov](https://codecov.io/gh/tkf/DisplayAs.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/tkf/DisplayAs.jl)
[![Aqua QA](https://raw.githubusercontent.com/JuliaTesting/Aqua.jl/master/badge.svg)](https://github.com/JuliaTesting/Aqua.jl)
[![GitHub last commit](https://img.shields.io/github/last-commit/tkf/DisplayAs.jl.svg?style=social&logo=github)](https://github.com/tkf/DisplayAs.jl)

DisplayAs.jl provides functions to show objects in a chosen MIME type.

```julia
julia> using DisplayAs
       using Markdown

julia> md_as_html = Markdown.parse("hello") |> DisplayAs.HTML;

julia> showable("text/html", md_as_html)
true

julia> showable("text/markdown", md_as_html)
false

julia> md_as_md = Markdown.parse("hello") |> DisplayAs.MD;

julia> showable("text/html", md_as_md)
false

julia> showable("text/markdown", md_as_md)
true
```

It is also possible to use nesting in order to allow the object to be displayed
as multiple MIME types:

```julia
julia> md_as_html_or_text = Markdown.parse("hello") |> DisplayAs.HTML |> DisplayAs.Text;

julia> showable("text/html", md_as_html_or_text)
true

julia> showable("text/plain", md_as_html_or_text)
true

julia> showable("text/markdown", md_as_html_or_text)
false
```



[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://tkf.github.io/ExternalDocstrings.jl/dev)
[![CI](https://github.com/tkf/ExternalDocstrings.jl/actions/workflows/test.yml/badge.svg)](https://github.com/tkf/ExternalDocstrings.jl/actions/workflows/test.yml)

ExternalDocstrings.jl is a helper for writing docstrings in markdown files.

See the [documentation](https://tkf.github.io/ExternalDocstrings.jl/dev) for more information.




[![Build Status](https://travis-ci.org/JuliaWeb/UAParser.jl.svg?branch=master)](https://travis-ci.org/JuliaWeb/UAParser.jl) </br>
[![Coverage Status](https://coveralls.io/repos/JuliaWeb/UAParser.jl/badge.svg)](https://coveralls.io/r/JuliaWeb/UAParser.jl)


UAParser is a Julia port of [ua-parser](https://github.com/ua-parser/uap-python), which itself is a multi-language port of [BrowserScope's](http://www.browserscope.org) [user agent string parser](http://code.google.com/p/ua-parser/). Per the [README file](https://github.com/ua-parser/uap-core/blob/master/README.md) of the main project:

> "The crux of the original parser--the data collected by [Steve Souders](http://stevesouders.com/) over the years--has been extracted into a separate [YAML file](https://github.com/tobie/ua-parser/blob/master/regexes.yaml) so as to be reusable _as is_ by implementations in other programming languages."

UAParser is a limited Julia implementation heavily influenced by the [Python code](https://github.com/ua-parser/uap-python) from the ua-parser library.

New regexes have were retrieved from [here](https://github.com/ua-parser/uap-core/blob/master/regexes.yaml) on 2018-12-19.

