Skip to content

Commit

Permalink
Add BitPacked embeddings for RAG retrieval (#152)
Browse files Browse the repository at this point in the history
  • Loading branch information
svilupp committed May 19, 2024
1 parent 27e4301 commit 9e9fd16
Show file tree
Hide file tree
Showing 8 changed files with 435 additions and 21 deletions.
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Added new field `meta` to `TracerMessage` and `TracerMessageLike` to hold metadata in a simply dictionary. Change is backward-compatible.
- Changed behaviour of `aitemplates(name::Symbol)` to look for the exact match on the template name, not just a partial match. This is a breaking change for the `aitemplates` function only. Motivation is that having multiple matches could have introduced subtle bugs when looking up valid placeholders for a template.


### Added
- Improved support for `aiclassify` with OpenAI models (you can now encode upto 40 choices).
- Added a template for routing questions `:QuestionRouter` (to be used with `aiclassify`)
- Improved tracing by `TracerSchema` to automatically capture crucial metadata such as any LLM API kwargs (`api_kwargs`), use of prompt templates and its versions. Information is captured in `meta(tracer)` dictionary. See `?TracerSchema` for more information.
- Improved tracing by `TracerSchema` to automatically capture crucial metadata such as any LLM API kwargs (`api_kwargs`), use of prompt templates and its version. Information is captured in `meta(tracer)` dictionary. See `?TracerSchema` for more information.
- New tracing schema `SaverSchema` allows to automatically serialize all conversations. It can be composed with other tracing schemas, eg, `TracerSchema` to automatically capture necessary metadata and serialize. See `?SaverSchema` for more information.
- Updated options for Binary embeddings (refer to release v0.18 for motivation). Adds utility functions `pack_bits` and `unpack_bits` to move between binary and UInt64 representations of embeddings. RAGTools adds the corresponding `BitPackedBatchEmbedder` and `BitPackedCosineSimilarity` for fast retrieval on these Bool<->UInt64 embeddings (credit to [**domluna's tinyRAG**](https://github.com/domluna/tinyRAG)).

### Fixed
- Fixed a bug where `aiclassify` would not work when returning the full conversation for choices with extra descriptions
Expand Down
64 changes: 63 additions & 1 deletion src/Experimental/RAGTools/preparation.jl
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,27 @@ struct BatchEmbedder <: AbstractEmbedder end
"""
BinaryBatchEmbedder <: AbstractEmbedder
Same as `BatchEmbedder` but reduces the embeddings matrix to binary tool (eg, `BitMatrix`).
Same as `BatchEmbedder` but reduces the embeddings matrix to a binary form (eg, `BitMatrix`).
Reference: [HuggingFace: Embedding Quantization](https://huggingface.co/blog/embedding-quantization#binary-quantization-in-vector-databases).
"""
struct BinaryBatchEmbedder <: AbstractEmbedder end

"""
BitPackedBatchEmbedder <: AbstractEmbedder
Same as `BatchEmbedder` but reduces the embeddings matrix to a binary form packed in UInt64 (eg, `BitMatrix.chunks`).
See also utilities `pack_bits` and `unpack_bits` to move between packed/non-packed binary forms.
Reference: [HuggingFace: Embedding Quantization](https://huggingface.co/blog/embedding-quantization#binary-quantization-in-vector-databases).
"""
struct BitPackedBatchEmbedder <: AbstractEmbedder end

EmbedderEltype(::T) where {T} = EmbedderEltype(T)
EmbedderEltype(::Type{<:AbstractEmbedder}) = Float32
EmbedderEltype(::Type{BinaryBatchEmbedder}) = Bool
EmbedderEltype(::Type{BitPackedBatchEmbedder}) = UInt64

### Tagging Types
"""
Expand Down Expand Up @@ -302,6 +314,56 @@ function get_embeddings(
emb = (emb .> 0) |> x -> x isa return_type ? x : return_type(x)
end

"""
get_embeddings(embedder::BitPackedBatchEmbedder, docs::AbstractVector{<:AbstractString};
verbose::Bool = true,
model::AbstractString = PT.MODEL_EMBEDDING,
truncate_dimension::Union{Int, Nothing} = nothing,
cost_tracker = Threads.Atomic{Float64}(0.0),
target_batch_size_length::Int = 80_000,
ntasks::Int = 4 * Threads.nthreads(),
kwargs...)
Embeds a vector of `docs` using the provided model (kwarg `model`) in a batched manner and then returns the binary embeddings matrix represented in UInt64 (bit-packed) - `BitPackedBatchEmbedder`.
`BitPackedBatchEmbedder` tries to batch embedding calls for roughly 80K characters per call (to avoid exceeding the API rate limit) to reduce network latency.
The best option for FAST and MEMORY-EFFICIENT storage of embeddings, for retrieval use `BitPackedCosineSimilarity`.
# Notes
- `docs` are assumed to be already chunked to the reasonable sizes that fit within the embedding context limit.
- If you get errors about exceeding input sizes, first check the `max_length` in your chunks.
If that does NOT resolve the issue, try reducing the `target_batch_size_length` parameter (eg, 10_000) and number of tasks `ntasks=1`.
Some providers cannot handle large batch sizes.
# Arguments
- `docs`: A vector of strings to be embedded.
- `verbose`: A boolean flag for verbose output. Default is `true`.
- `model`: The model to use for embedding. Default is `PT.MODEL_EMBEDDING`.
- `truncate_dimension`: The dimensionality of the embeddings to truncate to. Default is `nothing`.
- `cost_tracker`: A `Threads.Atomic{Float64}` object to track the total cost of the API calls. Useful to pass the total cost to the parent call.
- `target_batch_size_length`: The target length (in characters) of each batch of document chunks sent for embedding. Default is 80_000 characters. Speeds up embedding process.
- `ntasks`: The number of tasks to use for asyncmap. Default is 4 * Threads.nthreads().
See also: `unpack_bits`, `pack_bits`, `BitPackedCosineSimilarity`.
"""
function get_embeddings(
embedder::BitPackedBatchEmbedder, docs::AbstractVector{<:AbstractString};
verbose::Bool = true,
model::AbstractString = PT.MODEL_EMBEDDING,
truncate_dimension::Union{Int, Nothing} = nothing,
cost_tracker = Threads.Atomic{Float64}(0.0),
target_batch_size_length::Int = 80_000,
ntasks::Int = 4 * Threads.nthreads(),
kwargs...)
emb = get_embeddings(BatchEmbedder(), docs; verbose, model, truncate_dimension,
cost_tracker, target_batch_size_length, ntasks, kwargs...)
# This will return Matrix{UInt64} to save space
# Use unpack_bits to convert back to BitMatrix
pack_bits(emb .> 0)
end

### Tag Extraction

function get_tags(tagger::AbstractTagger, docs::AbstractVector{<:AbstractString};
Expand Down
102 changes: 85 additions & 17 deletions src/Experimental/RAGTools/retrieval.jl
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,18 @@ Reference: [HuggingFace: Embedding Quantization](https://huggingface.co/blog/emb
"""
struct BinaryCosineSimilarity <: AbstractSimilarityFinder end

"""
BitPackedCosineSimilarity <: AbstractSimilarityFinder
Finds the closest chunks to a query embedding by measuring the Hamming distance AND cosine similarity between the query and the chunks' embeddings in binary form.
The difference to `BinaryCosineSimilarity` is that the binary values are packed into UInt64, which is more efficient.
Reference: [HuggingFace: Embedding Quantization](https://huggingface.co/blog/embedding-quantization#binary-quantization-in-vector-databases).
Implementation of `hamming_distance` is based on [TinyRAG](https://github.com/domluna/tinyrag/blob/main/README.md).
"""
struct BitPackedCosineSimilarity <: AbstractSimilarityFinder end

"""
NoTagFilter <: AbstractTagFilter
Expand Down Expand Up @@ -202,32 +214,46 @@ function find_closest(
c -> find_closest(finder, index, c; top_k = top_k_, kwargs...), vcat, eachcol(query_emb))
end

## For binary embeddings
#### For binary embeddings
## Source: https://github.com/domluna/tinyrag/blob/main/README.md
## With minor modifications to the signatures

@inline function hamming_distance(x1::T, x2::T)::Int where {T <: Integer}
return Int(count_ones(x1 x2))
end
@inline function hamming_distance(x1::T, x2::T)::Int where {T <: Bool}
return Int(x1 x2)
end
@inline function hamming_distance(
x1::AbstractVector{T}, x2::AbstractVector{T})::Int where {T <: Integer}
s = 0
@inbounds @simd for i in eachindex(x1, x2)
s += hamming_distance(x1[i], x2[i])
end
s
end

"""
hamming_distance(mat::AbstractMatrix{<:Bool}, vect::AbstractVector{<:Bool})
hamming_distance(
mat::AbstractMatrix{T}, query::AbstractVector{T})::Vector{Int} where {T <: Integer}
Calculates the column-wise Hamming distance between a matrix of binary vectors `mat` and a single binary vector `vect`.
This is the first-pass ranking for `BinaryCosineSimilarity` method.
Implementation from [**domluna's tinyRAG**](https://github.com/domluna/tinyRAG).
"""
function hamming_distance(mat::AbstractMatrix{<:Bool}, vect::AbstractVector{<:Bool})
@inline function hamming_distance(
mat::AbstractMatrix{T}, query::AbstractVector{T})::Vector{Int} where {T <: Integer}
# Check if the number of rows matches
if size(mat, 1) != length(vect)
throw(ArgumentError("Matrix must have the same number of rows as the length of the Vector (provided: $(size(mat, 1)) vs $(length(vect)))"))
if size(mat, 1) != length(query)
throw(ArgumentError("Matrix must have the same number of rows as the length of the Vector (provided: $(size(mat, 1)) vs $(length(query)))"))
end

# Calculate number of different bits, the smaller the number, the more similar they are.
distances = zeros(Int, size(mat, 2))
@inbounds for j in axes(mat, 2)
cnt = 0
v = @view(mat[:, j])
@simd for i in eachindex(vect, v)
cnt += v[i] vect[i]
end
distances[j] = cnt
dists = zeros(Int, size(mat, 2))
@inbounds @simd for i in axes(mat, 2)
dists[i] = hamming_distance(@view(mat[:, i]), query)
end

return distances
dists
end

"""
Expand Down Expand Up @@ -272,6 +298,48 @@ function find_closest(
return positions[new_positions], scores
end

"""
find_closest(
finder::BitPackedCosineSimilarity, emb::AbstractMatrix{<:Bool},
query_emb::AbstractVector{<:Real};
top_k::Int = 100, rescore_multiplier::Int = 4, minimum_similarity::AbstractFloat = -1.0, kwargs...)
Finds the indices of chunks (represented by embeddings in `emb`) that are closest to query embedding (`query_emb`) using bit-packed binary embeddings (in the index).
This is a two-pass approach:
- First pass: Hamming distance in bit-packed binary form to get the `top_k * rescore_multiplier` (i.e., more than top_k) candidates.
- Second pass: Rescore the candidates with float embeddings and return the top_k.
Returns only `top_k` closest indices.
Reference: [HuggingFace: Embedding Quantization](https://huggingface.co/blog/embedding-quantization#binary-quantization-in-vector-databases).
# Examples
Convert any Float embeddings to bit-packed binary like this:
```julia
bitpacked_emb = pack_bits(emb.>0)
```
"""
function find_closest(
finder::BitPackedCosineSimilarity, emb::AbstractMatrix{<:Integer},
query_emb::AbstractVector{<:Real};
top_k::Int = 100, rescore_multiplier::Int = 4, minimum_similarity::AbstractFloat = -1.0, kwargs...)
# emb is an embedding matrix where the first dimension is the embedding dimension

## First pass, both in binary with Hamming, get rescore_multiplier times top_k
bit_query_emb = pack_bits(query_emb .> 0)
scores = hamming_distance(emb, bit_query_emb)
positions = scores |> sortperm |> x -> first(x, top_k * rescore_multiplier)

## Second pass, rescore with float embeddings and return top_k
unpacked_emb = unpack_bits(@view(emb[:, positions]))
new_positions, scores = find_closest(CosineSimilarity(), unpacked_emb,
query_emb; top_k, minimum_similarity, kwargs...)

## translate to original indices
return positions[new_positions], scores
end

## TODO: Implement for MultiIndex
## function find_closest(index::AbstractMultiIndex,
## query_emb::AbstractVector{<:Real};
Expand Down
97 changes: 97 additions & 0 deletions src/Experimental/RAGTools/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -367,3 +367,100 @@ function merge_kwargs_nested(nt1::NamedTuple, nt2::NamedTuple)
end
return (; zip(keys(result), values(result))...)
end

### Support for binary embeddings

function pack_bits(arr::AbstractArray{<:Number})
throw(ArgumentError("Input must be of binary eltype (Bool vs provided $(eltype(arr))). Please convert your matrix to binary before packing."))
end

"""
pack_bits(arr::AbstractMatrix{<:Bool}) -> Matrix{UInt64}
pack_bits(vect::AbstractVector{<:Bool}) -> Vector{UInt64}
Pack a matrix or vector of boolean values into a more compact representation using UInt64.
# Arguments (Input)
- `arr::AbstractMatrix{<:Bool}`: A matrix of boolean values where the number of rows must be divisible by 64.
# Returns
- For `arr::AbstractMatrix{<:Bool}`: Returns a matrix of UInt64 where each element represents 64 boolean values from the original matrix.
# Examples
For vectors:
```julia
bin = rand(Bool, 128)
binint = pack_bits(bin)
binx = unpack_bits(binint)
@assert bin == binx
```
For matrices:
```julia
bin = rand(Bool, 128, 10)
binint = pack_bits(bin)
binx = unpack_bits(binint)
@assert bin == binx
```
"""
function pack_bits(arr::AbstractMatrix{<:Bool})
rows, cols = size(arr)
@assert rows % 64==0 "Number of rows must be divisable by 64"
new_rows = rows ÷ 64
reshape(BitArray(arr).chunks, new_rows, cols)
end
function pack_bits(vect::AbstractVector{<:Bool})
len = length(vect)
@assert len % 64==0 "Length must be divisable by 64"
BitArray(vect).chunks
end

function unpack_bits(arr::AbstractArray{<:Number})
throw(ArgumentError("Input must be of UInt64 eltype (provided: $(eltype(arr))). Are you sure you've packed this array?"))
end

"""
unpack_bits(packed_vector::AbstractVector{UInt64}) -> Vector{Bool}
unpack_bits(packed_matrix::AbstractMatrix{UInt64}) -> Matrix{Bool}
Unpack a vector or matrix of UInt64 values into their original boolean representation.
# Arguments (Input)
- `packed_matrix::AbstractMatrix{UInt64}`: A matrix of UInt64 values where each element represents 64 boolean values.
# Returns
- For `packed_matrix::AbstractMatrix{UInt64}`: Returns a matrix of boolean values where the number of rows is 64 times the number of rows in the input matrix.
# Examples
For vectors:
```julia
bin = rand(Bool, 128)
binint = pack_bits(bin)
binx = unpack_bits(binint)
@assert bin == binx
```
For matrices:
```julia
bin = rand(Bool, 128, 10)
binint = pack_bits(bin)
binx = unpack_bits(binint)
@assert bin == binx
```
"""
function unpack_bits(packed_vector::AbstractVector{UInt64})
return Bool[((x >> i) & 1) == 1 for x in packed_vector for i in 0:63]
end
function unpack_bits(packed_matrix::AbstractMatrix{UInt64})
num_rows, num_cols = size(packed_matrix)
output_rows = num_rows * 64
output_matrix = Matrix{Bool}(undef, output_rows, num_cols)

for col in axes(packed_matrix, 2)
output_matrix[:, col] = unpack_bits(@view(packed_matrix[:, col]))
end

return output_matrix
end
9 changes: 9 additions & 0 deletions test/Experimental/RAGTools/preparation.jl
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ using PromptingTools.Experimental.RAGTools: build_tags, build_index, SimpleIndex
get_tags, get_chunks, get_embeddings
using PromptingTools.Experimental.RAGTools: build_tags, build_index
using PromptingTools: TestEchoOpenAISchema
using PromptingTools.Experimental.RAGTools: pack_bits, BitPackedBatchEmbedder

@testset "load_text" begin
# from file
Expand Down Expand Up @@ -80,9 +81,17 @@ end
@test size(output) == (100, 2)
@test eltype(output) == Bool

# BitPackedBatchEmbedder
output = get_embeddings(
BitPackedBatchEmbedder(), docs; model = "mock-emb")
@test size(output) == (2, 2)
@test eltype(output) == UInt64
output = pack_bits(ones(Float32, 128, 2) .> 0)

# EmbedderEltype
@test EmbedderEltype(BinaryBatchEmbedder()) == Bool
@test EmbedderEltype(BatchEmbedder()) == Float32
@test EmbedderEltype(BitPackedBatchEmbedder()) == UInt64
end

@testset "tags_extract" begin
Expand Down
Loading

0 comments on commit 9e9fd16

Please sign in to comment.