Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research ways to speed up brute force cosine similarity #246

Open
simonw opened this issue Sep 6, 2023 · 6 comments
Open

Research ways to speed up brute force cosine similarity #246

simonw opened this issue Sep 6, 2023 · 6 comments
Labels

Comments

@simonw
Copy link
Owner

simonw commented Sep 6, 2023

Might be faster to use numpy or some other trick.

@simonw simonw added the research label Sep 6, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 6, 2023

I got Code Interpreter to run some benchmarks for me: https://chat.openai.com/share/3b4ae422-17eb-4855-a5d7-e39d3f58c1f7

Most interesting result:

image

That's comparing this function with and without numpy:

import matplotlib.pyplot as plt
import time

def top_n(vectors: List[List[float]], n: int, skip_numpy: bool = False) -> List[Tuple[int, float]]:
    if not skip_numpy:
        try:
            import numpy as np
            numpy_available = True
        except ImportError:
            numpy_available = False
    else:
        numpy_available = False
    
    start_time = time.time()
    
    if numpy_available:
        # NumPy-based calculation
        target_vector = np.array(vectors[0])
        all_vectors = np.array(vectors[1:])
        
        dot_products = np.dot(all_vectors, target_vector)
        magnitude_a = np.linalg.norm(target_vector)
        magnitude_b = np.linalg.norm(all_vectors, axis=1)
        
        similarities = dot_products / (magnitude_a * magnitude_b)
        top_indices = np.argsort(similarities)[::-1][:n]
        top_scores = [(int(idx), float(similarities[idx])) for idx in top_indices]
    else:
        # Fallback to Python-based calculation
        target_vector = vectors[0]
        similarities = [(idx, cosine_similarity_fallback(target_vector, vec)) for idx, vec in enumerate(vectors[1:])]
        top_scores = sorted(similarities, key=lambda x: x[1], reverse=True)[:n]
    
    elapsed_time = time.time() - start_time
    return top_scores, elapsed_time

# Sizes for benchmarking
sizes_to_test = [1000, 2000, 5000, 10000]

# Benchmarking the function
benchmark_results = {'numpy': [], 'python': []}
for size in sizes_to_test:
    sample_vectors = [[float(j) / (i + 1) for j in range(1, 11)] for i in range(size)]
    
    # With NumPy
    _, elapsed_time_numpy = top_n(sample_vectors, 10)
    benchmark_results['numpy'].append(elapsed_time_numpy)
    
    # Without NumPy
    _, elapsed_time_python = top_n(sample_vectors, 10, skip_numpy=True)
    benchmark_results['python'].append(elapsed_time_python)

# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(sizes_to_test, benchmark_results['numpy'], label='NumPy', marker='o')
plt.plot(sizes_to_test, benchmark_results['python'], label='Python', marker='x')
plt.xlabel('Number of Vectors')
plt.ylabel('Time (seconds)')
plt.title('Benchmarking top_n Function')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

@simonw
Copy link
Owner Author

simonw commented Sep 6, 2023

I integrated that into LLM but I'm not quite ready to land it yet, because I worry about the difference between loading all the vectors into memory v.s. the current approach which compares them one at a time as part of the SQL query:

llm/llm/embeddings.py

Lines 245 to 277 in b2fc0a1

def distance_score(other_encoded):
other_vector = llm.decode(other_encoded)
return llm.cosine_similarity(other_vector, vector)
self.db.register_function(distance_score, replace=True)
where_bits = ["collection_id = ?"]
where_args = [str(self.id)]
if skip_id:
where_bits.append("id != ?")
where_args.append(skip_id)
return [
Entry(
id=row["id"],
score=row["score"],
content=row["content"],
metadata=json.loads(row["metadata"]) if row["metadata"] else None,
)
for row in self.db.query(
"""
select id, content, metadata, distance_score(embedding) as score
from embeddings
where {where}
order by score desc limit {number}
""".format(
where=" and ".join(where_bits),
number=number,
),
where_args,
)
]

@simonw
Copy link
Owner Author

simonw commented Sep 6, 2023

Leaving this here for the moment as a future research task. The big question is if it's worth loading every embedding into memory in exchange for this performance increase.

Maybe I should have a cutoff, where it counts the members of the collection first and only loads them all into memory if there are less than N, where N is 10,000 or something?

Could there be a way of speeding up the SQL-based comparisons as well, maybe by using an aggregation function and just loading vectors at a time in batches?

@simonw
Copy link
Owner Author

simonw commented Sep 6, 2023

I do this:

llm/llm/__init__.py

Lines 26 to 29 in 85e9225

try:
import numpy as np
except ImportError:
np = None

llm/llm/__init__.py

Lines 264 to 278 in 85e9225

if np:
# NumPy-based calculation
all_vectors = np.array(vectors)
dot_products = np.dot(all_vectors, target_vector)
magnitude_a = np.linalg.norm(target_vector)
magnitude_b = np.linalg.norm(all_vectors, axis=1)
similarities = dot_products / (magnitude_a * magnitude_b)
top_indices = np.argsort(similarities)[::-1][:n]
top_scores = [(int(idx), float(similarities[idx])) for idx in top_indices]
else:
# Fallback to Python-based calculation
similarities = [
(idx, cosine_similarity(target_vector, vec))
for idx, vec in enumerate(vectors)
]

So that I can have numpy as an optional dependency - I'd rather not pull it in for LLM by default just for this single optimization.

@simonw
Copy link
Owner Author

simonw commented Sep 6, 2023

Maybe I don't implement this optimization in core, but I instead make it available as one of the vector indexing plugins (the dumbest one of all, which simple maintains an in-memory numpy array for each collection) as part of this:

@simonw
Copy link
Owner Author

simonw commented Sep 10, 2023

I got some good results from using pytorch for this, with the F.cosine_similarity() function against a bunch of Tensors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant