-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Talk title
Writing High Performance LLM Applications in Python: Insights from building Modulo
Short talk description
Learn to build high-performance LLM apps in Python!
Today, it is easier than ever to put up a working AI SaaS app.
However, improving the usability of the application is often a challenge, especially when the workflows involved are non-trivial and process a lot of data.
Suddenly, we are back in the world of careful software engineering.
In this talk we will cover various considerations and techniques that can be used to improve the performance and token efficiency of your LLM application.
Discover how to optimize for time-to-first-token and overall latency. Learn about async patterns for faster responses, caching strategies , memory techniques to reduce token usage and a myriad other techniques being used to give us realtime semantics while working with LLMs.
Long talk description
In the era of AI-driven development, coding agents powered by large language models (LLMs) promise to automate complex software tasks—from generating code to debugging and deployment. However, raw LLM power often falls short due to latency (e.g., 10s to several minutes per call), cost($0.01-0.10 per 1k tokens), and reliability issues.
This talk dives into practical strategies for building high-performance LLM applications in Python, drawn from real-world experience developing a coding agent that fixes bugs.
We'll explore key optimizations:
- Set output token threshold limits. Total throughput = (tokens/sec per call) × (number of parallel calls)
- Use async/await for all I/O-bound LLM operations
- Microservices architecture for better state machine management. We will also briefly cover RabbitMQ.
- Batch requests when possible
- Compress context and manage memory actively
- Implement semantic caching
- Caching MCP query->responses with Redis.
We will go through code examples for all the optimizations above. We will also explain how to do basic code profiling in Python.
If we have time we will talk about some pitfalls to avoid while building applications that use LLMs.
What format do you have in mind?
Talk (20-25 minutes + Q&A)
Talk outline / Agenda
1. Introduction: The Performance Challenge (5 minutes)
- The rise of coding agents (Cursor, Devin, Aider, CodeRabbit)
- Why performance matters: UX, cost, scalability
- The Modulo AI journey
- Common LLM application bottlenecks
2. Asynchronous Programming & Concurrency (8 minutes)
- Python asyncio for LLM API calls
- Parallel execution for file operations and network calls.
- Batching for throughput
- Microservices architecture
- Code example: Multi-file code analysis
3. Latency Optimizations (7 minutes)
- Optimizing LLM calls for throughput.
- Time-to-First-Token optimization
- Real-world latency improvements
- Model selection trade-offs
4. Caching Strategies (6 minutes)
- Semantic caching
- MCP server API caching with Redis
- When to cache and when not to
- Code example: Implementing LLM caching
5. Context & Memory Management (6 minutes)
- Memory compression
- Context engineering for coding agents
- Progressive summarization strategies
6. Real-World Lessons from Production Coding Agents (3 minutes)
- Devin: 4x speedup with fine-tuning
- Aider: Multi-file editing approach
- CodeRabbit: 50% accuracy improvement
- Performance vs accuracy trade-offs
Q&A (5 minutes)
- Open floor for questions
Key takeaways
Building production-ready LLM applications requires more than just calling an API.
This talk shares practical insights from building Modulo AI, a coding agent for bug management and debugging, exploring how to write high-performance LLM applications in Python.
Key Takeaways:
-
Asynchronous Programming & Concurrency in Python
- Achieve faster response times with asyncio patterns
- Implement parallel execution for multi-agent workflows
- Batching techniques
- Microservices architectures with RabbitMQ.
-
Latency Optimization
- Optimizing LLM calls for throughput.
- Reduce Time-to-First-Token (TTFT) for better UX
- Implement streaming responses with async generators
- Optimize perceived vs actual latency
-
Caching Strategies
- Semantic caching
- Choose between Redis and in-memory solutions
- Implement effective cache invalidation. Pitfalls of optimizing for performance.
-
Context & Memory Management
- Reduce token usage with memory compression
- Navigate context window limitations effectively
- Apply progressive summarization strategies
-
Common Pitfalls
- Implement retry logic with exponential backoff
- Handle rate limits gracefully
- Design fallback mechanisms
Technologies Covered:
- asyncio, aiohttp
- Redis, GPTCache
- RabbitMQ
- LangChain
- OpenAI, Anthropic APIs
- MCPs
- FastAPI for async endpoints
- Mem0 for memory management
What domain would you say your talk falls under?
Data Science and Machine Learning
Duration (including Q&A)
30 minutes
Prerequisites and preparation
Required Knowledge:
- Intermediate Python programming
- Basic knowledge of LLMs
Setup
- Python
- Code examples will be shared via GitHub
Resources and references
[GitHub repository with all code examples - to be created]
Link to slides/demos (if available)
Here are the slides for the talk!
Writing-High-Performance-AI-Agents-in-Python-Insights-from-building-Modulo.pdf
Twitter/X handle (optional)
LinkedIn profile (optional)
https://www.linkedin.com/in/kirtivr/
Profile picture URL (optional)
https://i1.rgstatic.net/ii/profile.image/763888240975874-1559136557455_Q128/Kirti-Rathore.jpg
Speaker bio
Hello, this is Kirti Vardhan Rathore, or KV, for short.
I love digging into how computer systems work, their lifecycle, and improving them to better serve our customers.
Over my last 10 years as a Computer Scientist, I have enjoyed working on many different problems:
Networked Storage Systems (Google)
Computer Security (VMware)
Machine Learning (VMware)
Firmware Development and Distributed Systems (VMware)
Software Defined Networking (Arista Networks)
Computer Graphics (SAP Labs)
As a Tech Lead I have a track record with:
Collaborating across different teams and stakeholders.
Distributing tasks to optimize both the growth of team members and delivery efficiency.
Providing crucial inputs in software design and development.
Deploying, CI/Testing and monitoring systems at scale.
In my free time I like working out, reading fantasy fiction and talking with friends.
Availability
08/11/2025
Accessibility & special requirements
No
Speaker checklist
- I have read and understood the PyDelhi guidelines for submitting proposals and giving talks
- I will make my talk accessible to all attendees and will proactively ask for any accommodations or special requirements I might need
- I agree to share slides, code snippets, and other materials used during the talk with the community
- I will follow PyDelhi's Code of Conduct and maintain a welcoming, inclusive environment throughout my participation
- I understand that PyDelhi meetups are community-centric events focused on learning, knowledge sharing, and networking, and I will respect this ethos by not using this platform for self-promotion or hiring pitches during my presentation, unless explicitly invited to do so by means of a sponsorship or similar arrangement
- If the talk is recorded by the PyDelhi team, I grant permission to release the video on PyDelhi's YouTube channel under the CC-BY-4.0 license, or a different license of my choosing if I am specifying it in my proposal or with the materials I share
Additional comments
No response