Pipe Your Data Naturally
A modern, intuitive data manipulation library for Python that makes your data workflows read like natural language. Built on pandas' robust foundation with a clean, pipe-based syntax inspired by R's dplyr and tidyverse.
from pipeframe import *
# Your data pipeline reads like a story
result = (df
>> filter('age > 21')
>> group_by('city')
>> summarize(avg_income='mean(income)', count='count()')
>> arrange('-avg_income')
)π‘ How to read
>>: Read the>>operator as "pipe to" or "then". For example, the code above reads as: "Take df, then filter for age > 21, then group by city, then summarize..."
# β Traditional pandas: Hard to read
df[df['age'] > 30].groupby('dept')['salary'].mean().sort_values(ascending=False)
# β
PipeFrame: Clear and intuitive
df >> filter('age > 30') >> group_by('dept') >> summarize(avg='mean(salary)') >> arrange('-avg')- π Pipe Operator
>>- Natural method chaining without nested parentheses - π String Expressions - Write conditions as readable strings:
'age > 30 & salary > 50000' - π Security Hardened - Built-in expression validation prevents code injection
- πΌ Pandas Compatible - Works seamlessly with existing pandas DataFrames
- π― Type Safe - Full type hints for excellent IDE support and autocomplete
- β‘ Performance - Only ~5-15% overhead vs raw pandas
- π Rich I/O - Read/write CSV, Excel, JSON, Parquet, SQL, and more
- π Powerful Reshaping - Tidyr-style pivoting, melting, and transformations
- π‘οΈ Production Ready - Comprehensive error handling and validation
# Basic installation
pip install pipeframe
# With all optional dependencies
pip install pipeframe[all]
# Specific features
pip install pipeframe[excel] # Excel support
pip install pipeframe[parquet] # Parquet files
pip install pipeframe[sql] # SQL databases
pip install pipeframe[plot] # Visualizationfrom pipeframe import *
# Create a DataFrame
df = DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 32, 37, 29],
'salary': [50000, 65000, 72000, 58000],
'dept': ['Engineering', 'Marketing', 'Engineering', 'Sales']
})
# Transform with intuitive verbs
result = (df
>> filter('age > 30')
>> define(
bonus='salary * 0.1',
total='salary + bonus'
)
>> select('name', 'dept', 'total')
>> arrange('-total')
)
print(result)
# name dept total
# 0 Charlie Engineering 79200.0
# 1 Bob Marketing 71500.0Chain operations naturally without nested function calls:
# Traditional approach (hard to read)
result = arrange(
select(
define(
filter(df, 'age > 25'),
experience='2024 - start_year'
),
'name', 'experience', 'salary'
),
'-salary'
)
# PipeFrame approach (reads like a recipe)
result = (df
>> filter('age > 25')
>> define(experience='2024 - start_year')
>> select('name', 'experience', 'salary')
>> arrange('-salary')
)| Verb | Purpose | Example |
|---|---|---|
define() |
Create/modify columns | >> define(total='price * quantity') |
filter() |
Filter rows | >> filter('age > 30 & city == "NYC"') |
select() |
Choose columns | >> select('name', 'age', 'salary') |
arrange() |
Sort data | >> arrange('-salary', 'name') |
group_by() |
Group data | >> group_by('category', 'region') |
summarize() |
Aggregate | >> summarize(total='sum(sales)', avg='mean(price)') |
rename() |
Rename columns | >> rename(customer_id='cid') |
distinct() |
Unique rows | >> distinct('product', 'store') |
# if_else for binary conditions
df >> define(
status=if_else('salary > 60000', 'High', 'Standard'),
category=if_else('age >= 30', 'Senior', 'Junior')
)
# case_when for multiple conditions
df >> define(
grade=case_when(
('score >= 90', 'A'),
('score >= 80', 'B'),
('score >= 70', 'C'),
('score >= 60', 'D'),
default='F'
)
)# Summary by group
summary = (df
>> group_by('department', 'location')
>> summarize(
headcount='count()',
avg_salary='mean(salary)',
total_sales='sum(sales)',
top_performer='max(performance_score)'
)
>> arrange('-avg_salary')
)
# Multiple aggregations
analysis = (df
>> group_by('product_category')
>> summarize(
units_sold='sum(quantity)',
revenue='sum(price * quantity)',
avg_price='mean(price)',
num_transactions='count()'
)
>> define(
avg_transaction_value='revenue / num_transactions'
)
)# Pivot wider (long to wide)
wide = (df
>> pivot_wider(
id_cols='student',
names_from='subject',
values_from='grade'
)
)
# Pivot longer (wide to long)
long = (df
>> pivot_longer(
cols=['Q1_sales', 'Q2_sales', 'Q3_sales', 'Q4_sales'],
names_to='quarter',
values_to='sales'
)
)
# Separate columns
separated = df >> separate('full_name', into=['first', 'last'], sep=' ')
# Unite columns
united = df >> unite('full_date', ['year', 'month', 'day'], sep='-')# Select by pattern
df >> select(
'id',
starts_with('date_'), # All columns starting with 'date_'
ends_with('_amount'), # All columns ending with '_amount'
contains('price'), # All columns containing 'price'
matches(r'Q\d_sales') # Regex pattern matching
)
# Column ranges
df >> select('id', 'name:salary') # Select from 'name' to 'salary'# Read from various sources
df = read_csv('data.csv')
df = read_excel('data.xlsx', sheet_name='Sales')
df = read_json('data.json', orient='records')
df = read_parquet('data.parquet')
df = read_sql('SELECT * FROM users', connection)
df = read_clipboard() # Paste from spreadsheet!
# Write to different formats
df.to_csv('output.csv', index=False)
df.to_excel('report.xlsx', sheet_name='Results')
df.to_parquet('data.parquet', compression='gzip')
df.to_json('data.json', orient='records', lines=True)from pipeframe import *
# Load and analyze sales data
analysis = (
read_csv('sales_data.csv')
>> filter('date >= "2024-01-01" & revenue > 0')
>> define(
profit='revenue - cost',
margin='profit / revenue * 100',
quarter='pd.to_datetime(date).dt.quarter'
)
>> group_by('product_category', 'quarter')
>> summarize(
total_revenue='sum(revenue)',
total_profit='sum(profit)',
avg_margin='mean(margin)',
num_sales='count()'
)
>> define(
profit_per_sale='total_profit / num_sales'
)
>> arrange('-total_revenue')
)
# Export results
analysis.to_excel('quarterly_analysis.xlsx', sheet_name='Summary')# Segment customers by behavior
segments = (df
>> filter('total_purchases > 0')
>> define(
avg_order_value='total_spent / total_purchases',
recency_days='(pd.Timestamp.now() - last_purchase_date).dt.days',
segment=case_when(
('avg_order_value > 100 & recency_days < 30', 'Premium Active'),
('avg_order_value > 100 & recency_days >= 30', 'Premium At Risk'),
('recency_days < 30', 'Standard Active'),
('recency_days < 90', 'At Risk'),
default='Churned'
)
)
>> group_by('segment')
>> summarize(
customers='count()',
total_value='sum(total_spent)',
avg_value='mean(total_spent)'
)
)# Clean and standardize data
clean_data = (
read_excel('messy_data.xlsx')
>> filter('id.notna()') # Remove rows without ID
>> define(
# Standardize text fields
name='name.str.title().str.strip()',
email='email.str.lower().str.strip()',
# Parse dates
signup_date='pd.to_datetime(signup_date)',
# Fill missing values
phone='phone.fillna("Not Provided")',
# Create derived fields
account_age_days='(pd.Timestamp.now() - signup_date).dt.days'
)
>> distinct('email', keep='first') # Deduplicate by email
>> arrange('signup_date')
)PipeFrame includes built-in security features to prevent code injection:
# β
Safe expressions are allowed
df >> define(total='price * quantity')
df >> filter('age > 30 & city == "NYC"')
# β Dangerous expressions are blocked
df >> define(bad="__import__('os').system('rm -rf /')")
# PipeFrameExpressionError: Expression contains dangerous pattern
# All string expressions are validated before execution
# - Blocks: __import__, exec(), eval(), compile(), open(), file()
# - Validates expression syntax
# - Uses pandas' restricted eval environmentPipeFrame adds minimal overhead while dramatically improving code readability:
Benchmarks (1M rows):
- Filter operation: ~8% overhead
- GroupBy aggregation: ~12% overhead
- Complex pipeline (5 operations): ~10% overhead
Why the overhead is worth it:
- π§ Reduced cognitive load
- π Fewer bugs from clearer intent
- β‘ Faster development time
- π₯ Easier code review
- π Better maintainability
- Tutorial Notebook - Complete walkthrough
- API Reference - Detailed documentation
- Examples - Real-world use cases
- Contributing Guide - How to contribute
We welcome contributions! Here's how you can help:
- π Report bugs - Open an issue
- π‘ Suggest features - Share your ideas
- π Improve docs - Help others learn
- π§ Submit PRs - Fix bugs or add features
See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE file for details.
Dr. Yasser Mustafa
AI & Data Science Specialist | Theoretical Physics PhD
- π PhD in Theoretical Nuclear Physics
- πΌ 10+ years in production AI/ML systems
- π¬ 48+ research publications
- π’ Experience: Government (Abu Dhabi), Media (Track24), Recruitment (Reed), Energy (ADNOC)
- π Based in Newcastle Upon Tyne, UK
- βοΈ yasser.mustafan@gmail.com
- π LinkedIn | GitHub
PipeFrame was born from years of working with data pipelines in production environments, combining the elegance of R's tidyverse with Python's practicality.
If PipeFrame helps your work, please consider giving it a star! β
- β Core verbs and operators
- β Security hardening
- β Comprehensive I/O
- β Reshape operations
- β Type hints
- Join operations (left_join, inner_join, etc.)
- Window functions
- Time series helpers
- Enhanced plotting integration
- Performance optimizations
- Lazy evaluation engine
- Alternative backends (Polars, DuckDB)
- Distributed computing support
- Interactive data exploration tools
- SQL generation from pipes
- Issues: Report bugs or request features
- Discussions: Ask questions, share use cases
Built with β€οΈ for data scientists who value readability
Make your data speak naturally with PipeFrame π