Skip to content

A modern, intuitive data manipulation library for Python that makes your data workflows read like natural language. Built on pandas' robust foundation with a clean, pipe-based syntax inspired by R's dplyr and tidyverse. Built with ❀️ for data scientists who value readability. Make your data speak naturally with PipeFrame πŸ”„.

License

Notifications You must be signed in to change notification settings

Yasser03/pipeframe

Repository files navigation

PipeFrame πŸ”„

Pipe Your Data Naturally

PyPI version PyPI downloads Python 3.8+ License: MIT Code style: black

A modern, intuitive data manipulation library for Python that makes your data workflows read like natural language. Built on pandas' robust foundation with a clean, pipe-based syntax inspired by R's dplyr and tidyverse.

from pipeframe import *

# Your data pipeline reads like a story
result = (df
    >> filter('age > 21')
    >> group_by('city')  
    >> summarize(avg_income='mean(income)', count='count()')
    >> arrange('-avg_income')
)

πŸ’‘ How to read >>: Read the >> operator as "pipe to" or "then". For example, the code above reads as: "Take df, then filter for age > 21, then group by city, then summarize..."


🌟 Why PipeFrame?

Readability First

# ❌ Traditional pandas: Hard to read
df[df['age'] > 30].groupby('dept')['salary'].mean().sort_values(ascending=False)

# βœ… PipeFrame: Clear and intuitive
df >> filter('age > 30') >> group_by('dept') >> summarize(avg='mean(salary)') >> arrange('-avg')

Key Features

  • πŸ”— Pipe Operator >> - Natural method chaining without nested parentheses
  • πŸ“ String Expressions - Write conditions as readable strings: 'age > 30 & salary > 50000'
  • πŸ”’ Security Hardened - Built-in expression validation prevents code injection
  • 🐼 Pandas Compatible - Works seamlessly with existing pandas DataFrames
  • 🎯 Type Safe - Full type hints for excellent IDE support and autocomplete
  • ⚑ Performance - Only ~5-15% overhead vs raw pandas
  • πŸ“Š Rich I/O - Read/write CSV, Excel, JSON, Parquet, SQL, and more
  • πŸ”„ Powerful Reshaping - Tidyr-style pivoting, melting, and transformations
  • πŸ›‘οΈ Production Ready - Comprehensive error handling and validation

πŸš€ Quick Start

Installation

# Basic installation
pip install pipeframe

# With all optional dependencies
pip install pipeframe[all]

# Specific features
pip install pipeframe[excel]      # Excel support
pip install pipeframe[parquet]    # Parquet files
pip install pipeframe[sql]        # SQL databases
pip install pipeframe[plot]       # Visualization

Hello PipeFrame!

from pipeframe import *

# Create a DataFrame
df = DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 32, 37, 29],
    'salary': [50000, 65000, 72000, 58000],
    'dept': ['Engineering', 'Marketing', 'Engineering', 'Sales']
})

# Transform with intuitive verbs
result = (df
    >> filter('age > 30')
    >> define(
        bonus='salary * 0.1',
        total='salary + bonus'
    )
    >> select('name', 'dept', 'total')
    >> arrange('-total')
)

print(result)
#       name          dept    total
# 0  Charlie  Engineering  79200.0
# 1      Bob     Marketing  71500.0

πŸ“š Core Concepts

The Pipe Operator >>

Chain operations naturally without nested function calls:

# Traditional approach (hard to read)
result = arrange(
    select(
        define(
            filter(df, 'age > 25'),
            experience='2024 - start_year'
        ),
        'name', 'experience', 'salary'
    ),
    '-salary'
)

# PipeFrame approach (reads like a recipe)
result = (df
    >> filter('age > 25')
    >> define(experience='2024 - start_year')
    >> select('name', 'experience', 'salary')
    >> arrange('-salary')
)

Core Verbs

Verb Purpose Example
define() Create/modify columns >> define(total='price * quantity')
filter() Filter rows >> filter('age > 30 & city == "NYC"')
select() Choose columns >> select('name', 'age', 'salary')
arrange() Sort data >> arrange('-salary', 'name')
group_by() Group data >> group_by('category', 'region')
summarize() Aggregate >> summarize(total='sum(sales)', avg='mean(price)')
rename() Rename columns >> rename(customer_id='cid')
distinct() Unique rows >> distinct('product', 'store')

πŸ”₯ Advanced Features

Conditional Logic

# if_else for binary conditions
df >> define(
    status=if_else('salary > 60000', 'High', 'Standard'),
    category=if_else('age >= 30', 'Senior', 'Junior')
)

# case_when for multiple conditions
df >> define(
    grade=case_when(
        ('score >= 90', 'A'),
        ('score >= 80', 'B'),
        ('score >= 70', 'C'),
        ('score >= 60', 'D'),
        default='F'
    )
)

GroupBy Operations

# Summary by group
summary = (df
    >> group_by('department', 'location')
    >> summarize(
        headcount='count()',
        avg_salary='mean(salary)',
        total_sales='sum(sales)',
        top_performer='max(performance_score)'
    )
    >> arrange('-avg_salary')
)

# Multiple aggregations
analysis = (df
    >> group_by('product_category')
    >> summarize(
        units_sold='sum(quantity)',
        revenue='sum(price * quantity)',
        avg_price='mean(price)',
        num_transactions='count()'
    )
    >> define(
        avg_transaction_value='revenue / num_transactions'
    )
)

Data Reshaping

# Pivot wider (long to wide)
wide = (df
    >> pivot_wider(
        id_cols='student',
        names_from='subject',
        values_from='grade'
    )
)

# Pivot longer (wide to long)
long = (df
    >> pivot_longer(
        cols=['Q1_sales', 'Q2_sales', 'Q3_sales', 'Q4_sales'],
        names_to='quarter',
        values_to='sales'
    )
)

# Separate columns
separated = df >> separate('full_name', into=['first', 'last'], sep=' ')

# Unite columns
united = df >> unite('full_date', ['year', 'month', 'day'], sep='-')

Column Selection Helpers

# Select by pattern
df >> select(
    'id',
    starts_with('date_'),      # All columns starting with 'date_'
    ends_with('_amount'),      # All columns ending with '_amount'
    contains('price'),         # All columns containing 'price'
    matches(r'Q\d_sales')      # Regex pattern matching
)

# Column ranges
df >> select('id', 'name:salary')  # Select from 'name' to 'salary'

I/O Operations

# Read from various sources
df = read_csv('data.csv')
df = read_excel('data.xlsx', sheet_name='Sales')
df = read_json('data.json', orient='records')
df = read_parquet('data.parquet')
df = read_sql('SELECT * FROM users', connection)
df = read_clipboard()  # Paste from spreadsheet!

# Write to different formats
df.to_csv('output.csv', index=False)
df.to_excel('report.xlsx', sheet_name='Results')
df.to_parquet('data.parquet', compression='gzip')
df.to_json('data.json', orient='records', lines=True)

🎯 Real-World Examples

Sales Analysis Pipeline

from pipeframe import *

# Load and analyze sales data
analysis = (
    read_csv('sales_data.csv')
    >> filter('date >= "2024-01-01" & revenue > 0')
    >> define(
        profit='revenue - cost',
        margin='profit / revenue * 100',
        quarter='pd.to_datetime(date).dt.quarter'
    )
    >> group_by('product_category', 'quarter')
    >> summarize(
        total_revenue='sum(revenue)',
        total_profit='sum(profit)',
        avg_margin='mean(margin)',
        num_sales='count()'
    )
    >> define(
        profit_per_sale='total_profit / num_sales'
    )
    >> arrange('-total_revenue')
)

# Export results
analysis.to_excel('quarterly_analysis.xlsx', sheet_name='Summary')

Customer Segmentation

# Segment customers by behavior
segments = (df
    >> filter('total_purchases > 0')
    >> define(
        avg_order_value='total_spent / total_purchases',
        recency_days='(pd.Timestamp.now() - last_purchase_date).dt.days',
        segment=case_when(
            ('avg_order_value > 100 & recency_days < 30', 'Premium Active'),
            ('avg_order_value > 100 & recency_days >= 30', 'Premium At Risk'),
            ('recency_days < 30', 'Standard Active'),
            ('recency_days < 90', 'At Risk'),
            default='Churned'
        )
    )
    >> group_by('segment')
    >> summarize(
        customers='count()',
        total_value='sum(total_spent)',
        avg_value='mean(total_spent)'
    )
)

Data Cleaning Pipeline

# Clean and standardize data
clean_data = (
    read_excel('messy_data.xlsx')
    >> filter('id.notna()')  # Remove rows without ID
    >> define(
        # Standardize text fields
        name='name.str.title().str.strip()',
        email='email.str.lower().str.strip()',
        # Parse dates
        signup_date='pd.to_datetime(signup_date)',
        # Fill missing values
        phone='phone.fillna("Not Provided")',
        # Create derived fields
        account_age_days='(pd.Timestamp.now() - signup_date).dt.days'
    )
    >> distinct('email', keep='first')  # Deduplicate by email
    >> arrange('signup_date')
)

πŸ”’ Security Features

PipeFrame includes built-in security features to prevent code injection:

# βœ… Safe expressions are allowed
df >> define(total='price * quantity')
df >> filter('age > 30 & city == "NYC"')

# ❌ Dangerous expressions are blocked
df >> define(bad="__import__('os').system('rm -rf /')")
# PipeFrameExpressionError: Expression contains dangerous pattern

# All string expressions are validated before execution
# - Blocks: __import__, exec(), eval(), compile(), open(), file()
# - Validates expression syntax
# - Uses pandas' restricted eval environment

πŸ“Š Performance

PipeFrame adds minimal overhead while dramatically improving code readability:

Benchmarks (1M rows):

  • Filter operation: ~8% overhead
  • GroupBy aggregation: ~12% overhead
  • Complex pipeline (5 operations): ~10% overhead

Why the overhead is worth it:

  • 🧠 Reduced cognitive load
  • πŸ› Fewer bugs from clearer intent
  • ⚑ Faster development time
  • πŸ‘₯ Easier code review
  • πŸ“š Better maintainability

πŸŽ“ Learning Resources


🀝 Contributing

We welcome contributions! Here's how you can help:

  1. πŸ› Report bugs - Open an issue
  2. πŸ’‘ Suggest features - Share your ideas
  3. πŸ“ Improve docs - Help others learn
  4. πŸ”§ Submit PRs - Fix bugs or add features

See CONTRIBUTING.md for guidelines.


πŸ“œ License

MIT License - see LICENSE file for details.


πŸ‘¨β€πŸ’» Author

Dr. Yasser Mustafa

AI & Data Science Specialist | Theoretical Physics PhD

  • πŸŽ“ PhD in Theoretical Nuclear Physics
  • πŸ’Ό 10+ years in production AI/ML systems
  • πŸ”¬ 48+ research publications
  • 🏒 Experience: Government (Abu Dhabi), Media (Track24), Recruitment (Reed), Energy (ADNOC)
  • πŸ“ Based in Newcastle Upon Tyne, UK
  • βœ‰οΈ yasser.mustafan@gmail.com
  • πŸ”— LinkedIn | GitHub

PipeFrame was born from years of working with data pipelines in production environments, combining the elegance of R's tidyverse with Python's practicality.


🌟 Star History

If PipeFrame helps your work, please consider giving it a star! ⭐


πŸ“ˆ Roadmap

Current (v0.2.0)

  • βœ… Core verbs and operators
  • βœ… Security hardening
  • βœ… Comprehensive I/O
  • βœ… Reshape operations
  • βœ… Type hints

Upcoming (v0.3.0)

  • Join operations (left_join, inner_join, etc.)
  • Window functions
  • Time series helpers
  • Enhanced plotting integration
  • Performance optimizations

Future (v1.0.0)

  • Lazy evaluation engine
  • Alternative backends (Polars, DuckDB)
  • Distributed computing support
  • Interactive data exploration tools
  • SQL generation from pipes

πŸ’¬ Community

  • Issues: Report bugs or request features
  • Discussions: Ask questions, share use cases

Built with ❀️ for data scientists who value readability

Make your data speak naturally with PipeFrame πŸ”„

About

A modern, intuitive data manipulation library for Python that makes your data workflows read like natural language. Built on pandas' robust foundation with a clean, pipe-based syntax inspired by R's dplyr and tidyverse. Built with ❀️ for data scientists who value readability. Make your data speak naturally with PipeFrame πŸ”„.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages