# Feature Engineering: Binning (Discretization)

**The Power of Strategic Bucketing**

Converting continuous noisy data into robust categorical features.

In [None]:
import pandas as pd
import numpy as np

## The Problem: Continuous Data Creates Noise

Continuous variables like **exact prices** create problems:
- **Overfitting**: Model learns specific values ($19.99, $21.50, $19.95) instead of patterns
- **Noise**: Minor differences ($19.99 vs $20.00) treated as completely different features
- **Instability**: Small price changes cause unpredictable model behavior

In [None]:
df = pd.read_csv('udemy_courses.csv')

# Show the continuous price problem
print("BEFORE BINNING - Continuous Prices (Noisy):")
print(df[['course_title', 'price']].head(10))
print(f"\nUnique price values: {df['price'].nunique()}")
print(f"Price range: ${df['price'].min()} to ${df['price'].max()}")

## The Solution: Binning (Discretization)

**Technique**: Group continuous values into strategic categories (bins)

**Strategy**: Price → [Free], [Budget <$50], [Premium ≥$50]

This transforms hundreds of unique prices into 3 meaningful categories.

In [None]:
# Define bin edges and labels
bin_edges = [0, 0.01, 50, float('inf')]
bin_labels = ['Free', 'Budget', 'Premium']

# Apply binning using pd.cut()
df['price_bin'] = pd.cut(df['price'], 
                         bins=bin_edges, 
                         labels=bin_labels, 
                         include_lowest=True)

print("AFTER BINNING - Categorical Buckets (Clean):")
print(df[['course_title', 'price', 'price_bin']].head(10))
print(f"\nReduced to {df['price_bin'].nunique()} categories!")
print(f"\nDistribution:")
print(df['price_bin'].value_counts().sort_index())

## Comparison: Before vs After

Let's see how binning simplifies the data:

In [None]:
# Example: Three courses with similar prices
example_prices = [19.99, 21.50, 19.95]
example_courses = ['Python Basics', 'Java Fundamentals', 'Web Dev 101']

print("TRANSFORMATION EXAMPLE:")
print("="*60)
for course, price in zip(example_courses, example_prices):
    # Determine bin
    if price == 0:
        bin_name = 'Free'
    elif price < 50:
        bin_name = 'Budget'
    else:
        bin_name = 'Premium'
    
    print(f"Course: {course}")
    print(f"  Before: price = ${price} (unique continuous value)")
    print(f"  After:  price_bin = '{bin_name}' (shared category)")
    print()

print("="*60)
print("RESULT: All three courses now share the SAME feature value!")
print("        Model learns: 'Budget courses are popular'")
print("        Instead of: 'Courses priced $19.99 behave differently than $21.50'")

## The Benefits of Binning

✅ **Macro-Trends Over Micro-Differences**
   - Model identifies "Free courses go viral" instead of getting lost in exact price variations
   
✅ **Reduced Overfitting**
   - Fewer unique values = less chance to memorize noise
   
✅ **Robust Features**
   - Small price changes ($19→$21) don't change the bin, so predictions stay stable
   
✅ **Interpretability**
   - "Premium courses have fewer subscribers" is easier to understand than complex price coefficients

In [None]:
# Convert bins to numerical encoding for ML models
df['price_bin_encoded'] = df['price_bin'].cat.codes

print("Final Step: Convert to Numerical for ML Models")
print(df[['course_title', 'price', 'price_bin', 'price_bin_encoded']].head(10))
print("\nEncoding Map:")
print("Free → 0")
print("Budget → 1") 
print("Premium → 2")