# üêº Python for Data Science: Advanced Pandas Techniques

**Instructor:** Siva R Jasthi | Metropolitan State University  
**Course:** Python for Data Science  
**Level:** Middle School

---

## üìö What You'll Learn Today

In this notebook, you'll master these powerful pandas techniques:

1. **`pd.cut()`** - Sorting numbers into bins (like grades: A, B, C)
2. **`pd.qcut()`** - Creating equal-sized groups (like dividing class into quarters)
3. **`apply()`** - Applying custom functions to data
4. **`transform()`** - Transforming data while keeping the shape
5. **`filter()`** - Filtering groups based on conditions
6. **`melt()`** - Reshaping wide data to long format
7. **`stack()`** - Stacking columns into rows
8. **`crosstab()`** - Creating frequency tables

### üéØ Why These Matter

These tools help you:
- üìä Organize and categorize data
- üîÑ Reshape data for analysis
- üìà Create summaries and reports
- üéÆ Analyze real-world datasets

Let's dive in! üöÄ

---

# 1Ô∏è‚É£ pd.cut() - Sorting into Bins

## ü§î What is `pd.cut()`?

**Think of it like sorting laundry:**  
You have clothes of different temperatures they should be washed at, and you sort them into bins:
- ü•∂ Cold (0-30¬∞)
- üå°Ô∏è Warm (31-60¬∞)
- üî• Hot (61-90¬∞)

`pd.cut()` does the same with numbers! It puts them into **equal-width bins**.

## üìñ Real-World Examples

- **Grades:** Convert test scores (0-100) to letter grades (A, B, C, D, F)
- **Age Groups:** Sort people into age ranges (child, teen, adult, senior)
- **Price Ranges:** Categorize products as cheap, medium, or expensive
- **Game Scores:** Classify players as beginner, intermediate, or expert

## üîç How It Works

```
pd.cut(data, bins=[0, 10, 20, 30], labels=['Low', 'Medium', 'High'])

Numbers 0-39 ‚Üí Labels:
2  ‚Üí Low     (in range 0-10)
15 ‚Üí Medium  (in range 10-20)
25 ‚Üí High    (in range 20-30)
35 ‚Üí High    (in range 30-40)
```

---

In [None]:
# üìö EXAMPLE 1: Converting Test Scores to Letter Grades
import pandas as pd
import numpy as np

print("="*60)
print("EXAMPLE 1: Test Scores ‚Üí Letter Grades")
print("="*60)

# Create sample data: test scores for 10 students
test_scores = [95, 87, 76, 92, 65, 88, 58, 73, 91, 82]
students = ['Alice', 'Bob', 'Charlie', 'Diana', 'Ethan',
            'Fiona', 'George', 'Hannah', 'Ian', 'Julia']

df = pd.DataFrame({
    'Student': students,
    'Score': test_scores
})

print("\nüìä Original Data:")
display(df)

# Define grade bins
# F: 0-59, D: 60-69, C: 70-79, B: 80-89, A: 90-100
bins = [0, 60, 70, 80, 90, 100]
labels = ['F', 'D', 'C', 'B', 'A']

# Apply pd.cut to assign letter grades
df['Letter_Grade'] = pd.cut(df['Score'],
                             bins=bins,
                             labels=labels,
                             include_lowest=True)  # Include 0 in the first bin

print("\n‚úÖ After Adding Letter Grades:")
display(df)

# Count how many of each grade
print("\nüìà Grade Distribution:")
grade_counts = df['Letter_Grade'].value_counts().sort_index()
print(grade_counts)

# Visualize
print("\nüìä Visual Grade Distribution:")
for grade in ['A', 'B', 'C', 'D', 'F']:
    count = grade_counts.get(grade, 0)
    bar = '‚ñà' * count
    print(f"{grade}: {bar} ({count})")

### üí° Understanding the Code

**Step 1:** Define your bins (boundaries)
```python
bins = [0, 60, 70, 80, 90, 100]
```
This creates ranges: (0-60], (60-70], (70-80], (80-90], (90-100]

**Step 2:** Define your labels
```python
labels = ['F', 'D', 'C', 'B', 'A']
```
Must be one less than bins (5 labels for 6 bin edges)

**Step 3:** Apply the cut
```python
df['Letter_Grade'] = pd.cut(df['Score'], bins=bins, labels=labels)
```

**Important Parameters:**
- `bins`: List of bin edges
- `labels`: Names for each bin
- `include_lowest=True`: Include the lowest edge (0 in this case)

---

In [None]:
# üìö EXAMPLE 2: Age Groups for a Youth Program
print("="*60)
print("EXAMPLE 2: Sorting People by Age Groups")
print("="*60)

# Sample data: ages of program participants
ages = [5, 12, 8, 15, 18, 22, 7, 14, 10, 19, 25, 6, 13, 17, 21]
names = ['Emma', 'Liam', 'Olivia', 'Noah', 'Ava', 'Ethan', 'Sophia',
         'Mason', 'Isabella', 'Lucas', 'Mia', 'Logan', 'Charlotte',
         'Jacob', 'Amelia']

df_ages = pd.DataFrame({
    'Name': names,
    'Age': ages
})

print("\nüìä Original Data:")
display(df_ages.head(10))

# Define age groups
bins = [0, 10, 13, 18, 25]
labels = ['Kids (5-10)', 'Tweens (11-13)', 'Teens (14-18)', 'Young Adults (19-25)']

df_ages['Age_Group'] = pd.cut(df_ages['Age'],
                              bins=bins,
                              labels=labels,
                              include_lowest=True)

print("\n‚úÖ After Categorizing by Age Group:")
display(df_ages)

# Group summary
print("\nüìà Participants per Age Group:")
for group in labels:
    count = (df_ages['Age_Group'] == group).sum()
    print(f"  {group}: {count} participants")

In [None]:
# üé® VISUALIZATION: Before and After pd.cut()
print("="*60)
print("VISUAL COMPARISON: How pd.cut() Works")
print("="*60)

# Simple example
data = [39, 36, 38, 37, 23, 37, 2, 5, 8, 10, 15, 18, 20, 25, 30, 35]
df = pd.DataFrame({'values': data})

print("\nüìä BEFORE - Just Numbers:")
print(sorted(data))

# Apply cut
df['category'] = pd.cut(df['values'],
                        bins=[0, 10, 20, 30, 40],
                        labels=['Low', 'Medium', 'High', 'Very High'])

print("\nüìä AFTER - Categorized:")
print("\nNumber Line with Categories:")
print("0‚îÄ‚îÄ‚îÄ‚îÄ10‚îÄ‚îÄ‚îÄ‚îÄ20‚îÄ‚îÄ‚îÄ‚îÄ30‚îÄ‚îÄ‚îÄ‚îÄ40")
print(" Low  Medium High VeryHigh")

# Show which numbers fall into which category
for cat in ['Low', 'Medium', 'High', 'Very High']:
    values_in_cat = df[df['category'] == cat]['values'].tolist()
    print(f"\n{cat:12}: {values_in_cat}")

# Frequency count
print("\nüìà Frequency Count:")
print(df['category'].value_counts().sort_index())

### ‚ö†Ô∏è Common Mistakes to Avoid

1. **Wrong number of labels**
   ```python
   # ‚ùå WRONG: 5 bins, 5 labels (should be 4 labels)
   bins = [0, 10, 20, 30, 40]
   labels = ['A', 'B', 'C', 'D', 'E']  # One too many!
   
   # ‚úÖ CORRECT: 5 bins, 4 labels
   bins = [0, 10, 20, 30, 40]
   labels = ['A', 'B', 'C', 'D']  # Perfect!
   ```

2. **Forgetting `include_lowest=True`**
   ```python
   # If your data starts at 0, you need this!
   pd.cut(data, bins=[0, 10, 20], labels=['A', 'B'], include_lowest=True)
   ```

3. **Values outside bins**
   ```python
   # If you have value 45 but bins only go to 40, it will be NaN
   # Make sure your bins cover all your data!
   ```

---

# 2Ô∏è‚É£ pd.qcut() - Creating Equal-Sized Groups

## ü§î What is `pd.qcut()`?

**Think of it like dividing pizza fairly:**  
You have a pizza and want to cut it into 4 equal pieces. Each piece should have the same amount of pizza!

`pd.qcut()` does the same with data - it creates **equal-sized groups** (same number of items in each group).

## üìä pd.cut() vs pd.qcut()

```
NUMBERS: 2, 5, 8, 10, 15, 18, 20, 25, 30, 35, 40, 45

pd.cut() ‚Üí EQUAL WIDTHS:
  Group 1 (0-15):   6 numbers ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  Group 2 (15-30):  4 numbers ‚ñà‚ñà‚ñà‚ñà‚ñà
  Group 3 (30-45):  2 numbers ‚ñà‚ñà‚ñà
  (Groups have DIFFERENT counts but SAME width)

pd.qcut() ‚Üí EQUAL COUNTS:
  Quartile 1: 4 numbers ‚ñà‚ñà‚ñà‚ñà‚ñà
  Quartile 2: 4 numbers ‚ñà‚ñà‚ñà‚ñà‚ñà  
  Quartile 3: 4 numbers ‚ñà‚ñà‚ñà‚ñà‚ñà
  (Groups have SAME counts but DIFFERENT widths)
```

## üìñ Real-World Examples

- **Class Rankings:** Divide class into top 25%, next 25%, etc.
- **Income Brackets:** Split population into equal quartiles
- **Performance Tiers:** Classify athletes into equal-sized performance groups
- **Game Leaderboards:** Top 10%, next 10%, etc.

---

In [None]:
# üìö EXAMPLE 3: Ranking Students by Test Performance
print("="*60)
print("EXAMPLE 3: Student Rankings Using Quartiles")
print("="*60)

# Sample data: test scores
np.random.seed(42)
scores = np.random.randint(60, 100, size=20)
students = [f"Student_{i+1}" for i in range(20)]

df_scores = pd.DataFrame({
    'Student': students,
    'Score': scores
})

# Sort by score for better visualization
df_scores = df_scores.sort_values('Score', ascending=False).reset_index(drop=True)

print("\nüìä Student Scores (Sorted):")
display(df_scores.head(10))

# Use qcut to divide into quartiles (4 equal groups)
df_scores['Quartile'] = pd.qcut(df_scores['Score'],
                                 q=4,
                                 labels=['Q4 (Bottom)', 'Q3', 'Q2', 'Q1 (Top)'])

print("\n‚úÖ After Adding Quartile Rankings:")
display(df_scores)

# Count students in each quartile
print("\nüìà Students per Quartile:")
quartile_counts = df_scores['Quartile'].value_counts()
print(quartile_counts)

print("\nüí° Notice: Each quartile has exactly 5 students (20 √∑ 4 = 5)!")

# Show score ranges for each quartile
print("\nüìä Score Ranges per Quartile:")
for quartile in ['Q1 (Top)', 'Q2', 'Q3', 'Q4 (Bottom)']:
    scores_in_q = df_scores[df_scores['Quartile'] == quartile]['Score']
    print(f"  {quartile:15}: {scores_in_q.min()}-{scores_in_q.max()}")

In [None]:
# üìö EXAMPLE 4: Video Game Leaderboard
print("="*60)
print("EXAMPLE 4: Game Leaderboard Tiers")
print("="*60)

# Sample data: player scores in a video game
np.random.seed(123)
game_scores = np.random.randint(100, 10000, size=16)
players = [f"Player_{i+1}" for i in range(16)]

df_game = pd.DataFrame({
    'Player': players,
    'Points': game_scores
})

df_game = df_game.sort_values('Points', ascending=False).reset_index(drop=True)

print("\nüéÆ Player Scores:")
display(df_game)

# Divide into tiers: Bronze, Silver, Gold, Platinum
df_game['Tier'] = pd.qcut(df_game['Points'],
                          q=4,
                          labels=['ü•â Bronze', 'ü•à Silver', 'ü•á Gold', 'üíé Platinum'])

print("\n‚úÖ After Adding Tiers:")
display(df_game)

# Tier summary
print("\nüèÜ Tier Distribution:")
print(df_game['Tier'].value_counts())

print("\nüí° Each tier has exactly 4 players!")

In [None]:
# üé® COMPARISON: pd.cut() vs pd.qcut()
print("="*60)
print("VISUAL COMPARISON: cut() vs qcut()")
print("="*60)

# Same data for both
data = [2, 5, 8, 10, 15, 18, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65]
df_compare = pd.DataFrame({'values': data})

print(f"\nüìä Original Data ({len(data)} values): {data}")

# Using pd.cut() - equal width bins
df_compare['cut'] = pd.cut(df_compare['values'],
                           bins=4,
                           labels=['Group1', 'Group2', 'Group3', 'Group4'])

# Using pd.qcut() - equal counts
df_compare['qcut'] = pd.qcut(df_compare['values'],
                             q=4,
                             labels=['Quartile1', 'Quartile2', 'Quartile3', 'Quartile4'])

print("\nüìä Comparison:")
display(df_compare)

print("\nüìà pd.cut() - Equal Width Bins:")
cut_counts = df_compare['cut'].value_counts().sort_index()
for group, count in cut_counts.items():
    bar = '‚ñà' * count
    print(f"  {group}: {bar} ({count} values)")

print("\nüìà pd.qcut() - Equal Counts:")
qcut_counts = df_compare['qcut'].value_counts().sort_index()
for quartile, count in qcut_counts.items():
    bar = '‚ñà' * count
    print(f"  {quartile}: {bar} ({count} values)")

print("\nüí° Key Difference:")
print("  ‚Ä¢ pd.cut():  Groups may have DIFFERENT counts")
print("  ‚Ä¢ pd.qcut(): Groups always have SAME counts (or as close as possible)")

### üéØ When to Use Which?

| Situation | Use This | Why |
|-----------|----------|-----|
| **Letter grades** (A, B, C, D, F) | `pd.cut()` | Fixed score ranges |
| **Age groups** (child, teen, adult) | `pd.cut()` | Fixed age ranges |
| **Class rankings** (top 25%, next 25%) | `pd.qcut()` | Equal group sizes |
| **Price categories** ($, $$, $$$) | `pd.cut()` | Fixed price ranges |
| **Performance tiers** (based on percentiles) | `pd.qcut()` | Equal distribution |

**Rule of Thumb:**
- Use `pd.cut()` when you care about **specific ranges** (like grade cutoffs)
- Use `pd.qcut()` when you want **equal-sized groups** (like top 10%)

---

# 3Ô∏è‚É£ apply() - Custom Functions on Data

## ü§î What is `apply()`?

**Think of it like a factory assembly line:**  
Each item goes through the same process. `apply()` takes each row (or column) and applies the same function to it.

```
    Data ‚Üí [Function] ‚Üí Modified Data
     10  ‚Üí  [√ó 2]    ‚Üí      20
     15  ‚Üí  [√ó 2]    ‚Üí      30
     20  ‚Üí  [√ó 2]    ‚Üí      40
```

## üìñ Real-World Examples

- **Grade Calculator:** Apply grading formula to all students
- **Tax Calculator:** Calculate tax for all purchases
- **Unit Converter:** Convert all temperatures from ¬∞F to ¬∞C
- **Bonus Calculator:** Calculate bonuses based on performance

---

In [None]:
# üìö EXAMPLE 5: Calculating Final Grades with Weighted Scores
print("="*60)
print("EXAMPLE 5: Grade Calculator with apply()")
print("="*60)

# Sample data: student grades
df_grades = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Ethan'],
    'Homework': [85, 92, 78, 95, 88],
    'Midterm': [88, 85, 90, 92, 86],
    'Final': [90, 88, 85, 94, 91]
})

print("\nüìä Student Grades:")
display(df_grades)

# Define a function to calculate weighted average
# Homework: 30%, Midterm: 30%, Final: 40%
def calculate_final_grade(row):
    homework_weight = 0.30
    midterm_weight = 0.30
    final_weight = 0.40

    final_grade = (row['Homework'] * homework_weight +
                   row['Midterm'] * midterm_weight +
                   row['Final'] * final_weight)
    return round(final_grade, 2)

# Apply the function to each row
df_grades['Final_Grade'] = df_grades.apply(calculate_final_grade, axis=1)

print("\n‚úÖ After Calculating Final Grades:")
print("Formula: (Homework √ó 30%) + (Midterm √ó 30%) + (Final √ó 40%)")
display(df_grades)

# Add letter grades using cut
df_grades['Letter'] = pd.cut(df_grades['Final_Grade'],
                             bins=[0, 60, 70, 80, 90, 100],
                             labels=['F', 'D', 'C', 'B', 'A'],
                             include_lowest=True)

print("\nüéì Final Report:")
display(df_grades[['Name', 'Final_Grade', 'Letter']])

In [None]:
# üìö EXAMPLE 6: Temperature Converter
print("="*60)
print("EXAMPLE 6: Temperature Converter (¬∞F ‚Üí ¬∞C)")
print("="*60)

# Sample data: temperatures in different cities
cities_data = {
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Temp_F': [75, 82, 68, 88, 95]
}

df_temps = pd.DataFrame(cities_data)

print("\nüå°Ô∏è Temperatures in Fahrenheit:")
display(df_temps)

# Define conversion function
def fahrenheit_to_celsius(temp_f):
    """Convert Fahrenheit to Celsius"""
    temp_c = (temp_f - 32) * 5/9
    return round(temp_c, 1)

# Apply to the Temp_F column
df_temps['Temp_C'] = df_temps['Temp_F'].apply(fahrenheit_to_celsius)

print("\n‚úÖ After Conversion:")
print("Formula: (¬∞F - 32) √ó 5/9 = ¬∞C")
display(df_temps)

# Add weather description based on Celsius
def weather_description(temp_c):
    if temp_c < 10:
        return "ü•∂ Cold"
    elif temp_c < 20:
        return "üòä Cool"
    elif temp_c < 30:
        return "‚òÄÔ∏è Warm"
    else:
        return "üî• Hot"

df_temps['Weather'] = df_temps['Temp_C'].apply(weather_description)

print("\nüå§Ô∏è With Weather Description:")
display(df_temps)

In [None]:
# üìö EXAMPLE 7: Salary Bonus Calculator
print("="*60)
print("EXAMPLE 7: Performance Bonus Calculator")
print("="*60)

# Employee data
employees = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Ethan', 'Fiona'],
    'Salary': [50000, 60000, 45000, 70000, 55000, 65000],
    'Performance_Score': [92, 78, 88, 95, 82, 90]
})

print("\nüíº Employee Data:")
display(employees)

# Bonus calculation rules:
# Score >= 90: 10% bonus
# Score >= 80: 5% bonus
# Score < 80: 2% bonus

def calculate_bonus(row):
    """Calculate bonus based on performance score"""
    score = row['Performance_Score']
    salary = row['Salary']

    if score >= 90:
        bonus_rate = 0.10
    elif score >= 80:
        bonus_rate = 0.05
    else:
        bonus_rate = 0.02

    bonus = salary * bonus_rate
    return round(bonus, 2)

# Apply bonus calculation
employees['Bonus'] = employees.apply(calculate_bonus, axis=1)
employees['Total_Compensation'] = employees['Salary'] + employees['Bonus']

print("\n‚úÖ After Calculating Bonuses:")
print("Bonus Rules:")
print("  ‚Ä¢ Score ‚â• 90: 10% bonus")
print("  ‚Ä¢ Score ‚â• 80: 5% bonus")
print("  ‚Ä¢ Score < 80: 2% bonus")
display(employees)

print(f"\nüí∞ Total Bonuses Paid: ${employees['Bonus'].sum():,.2f}")

### üí° Understanding axis Parameter

```python
df.apply(function, axis=0)  # Apply to each COLUMN (top to bottom)
df.apply(function, axis=1)  # Apply to each ROW (left to right)
```

**Visual Example:**
```
DataFrame:
        A    B    C
    0  10   20   30
    1  40   50   60

axis=0 (columns):
    ‚Üì    ‚Üì    ‚Üì
   [10]  [20]  [30]
   [40]  [50]  [60]

axis=1 (rows):
   [10, 20, 30] ‚Üí
   [40, 50, 60] ‚Üí
```

---

# 4Ô∏è‚É£ transform() - Transform While Keeping Shape

## ü§î What is `transform()`?

**Think of it like photo filters:**  
You apply a filter to a photo, but the photo stays the same size and shape - just the colors change!

`transform()` modifies values but **returns the same shape** as the input.

## üìä apply() vs transform()

```
apply():     Can return different shape
transform(): ALWAYS returns same shape as input

Input:  [10, 20, 30, 40, 50] (5 values)

apply(sum):     ‚Üí 150         (1 value)
transform(+10): ‚Üí [20, 30, 40, 50, 60] (5 values)
```

## üìñ Common Uses

- **Normalization:** Scale values to 0-1 range
- **Standardization:** Convert to z-scores
- **Percentage of total:** Show each value as % of total
- **Running calculations:** Add group averages to each row

---

In [None]:
# üìö EXAMPLE 8: Normalizing Test Scores by Class
print("="*60)
print("EXAMPLE 8: Score Normalization with transform()")
print("="*60)

# Student scores from different classes
students_multi = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Ethan', 'Fiona', 'George', 'Hannah'],
    'Class': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
    'Score': [85, 90, 78, 95, 70, 88, 92, 75]
})

print("\nüìä Original Scores:")
display(students_multi)

# Calculate each student's score as percentage of class average
def normalize_to_average(group):
    """Express each score as percentage of group average"""
    avg = group.mean()
    return (group / avg) * 100

# Group by class and transform
students_multi['Normalized_Score'] = students_multi.groupby('Class')['Score'].transform(normalize_to_average)
students_multi['Normalized_Score'] = students_multi['Normalized_Score'].round(1)

print("\n‚úÖ After Normalization (% of class average):")
display(students_multi)

# Show class averages
print("\nüìà Class Statistics:")
class_stats = students_multi.groupby('Class')['Score'].agg(['mean', 'min', 'max'])
display(class_stats)

print("\nüí° Notice:")
print("  ‚Ä¢ Normalized_Score > 100: Above class average")
print("  ‚Ä¢ Normalized_Score = 100: At class average")
print("  ‚Ä¢ Normalized_Score < 100: Below class average")

In [None]:
# üìö EXAMPLE 9: Adding Running Totals
print("="*60)
print("EXAMPLE 9: Sales with Group Totals")
print("="*60)

# Sales data by region
sales_data = pd.DataFrame({
    'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
    'Store': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [5000, 7000, 6000, 8000, 5500, 6500, 7500, 8500]
})

print("\nüí∞ Sales Data:")
display(sales_data)

# Add region total to each row
sales_data['Region_Total'] = sales_data.groupby('Region')['Sales'].transform('sum')

# Calculate percentage of region total
sales_data['Pct_of_Region'] = (sales_data['Sales'] / sales_data['Region_Total'] * 100).round(1)

print("\n‚úÖ With Region Totals:")
display(sales_data)

print("\nüìä Summary by Region:")
for region in sales_data['Region'].unique():
    region_data = sales_data[sales_data['Region'] == region]
    total = region_data['Region_Total'].iloc[0]
    print(f"  {region}: ${total:,} total")

### üîë Key Differences Summary

| Method | Returns | Use When |
|--------|---------|----------|
| **apply()** | Any shape | Calculating new values that might change size |
| **transform()** | Same shape | Normalizing, scaling, or adding group statistics |
| **filter()** | Subset of rows | Removing groups based on conditions |

---

# 5Ô∏è‚É£ filter() - Filtering Groups

## ü§î What is `filter()`?

**Think of it like a bouncer at a club:**  
Groups that meet the requirements get in, others don't!

`filter()` keeps or removes **entire groups** based on a condition.

## üìñ Real-World Examples

- **Keep only classes with > 5 students**
- **Remove teams with < 3 wins**
- **Show only regions with sales > $10,000**

---

In [None]:
# üìö EXAMPLE 10: Filtering Classes by Size
print("="*60)
print("EXAMPLE 10: Keep Only Large Classes")
print("="*60)

# Student enrollment data
enrollment = pd.DataFrame({
    'Student': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'Class': ['Math', 'Math', 'Science', 'Science', 'History',
              'Math', 'Science', 'Art', 'Art', 'History'],
    'Grade': [85, 90, 78, 92, 88, 95, 87, 82, 79, 91]
})

print("\nüìö All Students:")
display(enrollment)

print("\nüìä Class Sizes:")
class_sizes = enrollment['Class'].value_counts()
print(class_sizes)

# Keep only classes with at least 3 students
large_classes = enrollment.groupby('Class').filter(lambda x: len(x) >= 3)

print("\n‚úÖ After Filtering (classes with ‚â• 3 students):")
display(large_classes)

print("\nüí° Notice:")
print("  ‚Ä¢ Math: 3 students ‚Üí KEPT")
print("  ‚Ä¢ Science: 3 students ‚Üí KEPT")
print("  ‚Ä¢ History: 2 students ‚Üí REMOVED")
print("  ‚Ä¢ Art: 2 students ‚Üí REMOVED")

---

# 6Ô∏è‚É£ melt() - Reshaping Wide to Long

## ü§î What is `melt()`?

**Think of it like unpacking a spreadsheet:**  
Instead of having many columns, you stack them into rows!

```
WIDE FORMAT (Before):
Name     Math  Science  English
Alice      85       88       89
Bob        90       92       85

LONG FORMAT (After melt):
Name     Subject    Score
Alice    Math         85
Alice    Science      88
Alice    English      89
Bob      Math         90
Bob      Science      92
Bob      English      85
```

## üìñ Why Use melt()?

- **Database format:** Easier to query and filter
- **Plotting:** Many plotting libraries prefer long format
- **Analysis:** Easier to group and aggregate

---

In [None]:
# üìö EXAMPLE 11: Student Grades from Wide to Long
print("="*60)
print("EXAMPLE 11: Melting Grade Data")
print("="*60)

# Wide format: Each subject is a column
grades_wide = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Math': [85, 90, 78, 92],
    'Science': [88, 92, 82, 95],
    'English': [89, 85, 88, 91],
    'History': [87, 88, 90, 94]
})

print("\nüìä WIDE FORMAT (Before):")
print("Each subject is a separate column")
display(grades_wide)

# Melt into long format
grades_long = pd.melt(
    grades_wide,
    id_vars=['Name'],           # Column to keep as-is
    var_name='Subject',         # Name for the new "variable" column
    value_name='Score'          # Name for the new "value" column
)

print("\nüìä LONG FORMAT (After melt):")
print("Each row is one student-subject combination")
display(grades_long)

print(f"\nüí° Comparison:")
print(f"  Wide format:  {len(grades_wide)} rows √ó {len(grades_wide.columns)} columns")
print(f"  Long format:  {len(grades_long)} rows √ó {len(grades_long.columns)} columns")

# Now easier to analyze by subject
print("\nüìà Average Score per Subject (easy with long format):")
subject_avg = grades_long.groupby('Subject')['Score'].mean().round(1)
print(subject_avg)

In [None]:
# üìö EXAMPLE 12: Monthly Sales Data
print("="*60)
print("EXAMPLE 12: Monthly Sales Analysis")
print("="*60)

# Sales data in wide format
sales_wide = pd.DataFrame({
    'Product': ['Widget', 'Gadget', 'Gizmo'],
    'Jan': [100, 150, 120],
    'Feb': [120, 160, 110],
    'Mar': [110, 170, 130]
})

print("\nüìä WIDE FORMAT:")
display(sales_wide)

# Melt to long format
sales_long = pd.melt(
    sales_wide,
    id_vars=['Product'],
    var_name='Month',
    value_name='Sales'
)

print("\nüìä LONG FORMAT:")
display(sales_long)

# Easy analysis in long format
print("\nüìà Analysis (easier with long format):")
print("\nTotal Sales by Product:")
print(sales_long.groupby('Product')['Sales'].sum())

print("\nTotal Sales by Month:")
print(sales_long.groupby('Month')['Sales'].sum())

print("\nBest performing product-month:")
best = sales_long.loc[sales_long['Sales'].idxmax()]
print(f"  {best['Product']} in {best['Month']}: {best['Sales']} units")

### üí° melt() Parameters Explained

```python
pd.melt(
    df,
    id_vars=['Name'],      # Columns to keep (identifier columns)
    var_name='Subject',    # Name for the "melted" column names
    value_name='Score'     # Name for the values
)
```

**Think of it as:**
- `id_vars`: What stays the same?
- `var_name`: What are these columns called?
- `value_name`: What do these numbers mean?

---

# 7Ô∏è‚É£ stack() - Pivoting Columns to Rows

## ü§î What is `stack()`?

**Think of it like stacking pancakes:**  
You take things that are side-by-side and stack them on top of each other!

`stack()` moves column labels into the row index, creating a **multi-level index**.

```
BEFORE (unstacked):
         Math  Science
Alice     85       88
Bob       90       92

AFTER (stacked):
Alice  Math       85
       Science    88
Bob    Math       90
       Science    92
```

## üìä melt() vs stack()

Both reshape data, but:
- `melt()`: Creates regular columns, resets index
- `stack()`: Creates multi-level index (hierarchical)

---

In [None]:
# üìö EXAMPLE 13: Stacking Grade Data
print("="*60)
print("EXAMPLE 13: Understanding stack()")
print("="*60)

# Create sample data
df_stack = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Math': [85, 90, 78],
    'Science': [88, 92, 82],
    'English': [89, 85, 88]
})

print("\nüìä ORIGINAL (Wide):")
display(df_stack)

# Set Name as index first (required for stack)
df_indexed = df_stack.set_index('Name')

print("\nüìä After Setting Index:")
display(df_indexed)

# Stack the data
stacked = df_indexed.stack()

print("\nüìä STACKED (Multi-level Index):")
print(stacked)
print(f"\nType: {type(stacked)}")
print(f"Shape: {stacked.shape}")

# Convert back to DataFrame
stacked_df = stacked.reset_index()
stacked_df.columns = ['Name', 'Subject', 'Score']

print("\nüìä Converted to DataFrame:")
display(stacked_df)

print("\nüí° This looks like melt(), but the process was different!")

In [None]:
# üìö EXAMPLE 14: Quarterly Performance Data
print("="*60)
print("EXAMPLE 14: Quarterly Sales with stack()")
print("="*60)

# Quarterly data
quarterly = pd.DataFrame({
    'Region': ['North', 'South', 'East', 'West'],
    'Q1': [10000, 12000, 11000, 13000],
    'Q2': [11000, 13000, 12000, 14000],
    'Q3': [12000, 14000, 13000, 15000],
    'Q4': [13000, 15000, 14000, 16000]
})

print("\nüìä ORIGINAL:")
display(quarterly)

# Stack with Region as index
stacked_quarterly = quarterly.set_index('Region').stack()

print("\nüìä STACKED:")
print(stacked_quarterly)

# Access specific values
print("\nüîç Accessing Values:")
print(f"North, Q1: ${stacked_quarterly['North']['Q1']:,}")
print(f"South, Q4: ${stacked_quarterly['South']['Q4']:,}")

# Convert to regular DataFrame
final_df = stacked_quarterly.reset_index()
final_df.columns = ['Region', 'Quarter', 'Sales']

print("\nüìä Final DataFrame:")
display(final_df)

### üîÑ Unstack - The Opposite of Stack

```python
# Stack: columns ‚Üí rows
stacked = df.stack()

# Unstack: rows ‚Üí columns (reverse of stack)
unstacked = stacked.unstack()
```

---

# 8Ô∏è‚É£ crosstab() - Creating Frequency Tables

## ü§î What is `crosstab()`?

**Think of it like a survey summary:**  
Count how many people chose each combination of options!

```
Survey: Gender √ó Favorite Sport

           Basketball  Soccer  Tennis
Female            3       5       2
Male              5       3       2

Reads as: 3 females like basketball, 5 like soccer, etc.
```

## üìñ Real-World Uses

- **Survey Analysis:** Gender √ó Preference
- **Sales Analysis:** Region √ó Product
- **Academic:** Grade √ó Subject
- **Demographics:** Age √ó Income bracket

---

In [None]:
# üìö EXAMPLE 15: Student Preferences Survey
print("="*60)
print("EXAMPLE 15: Student Activity Preferences")
print("="*60)

# Survey data: Gender and activity preference
survey = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female',
               'Male', 'Female', 'Male', 'Female', 'Male', 'Female',
               'Female', 'Male', 'Female', 'Male'],
    'Activity': ['Sports', 'Music', 'Music', 'Sports', 'Art', 'Sports',
                'Music', 'Music', 'Sports', 'Art', 'Art', 'Sports',
                'Music', 'Music', 'Sports', 'Art']
})

print("\nüìä Raw Survey Data:")
display(survey.head(10))

# Create crosstab
ct = pd.crosstab(survey['Gender'], survey['Activity'])

print("\nüìä CROSSTAB: Gender √ó Activity")
display(ct)

print("\nüìà Reading the Table:")
print(f"  ‚Ä¢ {ct.loc['Female', 'Music']} females prefer Music")
print(f"  ‚Ä¢ {ct.loc['Male', 'Sports']} males prefer Sports")
print(f"  ‚Ä¢ {ct.loc['Female', 'Sports']} females prefer Sports")

# Add totals
ct_with_margins = pd.crosstab(survey['Gender'], survey['Activity'], margins=True)

print("\nüìä WITH TOTALS:")
display(ct_with_margins)

# Calculate percentages
ct_pct = pd.crosstab(survey['Gender'], survey['Activity'], normalize='all') * 100
print("\nüìä AS PERCENTAGES (% of total):")
display(ct_pct.round(1))

In [None]:
# üìö EXAMPLE 16: Sales Analysis by Region and Product
print("="*60)
print("EXAMPLE 16: Product Sales by Region")
print("="*60)

# Sales transaction data
sales_transactions = pd.DataFrame({
    'Region': ['North', 'South', 'East', 'West', 'North', 'South',
               'East', 'West', 'North', 'South', 'East', 'West',
               'North', 'South', 'East', 'West', 'North', 'South'],
    'Product': ['Widget', 'Widget', 'Gadget', 'Gadget', 'Gizmo', 'Widget',
               'Widget', 'Gizmo', 'Gadget', 'Gadget', 'Gizmo', 'Widget',
               'Gizmo', 'Gizmo', 'Widget', 'Gadget', 'Widget', 'Gadget'],
    'Amount': [100, 150, 200, 180, 120, 160, 140, 190, 110, 170,
              130, 200, 140, 150, 160, 180, 120, 190]
})

print("\nüí∞ Sales Transactions:")
display(sales_transactions.head(10))

# Frequency crosstab (count of transactions)
ct_freq = pd.crosstab(sales_transactions['Region'],
                      sales_transactions['Product'])

print("\nüìä TRANSACTION COUNT: Region √ó Product")
display(ct_freq)

# Crosstab with values (sum of amounts)
ct_sales = pd.crosstab(
    sales_transactions['Region'],
    sales_transactions['Product'],
    values=sales_transactions['Amount'],
    aggfunc='sum'
)

print("\nüí∞ TOTAL SALES AMOUNT: Region √ó Product")
display(ct_sales)

# With margins (totals)
ct_sales_total = pd.crosstab(
    sales_transactions['Region'],
    sales_transactions['Product'],
    values=sales_transactions['Amount'],
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)

print("\nüí∞ WITH TOTALS:")
display(ct_sales_total)

# Find best performers
print("\nüèÜ Analysis:")
print(f"  Best region: {ct_sales.sum(axis=1).idxmax()} (${ct_sales.sum(axis=1).max():,.0f})")
print(f"  Best product: {ct_sales.sum(axis=0).idxmax()} (${ct_sales.sum(axis=0).max():,.0f})")

In [None]:
# üìö EXAMPLE 17: Multi-Level Crosstab
print("="*60)
print("EXAMPLE 17: Advanced Crosstab with Multiple Categories")
print("="*60)

# Student grade data
students = pd.DataFrame({
    'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
    'Class': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
    'Grade': ['A', 'A', 'B', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']
})

print("\nüìö Student Data:")
display(students)

# Multi-index crosstab
ct_multi = pd.crosstab(
    [students['Class'], students['Gender']],  # Row indices
    students['Grade']                         # Column index
)

print("\nüìä Multi-Level Crosstab: (Class, Gender) √ó Grade")
display(ct_multi)

print("\nüí° Reading Multi-Level Tables:")
print("  First level: Class (A or B)")
print("  Second level: Gender (F or M)")
print("  Columns: Grade received (A, B, or C)")

### üí° crosstab() Parameters

```python
pd.crosstab(
    index,              # Row variable(s)
    columns,            # Column variable(s)
    values=None,        # Values to aggregate (optional)
    aggfunc=None,       # How to aggregate (sum, mean, etc.)
    margins=False,      # Add row/column totals?
    normalize=False     # Show as percentages?
)
```

**Common aggfunc values:**
- `'sum'` - Add up all values
- `'mean'` - Calculate average
- `'count'` - Count occurrences
- `'min'` / `'max'` - Find min/max

---

# üéØ Practice Exercises

## Exercise 1: Grade Categorization

**Task:** Create a dataset of 15 students with random test scores (60-100). Use `pd.cut()` to assign letter grades, then create a crosstab showing the distribution.

**Bonus:** Add a gender column and create a crosstab of Gender √ó Letter Grade.

In [None]:
# YOUR CODE HERE
# Exercise 1



## Exercise 2: Sales Analysis

**Task:** Create a sales dataset with columns: Region (North/South/East/West), Product (A/B/C), and Sales amount.

1. Use `pd.qcut()` to divide sales into quartiles
2. Use `transform()` to add a column showing % of region total
3. Use `crosstab()` to show total sales by Region and Product

In [None]:
# YOUR CODE HERE
# Exercise 2



## Exercise 3: Data Reshaping

**Task:** Create a dataset of student scores in wide format (Name, Math, Science, English).

1. Use `melt()` to convert to long format
2. Use `apply()` to add a "Pass/Fail" column (passing is 70+)
3. Create a crosstab of Subject √ó Pass/Fail

In [None]:
# YOUR CODE HERE
# Exercise 3



---

# üìù Summary Cheat Sheet

## Quick Reference

| Function | Purpose | Example Use |
|----------|---------|-------------|
| **pd.cut()** | Sort into bins (equal width) | Grade ranges: A, B, C, D, F |
| **pd.qcut()** | Sort into groups (equal size) | Top 25%, next 25%, etc. |
| **apply()** | Apply function to rows/columns | Calculate weighted average |
| **transform()** | Transform keeping same shape | Normalize to group average |
| **filter()** | Keep/remove entire groups | Keep classes with 5+ students |
| **melt()** | Wide ‚Üí Long format | Subjects as rows instead of columns |
| **stack()** | Columns ‚Üí Multi-level rows | Create hierarchical index |
| **crosstab()** | Frequency tables | Count by category combinations |

## When to Use What?

```
Need to categorize numbers?
  ‚îú‚îÄ Fixed ranges ‚Üí pd.cut()
  ‚îî‚îÄ Equal groups ‚Üí pd.qcut()

Need to apply calculations?
  ‚îú‚îÄ Custom logic ‚Üí apply()
  ‚îú‚îÄ Keep shape ‚Üí transform()
  ‚îî‚îÄ Remove groups ‚Üí filter()

Need to reshape data?
  ‚îú‚îÄ Wide to long ‚Üí melt()
  ‚îú‚îÄ Hierarchical ‚Üí stack()
  ‚îî‚îÄ Summary table ‚Üí crosstab()
```

---

# üéâ Congratulations!

You've learned powerful pandas techniques for:
- ‚úÖ Categorizing and binning data
- ‚úÖ Applying custom transformations
- ‚úÖ Reshaping data for analysis
- ‚úÖ Creating summary tables

## üöÄ Next Steps

1. **Practice** with real datasets
2. **Combine** multiple techniques
3. **Explore** pandas documentation
4. **Build** your own analysis projects!

## üìö Additional Resources

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- Practice datasets on [Kaggle](https://www.kaggle.com/datasets)

---

**Happy Data Science!** üêºüìä
