# Week 1 ‚Äî NumPy Fundamentals & Vectorized Computing

**Course:** Applied ML Foundations for SaaS Analytics  
**Week Focus:** Learn NumPy arrays, vectorization, and efficient numerical operations for processing millions of SaaS telemetry records.

---

## üéØ Learning Objectives

By the end of this week, you will:
- Understand NumPy arrays and their advantages over Python lists
- Master array creation, indexing, and manipulation
- Apply vectorized operations for fast computations at scale
- Use broadcasting to efficiently compute metrics across dimensions
- Process real SaaS telemetry data with NumPy operations

## üìä Real-World Context

At a SaaS company like CloudWave, you're analyzing:
- **Daily Active Users (DAU)**: tracking engagement trends
- **Session counts**: per region, by plan tier, time-of-day
- **Feature adoption**: how many users engage with each product feature
- **Performance metrics**: response times, error rates, API latency

NumPy is the foundation because:
1. **Speed**: 50-100x faster than pure Python loops
2. **Memory**: compact arrays use 8-10x less memory than lists
3. **Simplicity**: expressive syntax for complex operations in one line
4. **Integration**: ecosystem standard (Pandas, Scikit-learn, TensorFlow built on NumPy)

In [None]:
from IPython.display import HTML
HTML('''
<style>
details {
  margin: 10px 0;
  padding: 8px 12px;
  border: 1px solid #d9e2ec;
  border-radius: 8px;
  background: #f9fbfd;
}
details summary {
  font-weight: 600;
  color: #0056b3;
  cursor: pointer;
}
details[open] {
  background: #f1f7ff;
  border-color: #c3d4f0;
}
details pre {
  background: #f8f9fa;
  padding: 8px;
  border-radius: 6px;
}
</style>
''')

## üè¢ Scenario ‚Äî CloudWave Growth Analytics

You're a data analyst at CloudWave, a growing SaaS company. Your CEO wants a daily health report:
- Which regions are growing?
- How do different customer tiers compare?
- Is feature adoption uniform or concentrated?

The challenge: you have 3 months of telemetry data with 220,000 event records. Processing millions of numbers efficiently requires understanding NumPy.

<details>
<summary>üí° Hint ‚Äî Breaking Down the Problem</summary>

**Hint 1:** When working with large arrays, think about the shape and dimensions you need:
- Do you have per-user metrics or per-event metrics?
- Are you aggregating across time (days/weeks) or across users (cohorts)?
- What dimensions need broadcasting?

**Hint 2:** Use NumPy's aggregation methods efficiently:
```python
# Good: single operation
mean_usage = arr.mean()

# Avoid: explicit loop
mean_usage = sum(arr) / len(arr)  # slower
```

**Hint 3:** For edge cases:
- Missing/zero values: use `np.nanmean()` to ignore NaN
- Outliers: use `np.percentile()` for robust statistics
- Integer vs float: be aware of type promotion (important for memory!)

</details>

<details>
<summary>‚úÖ Solution ‚Äî Feature Usage Aggregation</summary>

```python
import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('../data/feature_usage.csv', parse_dates=['date'], low_memory=False)

# Extract feature usage counts as NumPy array
usage_counts = df['usage_count'].values  # converts to NumPy array

# Compute statistics
print(f"Total events: {len(usage_counts)}")
print(f"Mean usage per event: {usage_counts.mean():.2f}")
print(f"Median: {np.median(usage_counts):.2f}")
print(f"Std Dev: {usage_counts.std():.2f}")
print(f"Min: {usage_counts.min()}, Max: {usage_counts.max()}")

# Top 10% threshold
percentile_90 = np.percentile(usage_counts, 90)
top_10_percent = df[df['usage_count'] >= percentile_90]
print(f"\nTop 10% threshold: {percentile_90}")
print(f"Number of events in top 10%: {len(top_10_percent)}")
print(f"Top users: {top_10_percent['user_id'].nunique()}")
```

**Why this works:**
- `.values` converts Pandas Series to NumPy array (then we get fast operations)
- `np.percentile()` is vectorized and handles large arrays efficiently
- No Python loops means we're leveraging NumPy's compiled C backend

</details>

## üìö Key Concepts ‚Äî Why NumPy Matters

### Arrays vs Lists
```
Python list:  [1, 2, 3, 4, 5]  ‚Üí stored as 5 separate objects in memory
NumPy array:  array([1,2,3,4,5]) ‚Üí contiguous block, all elements same type
```

**Vectorization**: Operations applied to entire arrays without explicit loops.

```python
# Slow (Python loop)
result = []
for x in big_list:
    result.append(x * 2)

# Fast (NumPy vectorization)
result = big_array * 2
```

### Broadcasting
Automatically aligns dimensions for operations:
```python
# 2D array (100 days √ó 50 regions)
daily_revenue = np.random.rand(100, 50)

# 1D baseline (50 regions)
baseline = np.array([100, 150, 200, ...])  # one per region

# Automatically broadcasts baseline across all days
adjusted = daily_revenue / baseline  # shape (100, 50)
```

## ‚úçÔ∏è Hands-on Exercises

### Exercise 1: Create and Manipulate Arrays
Load feature usage data and compute metrics using NumPy operations:
1. Load the feature_usage.csv data into a NumPy array
2. Extract usage counts for a single feature across all events
3. Compute: mean, median, std dev, min, max usage counts
4. Identify which users are in the top 10% for usage

### Exercise 2: Broadcasting in Action
1. Load daily DAU data (shape: 90 days √ó 50 regions)
2. Compute the regional average DAU across the 90-day period
3. Calculate deviation from regional average for each day (broadcast)
4. Find which region+day combination has highest growth rate

### Exercise 3: Vectorized Aggregations
1. Load user_events.csv
2. Group events by user (use unique users as an index)
3. For each user, compute: total events, avg event_value, events per day
4. All operations must use NumPy (no Pandas loops)

<details>
<summary>üí° Hint ‚Äî Broadcasting Challenge</summary>

When you have arrays of different shapes, NumPy automatically broadcasts them:

```python
# Day 1 data (50 regions):    [100, 150, 200, ...]  shape (50,)
# Regional baseline (50):      [95,  140, 210, ...]  shape (50,)
#
# When you divide them, NumPy broadcasts baseline to match across rows:
# [100/95, 100/140, 100/210, ...] for day 1
# [120/95, 120/140, 120/210, ...] for day 2
# etc.
```

**Key:** The shapes must be compatible for broadcasting:
- Dimensions align from the right
- Size 1 broadcasts to any size
- Mismatched non-1 dimensions cause an error

</details>

<details>
<summary>‚úÖ Solution ‚Äî Vectorized Metrics Computation</summary>

```python
import pandas as pd
import numpy as np

# Load feature usage and reshape for analysis
df = pd.read_csv('../data/feature_usage.csv', parse_dates=['date'], low_memory=False)

# Method 1: Using Pandas groupby, then convert to NumPy for vectorized operations
user_stats = df.groupby('user_id').agg({
    'usage_count': ['sum', 'mean', 'count']
}).reset_index()
user_stats.columns = ['user_id', 'total_usage', 'avg_usage', 'event_count']

# Now use NumPy to find top users
total_usage_array = user_stats['total_usage'].values
top_10_percent_threshold = np.percentile(total_usage_array, 90)
top_users = user_stats[user_stats['total_usage'] >= top_10_percent_threshold]

print(f"Total unique users: {len(user_stats)}")
print(f"Top 10% threshold: {top_10_percent_threshold:.0f}")
print(f"Top 10% users: {len(top_users)}")
print(f"Their avg usage: {top_users['avg_usage'].mean():.2f}")

# Using NumPy broadcasting to compute z-scores
mean_usage = total_usage_array.mean()
std_usage = total_usage_array.std()
z_scores = (total_usage_array - mean_usage) / std_usage
outliers = np.where(np.abs(z_scores) > 2.5)[0]
print(f"Outlier users (|z| > 2.5): {len(outliers)}")
```

**Why this hybrid approach works:**
- Pandas handles grouped aggregation (more efficient than manual NumPy)
- NumPy handles statistical computations (vectorized and fast)
- `.values` bridges Pandas and NumPy seamlessly

</details>

In [None]:
# Quick demo: NumPy operations on real SaaS data
import pandas as pd
import numpy as np

# Load feature usage data
df = pd.read_csv('../data/feature_usage.csv')
print("=" * 60)
print("FEATURE USAGE DATA OVERVIEW")
print("=" * 60)
print(f"Total records: {len(df):,}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"Unique users: {df['user_id'].nunique():,}")
print(f"Unique features: {df['feature_name'].nunique()}")
print()

# NumPy operations: find total usage per feature
print("=" * 60)
print("FEATURE USAGE SUMMARY (VECTORIZED WITH NUMPY)")
print("=" * 60)

# Convert to NumPy array
usage_array = df['usage_count'].values

# Compute statistics
stats = {
    'Total usage events': len(usage_array),
    'Mean usage': f"{usage_array.mean():.2f}",
    'Median usage': f"{np.median(usage_array):.2f}",
    'Std Dev': f"{usage_array.std():.2f}",
    'Min': f"{usage_array.min():.0f}",
    'Max': f"{usage_array.max():.0f}",
    'P95': f"{np.percentile(usage_array, 95):.0f}",
}

for key, val in stats.items():
    print(f"{key:.<40} {val}")

# Group by feature using pandas, then analyze with NumPy
feature_usage = df.groupby('feature_name')['usage_count'].sum().sort_values(ascending=False)
print()
print("Top 5 Features by Total Usage:")
print(feature_usage.head())
print()
print("NumPy power: computed statistics on 160,000+ records in milliseconds!")

## ü§î Reflection & Application

**Question 1:** Broadcasting helps scale baseline usage across 100k users. How would you structure this?
- Create a baseline array (features) and usage array (users √ó features)
- The broadcasting rule: `(100000, 1) / (1, 50)` becomes `(100000, 50)`

**Question 2:** When is NumPy not the right tool?
- When you need labeled data (use Pandas)
- When working with strings/text heavily (use regular Python)
- When data doesn't fit in memory (use Dask or Spark)

**Question 3:** How does NumPy performance scale?
- 1,000 elements: all methods similar
- 1,000,000 elements: NumPy ~50x faster
- 1,000,000,000+ elements: NumPy essential; pure Python becomes impractical

## üìù Practice Assignment

**Problem:** You have daily DAU data for 30 days across 10 regions. Compute:
1. Total DAU across all regions for each day
2. Regional share of total DAU each day (as percentages)
3. Which region had highest average share? Lowest?
4. Compute z-score for each region's daily DAU (standardize by region)
5. Identify anomalies (days where any region's DAU is > 2œÉ from its mean)

**Deliverable:** Write functions using NumPy (no loops), document with comments explaining the broadcasting logic.

## üîó Next Steps

In Week 2, we'll layer Pandas on top to handle labeled data, join multiple tables, and prepare datasets for analysis. NumPy becomes the engine underlying Pandas' high-performance operations.