# Lab 1: Python for Data Science - SOLUTIONS

**Duration:** 60-90 minutes | **Difficulty:** Beginner

**This notebook contains all solutions. Use for reference after attempting the exercises.**

---

## Overview

This lab introduces the essential Python libraries for data science: NumPy, Pandas, and Matplotlib. You'll learn to manipulate data arrays, work with tabular data, create visualizations, and preprocess data for machine learning applications.

### Lab Structure

| Part | Topic | Key Concepts |
|------|-------|--------------|
| **Part 1** | NumPy Arrays | Creating arrays, statistics (mean/max/min/sum), reshaping, slicing |
| **Part 2** | Pandas DataFrames | Creating DataFrames, exploring data, filtering, grouping & aggregation |
| **Part 3** | Data Visualization | Histograms, scatter plots, bar charts with Matplotlib |
| **Part 4** | Data Preprocessing | Normalization, standardization, one-hot encoding |

### Libraries Used

- **NumPy** - Numerical computing and array operations
- **Pandas** - Data manipulation and analysis
- **Matplotlib** - Data visualization

### Dataset

This lab uses a synthetic **Customer dataset** (100 rows) with columns:
`customer_id`, `age`, `income`, `years_customer`, `region`, `purchased`

---

## Learning Objectives

By the end of this lab, you will be able to:
1. Create and manipulate NumPy arrays
2. Use Pandas DataFrames to filter and aggregate data
3. Create visualizations with Matplotlib
4. Preprocess data for machine learning

## Instructions

- This is the **SOLUTIONS** notebook - use for reference after attempting exercises
- All exercise cells contain the correct answers
- Run cells with `Shift+Enter`

## Setup

Run the cell below to import the required libraries. You must run this cell first before any other code will work.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 6]
np.random.seed(42)

print("Setup complete! NumPy:", np.__version__, "| Pandas:", pd.__version__)

---
# Part 1: NumPy Arrays

NumPy is the foundation of data science in Python. It provides fast array operations.

## 1.1 Creating Arrays - Demonstration

Run the cell below to see different ways to create NumPy arrays:
- `np.array([list])` - Create from a Python list
- `np.zeros((rows, cols))` - Create array filled with zeros
- `np.ones((rows, cols))` - Create array filled with ones
- `np.arange(n)` - Create array with values 0 to n-1
- `np.eye(n)` - Create n×n identity matrix

In [None]:
# From a list
arr1 = np.array([1, 2, 3, 4, 5])
print("From list:", arr1)

# Zeros (2 rows, 3 columns)
arr2 = np.zeros((2, 3))
print("\nZeros (2x3):")
print(arr2)

# Range 0 to 9
arr3 = np.arange(10)
print("\nRange 0-9:", arr3)

# Identity matrix
arr4 = np.eye(3)
print("\n3x3 Identity:")
print(arr4)

## Exercise 1.1: Create Arrays - SOLUTION

In the cell below, replace each `None` with the correct code:

| Variable | What to create | Code to write |
|----------|----------------|---------------|
| `arr_a` | Array containing [10, 20, 30, 40, 50] | `np.array([10, 20, 30, 40, 50])` |
| `arr_b` | 4×4 array filled with zeros | `np.zeros((4, 4))` |
| `arr_c` | Array with values 0 to 19 | `np.arange(20)` |
| `arr_d` | 5×5 identity matrix | `np.eye(5)` |

In [None]:
"""
Exercise 1.1 Solution: Creating NumPy Arrays

This solution demonstrates four different ways to create NumPy arrays:
- From a Python list
- Filled with zeros
- Using a range of values
- As an identity matrix
"""

# Create array [10, 20, 30, 40, 50] from a Python list
arr_a = np.array([10, 20, 30, 40, 50])

# Create 4x4 array filled with zeros - useful for initializing matrices
arr_b = np.zeros((4, 4))

# Create array with values 0 to 19 using arange
arr_c = np.arange(20)

# Create 5x5 identity matrix - diagonal of 1s, rest 0s
arr_d = np.eye(5)

# Print results to verify
print("arr_a:", arr_a)
print("arr_b shape:", arr_b.shape)
print("arr_c:", arr_c)
print("arr_d shape:", arr_d.shape)

### Code Explanation: Exercise 1.1

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `arr_a = np.array([10, 20, 30, 40, 50])` | **Creates an array from a Python list.** The `np.array()` function converts a regular Python list into a NumPy array, enabling fast mathematical operations. |
| 2 | `arr_b = np.zeros((4, 4))` | **Creates a 4×4 matrix of zeros.** The tuple `(4, 4)` specifies the shape. Zeros arrays are commonly used to initialize matrices before filling them with computed values. |
| 3 | `arr_c = np.arange(20)` | **Creates an array with values 0 to 19.** `arange(n)` generates integers from 0 to n-1, similar to Python's `range()` but returns an array instead of an iterator. |
| 4 | `arr_d = np.eye(5)` | **Creates a 5×5 identity matrix.** An identity matrix has 1s on the diagonal and 0s elsewhere. It's essential in linear algebra (multiplying by identity returns the original matrix). |
| 5 | `print(..., arr_b.shape)` | **The `.shape` attribute** returns a tuple with the dimensions of the array. For `arr_b`, this is `(4, 4)` meaning 4 rows and 4 columns. |

**Why these functions matter:**
- `np.array()` is the foundation - it converts Python data to NumPy's optimized format
- `np.zeros()` and `np.ones()` are used to pre-allocate memory for efficiency
- `np.arange()` is faster than converting `list(range())` to an array
- `np.eye()` is essential for matrix operations like solving linear equations

## 1.2 Array Statistics - Demonstration

NumPy provides functions to calculate statistics:
- `np.mean(arr)` - Average value
- `np.max(arr)` - Maximum value
- `np.min(arr)` - Minimum value
- `np.sum(arr)` - Sum of all values
- `np.argmax(arr)` - Index of maximum value

Run the cell below to see these in action:

In [None]:
scores = np.array([85, 92, 78, 90, 88, 76, 95, 89])
print("Scores:", scores)
print("Mean:", np.mean(scores))
print("Max:", np.max(scores))
print("Min:", np.min(scores))
print("Sum:", np.sum(scores))
print("Index of max:", np.argmax(scores))

"""
Exercise 1.2 Solution: Array Statistics

This solution demonstrates NumPy's statistical functions for analyzing data.
These functions operate on entire arrays and return single values.
"""

temperatures = np.array([72, 75, 68, 80, 85, 70, 60])
print("Temperatures:", temperatures)

# Calculate the arithmetic mean (sum of values / count)
avg_temp = np.mean(temperatures)

# Find the maximum value in the array
max_temp = np.max(temperatures)

# Find the minimum value in the array
min_temp = np.min(temperatures)

# Find the INDEX (position) of the maximum value, not the value itself
hottest_day = np.argmax(temperatures)

# Calculate range: difference between highest and lowest values
temp_range = max_temp - min_temp

print("\nAverage:", avg_temp)
print("Max:", max_temp)
print("Min:", min_temp)
print("Hottest day index:", hottest_day)
print("Range:", temp_range)

### Code Explanation: Exercise 1.2

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `avg_temp = np.mean(temperatures)` | **Calculates the arithmetic mean.** Adds all values and divides by the count. Result: (72+75+68+80+85+70+60)/7 = 72.86. This is the central tendency of the data. |
| 2 | `max_temp = np.max(temperatures)` | **Finds the maximum value.** Scans the entire array and returns the largest element (85). Useful for finding peaks or upper bounds. |
| 3 | `min_temp = np.min(temperatures)` | **Finds the minimum value.** Returns the smallest element (60). Together with max, defines the range of data. |
| 4 | `hottest_day = np.argmax(temperatures)` | **Returns the INDEX of the maximum, not the value.** Returns 4 because `temperatures[4]` is 85. The "arg" prefix means "argument" or position. |
| 5 | `temp_range = max_temp - min_temp` | **Calculates the range.** The difference between max and min (85-60=25) shows the spread of the data. |

**Key distinction: `max()` vs `argmax()`**
- `np.max([10, 50, 30])` returns `50` (the value)
- `np.argmax([10, 50, 30])` returns `1` (the index where 50 is located)

**Why these functions matter:**
- `mean()` summarizes data with a single representative value
- `max()`/`min()` identify extremes and outliers
- `argmax()` finds WHERE the maximum occurs (e.g., which day was hottest)
- These are vectorized operations - much faster than Python loops

In [None]:
temperatures = np.array([72, 75, 68, 80, 85, 70, 60])
print("Temperatures:", temperatures)

avg_temp = np.mean(temperatures)  # SOLUTION
max_temp = np.max(temperatures)  # SOLUTION
min_temp = np.min(temperatures)  # SOLUTION
hottest_day = np.argmax(temperatures)  # SOLUTION
temp_range = max_temp - min_temp  # SOLUTION

print("\nAverage:", avg_temp)
print("Max:", max_temp)
print("Min:", min_temp)
print("Hottest day index:", hottest_day)
print("Range:", temp_range)

## 1.3 Reshaping Arrays - Demonstration

You can change the shape of an array with `.reshape(rows, cols)`.

**Important:** The total number of elements must stay the same (e.g., 12 elements can be 3×4, 4×3, 2×6, etc.)

Run the cell below to see reshaping in action:

In [None]:
arr = np.arange(12)  # [0, 1, 2, ..., 11]
print("Original:", arr)
print("Shape:", arr.shape)

# Reshape to 3 rows x 4 columns
arr_3x4 = arr.reshape(3, 4)
print("\nReshaped to 3x4:")
print(arr_3x4)

# Reshape to 4 rows x 3 columns
arr_4x3 = arr.reshape(4, 3)
print("\nReshaped to 4x3:")
print(arr_4x3)

## 1.4 Array Slicing - Demonstration

Select parts of arrays using slicing syntax `[row, column]`:
- `arr[0, :]` - First row (all columns)
- `arr[:, 0]` - First column (all rows)
- `arr[:, -1]` - Last column
- `arr[1:4, 1:4]` - Rows 1-3, columns 1-3

Run the cell below:

In [None]:
"""
Exercise 1.3 Solution: Reshaping and Slicing Arrays

This solution demonstrates how to change array dimensions with reshape()
and extract portions of arrays using slicing notation.
"""

arr = np.arange(20)
matrix = np.arange(25).reshape(5, 5)

print("arr:", arr)
print("\nmatrix:")
print(matrix)

# Reshape 1D array (20 elements) into 2D array (4 rows × 5 columns)
# Total elements must match: 4 × 5 = 20
arr_4x5 = arr.reshape(4, 5)

# Slice first row: [0, :] means row 0, all columns
# The colon (:) is a wildcard meaning "all"
first_row = matrix[0, :]

# Slice last column: [:, -1] means all rows, last column
# Negative indices count from the end (-1 = last)
last_col = matrix[:, -1]

# Slice center 3×3: [1:4, 1:4] means rows 1,2,3 and columns 1,2,3
# Note: end index is exclusive, so 1:4 gives indices 1, 2, 3
center = matrix[1:4, 1:4]

print("\narr_4x5:")
print(arr_4x5)
print("\nfirst_row:", first_row)
print("last_col:", last_col)
print("center:")
print(center)

### Code Explanation: Exercise 1.3

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `arr_4x5 = arr.reshape(4, 5)` | **Reshapes a 1D array into 2D.** The 20-element array becomes 4 rows × 5 columns. Elements fill row-by-row (row-major order). Total elements must match: 4×5=20. |
| 2 | `first_row = matrix[0, :]` | **Extracts row 0 (first row).** The syntax `[row, col]` selects elements. `:` means "all columns". Result: `[0, 1, 2, 3, 4]`. |
| 3 | `last_col = matrix[:, -1]` | **Extracts the last column.** `:` means "all rows", `-1` is the last column index. Negative indexing counts from the end. |
| 4 | `center = matrix[1:4, 1:4]` | **Extracts a 3×3 subarray.** `1:4` means indices 1, 2, 3 (end is exclusive). This gets the center portion, excluding edges. |

**Slicing Syntax Breakdown:**
```
matrix[start_row:end_row, start_col:end_col]
```
- Omitting start defaults to 0: `[:3]` = `[0:3]`
- Omitting end defaults to length: `[2:]` = from index 2 to end
- `:` alone means "everything"

**Visual Example of `matrix[1:4, 1:4]`:**
```
Original 5×5:           Slice [1:4, 1:4]:
[[ 0  1  2  3  4]       
 [ 5 [6  7  8] 9]       [[ 6  7  8]
 [10 [11 12 13] 14]  →   [11 12 13]
 [15 [16 17 18] 19]      [16 17 18]]
 [20 21 22 23 24]]
```

**Why reshaping and slicing matter:**
- Reshaping prepares data for algorithms that expect specific dimensions
- Slicing extracts features, windows, or batches from datasets
- Both are zero-copy operations when possible (very efficient)

## Exercise 1.3: Reshape and Slice - SOLUTION

Complete the following tasks by replacing `None` with the correct code:

| Variable | What to do | Code to write |
|----------|------------|---------------|
| `arr_4x5` | Reshape `arr` to 4 rows × 5 columns | `arr.reshape(4, 5)` |
| `first_row` | Get the first row of `matrix` | `matrix[0, :]` |
| `last_col` | Get the last column of `matrix` | `matrix[:, -1]` |
| `center` | Get the center 3×3 subarray | `matrix[1:4, 1:4]` |

In [None]:
arr = np.arange(20)
matrix = np.arange(25).reshape(5, 5)

print("arr:", arr)
print("\nmatrix:")
print(matrix)

# Reshape arr to 4x5
arr_4x5 = arr.reshape(4, 5)  # SOLUTION

# Get first row of matrix
first_row = matrix[0, :]  # SOLUTION

# Get last column of matrix
last_col = matrix[:, -1]  # SOLUTION

# Get center 3x3
center = matrix[1:4, 1:4]  # SOLUTION

print("\narr_4x5:")
print(arr_4x5)
print("\nfirst_row:", first_row)
print("last_col:", last_col)
print("center:")
print(center)

---
# Part 2: Pandas DataFrames

Pandas provides the DataFrame - like an Excel spreadsheet in Python.

## 2.1 Creating DataFrames - Demonstration

Create a DataFrame from a dictionary where:
- Keys become column names
- Values (lists) become the data

Run the cell below to see an example:

In [None]:
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'NYC', 'LA'],
    'salary': [70000, 80000, 90000, 75000]
}

df = pd.DataFrame(data)
print(df)
print("\nShape:", df.shape)
print("Columns:", list(df.columns))

## 2.2 Exploring DataFrames - Demonstration

Useful methods for exploring data:
- `df.head()` - First 5 rows
- `df.describe()` - Summary statistics
- `df['column']` - Select a single column
- `df['column'].mean()` - Mean of a column

Run the cell below:

In [None]:
print("First 2 rows:")
print(df.head(2))

print("\nSummary statistics:")
print(df.describe())

print("\nAverage salary:", df['salary'].mean())

"""
Exercise 2.1 Solution: Exploring DataFrame Statistics

This solution demonstrates how to access DataFrame columns and
calculate summary statistics using Pandas methods.
"""

# Access 'age' column and calculate mean
# The bracket notation df['column'] returns a Series
avg_age = customers['age'].mean()

# Access 'income' column and find maximum value
max_income = customers['income'].max()

# Sum the 'purchased' column (0s and 1s)
# Since purchased is binary, sum gives count of 1s (purchases)
total_purchased = customers['purchased'].sum()

print("Average age:", avg_age)
print("Max income:", max_income)
print("Total purchased:", total_purchased)

### Code Explanation: Exercise 2.1

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `avg_age = customers['age'].mean()` | **Accesses the 'age' column and calculates mean.** `customers['age']` returns a Pandas Series (single column). `.mean()` calculates the average of all values in that column. |
| 2 | `max_income = customers['income'].max()` | **Finds the maximum income.** The `.max()` method returns the largest value in the column. Useful for identifying the highest earner. |
| 3 | `total_purchased = customers['purchased'].sum()` | **Sums the purchase column.** Since 'purchased' contains 0s and 1s, summing counts how many 1s exist (i.e., how many customers made a purchase). |

**Understanding DataFrame Column Access:**
```python
customers['age']      # Returns a Series (single column)
customers[['age', 'income']]  # Returns a DataFrame (multiple columns)
customers.age         # Dot notation (works but less flexible)
```

**Common Statistical Methods:**
| Method | Returns |
|--------|---------|
| `.mean()` | Average value |
| `.median()` | Middle value when sorted |
| `.std()` | Standard deviation |
| `.sum()` | Sum of all values |
| `.count()` | Number of non-null values |
| `.min()` / `.max()` | Minimum / Maximum |

**Why this matters:**
- Understanding data distributions is the first step in any analysis
- These summary statistics help identify outliers and data quality issues
- Binary columns (0/1) can be summed to count occurrences

In [None]:
np.random.seed(42)
n = 100

customers = pd.DataFrame({
    'customer_id': range(1, n+1),
    'age': np.random.randint(18, 65, n),
    'income': np.random.normal(50000, 15000, n).astype(int),
    'years_customer': np.random.randint(1, 15, n),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'purchased': np.random.choice([0, 1], n, p=[0.6, 0.4])
})

print("Customer dataset created!")
print(f"Shape: {customers.shape[0]} rows, {customers.shape[1]} columns")
print("\nFirst 5 rows:")
print(customers.head())

## Exercise 2.1: Explore the Customer Data - SOLUTION

Calculate statistics about the customer dataset by replacing `None` with the correct code:

| Variable | What to calculate | Code to write |
|----------|-------------------|---------------|
| `avg_age` | Average customer age | `customers['age'].mean()` |
| `max_income` | Maximum income | `customers['income'].max()` |
| `total_purchased` | Total customers who purchased (sum of 1s) | `customers['purchased'].sum()` |

In [None]:
"""
Exercise 2.2 Solution: Filtering DataFrames

This solution demonstrates boolean indexing to filter rows based on conditions.
Pandas uses bracket notation with boolean expressions inside.
"""

# Filter rows where income is greater than 70000
# customers['income'] > 70000 creates a boolean Series (True/False for each row)
# Passing this to customers[...] returns only rows where True
high_income = customers[customers['income'] > 70000]

# Filter rows where region equals 'East'
# Use == for equality comparison (single = is assignment)
east_region = customers[customers['region'] == 'East']

# Multiple conditions: age < 30 AND purchased == 1
# Each condition must be in parentheses
# Use & for AND, | for OR (not 'and'/'or' keywords)
young_buyers = customers[(customers['age'] < 30) & (customers['purchased'] == 1)]

print("High income count:", len(high_income))
print("East region count:", len(east_region))
print("Young buyers count:", len(young_buyers))

### Code Explanation: Exercise 2.2

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `high_income = customers[customers['income'] > 70000]` | **Filters rows where income exceeds 70000.** The inner expression creates a boolean mask (True/False for each row), and the outer brackets select only True rows. |
| 2 | `east_region = customers[customers['region'] == 'East']` | **Filters rows where region is 'East'.** Uses `==` for equality. Strings must be quoted and match exactly (case-sensitive). |
| 3 | `young_buyers = customers[(customers['age'] < 30) & (customers['purchased'] == 1)]` | **Combines two conditions with AND.** Each condition is wrapped in parentheses. The `&` operator requires both conditions to be True. |

**How Boolean Indexing Works:**
```python
# Step 1: Create boolean mask
mask = customers['income'] > 70000
# mask is: [False, True, False, True, ...]

# Step 2: Use mask to filter
result = customers[mask]
# Returns only rows where mask is True
```

**Logical Operators in Pandas:**
| Operator | Meaning | Example |
|----------|---------|---------|
| `&` | AND | `(cond1) & (cond2)` |
| `\|` | OR | `(cond1) \| (cond2)` |
| `~` | NOT | `~(condition)` |

**Common Pitfall:**
```python
# WRONG - Python keywords don't work
df[(df['a'] > 5) and (df['b'] < 10)]  # Error!

# CORRECT - Use & with parentheses
df[(df['a'] > 5) & (df['b'] < 10)]
```

**Why filtering matters:**
- Segment data for targeted analysis
- Remove outliers or invalid records
- Create subsets for training/testing in ML

## 2.4 Filtering Data - Demonstration

Filter rows using boolean conditions:
- `df[df['column'] > value]` - Rows where column > value
- `df[df['column'] == 'text']` - Rows where column equals text
- `df[(cond1) & (cond2)]` - Multiple conditions with AND

Run the cell below:

In [None]:
# Filter: age > 50
older = customers[customers['age'] > 50]
print(f"Customers over 50: {len(older)}")

# Filter: region is 'West'
west = customers[customers['region'] == 'West']
print(f"Customers in West: {len(west)}")

# Multiple conditions: age > 40 AND purchased
older_buyers = customers[(customers['age'] > 40) & (customers['purchased'] == 1)]
print(f"Older buyers: {len(older_buyers)}")

"""
Exercise 2.3 Solution: Grouping and Aggregation

This solution demonstrates the split-apply-combine pattern:
1. Split data into groups by a column
2. Apply an aggregation function to each group
3. Combine results into a new Series
"""

# Group by 'region', then calculate mean of 'income' column
# Result is a Series with region as index and average income as values
avg_income_by_region = customers.groupby('region')['income'].mean()

# Group by 'region', count customer_ids in each group
# This gives the number of customers per region
count_by_region = customers.groupby('region')['customer_id'].count()

# Group by 'region', calculate mean of 'purchased' (0 or 1)
# Mean of binary values gives the proportion/rate
purchase_rate = customers.groupby('region')['purchased'].mean()

print("Average income by region:")
print(avg_income_by_region)
print("\nCount by region:")
print(count_by_region)
print("\nPurchase rate by region:")
print(purchase_rate)

### Code Explanation: Exercise 2.3

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `avg_income_by_region = customers.groupby('region')['income'].mean()` | **Groups rows by region, then calculates mean income for each group.** This is the split-apply-combine pattern: split by region → apply mean → combine into Series. |
| 2 | `count_by_region = customers.groupby('region')['customer_id'].count()` | **Counts customers in each region.** `.count()` returns the number of non-null values in each group. |
| 3 | `purchase_rate = customers.groupby('region')['purchased'].mean()` | **Calculates purchase rate per region.** Since 'purchased' is 0 or 1, the mean gives the proportion who purchased (e.g., 0.4 = 40%). |

**Understanding `groupby()` Step by Step:**
```python
# Step 1: Group the data
grouped = customers.groupby('region')
# Creates GroupBy object (no computation yet)

# Step 2: Select column to aggregate
column = grouped['income']
# Specifies which column to calculate on

# Step 3: Apply aggregation
result = column.mean()
# Computes mean for each group
```

**Common Aggregation Functions:**
| Function | Purpose |
|----------|---------|
| `.mean()` | Average value |
| `.sum()` | Total |
| `.count()` | Number of rows |
| `.min()` / `.max()` | Extremes |
| `.std()` | Standard deviation |
| `.agg(['mean', 'sum'])` | Multiple aggregations |

**Why the mean of binary data gives a rate:**
```
purchased = [1, 0, 1, 1, 0]
mean = (1+0+1+1+0) / 5 = 3/5 = 0.6 = 60% purchase rate
```

**Why grouping matters:**
- Compares metrics across segments (regions, demographics)
- Essential for creating pivot tables and reports
- Foundation for feature engineering in ML

In [None]:
# Income > 70000
high_income = customers[customers['income'] > 70000]  # SOLUTION

# Region is East
east_region = customers[customers['region'] == 'East']  # SOLUTION

# Young buyers (age < 30 and purchased)
young_buyers = customers[(customers['age'] < 30) & (customers['purchased'] == 1)]  # SOLUTION

print("High income count:", len(high_income) if high_income is not None else None)
print("East region count:", len(east_region) if east_region is not None else None)
print("Young buyers count:", len(young_buyers) if young_buyers is not None else None)

## 2.5 Grouping Data - Demonstration

Group data and calculate aggregates:
- `df.groupby('column')['other'].mean()` - Average by group
- `df.groupby('column')['other'].count()` - Count by group

Run the cell below:

In [None]:
# Average income by region
print("Average income by region:")
print(customers.groupby('region')['income'].mean())

# Count by region
print("\nCustomers per region:")
print(customers.groupby('region')['customer_id'].count())

"""
Exercise 3.1 Solution: Creating a Histogram

This solution creates a histogram to visualize the distribution
of customer income values using Matplotlib.
"""

# Create a new figure with specified size (width=10, height=5 inches)
plt.figure(figsize=(10, 5))

# Create histogram of income data
# bins=20: divide data range into 20 equal intervals
# edgecolor='black': add black border around each bar
# alpha=0.7: make bars 70% opaque (slight transparency)
plt.hist(customers['income'], bins=20, edgecolor='black', alpha=0.7)

# Add labels and title for clarity
plt.xlabel('Income ($)')      # X-axis label
plt.ylabel('Frequency')       # Y-axis label
plt.title('Distribution of Customer Income')  # Chart title

# Display the plot
plt.show()

### Code Explanation: Exercise 3.1

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `plt.figure(figsize=(10, 5))` | **Creates a new figure canvas.** The tuple (10, 5) sets width and height in inches. Larger figures show more detail. |
| 2 | `plt.hist(customers['income'], bins=20, ...)` | **Creates the histogram.** Divides income range into 20 bins and counts how many values fall in each bin. |
| 3 | `edgecolor='black'` | **Adds black borders around bars.** Makes individual bars more visible, especially when colors are similar. |
| 4 | `alpha=0.7` | **Sets transparency to 70%.** Values range 0 (invisible) to 1 (solid). Transparency helps when overlapping plots. |
| 5 | `plt.xlabel('Income ($)')` | **Labels the X-axis.** Always include units (like $) for clarity. |
| 6 | `plt.ylabel('Frequency')` | **Labels the Y-axis.** Frequency means "count of occurrences" in each bin. |
| 7 | `plt.title(...)` | **Adds a title above the plot.** Describes what the visualization shows. |
| 8 | `plt.show()` | **Renders and displays the plot.** Required in scripts; optional in Jupyter notebooks. |

**Understanding Histogram Bins:**
```
Income range: $20,000 - $80,000
With bins=20: each bin covers ($80,000 - $20,000) / 20 = $3,000

Bin 1: $20,000-$23,000 → count how many customers
Bin 2: $23,000-$26,000 → count how many customers
... and so on
```

**Why histograms matter:**
- Reveal the shape of data distribution (normal, skewed, bimodal)
- Show where most values concentrate
- Help identify outliers
- Essential first step in exploratory data analysis (EDA)

In [None]:
# Average income by region
avg_income_by_region = customers.groupby('region')['income'].mean()  # SOLUTION

# Count by region
count_by_region = customers.groupby('region')['customer_id'].count()  # SOLUTION

# Purchase rate by region
purchase_rate = customers.groupby('region')['purchased'].mean()  # SOLUTION

print("Average income by region:")
print(avg_income_by_region)
print("\nCount by region:")
print(count_by_region)
print("\nPurchase rate by region:")
print(purchase_rate)

---
# Part 3: Data Visualization

Matplotlib is the standard plotting library for Python.

"""
Exercise 3.2 Solution: Creating a Scatter Plot

This solution creates a scatter plot to explore the relationship
between customer tenure (years) and income.
"""

# Create figure with specified dimensions
plt.figure(figsize=(10, 6))

# Create scatter plot: each point represents one customer
# x-axis: years as customer
# y-axis: income
# alpha=0.5: semi-transparent points reveal overlapping data
plt.scatter(customers['years_customer'], customers['income'], alpha=0.5)

# Add descriptive labels
plt.xlabel('Years as Customer')
plt.ylabel('Income ($)')
plt.title('Customer Tenure vs Income')

# Display the plot
plt.show()

### Code Explanation: Exercise 3.2

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `plt.figure(figsize=(10, 6))` | **Creates a 10×6 inch figure.** Wider aspect ratio works well for scatter plots. |
| 2 | `plt.scatter(x, y, alpha=0.5)` | **Plots each data point as a dot.** First argument is X values, second is Y values. Each customer becomes one point. |
| 3 | `alpha=0.5` | **50% transparency.** When points overlap, darker areas show higher density. Essential for seeing patterns in crowded plots. |
| 4 | `plt.xlabel(...)` / `plt.ylabel(...)` | **Label the axes.** Readers need to know what each axis represents. |

**Interpreting Scatter Plots:**
- **Positive correlation:** Points trend upward left-to-right
- **Negative correlation:** Points trend downward left-to-right
- **No correlation:** Points scattered randomly (like this example)
- **Clusters:** Groups of points may indicate segments

**Why scatter plots matter:**
- Reveal relationships between two variables
- Essential for detecting correlations before modeling
- Help identify outliers (points far from the cluster)

In [None]:
plt.figure(figsize=(10, 5))
plt.hist(customers['age'], bins=15, edgecolor='black', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Customer Ages')
plt.show()

## Exercise 3.1: Create a Histogram - SOLUTION

Create a histogram of customer income. Add the following lines to the code cell below:

```python
plt.hist(customers['income'], bins=20, edgecolor='black', alpha=0.7)
plt.xlabel('Income ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Customer Income')
```

In [None]:
"""
Exercise 3.3 Solution: Creating a Bar Chart

This solution creates a bar chart comparing purchase rates across regions.
Bar charts are ideal for comparing values across categories.
"""

# Create figure
plt.figure(figsize=(8, 5))

# First, calculate the purchase rate per region using groupby
# Mean of binary 0/1 column gives the proportion (rate)
purchase_rate = customers.groupby('region')['purchased'].mean()

# Create bar chart from the Series
# kind='bar': vertical bars (use 'barh' for horizontal)
# color: fill color of bars
# edgecolor: border color of bars
purchase_rate.plot(kind='bar', color='green', edgecolor='black')

# Add labels and title
plt.xlabel('Region')
plt.ylabel('Purchase Rate')
plt.title('Purchase Rate by Region')

# Rotate x-axis labels for readability (0 = horizontal)
plt.xticks(rotation=0)

# Display the plot
plt.show()

### Code Explanation: Exercise 3.3

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `purchase_rate = customers.groupby('region')['purchased'].mean()` | **Calculates purchase rate per region.** Groups by region, then computes mean of the binary column. Result is a Series with regions as index. |
| 2 | `purchase_rate.plot(kind='bar', ...)` | **Creates bar chart from Series.** Pandas Series have a built-in `.plot()` method. Index becomes x-axis labels, values become bar heights. |
| 3 | `color='green'` | **Sets bar fill color.** Can use color names ('red', 'blue') or hex codes ('#FF5733'). |
| 4 | `edgecolor='black'` | **Sets bar border color.** Adds definition to each bar. |
| 5 | `plt.xticks(rotation=0)` | **Controls label rotation.** 0 = horizontal, 45 = diagonal, 90 = vertical. Adjust based on label length. |

**Bar Chart vs Histogram:**
| Bar Chart | Histogram |
|-----------|-----------|
| Categorical data | Continuous data |
| Bars have gaps | Bars touch |
| Compares groups | Shows distribution |
| Example: Sales by region | Example: Age distribution |

**Why bar charts matter:**
- Best for comparing values across discrete categories
- Height makes differences immediately visible
- Commonly used in business reports and dashboards

## 3.2 Scatter Plots - Demonstration

Scatter plots show the relationship between two variables.

Run the cell below:

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(customers['age'], customers['income'], alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.title('Age vs Income')
plt.show()

## Exercise 3.2: Create a Scatter Plot - SOLUTION

Create a scatter plot of `years_customer` vs `income`. Add the following lines:

```python
plt.scatter(customers['years_customer'], customers['income'], alpha=0.5)
plt.xlabel('Years as Customer')
plt.ylabel('Income ($)')
plt.title('Customer Tenure vs Income')
```

In [None]:
"""
Exercise 4.1 Solution: Normalization Function

This function scales array values to the range [0, 1] using min-max normalization.
Essential for algorithms sensitive to feature scales (e.g., neural networks, KNN).
"""

def normalize(arr):
    """
    Normalize array values to range [0, 1].
    
    Formula: normalized = (x - min) / (max - min)
    
    Parameters:
        arr: NumPy array of numeric values
        
    Returns:
        NumPy array with values scaled to [0, 1]
    """
    # Subtract minimum to shift range to start at 0
    # Divide by (max - min) to scale range to [0, 1]
    return (arr - arr.min()) / (arr.max() - arr.min())

# Test the function
test_data = np.array([10, 20, 30, 40, 50])
result = normalize(test_data)
print("Input:", test_data)
print("Output:", result)
print("Expected: [0.   0.25 0.5  0.75 1.  ]")

### Code Explanation: Exercise 4.1

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `def normalize(arr):` | **Defines a reusable function.** Takes any NumPy array as input. |
| 2 | `arr - arr.min()` | **Shifts values so minimum becomes 0.** If data is [10, 20, 30], after this: [0, 10, 20]. |
| 3 | `arr.max() - arr.min()` | **Calculates the range.** This is the denominator that scales values to [0, 1]. |
| 4 | `(arr - min) / (max - min)` | **Complete formula.** Numerator shifts to 0, denominator scales to 1. |

**Step-by-Step Example:**
```
Original:    [10, 20, 30, 40, 50]
min = 10, max = 50, range = 40

Step 1 (shift): [10-10, 20-10, 30-10, 40-10, 50-10] = [0, 10, 20, 30, 40]
Step 2 (scale): [0/40, 10/40, 20/40, 30/40, 40/40] = [0, 0.25, 0.5, 0.75, 1.0]
```

**Properties of Normalized Data:**
- Minimum value becomes exactly 0
- Maximum value becomes exactly 1
- All other values fall between 0 and 1
- Preserves the relative distances between values

**Why normalization matters:**
- Many ML algorithms (neural networks, SVM, KNN) perform better when features are on similar scales
- Prevents features with large values from dominating the model
- Required for gradient-based optimization to converge properly

## 3.3 Bar Charts - Demonstration

Bar charts compare values across categories. Use `.plot(kind='bar')` on a Pandas Series.

Run the cell below:

In [None]:
avg_income = customers.groupby('region')['income'].mean()

plt.figure(figsize=(8, 5))
avg_income.plot(kind='bar', color='steelblue', edgecolor='black')
plt.xlabel('Region')
plt.ylabel('Average Income ($)')
plt.title('Average Income by Region')
plt.xticks(rotation=0)
plt.show()

"""
Exercise 4.2 Solution: Standardization Function

This function transforms data to have mean=0 and standard deviation=1 (z-score normalization).
Preferred when data should follow a standard normal distribution.
"""

def standardize(arr):
    """
    Standardize array to mean=0, std=1.
    
    Formula: z = (x - mean) / std
    
    Parameters:
        arr: NumPy array of numeric values
        
    Returns:
        NumPy array with mean=0 and std=1
    """
    # Subtract mean to center data around 0
    # Divide by std to scale spread to 1
    return (arr - arr.mean()) / arr.std()

# Test the function
test_data = np.array([10, 20, 30, 40, 50])
result = standardize(test_data)
print("Input:", test_data)
print("Output:", result)
if result is not None:
    print(f"Mean: {result.mean():.4f} (should be ~0)")
    print(f"Std: {result.std():.4f} (should be ~1)")

### Code Explanation: Exercise 4.2

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `def standardize(arr):` | **Defines a function for z-score transformation.** |
| 2 | `arr - arr.mean()` | **Centers data around 0.** After this, the mean of the result is 0. |
| 3 | `arr.std()` | **Calculates standard deviation.** Measures the spread of data. |
| 4 | `(arr - mean) / std` | **Complete z-score formula.** Each value becomes "how many standard deviations from the mean." |

**Step-by-Step Example:**
```
Original:    [10, 20, 30, 40, 50]
mean = 30, std = 14.14

Step 1 (center): [10-30, 20-30, 30-30, 40-30, 50-30] = [-20, -10, 0, 10, 20]
Step 2 (scale):  [-20/14.14, -10/14.14, 0/14.14, 10/14.14, 20/14.14]
                = [-1.41, -0.71, 0, 0.71, 1.41]
```

**Interpreting Z-Scores:**
- z = 0: Value equals the mean
- z = 1: Value is 1 standard deviation above mean
- z = -2: Value is 2 standard deviations below mean
- |z| > 3: Often considered an outlier

**Standardization vs Normalization:**
| Standardization | Normalization |
|-----------------|---------------|
| Mean=0, Std=1 | Range [0, 1] |
| No fixed range | Fixed range |
| Good for Gaussian data | Good for bounded data |
| Used in linear models | Used in neural networks |

**Why standardization matters:**
- Required for algorithms that assume normally distributed features
- Makes different features comparable (e.g., age in years vs income in dollars)
- Helps with numerical stability in optimization

In [None]:
plt.figure(figsize=(8, 5))

# SOLUTION:
purchase_rate = customers.groupby('region')['purchased'].mean()
purchase_rate.plot(kind='bar', color='green', edgecolor='black')
plt.xlabel('Region')
plt.ylabel('Purchase Rate')
plt.title('Purchase Rate by Region')
plt.xticks(rotation=0)

plt.show()

---
# Part 4: Data Preprocessing

Before using data in machine learning, we often need to preprocess it.

"""
Exercise 4.3 Solution: One-Hot Encoding

This solution converts categorical 'region' column into binary indicator columns.
Required for ML algorithms that cannot handle text categories directly.
"""

print("Original columns:", list(customers.columns))

# pd.get_dummies() converts categorical columns to binary columns
# columns=['region'] specifies which column(s) to encode
# Original 'region' column is removed and replaced with:
# region_East, region_North, region_South, region_West
customers_encoded = pd.get_dummies(customers, columns=['region'])

if customers_encoded is not None:
    print("\nNew columns:", list(customers_encoded.columns))
    print("\nFirst 3 rows:")
    print(customers_encoded.head(3))

### Code Explanation: Exercise 4.3

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `pd.get_dummies(customers, columns=['region'])` | **Converts 'region' to binary columns.** Each unique value becomes a new column with 0/1 values. |

**Before and After One-Hot Encoding:**
```
BEFORE (categorical):
| region  |
|---------|
| North   |
| East    |
| South   |

AFTER (one-hot encoded):
| region_East | region_North | region_South | region_West |
|-------------|--------------|--------------|-------------|
| 0           | 1            | 0            | 0           |
| 1           | 0            | 0            | 0           |
| 0           | 0            | 1            | 0           |
```

**Key Features of get_dummies():**
- Original column is removed automatically
- New columns named `originalname_value`
- Exactly one column is 1 per row (mutually exclusive)
- Creates k columns for k categories

**The "Dummy Variable Trap":**
When using one-hot encoding in regression:
- k categories create k columns
- Only k-1 are needed (the last is implied)
- Use `drop_first=True` to avoid multicollinearity:
  ```python
  pd.get_dummies(df, columns=['region'], drop_first=True)
  ```

**Why one-hot encoding matters:**
- Most ML algorithms require numeric input
- Text categories have no mathematical meaning (North < South makes no sense)
- Creates a representation where each category is equidistant
- Essential preprocessing step for classification and regression

In [None]:
data = np.array([10, 20, 30, 40, 50])
print("Original:", data)

normalized = (data - data.min()) / (data.max() - data.min())
print("Normalized:", normalized)

## Exercise 4.1: Write a Normalize Function - SOLUTION

Complete the `normalize` function below. Replace the `return None` line with:

```python
return (arr - arr.min()) / (arr.max() - arr.min())
```

In [None]:
def normalize(arr):
    """Normalize array to [0, 1] range"""
    return (arr - arr.min()) / (arr.max() - arr.min())  # SOLUTION

# Test
test_data = np.array([10, 20, 30, 40, 50])
result = normalize(test_data)
print("Input:", test_data)
print("Output:", result)
print("Expected: [0.   0.25 0.5  0.75 1.  ]")

## 4.2 Standardization - Demonstration

Standardization centers data around 0 with standard deviation of 1:

```
standardized = (x - mean) / std
```

Run the cell below:

In [None]:
data = np.array([10, 20, 30, 40, 50])
print("Original:", data)
print("Mean:", data.mean(), "Std:", data.std())

standardized = (data - data.mean()) / data.std()
print("\nStandardized:", standardized)
print("New mean:", standardized.mean().round(10))
print("New std:", standardized.std())

## Exercise 4.2: Write a Standardize Function - SOLUTION

Complete the `standardize` function below. Replace the `return None` line with:

```python
return (arr - arr.mean()) / arr.std()
```

In [None]:
def standardize(arr):
    """Standardize array to mean=0, std=1"""
    return (arr - arr.mean()) / arr.std()  # SOLUTION

# Test
test_data = np.array([10, 20, 30, 40, 50])
result = standardize(test_data)
print("Input:", test_data)
print("Output:", result)
if result is not None:
    print(f"Mean: {result.mean():.4f} (should be ~0)")
    print(f"Std: {result.std():.4f} (should be ~1)")

## 4.3 One-Hot Encoding - Demonstration

One-hot encoding converts categorical variables to binary columns.

Use `pd.get_dummies(df, columns=['column_name'])`

Run the cell below:

In [None]:
sample = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'city': ['NYC', 'LA', 'NYC']
})
print("Before:")
print(sample)

encoded = pd.get_dummies(sample, columns=['city'])
print("\nAfter one-hot encoding:")
print(encoded)

## Exercise 4.3: One-Hot Encode the Customer Data - SOLUTION

One-hot encode the `region` column in the customers DataFrame. Replace `None` with:

```python
pd.get_dummies(customers, columns=['region'])
```

In [None]:
print("Original columns:", list(customers.columns))

# One-hot encode the region column
customers_encoded = pd.get_dummies(customers, columns=['region'])  # SOLUTION

if customers_encoded is not None:
    print("\nNew columns:", list(customers_encoded.columns))
    print("\nFirst 3 rows:")
    print(customers_encoded.head(3))

---
# Lab Complete!

## Summary

You learned:
- **NumPy**: Create arrays, calculate statistics, reshape, slice
- **Pandas**: Create DataFrames, filter, group, aggregate
- **Matplotlib**: Histograms, scatter plots, bar charts
- **Preprocessing**: Normalize, standardize, one-hot encode

## Quick Reference

```python
# NumPy
np.array([1,2,3])       # Create array
np.zeros((3,4))         # 3x4 zeros
np.mean(arr)            # Average
arr.reshape(2,3)        # Reshape

# Pandas
df['col'].mean()        # Column average
df[df['col'] > 5]       # Filter
df.groupby('a')['b'].mean()  # Group

# Matplotlib
plt.hist(data)          # Histogram
plt.scatter(x, y)       # Scatter
series.plot(kind='bar') # Bar chart
```