# Lab 1: Python for Data Science - SOLUTIONS

**Duration:** 60-90 minutes | **Difficulty:** Beginner

**This notebook contains all solutions. Use for reference after attempting the exercises.**

---

## Overview

This lab introduces the essential Python libraries for data science: NumPy, Pandas, and Matplotlib. You'll learn to manipulate data arrays, work with tabular data, create visualizations, and preprocess data for machine learning applications.

### Lab Structure

| Part | Topic | Key Concepts |
|------|-------|--------------|
| **Part 1** | NumPy Arrays | Creating arrays, statistics (mean/max/min/sum), reshaping, slicing |
| **Part 2** | Pandas DataFrames | Creating DataFrames, exploring data, filtering, grouping & aggregation |
| **Part 3** | Data Visualization | Histograms, scatter plots, bar charts with Matplotlib |
| **Part 4** | Data Preprocessing | Normalization, standardization, one-hot encoding |

### Libraries Used

- **NumPy** - Numerical computing and array operations
- **Pandas** - Data manipulation and analysis
- **Matplotlib** - Data visualization

### Dataset

This lab uses a synthetic **Customer dataset** (100 rows) with columns:
`customer_id`, `age`, `income`, `years_customer`, `region`, `purchased`

---

## Learning Objectives

By the end of this lab, you will be able to:
1. Create and manipulate NumPy arrays
2. Use Pandas DataFrames to filter and aggregate data
3. Create visualizations with Matplotlib
4. Preprocess data for machine learning

## Instructions

- This is the **SOLUTIONS** notebook - use for reference after attempting exercises
- All exercise cells contain the correct answers
- Run cells with `Shift+Enter`

## Setup

Run the cell below to import the required libraries. You must run this cell first before any other code will work.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 6]
np.random.seed(42)

print("Setup complete! NumPy:", np.__version__, "| Pandas:", pd.__version__)

---
# Part 1: NumPy Arrays

NumPy is the foundation of data science in Python. It provides fast array operations.

## 1.1 Creating Arrays - Demonstration

Run the cell below to see different ways to create NumPy arrays:
- `np.array([list])` - Create from a Python list
- `np.zeros((rows, cols))` - Create array filled with zeros
- `np.ones((rows, cols))` - Create array filled with ones
- `np.arange(n)` - Create array with values 0 to n-1
- `np.eye(n)` - Create n×n identity matrix

In [None]:
# From a list
arr1 = np.array([1, 2, 3, 4, 5])
print("From list:", arr1)

# Zeros (2 rows, 3 columns)
arr2 = np.zeros((2, 3))
print("\nZeros (2x3):")
print(arr2)

# Range 0 to 9
arr3 = np.arange(10)
print("\nRange 0-9:", arr3)

# Identity matrix
arr4 = np.eye(3)
print("\n3x3 Identity:")
print(arr4)

## Exercise 1.1: Create Arrays - SOLUTION

In the cell below, replace each `None` with the correct code:

| Variable | What to create | Code to write |
|----------|----------------|---------------|
| `arr_a` | Array containing [10, 20, 30, 40, 50] | `np.array([10, 20, 30, 40, 50])` |
| `arr_b` | 4×4 array filled with zeros | `np.zeros((4, 4))` |
| `arr_c` | Array with values 0 to 19 | `np.arange(20)` |
| `arr_d` | 5×5 identity matrix | `np.eye(5)` |

In [None]:
# Create array [10, 20, 30, 40, 50]
arr_a = np.array([10, 20, 30, 40, 50])  # SOLUTION

# Create 4x4 zeros
arr_b = np.zeros((4, 4))  # SOLUTION

# Create range 0-19
arr_c = np.arange(20)  # SOLUTION

# Create 5x5 identity
arr_d = np.eye(5)  # SOLUTION

# Test
print("arr_a:", arr_a)
print("arr_b shape:", arr_b.shape if arr_b is not None else None)
print("arr_c:", arr_c)
print("arr_d shape:", arr_d.shape if arr_d is not None else None)

## 1.2 Array Statistics - Demonstration

NumPy provides functions to calculate statistics:
- `np.mean(arr)` - Average value
- `np.max(arr)` - Maximum value
- `np.min(arr)` - Minimum value
- `np.sum(arr)` - Sum of all values
- `np.argmax(arr)` - Index of maximum value

Run the cell below to see these in action:

In [None]:
scores = np.array([85, 92, 78, 90, 88, 76, 95, 89])
print("Scores:", scores)
print("Mean:", np.mean(scores))
print("Max:", np.max(scores))
print("Min:", np.min(scores))
print("Sum:", np.sum(scores))
print("Index of max:", np.argmax(scores))

## Exercise 1.2: Calculate Statistics - SOLUTION

Given the `temperatures` array, calculate the following by replacing `None` with the correct code:

| Variable | What to calculate | Code to write |
|----------|-------------------|---------------|
| `avg_temp` | Average temperature | `np.mean(temperatures)` |
| `max_temp` | Maximum temperature | `np.max(temperatures)` |
| `min_temp` | Minimum temperature | `np.min(temperatures)` |
| `hottest_day` | Index of hottest day | `np.argmax(temperatures)` |
| `temp_range` | Range (max - min) | `max_temp - min_temp` |

In [None]:
temperatures = np.array([72, 75, 68, 80, 85, 70, 60])
print("Temperatures:", temperatures)

avg_temp = np.mean(temperatures)  # SOLUTION
max_temp = np.max(temperatures)  # SOLUTION
min_temp = np.min(temperatures)  # SOLUTION
hottest_day = np.argmax(temperatures)  # SOLUTION
temp_range = max_temp - min_temp  # SOLUTION

print("\nAverage:", avg_temp)
print("Max:", max_temp)
print("Min:", min_temp)
print("Hottest day index:", hottest_day)
print("Range:", temp_range)

## 1.3 Reshaping Arrays - Demonstration

You can change the shape of an array with `.reshape(rows, cols)`.

**Important:** The total number of elements must stay the same (e.g., 12 elements can be 3×4, 4×3, 2×6, etc.)

Run the cell below to see reshaping in action:

In [None]:
arr = np.arange(12)  # [0, 1, 2, ..., 11]
print("Original:", arr)
print("Shape:", arr.shape)

# Reshape to 3 rows x 4 columns
arr_3x4 = arr.reshape(3, 4)
print("\nReshaped to 3x4:")
print(arr_3x4)

# Reshape to 4 rows x 3 columns
arr_4x3 = arr.reshape(4, 3)
print("\nReshaped to 4x3:")
print(arr_4x3)

## 1.4 Array Slicing - Demonstration

Select parts of arrays using slicing syntax `[row, column]`:
- `arr[0, :]` - First row (all columns)
- `arr[:, 0]` - First column (all rows)
- `arr[:, -1]` - Last column
- `arr[1:4, 1:4]` - Rows 1-3, columns 1-3

Run the cell below:

In [None]:
matrix = np.arange(16).reshape(4, 4)
print("Matrix:")
print(matrix)

print("\nFirst row:", matrix[0, :])
print("Last column:", matrix[:, -1])
print("Center 2x2:")
print(matrix[1:3, 1:3])

## Exercise 1.3: Reshape and Slice - SOLUTION

Complete the following tasks by replacing `None` with the correct code:

| Variable | What to do | Code to write |
|----------|------------|---------------|
| `arr_4x5` | Reshape `arr` to 4 rows × 5 columns | `arr.reshape(4, 5)` |
| `first_row` | Get the first row of `matrix` | `matrix[0, :]` |
| `last_col` | Get the last column of `matrix` | `matrix[:, -1]` |
| `center` | Get the center 3×3 subarray | `matrix[1:4, 1:4]` |

In [None]:
arr = np.arange(20)
matrix = np.arange(25).reshape(5, 5)

print("arr:", arr)
print("\nmatrix:")
print(matrix)

# Reshape arr to 4x5
arr_4x5 = arr.reshape(4, 5)  # SOLUTION

# Get first row of matrix
first_row = matrix[0, :]  # SOLUTION

# Get last column of matrix
last_col = matrix[:, -1]  # SOLUTION

# Get center 3x3
center = matrix[1:4, 1:4]  # SOLUTION

print("\narr_4x5:")
print(arr_4x5)
print("\nfirst_row:", first_row)
print("last_col:", last_col)
print("center:")
print(center)

---
# Part 2: Pandas DataFrames

Pandas provides the DataFrame - like an Excel spreadsheet in Python.

## 2.1 Creating DataFrames - Demonstration

Create a DataFrame from a dictionary where:
- Keys become column names
- Values (lists) become the data

Run the cell below to see an example:

In [None]:
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'NYC', 'LA'],
    'salary': [70000, 80000, 90000, 75000]
}

df = pd.DataFrame(data)
print(df)
print("\nShape:", df.shape)
print("Columns:", list(df.columns))

## 2.2 Exploring DataFrames - Demonstration

Useful methods for exploring data:
- `df.head()` - First 5 rows
- `df.describe()` - Summary statistics
- `df['column']` - Select a single column
- `df['column'].mean()` - Mean of a column

Run the cell below:

In [None]:
print("First 2 rows:")
print(df.head(2))

print("\nSummary statistics:")
print(df.describe())

print("\nAverage salary:", df['salary'].mean())

## 2.3 Setup: Customer Dataset

Run the cell below to create a larger dataset that we'll use for the exercises:

In [None]:
np.random.seed(42)
n = 100

customers = pd.DataFrame({
    'customer_id': range(1, n+1),
    'age': np.random.randint(18, 65, n),
    'income': np.random.normal(50000, 15000, n).astype(int),
    'years_customer': np.random.randint(1, 15, n),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'purchased': np.random.choice([0, 1], n, p=[0.6, 0.4])
})

print("Customer dataset created!")
print(f"Shape: {customers.shape[0]} rows, {customers.shape[1]} columns")
print("\nFirst 5 rows:")
print(customers.head())

## Exercise 2.1: Explore the Customer Data - SOLUTION

Calculate statistics about the customer dataset by replacing `None` with the correct code:

| Variable | What to calculate | Code to write |
|----------|-------------------|---------------|
| `avg_age` | Average customer age | `customers['age'].mean()` |
| `max_income` | Maximum income | `customers['income'].max()` |
| `total_purchased` | Total customers who purchased (sum of 1s) | `customers['purchased'].sum()` |

In [None]:
# Average age
avg_age = customers['age'].mean()  # SOLUTION

# Maximum income
max_income = customers['income'].max()  # SOLUTION

# Total who purchased
total_purchased = customers['purchased'].sum()  # SOLUTION

print("Average age:", avg_age)
print("Max income:", max_income)
print("Total purchased:", total_purchased)

## 2.4 Filtering Data - Demonstration

Filter rows using boolean conditions:
- `df[df['column'] > value]` - Rows where column > value
- `df[df['column'] == 'text']` - Rows where column equals text
- `df[(cond1) & (cond2)]` - Multiple conditions with AND

Run the cell below:

In [None]:
# Filter: age > 50
older = customers[customers['age'] > 50]
print(f"Customers over 50: {len(older)}")

# Filter: region is 'West'
west = customers[customers['region'] == 'West']
print(f"Customers in West: {len(west)}")

# Multiple conditions: age > 40 AND purchased
older_buyers = customers[(customers['age'] > 40) & (customers['purchased'] == 1)]
print(f"Older buyers: {len(older_buyers)}")

## Exercise 2.2: Filter the Data - SOLUTION

Create filtered DataFrames by replacing `None` with the correct code:

| Variable | What to filter | Code to write |
|----------|----------------|---------------|
| `high_income` | Customers with income > 70000 | `customers[customers['income'] > 70000]` |
| `east_region` | Customers in 'East' region | `customers[customers['region'] == 'East']` |
| `young_buyers` | Age < 30 AND purchased == 1 | `customers[(customers['age'] < 30) & (customers['purchased'] == 1)]` |

In [None]:
# Income > 70000
high_income = customers[customers['income'] > 70000]  # SOLUTION

# Region is East
east_region = customers[customers['region'] == 'East']  # SOLUTION

# Young buyers (age < 30 and purchased)
young_buyers = customers[(customers['age'] < 30) & (customers['purchased'] == 1)]  # SOLUTION

print("High income count:", len(high_income) if high_income is not None else None)
print("East region count:", len(east_region) if east_region is not None else None)
print("Young buyers count:", len(young_buyers) if young_buyers is not None else None)

## 2.5 Grouping Data - Demonstration

Group data and calculate aggregates:
- `df.groupby('column')['other'].mean()` - Average by group
- `df.groupby('column')['other'].count()` - Count by group

Run the cell below:

In [None]:
# Average income by region
print("Average income by region:")
print(customers.groupby('region')['income'].mean())

# Count by region
print("\nCustomers per region:")
print(customers.groupby('region')['customer_id'].count())

## Exercise 2.3: Group and Aggregate - SOLUTION

Calculate group statistics by replacing `None` with the correct code:

| Variable | What to calculate | Code to write |
|----------|-------------------|---------------|
| `avg_income_by_region` | Average income per region | `customers.groupby('region')['income'].mean()` |
| `count_by_region` | Number of customers per region | `customers.groupby('region')['customer_id'].count()` |
| `purchase_rate` | Purchase rate (mean of purchased) per region | `customers.groupby('region')['purchased'].mean()` |

In [None]:
# Average income by region
avg_income_by_region = customers.groupby('region')['income'].mean()  # SOLUTION

# Count by region
count_by_region = customers.groupby('region')['customer_id'].count()  # SOLUTION

# Purchase rate by region
purchase_rate = customers.groupby('region')['purchased'].mean()  # SOLUTION

print("Average income by region:")
print(avg_income_by_region)
print("\nCount by region:")
print(count_by_region)
print("\nPurchase rate by region:")
print(purchase_rate)

---
# Part 3: Data Visualization

Matplotlib is the standard plotting library for Python.

## 3.1 Histograms - Demonstration

Histograms show the distribution of a single variable.

Key parameters:
- `bins` - Number of bars
- `edgecolor` - Color of bar edges
- `alpha` - Transparency (0-1)

Run the cell below:

In [None]:
plt.figure(figsize=(10, 5))
plt.hist(customers['age'], bins=15, edgecolor='black', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Customer Ages')
plt.show()

## Exercise 3.1: Create a Histogram - SOLUTION

Create a histogram of customer income. Add the following lines to the code cell below:

```python
plt.hist(customers['income'], bins=20, edgecolor='black', alpha=0.7)
plt.xlabel('Income ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Customer Income')
```

In [None]:
plt.figure(figsize=(10, 5))

# SOLUTION:
plt.hist(customers['income'], bins=20, edgecolor='black', alpha=0.7)
plt.xlabel('Income ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Customer Income')

plt.show()

## 3.2 Scatter Plots - Demonstration

Scatter plots show the relationship between two variables.

Run the cell below:

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(customers['age'], customers['income'], alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.title('Age vs Income')
plt.show()

## Exercise 3.2: Create a Scatter Plot - SOLUTION

Create a scatter plot of `years_customer` vs `income`. Add the following lines:

```python
plt.scatter(customers['years_customer'], customers['income'], alpha=0.5)
plt.xlabel('Years as Customer')
plt.ylabel('Income ($)')
plt.title('Customer Tenure vs Income')
```

In [None]:
plt.figure(figsize=(10, 6))

# SOLUTION:
plt.scatter(customers['years_customer'], customers['income'], alpha=0.5)
plt.xlabel('Years as Customer')
plt.ylabel('Income ($)')
plt.title('Customer Tenure vs Income')

plt.show()

## 3.3 Bar Charts - Demonstration

Bar charts compare values across categories. Use `.plot(kind='bar')` on a Pandas Series.

Run the cell below:

In [None]:
avg_income = customers.groupby('region')['income'].mean()

plt.figure(figsize=(8, 5))
avg_income.plot(kind='bar', color='steelblue', edgecolor='black')
plt.xlabel('Region')
plt.ylabel('Average Income ($)')
plt.title('Average Income by Region')
plt.xticks(rotation=0)
plt.show()

## Exercise 3.3: Create a Bar Chart - SOLUTION

Create a bar chart showing purchase rate by region. Add the following lines:

```python
purchase_rate = customers.groupby('region')['purchased'].mean()
purchase_rate.plot(kind='bar', color='green', edgecolor='black')
plt.xlabel('Region')
plt.ylabel('Purchase Rate')
plt.title('Purchase Rate by Region')
plt.xticks(rotation=0)
```

In [None]:
plt.figure(figsize=(8, 5))

# SOLUTION:
purchase_rate = customers.groupby('region')['purchased'].mean()
purchase_rate.plot(kind='bar', color='green', edgecolor='black')
plt.xlabel('Region')
plt.ylabel('Purchase Rate')
plt.title('Purchase Rate by Region')
plt.xticks(rotation=0)

plt.show()

---
# Part 4: Data Preprocessing

Before using data in machine learning, we often need to preprocess it.

## 4.1 Normalization - Demonstration

Normalization scales values to the range [0, 1] using the formula:

```
normalized = (x - min) / (max - min)
```

Run the cell below:

In [None]:
data = np.array([10, 20, 30, 40, 50])
print("Original:", data)

normalized = (data - data.min()) / (data.max() - data.min())
print("Normalized:", normalized)

## Exercise 4.1: Write a Normalize Function - SOLUTION

Complete the `normalize` function below. Replace the `return None` line with:

```python
return (arr - arr.min()) / (arr.max() - arr.min())
```

In [None]:
def normalize(arr):
    """Normalize array to [0, 1] range"""
    return (arr - arr.min()) / (arr.max() - arr.min())  # SOLUTION

# Test
test_data = np.array([10, 20, 30, 40, 50])
result = normalize(test_data)
print("Input:", test_data)
print("Output:", result)
print("Expected: [0.   0.25 0.5  0.75 1.  ]")

## 4.2 Standardization - Demonstration

Standardization centers data around 0 with standard deviation of 1:

```
standardized = (x - mean) / std
```

Run the cell below:

In [None]:
data = np.array([10, 20, 30, 40, 50])
print("Original:", data)
print("Mean:", data.mean(), "Std:", data.std())

standardized = (data - data.mean()) / data.std()
print("\nStandardized:", standardized)
print("New mean:", standardized.mean().round(10))
print("New std:", standardized.std())

## Exercise 4.2: Write a Standardize Function - SOLUTION

Complete the `standardize` function below. Replace the `return None` line with:

```python
return (arr - arr.mean()) / arr.std()
```

In [None]:
def standardize(arr):
    """Standardize array to mean=0, std=1"""
    return (arr - arr.mean()) / arr.std()  # SOLUTION

# Test
test_data = np.array([10, 20, 30, 40, 50])
result = standardize(test_data)
print("Input:", test_data)
print("Output:", result)
if result is not None:
    print(f"Mean: {result.mean():.4f} (should be ~0)")
    print(f"Std: {result.std():.4f} (should be ~1)")

## 4.3 One-Hot Encoding - Demonstration

One-hot encoding converts categorical variables to binary columns.

Use `pd.get_dummies(df, columns=['column_name'])`

Run the cell below:

In [None]:
sample = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'city': ['NYC', 'LA', 'NYC']
})
print("Before:")
print(sample)

encoded = pd.get_dummies(sample, columns=['city'])
print("\nAfter one-hot encoding:")
print(encoded)

## Exercise 4.3: One-Hot Encode the Customer Data - SOLUTION

One-hot encode the `region` column in the customers DataFrame. Replace `None` with:

```python
pd.get_dummies(customers, columns=['region'])
```

In [None]:
print("Original columns:", list(customers.columns))

# One-hot encode the region column
customers_encoded = pd.get_dummies(customers, columns=['region'])  # SOLUTION

if customers_encoded is not None:
    print("\nNew columns:", list(customers_encoded.columns))
    print("\nFirst 3 rows:")
    print(customers_encoded.head(3))

---
# Lab Complete!

## Summary

You learned:
- **NumPy**: Create arrays, calculate statistics, reshape, slice
- **Pandas**: Create DataFrames, filter, group, aggregate
- **Matplotlib**: Histograms, scatter plots, bar charts
- **Preprocessing**: Normalize, standardize, one-hot encode

## Quick Reference

```python
# NumPy
np.array([1,2,3])       # Create array
np.zeros((3,4))         # 3x4 zeros
np.mean(arr)            # Average
arr.reshape(2,3)        # Reshape

# Pandas
df['col'].mean()        # Column average
df[df['col'] > 5]       # Filter
df.groupby('a')['b'].mean()  # Group

# Matplotlib
plt.hist(data)          # Histogram
plt.scatter(x, y)       # Scatter
series.plot(kind='bar') # Bar chart
```