# Lab 1: Python for Data Science

## Welcome to Your First Lab!

**Duration:** 60 minutes | **Difficulty:** Beginner | **Prerequisites:** Basic Python syntax

---

### What You Will Learn

In this lab, you will master the essential Python libraries for data science and machine learning:

1. **NumPy** - The foundation of scientific computing (creating arrays, mathematical operations)
2. **Pandas** - Data manipulation and analysis with DataFrames
3. **Matplotlib** - Creating visualizations to understand your data
4. **Data Preprocessing** - Preparing data for machine learning

### How This Lab Works

1. **Read the instructions** in each markdown cell carefully
2. **Complete the code** in the code cells where you see `# YOUR CODE HERE`
3. **Run each cell** using Shift+Enter or the ▶ button
4. **Check your output** against the expected results provided
5. **Don't hesitate to experiment** - you can always reset your notebook!

### Keyboard Shortcuts

| Shortcut | Action |
|----------|--------|
| Shift+Enter | Run cell and move to next |
| Ctrl+Enter | Run cell and stay |
| Ctrl+S | Save notebook |
| Esc | Enter command mode |
| Enter | Enter edit mode |

### Grading

This lab is self-paced. Compare your solutions with the solution notebook after completing each section.

---

**Let's begin!** Start by running the setup cell below.

## Setup: Import Required Libraries

Run this cell first to import all the libraries we'll use. You should see no output if successful.

In [None]:
# Run this cell first - DO NOT MODIFY
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

print("✓ All libraries imported successfully!")
print(f"  NumPy version: {np.__version__}")
print(f"  Pandas version: {pd.__version__}")

---

## Exercise 1: NumPy Fundamentals

NumPy (Numerical Python) is the foundation of nearly all data science and ML in Python. It provides efficient array operations that are 10-100x faster than Python lists.

### Task 1.1: Create Different Types of Arrays

**Your Goal:** Complete the `create_arrays()` function to create and return:

1. `arr_1d` - A 1D array containing [1, 2, 3, 4, 5]
2. `arr_zeros` - A 3x3 array filled with zeros
3. `arr_ones` - A 3x3 array filled with ones
4. `arr_range` - An array with values from 0 to 9
5. `arr_identity` - A 3x3 identity matrix

**Hints:**
- Use `np.array([...])` to create an array from a list
- Use `np.zeros((rows, cols))` for zeros
- Use `np.ones((rows, cols))` for ones
- Use `np.arange(n)` for a range of values
- Use `np.eye(n)` for an identity matrix

**Expected Output:**
```
1D Array: [1 2 3 4 5]
Identity:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
```

In [None]:
def create_arrays():
    """Create different types of NumPy arrays.
    
    Returns:
        tuple: (arr_1d, arr_zeros, arr_ones, arr_range, arr_identity)
    """
    # YOUR CODE HERE - Create a 1D array [1, 2, 3, 4, 5]
    arr_1d = None
    
    # YOUR CODE HERE - Create a 3x3 array of zeros
    arr_zeros = None
    
    # YOUR CODE HERE - Create a 3x3 array of ones
    arr_ones = None
    
    # YOUR CODE HERE - Create an array with values 0-9
    arr_range = None
    
    # YOUR CODE HERE - Create a 3x3 identity matrix
    arr_identity = None
    
    return arr_1d, arr_zeros, arr_ones, arr_range, arr_identity

# Test your function
arr_1d, arr_zeros, arr_ones, arr_range, arr_identity = create_arrays()
print("1D Array:", arr_1d)
print("\nIdentity matrix:")
print(arr_identity)

### Task 1.2: Array Statistics

**Your Goal:** Complete the `array_operations()` function to calculate statistics on an array.

**What to implement:** Given an input array, return a dictionary with:
- `'mean'` - The average value (use `np.mean()`)
- `'std'` - The standard deviation (use `np.std()`)
- `'max'` - The maximum value (use `np.max()`)
- `'min_index'` - The INDEX of the minimum value (use `np.argmin()`)
- `'sum'` - The sum of all values (use `np.sum()`)

**Expected Output:**
```
Array stats: {'mean': 5.5, 'std': 2.87..., 'max': 10, 'min_index': 0, 'sum': 55}
```

In [None]:
def array_operations(arr):
    """Perform basic array operations.
    
    Args:
        arr: A NumPy array
    
    Returns:
        dict: Statistics about the array
    """
    return {
        # YOUR CODE HERE - Calculate the mean
        'mean': None,
        
        # YOUR CODE HERE - Calculate the standard deviation
        'std': None,
        
        # YOUR CODE HERE - Find the maximum value
        'max': None,
        
        # YOUR CODE HERE - Find the INDEX of the minimum value
        'min_index': None,
        
        # YOUR CODE HERE - Calculate the sum
        'sum': None
    }

# Test your function
test_arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print("Test array:", test_arr)
print("Array stats:", array_operations(test_arr))

---

## Exercise 2: Array Reshaping and Slicing

Reshaping arrays and selecting subsets (slicing) are fundamental skills for data manipulation.

### Task 2.1: Reshape Arrays

**Your Goal:** Complete the `reshape_and_slice()` function.

Starting with `arr = np.arange(12)` (values 0-11), create:
1. `reshaped_3x4` - Reshape to 3 rows, 4 columns
2. `reshaped_4x3` - Reshape to 4 rows, 3 columns
3. `reshaped_3d` - Reshape to 3D: 2 x 2 x 3

**Hint:** Use `arr.reshape(rows, cols)` or `arr.reshape(dim1, dim2, dim3)`

**Expected Output for 3x4:**
```
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
```

In [None]:
def reshape_and_slice():
    """Practice array reshaping.
    
    Returns:
        tuple: (reshaped_3x4, reshaped_4x3, reshaped_3d)
    """
    arr = np.arange(12)  # [0, 1, 2, ..., 11]
    
    # YOUR CODE HERE - Reshape to 3 rows, 4 columns
    reshaped_3x4 = None
    
    # YOUR CODE HERE - Reshape to 4 rows, 3 columns
    reshaped_4x3 = None
    
    # YOUR CODE HERE - Reshape to 3D: 2 x 2 x 3
    reshaped_3d = None
    
    return reshaped_3x4, reshaped_4x3, reshaped_3d

# Test your function
r1, r2, r3 = reshape_and_slice()
print("3x4 reshape:")
print(r1)
print("\n4x3 reshape:")
print(r2)

### Task 2.2: Array Slicing

**Your Goal:** Complete the `array_slicing()` function to extract parts of a 2D array.

Given a 2D array, extract:
1. `first_row` - The entire first row
2. `last_col` - The entire last column
3. `top_left` - A 2x2 subarray from the top-left corner
4. `every_other` - Every other element from the first row

**Slicing Syntax:**
- `arr[row, col]` - Single element
- `arr[row, :]` - Entire row
- `arr[:, col]` - Entire column
- `arr[r1:r2, c1:c2]` - Subarray
- `arr[0, ::2]` - Every other element (step=2)
- `arr[:, -1]` - Last column (negative indexing)

**Expected Output:**
```
First row: [0 1 2 3]
Last column: [ 3  7 11 15]
```

In [None]:
def array_slicing(arr):
    """Practice array slicing on a 2D array.
    
    Args:
        arr: A 2D NumPy array
    
    Returns:
        tuple: (first_row, last_col, top_left, every_other)
    """
    # YOUR CODE HERE - Get the entire first row
    first_row = None
    
    # YOUR CODE HERE - Get the entire last column
    last_col = None
    
    # YOUR CODE HERE - Get top-left 2x2 subarray
    top_left = None
    
    # YOUR CODE HERE - Get every other element from first row
    every_other = None
    
    return first_row, last_col, top_left, every_other

# Test your function
test_matrix = np.arange(16).reshape(4, 4)
print("Test matrix:")
print(test_matrix)
print()

first_row, last_col, top_left, every_other = array_slicing(test_matrix)
print("First row:", first_row)
print("Last column:", last_col)
print("Top-left 2x2:")
print(top_left)
print("Every other (row 0):", every_other)

---

## Exercise 3: Pandas DataFrames

Pandas is the go-to library for data manipulation. A DataFrame is like an Excel spreadsheet in Python.

### Setup: Create Sample Dataset

Run this cell to create a sample dataset we'll use for the exercises.

In [None]:
# Run this cell - DO NOT MODIFY
# This creates a sample dataset about customers

np.random.seed(42)
n_samples = 100

data = {
    'age': np.random.randint(18, 65, n_samples),
    'income': np.random.normal(50000, 15000, n_samples).astype(int),
    'education_years': np.random.randint(8, 22, n_samples),
    'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n_samples),
    'purchased': np.random.choice([0, 1], n_samples, p=[0.6, 0.4])
}

df = pd.DataFrame(data)

print("Dataset created! Here are the first 5 rows:")
print(df.head())
print(f"\nDataset shape: {df.shape[0]} rows, {df.shape[1]} columns")

### Task 3.1: Explore the DataFrame

**Your Goal:** Complete `explore_dataframe()` to return basic information about a DataFrame.

**What to return:**
1. `stats` - Summary statistics (use `df.describe()`)
2. `dtypes` - Data types of each column (use `df.dtypes`)
3. `missing` - Count of missing values per column (use `df.isnull().sum()`)

In [None]:
def explore_dataframe(df):
    """Get basic information about a DataFrame.
    
    Args:
        df: A Pandas DataFrame
    
    Returns:
        tuple: (stats, dtypes, missing)
    """
    # YOUR CODE HERE - Get summary statistics
    stats = None
    
    # YOUR CODE HERE - Get data types
    dtypes = None
    
    # YOUR CODE HERE - Count missing values
    missing = None
    
    return stats, dtypes, missing

# Test your function
stats, dtypes, missing = explore_dataframe(df)
print("Summary Statistics:")
print(stats)
print("\nData Types:")
print(dtypes)
print("\nMissing Values:")
print(missing)

### Task 3.2: Filter and Group Data

**Your Goal:** Complete `filter_and_group()` to filter and aggregate data.

**What to return:**
1. `older_than_30` - All rows where age > 30
2. `high_income_buyers` - Rows where income > 60000 AND purchased == 1
3. `income_by_city` - Average income grouped by city
4. `purchases_by_city` - Total purchases grouped by city

**Syntax hints:**
- Filter: `df[df['column'] > value]`
- Multiple conditions: `df[(condition1) & (condition2)]`
- Group and aggregate: `df.groupby('column')['other'].mean()`

In [None]:
def filter_and_group(df):
    """Filter and group DataFrame operations.
    
    Args:
        df: A Pandas DataFrame
    
    Returns:
        tuple: (older_than_30, high_income_buyers, income_by_city, purchases_by_city)
    """
    # YOUR CODE HERE - Filter rows where age > 30
    older_than_30 = None
    
    # YOUR CODE HERE - Filter rows where income > 60000 AND purchased == 1
    high_income_buyers = None
    
    # YOUR CODE HERE - Group by city and calculate mean income
    income_by_city = None
    
    # YOUR CODE HERE - Group by city and sum purchases
    purchases_by_city = None
    
    return older_than_30, high_income_buyers, income_by_city, purchases_by_city

# Test your function
older, high_buyers, income_city, purch_city = filter_and_group(df)
print(f"People older than 30: {len(older)} rows")
print(f"High income buyers: {len(high_buyers)} rows")
print(f"\nAverage income by city:")
print(income_city)
print(f"\nTotal purchases by city:")
print(purch_city)

---

## Exercise 4: Data Visualization

Visualization helps you understand patterns in your data. We'll use Matplotlib.

### Task 4.1: Create a Histogram

**Your Goal:** Complete `create_histogram()` to visualize the distribution of a column.

**What to do:**
1. Use `plt.hist(df[column], bins=20, edgecolor='black', alpha=0.7)`
2. Add x-label, y-label ('Frequency'), and title

In [None]:
def create_histogram(df, column):
    """Create a histogram of a DataFrame column.
    
    Args:
        df: A Pandas DataFrame
        column: Name of the column to plot
    """
    # YOUR CODE HERE - Create histogram
    # Use: plt.hist(df[column], bins=20, edgecolor='black', alpha=0.7)
    
    # YOUR CODE HERE - Add x-label (the column name)
    
    # YOUR CODE HERE - Add y-label ('Frequency')
    
    # YOUR CODE HERE - Add title (f'Distribution of {column}')
    pass

# Test your function
plt.figure(figsize=(8, 5))
create_histogram(df, 'income')
plt.show()

### Task 4.2: Create a Scatter Plot

**Your Goal:** Complete `create_scatter_plot()` to show the relationship between two variables.

**What to do:**
1. Use `plt.scatter(df[x_col], df[y_col], alpha=0.6)`
2. If `color_col` is provided, use it to color the points
3. Add axis labels and title

In [None]:
def create_scatter_plot(df, x_col, y_col, color_col=None):
    """Create a scatter plot.
    
    Args:
        df: A Pandas DataFrame
        x_col: Column for x-axis
        y_col: Column for y-axis
        color_col: Optional column to color points by
    """
    # YOUR CODE HERE - Create scatter plot
    # If color_col is provided, use: c=df[color_col], cmap='viridis'
    # Otherwise just use alpha=0.6
    
    # YOUR CODE HERE - Add colorbar if color_col was used
    # Use: plt.colorbar(label=color_col)
    
    # YOUR CODE HERE - Add x-label, y-label, and title
    pass

# Test your function
plt.figure(figsize=(8, 5))
create_scatter_plot(df, 'age', 'income', 'purchased')
plt.show()

### Task 4.3: Create a Bar Chart

**Your Goal:** Complete `create_bar_chart()` to compare values across categories.

**What to do:**
1. Group by `category_col` and calculate mean of `value_col`
2. Use `.plot(kind='bar', edgecolor='black')` to create the chart
3. Add labels and title

In [None]:
def create_bar_chart(df, category_col, value_col):
    """Create a bar chart showing mean values by category.
    
    Args:
        df: A Pandas DataFrame
        category_col: Column to group by (x-axis)
        value_col: Column to average (y-axis)
    """
    # YOUR CODE HERE - Group by category and calculate mean
    # grouped = df.groupby(category_col)[value_col].mean()
    
    # YOUR CODE HERE - Create bar chart
    # grouped.plot(kind='bar', edgecolor='black')
    
    # YOUR CODE HERE - Add labels and title
    # Don't forget: plt.xticks(rotation=45) for readable labels
    pass

# Test your function
plt.figure(figsize=(8, 5))
create_bar_chart(df, 'city', 'income')
plt.tight_layout()
plt.show()

---

## Exercise 5: Data Preprocessing

Before feeding data to ML models, we need to preprocess it. Common steps include normalization, standardization, and encoding categorical variables.

### Task 5.1: Normalize Data (Min-Max Scaling)

**Your Goal:** Complete `normalize_column()` to scale values to [0, 1].

**Formula:** `normalized = (x - min) / (max - min)`

**Expected Output:**
```
Original: [10 20 30 40 50]
Normalized: [0.   0.25 0.5  0.75 1.  ]
```

In [None]:
def normalize_column(arr):
    """Min-max normalization to scale values to [0, 1].
    
    Args:
        arr: A NumPy array
    
    Returns:
        Normalized array with values between 0 and 1
    """
    # YOUR CODE HERE
    # Formula: (arr - arr.min()) / (arr.max() - arr.min())
    return None

# Test your function
test_data = np.array([10, 20, 30, 40, 50])
print("Original:", test_data)
print("Normalized:", normalize_column(test_data))

### Task 5.2: Standardize Data (Z-Score)

**Your Goal:** Complete `standardize_column()` to center data around 0 with std=1.

**Formula:** `standardized = (x - mean) / std`

**Expected Output:**
```
Standardized mean: 0.0000, std: 1.0000
```

In [None]:
def standardize_column(arr):
    """Z-score standardization (mean=0, std=1).
    
    Args:
        arr: A NumPy array
    
    Returns:
        Standardized array
    """
    # YOUR CODE HERE
    # Formula: (arr - arr.mean()) / arr.std()
    return None

# Test your function
test_data = np.array([10, 20, 30, 40, 50])
standardized = standardize_column(test_data)
print("Original:", test_data)
print("Standardized:", standardized)
print(f"Standardized mean: {standardized.mean():.4f}, std: {standardized.std():.4f}")

### Task 5.3: One-Hot Encode Categorical Variables

**Your Goal:** Complete `one_hot_encode()` to convert categorical columns to binary columns.

**Why?** ML models work with numbers, not text. One-hot encoding converts categories like ['NYC', 'LA', 'Chicago'] into separate binary columns.

**Use:** `pd.get_dummies(df, columns=[column])`

In [None]:
def one_hot_encode(df, column):
    """One-hot encode a categorical column.
    
    Args:
        df: A Pandas DataFrame
        column: Name of the categorical column to encode
    
    Returns:
        DataFrame with one-hot encoded columns
    """
    # YOUR CODE HERE
    # Use: pd.get_dummies(df, columns=[column])
    return None

# Test your function
encoded = one_hot_encode(df, 'city')
print("Original columns:", list(df.columns))
print("\nEncoded columns:", list(encoded.columns))
print("\nFirst 5 rows of encoded data:")
print(encoded.head())

---

## Checkpoint: Lab Complete!

### Congratulations!

You've completed Lab 1 and learned the essential tools for data science:

| Topic | Key Functions |
|-------|---------------|
| **NumPy Arrays** | `np.array()`, `np.zeros()`, `np.ones()`, `np.arange()`, `np.eye()` |
| **Array Stats** | `np.mean()`, `np.std()`, `np.max()`, `np.argmin()`, `np.sum()` |
| **Reshaping** | `arr.reshape()`, slicing with `[row, col]` |
| **Pandas** | `df.describe()`, `df.groupby()`, `df[condition]` |
| **Visualization** | `plt.hist()`, `plt.scatter()`, `plt.bar()` |
| **Preprocessing** | Normalization, standardization, one-hot encoding |

### Next Steps

1. **Compare your solutions** with the solution notebook
2. **Practice more** - try different operations on the dataset
3. **Move on** to Lab 2: Machine Learning Basics

---

*Remember to save your work! (Ctrl+S)*