# Lab 1: Python for Data Science

**Duration:** 60-90 minutes | **Difficulty:** Beginner

---

## Overview

This lab introduces the essential Python libraries for data science: NumPy, Pandas, and Matplotlib.

### Lab Structure

| Part | Topic | Key Concepts |
|------|-------|---------------|
| **Part 1** | NumPy Arrays | Creating arrays, statistics, reshaping, slicing |
| **Part 2** | Pandas DataFrames | Creating DataFrames, filtering, grouping |
| **Part 3** | Data Visualization | Histograms, scatter plots, bar charts |
| **Part 4** | Data Preprocessing | Normalization, standardization, one-hot encoding |

### Instructions

- Read each markdown cell carefully
- Write your code in the empty code cells below each explanation
- Run cells with `Shift+Enter`

## Setup

Run the cell below to import the required libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 6]
np.random.seed(42)

print("Setup complete! NumPy:", np.__version__, "| Pandas:", pd.__version__)

---
# Part 1: NumPy Arrays

NumPy is the foundation of data science in Python. It provides fast array operations.

## 1.1 Creating Arrays

NumPy provides several functions to create arrays:

| Function | Description | Example |
|----------|-------------|----------|
| `np.array([...])` | Create from a Python list | `np.array([1, 2, 3])` |
| `np.zeros((rows, cols))` | Array filled with zeros | `np.zeros((2, 3))` |
| `np.ones((rows, cols))` | Array filled with ones | `np.ones((3, 3))` |
| `np.arange(n)` | Values from 0 to n-1 | `np.arange(10)` |
| `np.eye(n)` | n×n identity matrix | `np.eye(4)` |

**Your Task:** Create the following arrays:
- `arr_a`: Array containing values 10, 20, 30, 40, 50
- `arr_b`: A 4×4 array filled with zeros
- `arr_c`: Array with values from 0 to 19
- `arr_d`: A 5×5 identity matrix

Print each array to verify your results.

**Expected Output:**
```
arr_a: [10 20 30 40 50]
arr_b shape: (4, 4)
arr_c: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
arr_d shape: (5, 5)
```

**Sample Code:**
```python
# Creating arrays with different methods
my_list = np.array([1, 2, 3])       # From a list
my_zeros = np.zeros((2, 2))         # 2x2 zeros
my_range = np.arange(5)             # [0, 1, 2, 3, 4]
print("Array:", my_list)
print("Shape:", my_zeros.shape)
```

In [None]:
# Your code here


## 1.2 Array Statistics

NumPy provides functions to calculate statistics on arrays:

| Function | Description | Example |
|----------|-------------|----------|
| `np.mean(arr)` | Average value | `np.mean(scores)` |
| `np.max(arr)` | Maximum value | `np.max(scores)` |
| `np.min(arr)` | Minimum value | `np.min(scores)` |
| `np.sum(arr)` | Sum of all values | `np.sum(scores)` |
| `np.argmax(arr)` | Index of maximum | `np.argmax(scores)` |

**Your Task:** Given this temperatures array:
```python
temperatures = np.array([72, 75, 68, 80, 85, 70, 60])
```

Calculate and print:
- `avg_temp`: The average temperature
- `max_temp`: The maximum temperature
- `min_temp`: The minimum temperature
- `hottest_day`: The index of the hottest day
- `temp_range`: The range (max minus min)

**Expected Output:**
```
Average: 72.85714285714286
Max: 85
Min: 60
Hottest day index: 4
Range: 25
```

**Sample Code:**
```python
# Calculating statistics on an array
scores = np.array([88, 92, 79, 95, 84])
average = np.mean(scores)
highest = np.max(scores)
best_idx = np.argmax(scores)
print("Average:", average)
print("Highest:", highest)
print("Best index:", best_idx)
```

In [None]:
# Your code here


## 1.3 Reshaping Arrays

You can change the shape of an array with `.reshape(rows, cols)`.

**Important:** The total number of elements must stay the same (e.g., 12 elements can be 3×4, 4×3, 2×6, etc.)

| Example | Description |
|---------|-------------|
| `arr.reshape(3, 4)` | Reshape to 3 rows, 4 columns |
| `arr.reshape(2, -1)` | 2 rows, auto-calculate columns |

**Your Task:**
1. Create an array with values 0-19 using `np.arange(20)`
2. Reshape it to 4 rows and 5 columns
3. Print both the original and reshaped arrays

**Expected Output:**
```
Original: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
Reshaped (4x5):
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]
```

**Sample Code:**
```python
# Reshaping a 1D array into 2D
arr = np.arange(12)              # [0, 1, 2, ..., 11]
matrix = arr.reshape(3, 4)       # 3 rows, 4 columns
print("Original:", arr)
print("Reshaped:")
print(matrix)
```

In [None]:
# Your code here


## 1.4 Array Slicing

Select parts of arrays using slicing syntax `[row, column]`:

| Syntax | Description |
|--------|-------------|
| `arr[0, :]` | First row (all columns) |
| `arr[:, 0]` | First column (all rows) |
| `arr[:, -1]` | Last column |
| `arr[1:4, 1:4]` | Rows 1-3, columns 1-3 |

**Your Task:**
1. Create a 5×5 matrix using `np.arange(25).reshape(5, 5)`
2. Extract and print:
   - `first_row`: The first row
   - `last_col`: The last column
   - `center`: The center 3×3 subarray (rows 1-3, columns 1-3)

**Expected Output:**
```
Matrix:
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]

First row: [0 1 2 3 4]
Last column: [ 4  9 14 19 24]
Center 3x3:
[[ 6  7  8]
 [11 12 13]
 [16 17 18]]
```

**Sample Code:**
```python
# Slicing a 2D array
matrix = np.arange(16).reshape(4, 4)
row_0 = matrix[0, :]           # First row
col_2 = matrix[:, 2]           # Third column
subarray = matrix[1:3, 1:3]    # 2x2 block from middle
print("Row 0:", row_0)
print("Column 2:", col_2)
```

In [None]:
# Your code here


---
# Part 2: Pandas DataFrames

Pandas provides the DataFrame - like an Excel spreadsheet in Python.

## 2.1 Creating DataFrames

Create a DataFrame from a dictionary where keys become column names:

```python
data = {
    'column1': [value1, value2, value3],
    'column2': [value1, value2, value3]
}
df = pd.DataFrame(data)
```

**Your Task:** Create a DataFrame called `students` with:
- Column `name`: ['Alice', 'Bob', 'Charlie', 'Diana']
- Column `age`: [22, 25, 23, 24]
- Column `grade`: [85, 92, 78, 95]

Print the DataFrame and its shape using `df.shape`.

**Expected Output:**
```
      name  age  grade
0    Alice   22     85
1      Bob   25     92
2  Charlie   23     78
3    Diana   24     95

Shape: (4, 3)
```

In [None]:
# Your code here


## 2.2 Setup: Customer Dataset

Run the cell below to create a customer dataset we'll use for the remaining exercises.

In [None]:
np.random.seed(42)
n = 100

customers = pd.DataFrame({
    'customer_id': range(1, n+1),
    'age': np.random.randint(18, 65, n),
    'income': np.random.normal(50000, 15000, n).astype(int),
    'years_customer': np.random.randint(1, 15, n),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'purchased': np.random.choice([0, 1], n, p=[0.6, 0.4])
})

print("Customer dataset created!")
print(f"Shape: {customers.shape[0]} rows, {customers.shape[1]} columns")
print("\nFirst 5 rows:")
print(customers.head())

## 2.3 Exploring DataFrames

Useful methods for exploring and calculating statistics:

| Method | Description | Example |
|--------|-------------|----------|
| `df.head()` | First 5 rows | `customers.head()` |
| `df.describe()` | Summary statistics | `customers.describe()` |
| `df['column']` | Select a column | `customers['age']` |
| `df['column'].mean()` | Mean of column | `customers['age'].mean()` |
| `df['column'].max()` | Max of column | `customers['income'].max()` |
| `df['column'].sum()` | Sum of column | `customers['purchased'].sum()` |

**Your Task:** Using the `customers` DataFrame, calculate and print:
- `avg_age`: The average customer age
- `max_income`: The maximum income value
- `total_purchased`: The total number who purchased (sum of `purchased` column)

**Expected Output:**
```
Average age: 40.47
Max income: 89594
Total purchased: 42
```

**Sample Code:**
```python
# Getting statistics from a DataFrame column
avg_years = customers['years_customer'].mean()
min_id = customers['customer_id'].min()
print("Average years:", avg_years)
print("Min ID:", min_id)
```

In [None]:
# Your code here


## 2.4 Filtering Data

Filter rows using boolean conditions:

| Syntax | Description |
|--------|-------------|
| `df[df['column'] > value]` | Rows where column > value |
| `df[df['column'] == 'text']` | Rows where column equals text |
| `df[(cond1) & (cond2)]` | Multiple conditions with AND |

**Your Task:** Create filtered DataFrames:
- `high_income`: Customers with income greater than 70000
- `east_region`: Customers in the 'East' region
- `young_buyers`: Customers under 30 who purchased (age < 30 AND purchased == 1)

Print the count of each using `len()`.

**Expected Output:**
```
High income count: 7
East region count: 29
Young buyers count: 5
```

**Sample Code:**
```python
# Filtering a DataFrame
seniors = customers[customers['age'] > 50]
long_term = customers[customers['years_customer'] >= 10]
north_buyers = customers[(customers['region'] == 'North') & (customers['purchased'] == 1)]
print("Seniors count:", len(seniors))
```

In [None]:
# Your code here


## 2.5 Grouping Data

Group data and calculate aggregates:

| Syntax | Description |
|--------|-------------|
| `df.groupby('col')['other'].mean()` | Average by group |
| `df.groupby('col')['other'].count()` | Count by group |
| `df.groupby('col')['other'].sum()` | Sum by group |

**Your Task:** Calculate and print:
- `avg_income_by_region`: Average income for each region
- `count_by_region`: Number of customers in each region
- `purchase_rate`: Average of `purchased` column per region (gives purchase rate 0-1)

**Expected Output:**
```
Average income by region:
region
East     51338.655172
North    49878.333333
South    52014.769231
West     49458.727273
Name: income, dtype: float64

Count by region:
region
East     29
North    24
South    26
West     21
```

**Sample Code:**
```python
# Grouping and aggregating data
avg_age_by_region = customers.groupby('region')['age'].mean()
total_by_region = customers.groupby('region')['customer_id'].count()
print("Average age by region:")
print(avg_age_by_region)
```

In [None]:
# Your code here


---
# Part 3: Data Visualization

Matplotlib is the standard plotting library for Python.

## 3.1 Histograms

Histograms show the distribution of a single variable:

```python
plt.figure(figsize=(10, 5))
plt.hist(data, bins=15, edgecolor='black', alpha=0.7)
plt.xlabel('X Label')
plt.ylabel('Frequency')
plt.title('Title')
plt.show()
```

**Your Task:** Create a histogram of customer ages:
- Use `customers['age']` as the data
- Use 15 bins
- Add appropriate labels and title

**Expected Output:** A histogram plot showing the distribution of customer ages across 15 bins.

In [None]:
# Your code here


## 3.2 Scatter Plots

Scatter plots show the relationship between two variables:

```python
plt.figure(figsize=(10, 6))
plt.scatter(x_data, y_data, alpha=0.5)
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.title('Title')
plt.show()
```

**Your Task:** Create a scatter plot showing age vs income:
- X-axis: `customers['age']`
- Y-axis: `customers['income']`
- Add appropriate labels and title

**Expected Output:** A scatter plot with 100 points showing the relationship between customer age and income.

In [None]:
# Your code here


## 3.3 Bar Charts

Bar charts compare values across categories. Use `.plot(kind='bar')` on a Pandas Series:

```python
data_series = df.groupby('category')['value'].mean()

plt.figure(figsize=(8, 5))
data_series.plot(kind='bar', color='steelblue', edgecolor='black')
plt.xlabel('Category')
plt.ylabel('Value')
plt.title('Title')
plt.xticks(rotation=0)
plt.show()
```

**Your Task:** Create a bar chart showing average income by region:
1. First calculate average income per region using groupby
2. Then create the bar chart with appropriate labels

**Expected Output:** A bar chart with 4 bars (East, North, South, West) showing average income around $49,000-$52,000.

In [None]:
# Your code here


---
# Part 4: Data Preprocessing

Before using data in machine learning, we often need to preprocess it.

## 4.1 Normalization

Normalization scales values to the range [0, 1] using the formula:

```
normalized = (x - min) / (max - min)
```

In NumPy: `(arr - arr.min()) / (arr.max() - arr.min())`

**Your Task:** 
1. Create an array: `data = np.array([10, 20, 30, 40, 50])`
2. Normalize it using the formula above
3. Print both original and normalized arrays

**Expected Output:**
```
Original: [10 20 30 40 50]
Normalized: [0.   0.25 0.5  0.75 1.  ]
```

In [None]:
# Your code here


## 4.2 Standardization

Standardization centers data around 0 with standard deviation of 1:

```
standardized = (x - mean) / std
```

In NumPy: `(arr - arr.mean()) / arr.std()`

**Your Task:**
1. Create an array: `data = np.array([10, 20, 30, 40, 50])`
2. Standardize it using the formula above
3. Print the standardized array
4. Verify: print the mean (should be ~0) and std (should be ~1) of the result

**Expected Output:**
```
Original: [10 20 30 40 50]
Standardized: [-1.41421356 -0.70710678  0.          0.70710678  1.41421356]
Mean: 0.0
Std: 1.0
```

In [None]:
# Your code here


## 4.3 One-Hot Encoding

One-hot encoding converts categorical variables to binary columns:

```python
encoded_df = pd.get_dummies(df, columns=['column_name'])
```

This converts a column like `region` with values ['North', 'South', 'East', 'West'] into four binary columns: `region_North`, `region_South`, `region_East`, `region_West`.

**Your Task:**
1. One-hot encode the `region` column in the `customers` DataFrame
2. Store the result in `customers_encoded`
3. Print the new column names to see the result

**Expected Output:**
```
Original columns: ['customer_id', 'age', 'income', 'years_customer', 'region', 'purchased']
New columns: ['customer_id', 'age', 'income', 'years_customer', 'purchased', 'region_East', 'region_North', 'region_South', 'region_West']
```

In [None]:
# Your code here


---
# Lab Complete!

## Summary

You learned:
- **NumPy**: Create arrays, calculate statistics, reshape, slice
- **Pandas**: Create DataFrames, filter, group, aggregate
- **Matplotlib**: Histograms, scatter plots, bar charts
- **Preprocessing**: Normalize, standardize, one-hot encode