# Lesson 01: Python and Pandas Basics

**What you'll learn:**
- Load CSV files into DataFrames
- Select rows and columns
- Get basic statistics
- Handle missing values

---

## Section 1: What is a DataFrame?

### READ

A **DataFrame** is like an Excel spreadsheet in Python. It has rows and columns.

- Each **row** is one data sample (like one person, one product, one network connection)
- Each **column** is one feature (like age, price, packet size)

We use the `pandas` library to work with DataFrames.

### TRY IT

In [None]:
# First, import pandas
import pandas as pd

# Load a dataset (tomato juice quality data)
df = pd.read_csv('../datasets/tomatjus.csv')

# See the first 5 rows
df.head()

### EXPLAIN

What happened:
- `pd.read_csv()` reads a CSV file and creates a DataFrame
- `df.head()` shows the first 5 rows (you can change this: `df.head(10)` for 10 rows)
- Each row is one tomato juice sample
- Each column is a measurement (acidity, sugar, etc.)

---

## Section 2: Exploring Your Data

### READ

Before doing any Machine Learning, you need to understand your data:
- How many rows and columns?
- What are the column names?
- Are there missing values?
- What do the numbers look like?

### TRY IT

In [None]:
# Shape: (rows, columns)
print("Shape:", df.shape)
print("This means", df.shape[0], "rows and", df.shape[1], "columns")

In [None]:
# Column names
print("Column names:")
print(df.columns.tolist())

In [None]:
# Data types and missing values
df.info()

In [None]:
# Basic statistics (min, max, mean, etc.)
df.describe()

### EXPLAIN

What we learned:
- `shape` gives (rows, columns) - our dataset has 1599 rows, 12 columns
- `columns.tolist()` shows all column names as a list
- `info()` shows data types and counts non-null (non-empty) values
- `describe()` gives statistics: count, mean, std, min, 25%, 50%, 75%, max

**Tip:** Always start with these commands to get a feel for your data!

---

## Section 3: Selecting Data

### READ

You often need to select specific columns or rows:
- Select one column: `df['column_name']`
- Select multiple columns: `df[['col1', 'col2']]`
- Filter rows by condition: `df[df['column'] > value]`

### TRY IT

In [None]:
# Select ONE column (returns a Series)
quality = df['quality']
print("Quality column:")
print(quality.head())
print("Type:", type(quality))

In [None]:
# Select MULTIPLE columns (returns a DataFrame)
subset = df[['pH', 'pulp', 'quality']]
print("Subset with 3 columns:")
print(subset.head())
print("Type:", type(subset))

In [None]:
# Filter rows by condition
# Get only rows where pH is greater than 3.5
high_ph = df[df['pH'] > 3.5]
print(f"Original rows: {len(df)}")
print(f"Rows with pH > 3.5: {len(high_ph)}")

In [None]:
# Combine conditions with & (and) or | (or)
# Get rows where pH > 3.5 AND pulp > 10
filtered = df[(df['pH'] > 3.5) & (df['pulp'] > 10)]
print(f"Rows with pH > 3.5 AND pulp > 10: {len(filtered)}")

### EXPLAIN

What we learned:
- Single brackets `['col']` with one column name gives a Series (one-dimensional)
- Double brackets `[['col1', 'col2']]` with a list gives a DataFrame
- Conditions inside brackets filter rows: `df[df['col'] > value]`
- Use `&` for AND, `|` for OR (put conditions in parentheses)

---

## Section 4: Handling Missing Values

### READ

Missing values are empty cells in your data. They can break your ML model!

Common ways to handle missing values:
1. **Drop rows** with missing values
2. **Fill** missing values with mean, median, or a specific value

### TRY IT

In [None]:
# Check for missing values in each column
print("Missing values per column:")
print(df.isnull().sum())

In [None]:
# Total missing values
total_missing = df.isnull().sum().sum()
print(f"Total missing values: {total_missing}")

if total_missing == 0:
    print("Great! No missing values in this dataset.")

In [None]:
# Example: If there WERE missing values, here's how to handle them

# Create a copy with some fake missing values for demonstration
df_demo = df.copy()
df_demo.loc[0:5, 'pH'] = None  # Set first 5 pH values to missing

print("Missing values in demo data:")
print(df_demo.isnull().sum())

In [None]:
# Option 1: Drop rows with missing values
df_dropped = df_demo.dropna()
print(f"Original rows: {len(df_demo)}")
print(f"After dropping missing: {len(df_dropped)}")

In [None]:
# Option 2: Fill missing values with the mean
df_filled = df_demo.copy()
mean_pH = df_filled['pH'].mean()
print(f"Mean pH value: {mean_pH:.2f}")

df_filled['pH'] = df_filled['pH'].fillna(mean_pH)
print(f"Missing values after fill: {df_filled['pH'].isnull().sum()}")

### EXPLAIN

What we learned:
- `isnull().sum()` counts missing values per column
- `dropna()` removes rows with any missing value
- `fillna(value)` replaces missing values with a specified value
- Common fill strategies: mean (for numbers), most frequent (for categories)

---

## Section 5: Quick Reference

| Task | Code |
|------|------|
| Load CSV | `df = pd.read_csv('file.csv')` |
| First 5 rows | `df.head()` |
| Shape (rows, cols) | `df.shape` |
| Column names | `df.columns.tolist()` |
| Data info | `df.info()` |
| Statistics | `df.describe()` |
| Select one column | `df['column']` |
| Select multiple | `df[['col1', 'col2']]` |
| Filter rows | `df[df['col'] > value]` |
| Check missing | `df.isnull().sum()` |
| Drop missing | `df.dropna()` |
| Fill missing | `df['col'].fillna(value)` |

---

## Practice Exercise

Try these on your own:

In [None]:
# Exercise 1: Load the churn dataset
churn = pd.read_csv('../datasets/churn_modelling.csv')

# How many rows and columns does it have?
# YOUR CODE HERE:


In [None]:
# Exercise 2: What are the column names?
# YOUR CODE HERE:


In [None]:
# Exercise 3: Select only the 'Age' and 'Balance' columns
# YOUR CODE HERE:


In [None]:
# Exercise 4: Filter rows where Balance > 100000
# How many rows match?
# YOUR CODE HERE:


In [None]:
# Exercise 5: Check for missing values
# YOUR CODE HERE:


---

## Next Lesson

In **Lesson 02**, you'll learn:
- What is Machine Learning?
- Supervised vs Unsupervised learning
- Classification vs Regression
- The ML pipeline