# Notebook 5: Pandas Preview - Your First Taste of Data Science

**Duration:** 15 minutes  
**Goal:** Get a sneak peek at the most important data science library without overwhelming complexity

**Why This Matters:**
You'll understand what DataFrames are and why they're everywhere in data science. This preview prepares you to recognize pandas patterns in advanced ML notebooks.

**What You'll Learn:**
- Reading CSV files (`pd.read_csv()`)
- Basic DataFrame operations
- Column selection and filtering
- Why pandas is the data scientist's best friend

## What is Pandas?

Pandas is THE library for data manipulation in Python. If NumPy is the foundation, pandas is the main tool data scientists use daily.

**Why pandas is everywhere:**
- Works with spreadsheet-like data (rows and columns)
- Reads/writes CSV, Excel, JSON files easily
- Handles missing data gracefully
- Built on top of NumPy

In [None]:
# The standard pandas import - you'll see this in EVERY data science notebook!
import pandas as pd
import numpy as np

print(f"✅ Pandas version: {pd.__version__}")
print("Pandas imported successfully!")

# Create sample data (simulating a CSV file)
sample_data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'score': [85, 92, 78, 96, 88],
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'Berlin']
}

# Create a DataFrame (pandas' main data structure)
df = pd.DataFrame(sample_data)
print("\nSample DataFrame:")
print(df)

## Basic DataFrame Operations

These are the operations you'll see most often in data science notebooks:

In [None]:
# Basic info about the DataFrame
print("DataFrame shape:", df.shape)  # (rows, columns)
print("Column names:", df.columns.tolist())
print("Data types:")
print(df.dtypes)
print()

# First/last few rows - very common!
print("First 3 rows:")
print(df.head(3))
print()

print("Basic statistics:")
print(df.describe())

## Column Selection - You'll See This Everywhere!

These patterns appear in almost every ML notebook:

In [None]:
# Select single column
names = df['name']  # Returns a Series
print("Names column:")
print(names)
print(f"Type: {type(names)}")
print()

# Select multiple columns
subset = df[['name', 'score']]  # Returns a DataFrame
print("Name and score columns:")
print(subset)
print()

# This pattern is EVERYWHERE in ML: separate features from target
X = df[['age', 'score']]  # Features (inputs)
y = df['name']            # Target (what we want to predict)

print("Features (X):")
print(X)
print("\nTarget (y):")
print(y)

## Filtering Data - Essential Skill

Filtering is used constantly in data preprocessing:

In [None]:
# Boolean filtering - very common pattern!
high_scorers = df[df['score'] > 85]
print("High scorers (score > 85):")
print(high_scorers)
print()

# Multiple conditions
young_high_scorers = df[(df['age'] < 30) & (df['score'] > 85)]
print("Young high scorers (age < 30 AND score > 85):")
print(young_high_scorers)
print()

# Get values as NumPy array (for ML algorithms)
scores_array = df['score'].values
print(f"Scores as NumPy array: {scores_array}")
print(f"Type: {type(scores_array)}")

## Reading CSV Files - The Most Common Operation

This is how real data science projects start:

In [None]:
# Create a sample CSV file for demonstration
sample_csv_data = """
product,sales,region,quarter
Laptop,1200,North,Q1
Phone,800,South,Q1
Tablet,600,East,Q1
Laptop,1350,North,Q2
Phone,900,South,Q2
Tablet,550,East,Q2
""".strip()

# Save to file
with open('sample_sales.csv', 'w') as f:
    f.write(sample_csv_data)

# Read CSV file - this is how most projects start!
sales_df = pd.read_csv('sample_sales.csv')
print("Data loaded from CSV:")
print(sales_df)
print()

# Quick analysis
print("Sales by product:")
print(sales_df.groupby('product')['sales'].sum())
print()

print("Average sales by region:")
print(sales_df.groupby('region')['sales'].mean())

# Clean up
import os
os.remove('sample_sales.csv')
print("\n✅ CSV file demonstration complete!")

## Key Takeaways for ML Notebooks

Now you understand these common patterns you'll see:

```python
# Loading data
df = pd.read_csv('data.csv')

# Exploring data
df.head()
df.shape
df.describe()

# Preparing for ML
X = df[['feature1', 'feature2', 'feature3']]  # Features
y = df['target']                              # Target

# Converting to NumPy for algorithms
X_array = X.values
y_array = y.values
```

**What's Next:**
- In intermediate courses, you'll learn pandas in depth
- You'll work with real datasets
- You'll learn data cleaning and preprocessing

**For now:** You can recognize and understand pandas code in ML notebooks! 🎉

---

## 🎯 Try It Yourself: Mini Data Analysis

### Interactive Challenge: Explore the Sales Data
Use the sales_data DataFrame above to answer these questions. Type your code in the cell below!

In [None]:
# YOUR TURN: Answer these questions using pandas code

# 1. What's the average sale amount?
# Try: sales_data['amount'].mean()

# 2. How many sales were made in each region?
# Try: sales_data['region'].value_counts()

# 3. What's the total sales amount for the East region?
# Try: sales_data[sales_data['region'] == 'East']['amount'].sum()

# 4. Create a new column 'large_sale' that's True if amount > 1000
# Try: sales_data['large_sale'] = sales_data['amount'] > 1000

# Write your solutions below:
print("1. Average sale amount:")
# Your code here

print("\n2. Sales count by region:")
# Your code here

print("\n3. Total sales in East region:")
# Your code here

print("\n4. Adding large_sale column:")
# Your code here
# Then display the updated DataFrame

### 💡 Helpful Hints

**Stuck? Here are some tips:**

1. **For question 1**: Use `.mean()` on a column
2. **For question 2**: Use `.value_counts()` to count unique values
3. **For question 3**: First filter the DataFrame, then sum the amount column
4. **For question 4**: Use comparison operators to create boolean columns

**Remember**: 
- Column names go in quotes: `df['column_name']`
- Use square brackets for filtering: `df[condition]`
- Chain operations: `df[condition]['column'].sum()`

---

## 🎓 Congratulations!

You've just completed your first pandas operations! Here's what you accomplished:

✅ **Loaded data** from a CSV file  
✅ **Explored data** with `.head()`, `.info()`, `.describe()`  
✅ **Selected columns** and filtered rows  
✅ **Performed calculations** on data  
✅ **Created new columns** based on conditions  

## 🚀 What's Next?

In future courses, you'll learn:
- **Data cleaning**: Handling missing values, duplicates
- **Advanced filtering**: Multiple conditions, complex queries
- **Grouping and aggregation**: Group by categories, calculate statistics
- **Merging datasets**: Combining multiple data sources
- **Time series analysis**: Working with dates and time data
- **Data visualization**: Creating charts directly from DataFrames

**For now**: You know enough pandas to understand most ML notebook code! 🎉