# Pandas Data Analysis

This notebook demonstrates data manipulation and analysis with Pandas.

**Library:** [Pandas](https://pandas.pydata.org/) - Data manipulation and analysis

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

## Creating DataFrames

A DataFrame is the primary data structure in Pandas - a 2D labeled data structure with columns of potentially different types.

In [None]:
# Create DataFrame from dictionary
data = {
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [25, 30, 35, 28, 32],
    "department": ["Engineering", "Marketing", "Engineering", "Sales", "Marketing"],
    "salary": [75000, 65000, 85000, 70000, 72000],
    "start_date": pd.date_range("2020-01-01", periods=5, freq="3ME"),
}
df = pd.DataFrame(data)
df

In [None]:
# DataFrame info
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nData types:")
print(df.dtypes)

## Data Selection and Filtering

Pandas provides powerful ways to select and filter data.

### Selecting Columns

In [None]:
# Select single column
df["name"]

In [None]:
# Select multiple columns
df[["name", "department", "salary"]]

### Filtering Rows

In [None]:
# Filter by condition
engineers = df[df["department"] == "Engineering"]
print("Engineers only:")
engineers

In [None]:
# Multiple conditions with & (and) and | (or)
high_earners = df[(df["salary"] > 70000) & (df["age"] < 35)]
print("High earners under 35:")
high_earners

In [None]:
# Using query() for more readable filtering
result = df.query("department == 'Marketing' and salary > 60000")
print("Marketing with salary > 60000:")
result

## Aggregation and Grouping

Pandas makes it easy to compute summary statistics and group data.

### Basic Aggregation

In [None]:
# Describe gives summary statistics
df["salary"].describe()

### Group By Operations

In [None]:
# Group by department and aggregate
dept_stats = df.groupby("department").agg(
    {"salary": ["mean", "min", "max", "count"], "age": "mean"}
)
print("Statistics by department:")
dept_stats

In [None]:
# Custom aggregation with apply
summary = df.groupby("department").apply(
    lambda x: pd.Series(
        {"avg_salary": x["salary"].mean(), "total_employees": len(x)}
    )
)
print("Custom summary by department:")
summary

## Data Transformation

Adding new columns and transforming existing data.

In [None]:
# Add categorical column based on salary
df["salary_category"] = pd.cut(
    df["salary"], bins=[0, 70000, 80000, float("inf")], labels=["Low", "Medium", "High"]
)

# Calculate years employed
df["years_employed"] = (datetime.now() - df["start_date"]).dt.days / 365

print("DataFrame with new columns:")
df

In [None]:
# Normalize salary within each department using transform
df["salary_normalized"] = df.groupby("department")["salary"].transform(
    lambda x: (x - x.mean()) / x.std() if x.std() > 0 else 0
)
df[["name", "department", "salary", "salary_normalized"]]

## Pivot Tables

Pivot tables allow you to reshape data and compute aggregations.

In [None]:
# Create sample sales data
np.random.seed(42)
sales_data = pd.DataFrame(
    {
        "date": pd.date_range("2024-01-01", periods=12, freq="ME"),
        "region": ["North", "South"] * 6,
        "product": ["A", "B", "A", "B"] * 3,
        "sales": np.random.randint(1000, 5000, 12),
    }
)
print("Sales data:")
sales_data

In [None]:
# Create pivot table
pivot = pd.pivot_table(
    sales_data, values="sales", index="region", columns="product", aggfunc="sum"
)
print("Sales pivot table (sum by region and product):")
pivot

## Time Series Operations

Pandas has excellent support for time series data.

In [None]:
# Create time series with datetime index
np.random.seed(42)
dates = pd.date_range("2024-01-01", periods=100, freq="D")
ts = pd.Series(np.random.randn(100).cumsum(), index=dates)

print("Time series (first 10 days):")
ts.head(10)

### Resampling

Change the frequency of your time series data.

In [None]:
# Resample to weekly frequency
weekly = ts.resample("W").mean()
print("Weekly resampled (mean):")
weekly.head()

### Rolling Statistics

Compute statistics over a sliding window.

In [None]:
# 7-day rolling mean
rolling = ts.rolling(window=7).mean()
print("7-day rolling mean (last 10 values):")
rolling.tail(10)

## Data Cleaning

Handling missing values and duplicates.

In [None]:
# Create messy data with missing values
messy = pd.DataFrame(
    {
        "A": [1, 2, np.nan, 4, 5],
        "B": [np.nan, 2, 3, np.nan, 5],
        "C": ["x", "y", "z", "x", "y"],
    }
)
print("Messy data:")
messy

In [None]:
# Check for missing values
print("Missing values per column:")
messy.isnull().sum()

In [None]:
# Fill missing values
filled = messy.fillna({"A": messy["A"].mean(), "B": 0})
print("Filled data:")
filled

### Handling Duplicates

In [None]:
# Create data with duplicates
with_dups = pd.DataFrame({"x": [1, 1, 2], "y": [1, 1, 3]})
print("With duplicates:")
print(with_dups)

print("\nWithout duplicates:")
with_dups.drop_duplicates()

---

## Summary

In this notebook, we covered:

1. **Creating DataFrames** from dictionaries and other sources
2. **Selection and Filtering** using conditions and query()
3. **Aggregation and Grouping** with groupby() and agg()
4. **Data Transformation** adding columns and applying functions
5. **Pivot Tables** for reshaping data
6. **Time Series Operations** including resampling and rolling statistics
7. **Data Cleaning** handling missing values and duplicates

For more information, visit the [Pandas Documentation](https://pandas.pydata.org/docs/).