# Pandas Tutorial: Complete Guide for Data Analysis

This notebook provides a comprehensive introduction to pandas, Python's most popular data manipulation and analysis library.

## Table of Contents
1. [Introduction to Pandas](#introduction)
2. [Data Structures: Series and DataFrame](#data-structures)
3. [Reading and Writing Data](#io-operations)
4. [Data Inspection and Exploration](#data-inspection)
5. [Data Selection and Indexing](#selection-indexing)
6. [Data Cleaning](#data-cleaning)
7. [Data Transformation](#data-transformation)
8. [Grouping and Aggregation](#grouping-aggregation)
9. [Merging and Joining](#merging-joining)
10. [Time Series Analysis](#time-series)
11. [Data Visualization with Pandas](#visualization)

## 1. Introduction to Pandas {#introduction}

Pandas is a powerful Python library for data manipulation and analysis. It provides:
- Fast, flexible data structures (Series and DataFrame)
- Tools for reading/writing data from various formats
- Data cleaning and preparation capabilities
- Statistical analysis functions
- Integration with other Python libraries

Let's start by importing pandas and other useful libraries:

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 100)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Pandas version: 2.0.2
NumPy version: 1.24.3


## 2. Data Structures: Series and DataFrame {#data-structures}

### Series
A Series is a one-dimensional array-like object with labeled indices. Think of it as a column in a spreadsheet.

In [4]:
# Creating a Series from a list
ages = pd.Series([25, 30, 35, 40, 45], name='Age')
print("Series from list:")
print(ages)
print(f"Data type: {ages.dtype}")
print(f"Name: {ages.name}")
print()

Series from list:
0    25
1    30
2    35
3    40
4    45
Name: Age, dtype: int64
Data type: int64
Name: Age



In [5]:
# Creating a Series with custom index
scores = pd.Series([85, 92, 78, 96, 88], 
                   index=['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
                   name='Test_Score')
print("Series with custom index:")
print(scores)
print(f"Index: {list(scores.index)}")
print()

Series with custom index:
Alice      85
Bob        92
Charlie    78
David      96
Eve        88
Name: Test_Score, dtype: int64
Index: ['Alice', 'Bob', 'Charlie', 'David', 'Eve']



In [6]:
# Creating a Series from a dictionary
city_population = pd.Series({
    'New York': 8_419_000,
    'Los Angeles': 3_980_000,
    'Chicago': 2_716_000,
    'Houston': 2_328_000
}, name='Population')
print("Series from dictionary:")
print(city_population)

Series from dictionary:
New York       8419000
Los Angeles    3980000
Chicago        2716000
Houston        2328000
Name: Population, dtype: int64


### DataFrame
A DataFrame is a two-dimensional data structure with labeled rows and columns. It's like a spreadsheet or SQL table.

In [7]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 80000, 90000, 95000, 85000]
}

df = pd.DataFrame(data)
print("DataFrame from dictionary:")
print(df)
print(f"\nShape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"Index: {list(df.index)}")

DataFrame from dictionary:
      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000
3    David   40      Houston   95000
4      Eve   45      Phoenix   85000

Shape: (5, 4)
Columns: ['Name', 'Age', 'City', 'Salary']
Index: [0, 1, 2, 3, 4]


## 3. Reading and Writing Data {#io-operations}

Pandas can read data from various file formats including CSV, Excel, JSON, SQL databases, and more.

In [8]:
# Create sample data and save to CSV for demonstration
sample_data = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'],
    'Price': [999.99, 29.99, 79.99, 299.99, 149.99],
    'Stock': [50, 200, 150, 75, 100],
    'Rating': [4.5, 4.2, 4.0, 4.8, 4.3]
})

# Save to CSV
sample_data.to_csv('sample_products.csv', index=False)
print("Sample data saved to CSV:")
print(sample_data)

Sample data saved to CSV:
      Product     Category   Price  Stock  Rating
0      Laptop  Electronics  999.99     50     4.5
1       Mouse  Accessories   29.99    200     4.2
2    Keyboard  Accessories   79.99    150     4.0
3     Monitor  Electronics  299.99     75     4.8
4  Headphones  Accessories  149.99    100     4.3


In [9]:
# Reading from CSV
df_csv = pd.read_csv('sample_products.csv')
print("Data read from CSV:")
print(df_csv)
print(f"\nData types:\n{df_csv.dtypes}")

Data read from CSV:
      Product     Category   Price  Stock  Rating
0      Laptop  Electronics  999.99     50     4.5
1       Mouse  Accessories   29.99    200     4.2
2    Keyboard  Accessories   79.99    150     4.0
3     Monitor  Electronics  299.99     75     4.8
4  Headphones  Accessories  149.99    100     4.3

Data types:
Product      object
Category     object
Price       float64
Stock         int64
Rating      float64
dtype: object


In [10]:
# Reading with specific parameters
df_custom = pd.read_csv('sample_products.csv', 
                        index_col='Product',  # Set Product as index
                        dtype={'Stock': 'int32'},  # Specify data type
                        usecols=['Product', 'Price', 'Rating'])  # Select specific columns

print("Custom CSV reading:")
print(df_custom)
print(f"\nData types:\n{df_custom.dtypes}")

Custom CSV reading:
             Price  Rating
Product                   
Laptop      999.99     4.5
Mouse        29.99     4.2
Keyboard     79.99     4.0
Monitor     299.99     4.8
Headphones  149.99     4.3

Data types:
Price     float64
Rating    float64
dtype: object


## 4. Data Inspection and Exploration {#data-inspection}

Before working with data, it's crucial to understand its structure and content.

In [11]:
# Use our sample data for exploration
df = sample_data.copy()

print("Basic information about the DataFrame:")
print(f"Shape: {df.shape}")
print(f"Size: {df.size}")
print(f"Columns: {list(df.columns)}")
print(f"Index: {list(df.index)}")
print()

# First few rows
print("First 3 rows:")
print(df.head(3))
print()

# Last few rows
print("Last 2 rows:")
print(df.tail(2))

Basic information about the DataFrame:
Shape: (5, 5)
Size: 25
Columns: ['Product', 'Category', 'Price', 'Stock', 'Rating']
Index: [0, 1, 2, 3, 4]

First 3 rows:
    Product     Category   Price  Stock  Rating
0    Laptop  Electronics  999.99     50     4.5
1     Mouse  Accessories   29.99    200     4.2
2  Keyboard  Accessories   79.99    150     4.0

Last 2 rows:
      Product     Category   Price  Stock  Rating
3     Monitor  Electronics  299.99     75     4.8
4  Headphones  Accessories  149.99    100     4.3


In [12]:
# Data types and memory usage
print("Data types and memory info:")
df.info()
print()

# Data types only
print("Data types:")
print(df.dtypes)

Data types and memory info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Product   5 non-null      object 
 1   Category  5 non-null      object 
 2   Price     5 non-null      float64
 3   Stock     5 non-null      int64  
 4   Rating    5 non-null      float64
dtypes: float64(2), int64(1), object(2)
memory usage: 328.0+ bytes

Data types:
Product      object
Category     object
Price       float64
Stock         int64
Rating      float64
dtype: object


In [13]:
# Statistical summary
print("Statistical summary (numeric columns):")
print(df.describe())
print()

# Summary including non-numeric columns
print("Summary of all columns:")
print(df.describe(include='all'))

Statistical summary (numeric columns):
            Price       Stock    Rating
count    5.000000    5.000000  5.000000
mean   311.990000  115.000000  4.360000
std    397.831623   60.207973  0.304959
min     29.990000   50.000000  4.000000
25%     79.990000   75.000000  4.200000
50%    149.990000  100.000000  4.300000
75%    299.990000  150.000000  4.500000
max    999.990000  200.000000  4.800000

Summary of all columns:
       Product     Category       Price       Stock    Rating
count        5            5    5.000000    5.000000  5.000000
unique       5            2         NaN         NaN       NaN
top     Laptop  Accessories         NaN         NaN       NaN
freq         1            3         NaN         NaN       NaN
mean       NaN          NaN  311.990000  115.000000  4.360000
std        NaN          NaN  397.831623   60.207973  0.304959
min        NaN          NaN   29.990000   50.000000  4.000000
25%        NaN          NaN   79.990000   75.000000  4.200000
50%        NaN    

In [14]:
# Check for missing values
print("Missing values count:")
print(df.isnull().sum())
print()

# Unique values in categorical columns
print("Unique values in 'Category':")
print(df['Category'].unique())
print(f"Number of unique categories: {df['Category'].nunique()}")
print()

# Value counts
print("Value counts for 'Category':")
print(df['Category'].value_counts())

Missing values count:
Product     0
Category    0
Price       0
Stock       0
Rating      0
dtype: int64

Unique values in 'Category':
['Electronics' 'Accessories']
Number of unique categories: 2

Value counts for 'Category':
Category
Accessories    3
Electronics    2
Name: count, dtype: int64


## 5. Data Selection and Indexing {#selection-indexing}

Pandas provides multiple ways to select and filter data.

In [15]:
# Selecting columns
print("Single column selection:")
print(df['Product'])
print(f"Type: {type(df['Product'])}")
print()

# Multiple columns
print("Multiple column selection:")
print(df[['Product', 'Price']])
print(f"Type: {type(df[['Product', 'Price']])}")

Single column selection:
0        Laptop
1         Mouse
2      Keyboard
3       Monitor
4    Headphones
Name: Product, dtype: object
Type: <class 'pandas.core.series.Series'>

Multiple column selection:
      Product   Price
0      Laptop  999.99
1       Mouse   29.99
2    Keyboard   79.99
3     Monitor  299.99
4  Headphones  149.99
Type: <class 'pandas.core.frame.DataFrame'>


In [16]:
# Row selection by index
print("First row (using iloc):")
print(df.iloc[0])
print()

# Multiple rows
print("First 3 rows (using iloc):")
print(df.iloc[0:3])
print()

# Specific rows and columns
print("Specific rows and columns (using iloc):")
print(df.iloc[0:3, 1:3])  # First 3 rows, columns 1-2

First row (using iloc):
Product          Laptop
Category    Electronics
Price            999.99
Stock                50
Rating              4.5
Name: 0, dtype: object

First 3 rows (using iloc):
    Product     Category   Price  Stock  Rating
0    Laptop  Electronics  999.99     50     4.5
1     Mouse  Accessories   29.99    200     4.2
2  Keyboard  Accessories   79.99    150     4.0

Specific rows and columns (using iloc):
      Category   Price
0  Electronics  999.99
1  Accessories   29.99
2  Accessories   79.99


In [17]:
# Boolean indexing (filtering)
print("Products with price > 100:")
expensive_products = df[df['Price'] > 100]
print(expensive_products)
print()

# Multiple conditions
print("Electronics with price > 200:")
expensive_electronics = df[(df['Category'] == 'Electronics') & (df['Price'] > 200)]
print(expensive_electronics)
print()

# Using isin() for multiple values
print("Products that are either Laptop or Monitor:")
selected_products = df[df['Product'].isin(['Laptop', 'Monitor'])]
print(selected_products)

Products with price > 100:
      Product     Category   Price  Stock  Rating
0      Laptop  Electronics  999.99     50     4.5
3     Monitor  Electronics  299.99     75     4.8
4  Headphones  Accessories  149.99    100     4.3

Electronics with price > 200:
   Product     Category   Price  Stock  Rating
0   Laptop  Electronics  999.99     50     4.5
3  Monitor  Electronics  299.99     75     4.8

Products that are either Laptop or Monitor:
   Product     Category   Price  Stock  Rating
0   Laptop  Electronics  999.99     50     4.5
3  Monitor  Electronics  299.99     75     4.8


In [18]:
# Using loc for label-based selection
df_indexed = df.set_index('Product')
print("DataFrame with Product as index:")
print(df_indexed)
print()

print("Select Laptop row using loc:")
print(df_indexed.loc['Laptop'])
print()

print("Select specific rows and columns using loc:")
print(df_indexed.loc['Laptop':'Keyboard', 'Price':'Stock'])

DataFrame with Product as index:
               Category   Price  Stock  Rating
Product                                       
Laptop      Electronics  999.99     50     4.5
Mouse       Accessories   29.99    200     4.2
Keyboard    Accessories   79.99    150     4.0
Monitor     Electronics  299.99     75     4.8
Headphones  Accessories  149.99    100     4.3

Select Laptop row using loc:
Category    Electronics
Price            999.99
Stock                50
Rating              4.5
Name: Laptop, dtype: object

Select specific rows and columns using loc:
           Price  Stock
Product                
Laptop    999.99     50
Mouse      29.99    200
Keyboard   79.99    150


## 6. Data Cleaning {#data-cleaning}

Real-world data often contains missing values, duplicates, and inconsistencies that need to be addressed.

In [19]:
# Create data with missing values and duplicates for demonstration
messy_data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eve', 'Frank'],
    'Age': [25, 30, np.nan, 25, 45, 35],
    'City': ['New York', 'LA', 'Chicago', 'New York', None, 'Boston'],
    'Salary': [70000, 80000, 90000, 70000, 85000, np.nan],
    'Email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 
              'alice@email.com', 'eve@email.com', 'frank@email.com']
})

print("Original messy data:")
print(messy_data)
print(f"\nMissing values:\n{messy_data.isnull().sum()}")
print(f"\nDuplicate rows: {messy_data.duplicated().sum()}")

Original messy data:
      Name   Age      City   Salary              Email
0    Alice  25.0  New York  70000.0    alice@email.com
1      Bob  30.0        LA  80000.0      bob@email.com
2  Charlie   NaN   Chicago  90000.0  charlie@email.com
3    Alice  25.0  New York  70000.0    alice@email.com
4      Eve  45.0      None  85000.0      eve@email.com
5    Frank  35.0    Boston      NaN    frank@email.com

Missing values:
Name      0
Age       1
City      1
Salary    1
Email     0
dtype: int64

Duplicate rows: 1


In [None]:
# Handling missing values
print("Methods to handle missing values:\n")

# Drop rows with any missing values
print("1. Drop rows with any missing values:")
clean_dropna = messy_data.dropna()
print(clean_dropna)
print(f"Shape: {clean_dropna.shape}")
print()

# Drop rows only if all values are missing
print("2. Drop rows only if all values are missing:")
clean_dropna_all = messy_data.dropna(how='all')
print(clean_dropna_all)
print()

# Fill missing values
print("3. Fill missing values:")
clean_filled = messy_data.copy()
clean_filled['Age'].fillna(clean_filled['Age'].mean(), inplace=True)
clean_filled['City'].fillna('Unknown', inplace=True)
clean_filled['Salary'].fillna(clean_filled['Salary'].median(), inplace=True)
print(clean_filled)

In [None]:
# Handling duplicates
print("Handling duplicates:\n")

# Identify duplicates
print("Duplicate rows:")
print(messy_data[messy_data.duplicated()])
print()

# Remove duplicates (keep first occurrence)
print("After removing duplicates:")
clean_no_duplicates = messy_data.drop_duplicates()
print(clean_no_duplicates)
print(f"Shape before: {messy_data.shape}")
print(f"Shape after: {clean_no_duplicates.shape}")
print()

# Remove duplicates based on specific columns
print("Remove duplicates based on 'Name' and 'Email':")
clean_subset = messy_data.drop_duplicates(subset=['Name', 'Email'])
print(clean_subset)

In [None]:
# Data type conversion
print("Data type conversion:\n")

# Create sample data with wrong types
df_types = pd.DataFrame({
    'ID': ['001', '002', '003', '004'],
    'Score': ['85.5', '92.0', '78.5', '96.0'],
    'Grade': ['A', 'A+', 'B', 'A+'],
    'Pass': ['True', 'True', 'False', 'True']
})

print("Original data types:")
print(df_types.dtypes)
print()

# Convert data types
df_types['ID'] = pd.to_numeric(df_types['ID'])
df_types['Score'] = pd.to_numeric(df_types['Score'])
df_types['Pass'] = df_types['Pass'].astype('bool')

print("After conversion:")
print(df_types.dtypes)
print()
print(df_types)

## 7. Data Transformation {#data-transformation}

Transforming data to create new insights or prepare it for analysis.

In [None]:
# Creating new columns
df = sample_data.copy()
print("Original data:")
print(df)
print()

# Add new columns
df['Total_Value'] = df['Price'] * df['Stock']  # Calculate total inventory value
df['Price_Category'] = pd.cut(df['Price'], 
                             bins=[0, 50, 200, float('inf')], 
                             labels=['Cheap', 'Medium', 'Expensive'])
df['High_Rating'] = df['Rating'] >= 4.5

print("With new columns:")
print(df)

In [None]:
# String operations
df['Product_Upper'] = df['Product'].str.upper()
df['Product_Length'] = df['Product'].str.len()
df['Category_Code'] = df['Category'].str[:4].str.upper()  # First 4 characters, uppercase

print("String operations:")
print(df[['Product', 'Product_Upper', 'Product_Length', 'Category', 'Category_Code']])

In [None]:
# Apply custom functions
def categorize_stock(stock):
    if stock < 75:
        return 'Low'
    elif stock < 150:
        return 'Medium'
    else:
        return 'High'

df['Stock_Level'] = df['Stock'].apply(categorize_stock)

print("Custom function application:")
print(df[['Product', 'Stock', 'Stock_Level']])
print()

# Using lambda functions
df['Discounted_Price'] = df['Price'].apply(lambda x: x * 0.9 if x > 100 else x)

print("Lambda function (10% discount for items > $100):")
print(df[['Product', 'Price', 'Discounted_Price']])

In [None]:
# Sorting data
print("Sorting by Price (ascending):")
print(df.sort_values('Price')[['Product', 'Price']])
print()

print("Sorting by multiple columns:")
print(df.sort_values(['Category', 'Price'], ascending=[True, False])[['Product', 'Category', 'Price']])
print()

# Ranking
df['Price_Rank'] = df['Price'].rank(ascending=False)
df['Rating_Rank'] = df['Rating'].rank(ascending=False)

print("Rankings:")
print(df[['Product', 'Price', 'Price_Rank', 'Rating', 'Rating_Rank']])

## 8. Grouping and Aggregation {#grouping-aggregation}

Grouping data and performing aggregate operations is essential for data analysis.

In [None]:
# Create more comprehensive sample data
sales_data = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=20, freq='D'),
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'] * 4,
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'] * 4,
    'Sales': np.random.randint(1, 10, 20),
    'Revenue': np.random.randint(100, 1000, 20),
    'Region': ['North', 'South', 'East', 'West', 'North'] * 4
})

print("Sales data:")
print(sales_data.head(10))

In [None]:
# Basic grouping and aggregation
print("Sales by Category:")
category_sales = sales_data.groupby('Category')['Sales'].sum()
print(category_sales)
print()

print("Multiple aggregations:")
category_stats = sales_data.groupby('Category').agg({
    'Sales': ['sum', 'mean', 'count'],
    'Revenue': ['sum', 'mean', 'max', 'min']
})
print(category_stats)

In [None]:
# Multiple grouping columns
print("Sales by Category and Region:")
category_region_sales = sales_data.groupby(['Category', 'Region'])['Sales'].sum()
print(category_region_sales)
print()

# Unstack for better visualization
print("Unstacked format:")
print(category_region_sales.unstack(fill_value=0))

In [None]:
# Custom aggregation functions
def revenue_per_sale(group):
    return group['Revenue'].sum() / group['Sales'].sum()

print("Custom aggregation - Average Revenue per Sale:")
avg_revenue_per_sale = sales_data.groupby('Product').apply(revenue_per_sale)
print(avg_revenue_per_sale)
print()

# Using transform to add group statistics back to original DataFrame
sales_data['Category_Avg_Sales'] = sales_data.groupby('Category')['Sales'].transform('mean')
sales_data['Sales_vs_Category_Avg'] = sales_data['Sales'] - sales_data['Category_Avg_Sales']

print("Transform example (first 10 rows):")
print(sales_data[['Product', 'Category', 'Sales', 'Category_Avg_Sales', 'Sales_vs_Category_Avg']].head(10))

## 9. Merging and Joining {#merging-joining}

Combining data from multiple DataFrames is a common task in data analysis.

In [None]:
# Create sample datasets
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'customer_id': [1, 2, 2, 3, 1, 6],  # Note: customer_id 6 doesn't exist in customers
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones', 'Tablet'],
    'amount': [999, 29, 79, 299, 149, 399]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

In [None]:
# Inner join (default) - only matching records
print("Inner join:")
inner_join = pd.merge(customers, orders, on='customer_id')
print(inner_join)
print()

# Left join - all records from left DataFrame
print("Left join:")
left_join = pd.merge(customers, orders, on='customer_id', how='left')
print(left_join)
print()

# Right join - all records from right DataFrame
print("Right join:")
right_join = pd.merge(customers, orders, on='customer_id', how='right')
print(right_join)
print()

# Outer join - all records from both DataFrames
print("Outer join:")
outer_join = pd.merge(customers, orders, on='customer_id', how='outer')
print(outer_join)

In [None]:
# Joining with different column names
products = pd.DataFrame({
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'],
    'price': [999, 29, 79, 299, 149]
})

print("Products:")
print(products)
print()

# Merge with different column names
order_details = pd.merge(orders, products, 
                        left_on='product', right_on='product_name', 
                        how='left')
print("Order details (merged on different column names):")
print(order_details)

In [None]:
# Concatenating DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8, 9], 'B': [10, 11, 12]})
df3 = pd.DataFrame({'C': [13, 14, 15], 'D': [16, 17, 18]})

print("Original DataFrames:")
print("df1:")
print(df1)
print("\ndf2:")
print(df2)
print("\ndf3:")
print(df3)
print()

# Vertical concatenation (stacking rows)
print("Vertical concatenation (df1 + df2):")
vertical_concat = pd.concat([df1, df2], ignore_index=True)
print(vertical_concat)
print()

# Horizontal concatenation (side by side)
print("Horizontal concatenation (df1 + df3):")
horizontal_concat = pd.concat([df1, df3], axis=1)
print(horizontal_concat)

## 10. Time Series Analysis {#time-series}

Pandas has excellent support for working with dates and time series data.

In [None]:
# Create time series data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
ts_data = pd.DataFrame({
    'date': dates,
    'sales': np.random.normal(100, 20, 365) + np.sin(np.arange(365) * 2 * np.pi / 365) * 10,
    'temperature': np.random.normal(20, 10, 365) + np.sin(np.arange(365) * 2 * np.pi / 365) * 15
})

# Set date as index
ts_data.set_index('date', inplace=True)

print("Time series data (first 10 rows):")
print(ts_data.head(10))
print(f"\nData types:\n{ts_data.dtypes}")
print(f"\nIndex type: {type(ts_data.index)}")

In [None]:
# Date/time operations
ts_data['year'] = ts_data.index.year
ts_data['month'] = ts_data.index.month
ts_data['day_of_week'] = ts_data.index.day_name()
ts_data['quarter'] = ts_data.index.quarter

print("Date components:")
print(ts_data[['sales', 'year', 'month', 'day_of_week', 'quarter']].head(10))

In [None]:
# Time-based selection
print("January 2024 data:")
january_data = ts_data['2024-01']
print(january_data.head())
print(f"Shape: {january_data.shape}")
print()

# Date range selection
print("First week of 2024:")
first_week = ts_data['2024-01-01':'2024-01-07']
print(first_week)

In [None]:
# Resampling (aggregating by time periods)
print("Monthly averages:")
monthly_avg = ts_data.resample('M')[['sales', 'temperature']].mean()
print(monthly_avg)
print()

print("Weekly sums:")
weekly_sales = ts_data['sales'].resample('W').sum()
print(weekly_sales.head(10))
print()

# Rolling calculations
ts_data['sales_7day_avg'] = ts_data['sales'].rolling(window=7).mean()
ts_data['sales_30day_avg'] = ts_data['sales'].rolling(window=30).mean()

print("Rolling averages (first 40 rows):")
print(ts_data[['sales', 'sales_7day_avg', 'sales_30day_avg']].head(40))

## 11. Data Visualization with Pandas {#visualization}

Pandas integrates well with matplotlib for quick data visualization.

In [None]:
# Set up plotting
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)

# Create sample data for visualization
viz_data = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones', 'Tablet'],
    'Sales': [150, 300, 200, 120, 180, 90],
    'Revenue': [149850, 8970, 15980, 35880, 26820, 35910],
    'Rating': [4.5, 4.2, 4.0, 4.8, 4.3, 4.1]
})

print("Data for visualization:")
print(viz_data)

In [None]:
# Bar plot
plt.figure(figsize=(10, 6))
viz_data.set_index('Product')['Sales'].plot(kind='bar', color='skyblue')
plt.title('Sales by Product')
plt.xlabel('Product')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Horizontal bar plot
plt.figure(figsize=(10, 6))
viz_data.set_index('Product')['Revenue'].plot(kind='barh', color='lightcoral')
plt.title('Revenue by Product')
plt.xlabel('Revenue ($)')
plt.tight_layout()
plt.show()

In [None]:
# Scatter plot
plt.figure(figsize=(10, 6))
viz_data.plot(x='Sales', y='Revenue', kind='scatter', s=100, alpha=0.7)
plt.title('Sales vs Revenue')
plt.xlabel('Sales')
plt.ylabel('Revenue ($)')

# Add product labels
for i, row in viz_data.iterrows():
    plt.annotate(row['Product'], (row['Sales'], row['Revenue']), 
                xytext=(5, 5), textcoords='offset points')

plt.tight_layout()
plt.show()

In [None]:
# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Sales bar chart
viz_data.set_index('Product')['Sales'].plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('Sales by Product')
axes[0,0].tick_params(axis='x', rotation=45)

# Revenue pie chart
viz_data.set_index('Product')['Revenue'].plot(kind='pie', ax=axes[0,1], autopct='%1.1f%%')
axes[0,1].set_title('Revenue Distribution')
axes[0,1].set_ylabel('')

# Rating histogram
viz_data['Rating'].plot(kind='hist', bins=10, ax=axes[1,0], color='lightgreen', alpha=0.7)
axes[1,0].set_title('Rating Distribution')
axes[1,0].set_xlabel('Rating')

# Box plot
viz_data[['Sales', 'Rating']].plot(kind='box', ax=axes[1,1])
axes[1,1].set_title('Box Plot: Sales and Rating')

plt.tight_layout()
plt.show()

In [None]:
# Time series plot
plt.figure(figsize=(15, 8))

# Plot original sales and moving averages
ts_data['sales'].plot(label='Daily Sales', alpha=0.3)
ts_data['sales_7day_avg'].plot(label='7-Day Moving Average', linewidth=2)
ts_data['sales_30day_avg'].plot(label='30-Day Moving Average', linewidth=2)

plt.title('Sales Time Series with Moving Averages')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Summary and Best Practices

### Key Takeaways

1. **Data Structures**: Master Series and DataFrame - they're the foundation of pandas
2. **Data Loading**: Use appropriate `read_*` functions and specify parameters for optimal loading
3. **Data Exploration**: Always start with `.info()`, `.describe()`, `.head()`, and `.tail()`
4. **Data Cleaning**: Handle missing values and duplicates systematically
5. **Selection**: Use `.loc[]` for label-based and `.iloc[]` for position-based selection
6. **Grouping**: Leverage `groupby()` for powerful aggregations and analysis
7. **Merging**: Choose the right join type based on your analysis needs
8. **Time Series**: Set datetime as index for time-based operations

### Best Practices

- **Memory Management**: Use appropriate data types (`category` for categorical data, `int32` instead of `int64` when possible)
- **Method Chaining**: Chain operations for cleaner code: `df.dropna().groupby('column').mean()`
- **Copy vs. View**: Be aware of when you're creating copies vs. views of data
- **Performance**: Use vectorized operations instead of loops when possible
- **Documentation**: Document your data transformations and assumptions

### Next Steps

- Explore advanced pandas features like MultiIndex
- Learn about pandas extensions and integrations with other libraries
- Practice with real datasets to solidify your understanding
- Consider learning complementary libraries like NumPy, Matplotlib, and Scikit-learn

In [None]:
# Clean up created files
import os
if os.path.exists('sample_products.csv'):
    os.remove('sample_products.csv')
    print("Cleaned up sample CSV file")