# Data Manipulation with Pandas and Numpy

This notebook provides a comprehensive guide to data manipulation using two of Python's most powerful libraries: NumPy and Pandas. You'll learn how to efficiently work with arrays, dataframes, clean data, perform transformations, and prepare datasets for analysis and machine learning.

---
## 1. Introduction to NumPy Arrays and Operations

NumPy (Numerical Python) is the fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays efficiently.

### Why NumPy?
- **Speed**: NumPy arrays are stored more efficiently and operations are significantly faster than Python lists
- **Vectorization**: Perform operations on entire arrays without explicit loops
- **Broadcasting**: Perform operations on arrays of different shapes
- **Foundation**: NumPy is the foundation for Pandas, SciPy, scikit-learn, and many other libraries

In [None]:
# Import NumPy library
import numpy as np

# Display NumPy version
print(f"NumPy version: {np.__version__}")

### Creating NumPy Arrays

There are multiple ways to create NumPy arrays depending on your data source and needs.

In [None]:
# Creating arrays from Python lists
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1)
print("Data type:", arr1.dtype)

# Creating a 2D array (matrix)
arr2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\n2D Array:\n", arr2)
print("Shape:", arr2.shape)  # Returns (rows, columns)
print("Dimensions:", arr2.ndim)

In [None]:
# Creating arrays with built-in functions

# Array of zeros
zeros = np.zeros((3, 4))  # 3 rows, 4 columns
print("Zeros array:\n", zeros)

# Array of ones
ones = np.ones((2, 3))
print("\nOnes array:\n", ones)

# Identity matrix
identity = np.eye(3)
print("\nIdentity matrix:\n", identity)

# Array with a range of values
range_arr = np.arange(0, 10, 2)  # Start, stop, step
print("\nRange array:", range_arr)

# Array with evenly spaced values
linspace_arr = np.linspace(0, 1, 5)  # Start, stop, number of values
print("\nLinspace array:", linspace_arr)

In [None]:
# Creating random arrays

# Random values between 0 and 1
random_arr = np.random.random((2, 3))
print("Random array:\n", random_arr)

# Random integers
random_int = np.random.randint(1, 100, size=(3, 3))  # Between 1 and 100
print("\nRandom integers:\n", random_int)

# Random values from normal distribution
normal_arr = np.random.randn(4)  # Mean=0, Std=1
print("\nNormal distribution:", normal_arr)

### Array Operations and Manipulation

NumPy allows you to perform element-wise operations efficiently without explicit loops.

In [None]:
# Basic arithmetic operations (element-wise)
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print("Array a:", a)
print("Array b:", b)

print("\nAddition:", a + b)
print("Subtraction:", b - a)
print("Multiplication:", a * b)
print("Division:", b / a)
print("Power:", a ** 2)

# Operations with scalars
print("\nScalar multiplication:", a * 10)
print("Scalar addition:", a + 5)

In [None]:
# Universal functions (ufuncs)
arr = np.array([1, 4, 9, 16, 25])

print("Original array:", arr)
print("Square root:", np.sqrt(arr))
print("Exponential:", np.exp(arr[:3]))  # First 3 elements only
print("Logarithm:", np.log(arr))
print("Sine:", np.sin(arr))

In [None]:
# Statistical operations
data = np.array([12, 15, 18, 22, 25, 30, 35, 40])

print("Data:", data)
print("\nMean:", np.mean(data))
print("Median:", np.median(data))
print("Standard deviation:", np.std(data))
print("Variance:", np.var(data))
print("Min:", np.min(data))
print("Max:", np.max(data))
print("Sum:", np.sum(data))

In [None]:
# Array indexing and slicing
arr = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])

print("Original array:", arr)
print("First element:", arr[0])
print("Last element:", arr[-1])
print("Slice [2:5]:", arr[2:5])  # Elements at index 2, 3, 4
print("Every other element:", arr[::2])
print("Reversed:", arr[::-1])

In [None]:
# 2D array indexing
matrix = np.array([[1, 2, 3], 
                   [4, 5, 6], 
                   [7, 8, 9]])

print("Matrix:\n", matrix)
print("\nElement at [0, 0]:", matrix[0, 0])  # First row, first column
print("Element at [1, 2]:", matrix[1, 2])  # Second row, third column
print("\nFirst row:", matrix[0, :])  # All columns of first row
print("Second column:", matrix[:, 1])  # All rows of second column
print("\nSubmatrix:\n", matrix[0:2, 1:3])  # First 2 rows, columns 1-2

In [None]:
# Boolean indexing (filtering)
scores = np.array([45, 78, 92, 65, 88, 55, 73, 95])

print("All scores:", scores)

# Create a boolean mask
passing = scores >= 60
print("\nPassing mask:", passing)

# Filter using the mask
passing_scores = scores[passing]
print("Passing scores:", passing_scores)

# One-line filtering
high_scores = scores[scores > 80]
print("High scores (>80):", high_scores)

In [None]:
# Reshaping arrays
arr = np.arange(12)  # Creates [0, 1, 2, ..., 11]
print("Original array:", arr)
print("Shape:", arr.shape)

# Reshape to 2D
reshaped = arr.reshape(3, 4)  # 3 rows, 4 columns
print("\nReshaped (3x4):\n", reshaped)

# Reshape to 3D
reshaped_3d = arr.reshape(2, 2, 3)  # 2 blocks, 2 rows, 3 columns
print("\nReshaped (2x2x3):\n", reshaped_3d)

# Flatten back to 1D
flattened = reshaped.flatten()
print("\nFlattened:", flattened)

In [None]:
# Combining arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Concatenate
concatenated = np.concatenate([a, b])
print("Concatenated:", concatenated)

# Vertical stack (rows)
vstacked = np.vstack([a, b])
print("\nVertically stacked:\n", vstacked)

# Horizontal stack (columns)
hstacked = np.hstack([a, b])
print("\nHorizontally stacked:", hstacked)

---
## 2. Pandas Series and DataFrames

Pandas is built on top of NumPy and provides two primary data structures: **Series** (1-dimensional) and **DataFrame** (2-dimensional). These structures are designed specifically for data analysis and manipulation.

### Why Pandas?
- **Labeled data**: Work with row and column labels instead of just numeric indices
- **Heterogeneous data**: Store different data types in the same structure
- **Missing data handling**: Built-in methods for dealing with missing values
- **Data alignment**: Automatic alignment based on labels during operations

In [None]:
# Import Pandas library
import pandas as pd

# Display Pandas version
print(f"Pandas version: {pd.__version__}")

### Pandas Series

A Series is a one-dimensional labeled array that can hold any data type. Think of it as a column in a spreadsheet or a labeled NumPy array.

In [None]:
# Creating a Series from a list
temperatures = pd.Series([22, 25, 28, 24, 26])
print("Temperature Series:")
print(temperatures)
print("\nData type:", temperatures.dtype)

In [None]:
# Creating a Series with custom index
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri']
temperatures = pd.Series([22, 25, 28, 24, 26], index=days)
print("Temperature by day:")
print(temperatures)

# Accessing values by label
print("\nWednesday temperature:", temperatures['Wed'])

# Accessing values by position
print("First temperature:", temperatures[0])

In [None]:
# Creating a Series from a dictionary
population = {
    'Tokyo': 37400000,
    'Delhi': 30290000,
    'Shanghai': 27058000,
    'Mumbai': 20410000
}

pop_series = pd.Series(population)
print("City populations:")
print(pop_series)

# Series attributes
print("\nValues:", pop_series.values)  # Returns NumPy array
print("Index:", pop_series.index.tolist())

In [None]:
# Series operations
prices = pd.Series([100, 200, 150, 300], index=['A', 'B', 'C', 'D'])

print("Original prices:")
print(prices)

# Arithmetic operations
print("\nAfter 10% discount:")
print(prices * 0.9)

# Boolean filtering
print("\nPrices above 150:")
print(prices[prices > 150])

# Statistical methods
print("\nMean price:", prices.mean())
print("Max price:", prices.max())

### Pandas DataFrames

A DataFrame is a 2-dimensional labeled data structure with columns that can be of different types. It's similar to a spreadsheet, SQL table, or a dictionary of Series objects.

In [None]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Mumbai'],
    'Salary': [70000, 80000, 75000, 90000, 85000]
}

df = pd.DataFrame(data)
print("Employee DataFrame:")
print(df)

In [None]:
# DataFrame attributes and methods
print("Shape (rows, columns):", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nIndex:", df.index.tolist())
print("\nData types:\n", df.dtypes)

# Quick statistics
print("\nBasic statistics:")
print(df.describe())

In [None]:
# Viewing data
print("First 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

# Getting information about the DataFrame
print("\nDataFrame info:")
df.info()

In [None]:
# Accessing columns
print("Names column:")
print(df['Name'])  # Returns a Series

# Alternative syntax (works only if column name has no spaces)
print("\nAges:")
print(df.Age)

# Accessing multiple columns
print("\nName and City:")
print(df[['Name', 'City']])  # Returns a DataFrame

In [None]:
# Adding new columns
df['Department'] = ['IT', 'HR', 'IT', 'Finance', 'HR']
print("After adding Department:")
print(df)

# Creating calculated columns
df['Salary_Thousands'] = df['Salary'] / 1000
print("\nWith calculated column:")
print(df)

In [None]:
# Deleting columns
df_copy = df.copy()  # Create a copy to preserve original

# Method 1: Using drop (returns new DataFrame)
df_dropped = df_copy.drop('Salary_Thousands', axis=1)
print("After dropping column:")
print(df_dropped)

# Method 2: Using del (modifies in-place)
del df_copy['Salary_Thousands']
print("\nUsing del:")
print(df_copy)

---
## 3. Data Loading and Exploration

Pandas can read data from various file formats including CSV, Excel, JSON, SQL databases, and more. This section covers the most common data loading scenarios.

### Creating Sample Data for Examples

First, let's create sample datasets that we'll use throughout this section.

In [None]:
# Create a sample dataset
import numpy as np

np.random.seed(42)  # For reproducibility

# Sample sales data
sales_data = {
    'Date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Watch'], 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Sales': np.random.randint(100, 1000, 100),
    'Quantity': np.random.randint(1, 20, 100)
}

sales_df = pd.DataFrame(sales_data)
print("Sample sales data created:")
print(sales_df.head())

### Reading CSV Files

CSV (Comma-Separated Values) is one of the most common file formats for data exchange.

In [None]:
# Save DataFrame to CSV for demonstration
sales_df.to_csv('sales_data.csv', index=False)
print("CSV file created successfully!")

# Read CSV file
df_from_csv = pd.read_csv('sales_data.csv')
print("\nData loaded from CSV:")
print(df_from_csv.head())
print(f"\nShape: {df_from_csv.shape}")

In [None]:
# CSV reading with options

# Read only first 10 rows
df_limited = pd.read_csv('sales_data.csv', nrows=10)
print("First 10 rows only:")
print(df_limited)

# Read specific columns
df_columns = pd.read_csv('sales_data.csv', usecols=['Product', 'Sales', 'Quantity'])
print("\nSpecific columns:")
print(df_columns.head())

# Read with Date column as datetime
df_dates = pd.read_csv('sales_data.csv', parse_dates=['Date'])
print("\nWith parsed dates:")
print(df_dates.dtypes)

### Reading Excel Files

Pandas can read and write Excel files (.xlsx, .xls) using the `openpyxl` or `xlrd` libraries.

In [None]:
# Save DataFrame to Excel (requires openpyxl)
try:
    sales_df.to_excel('sales_data.xlsx', sheet_name='Sales', index=False)
    print("Excel file created successfully!")
    
    # Read Excel file
    df_from_excel = pd.read_excel('sales_data.xlsx', sheet_name='Sales')
    print("\nData loaded from Excel:")
    print(df_from_excel.head())
    
except ImportError:
    print("Note: Install openpyxl to work with Excel files")
    print("Run: pip install openpyxl")

### Reading from Other Formats

In [None]:
# Reading JSON
json_data = sales_df.head().to_json(orient='records')
df_from_json = pd.read_json(json_data)
print("From JSON:")
print(df_from_json)

# Reading from dictionary
dict_data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df_from_dict = pd.DataFrame(dict_data)
print("\nFrom dictionary:")
print(df_from_dict)

# Reading from clipboard (useful for quick data pasting)
# df_from_clipboard = pd.read_clipboard()  # Uncomment to use

### Data Exploration Methods

Once data is loaded, it's essential to explore and understand its structure and content.

In [None]:
# Load our sample data for exploration
df = pd.read_csv('sales_data.csv')

# Basic information
print("Dataset shape:", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)
print("\nMemory usage:")
print(df.memory_usage(deep=True))

In [None]:
# Statistical summary
print("Statistical summary:")
print(df.describe())

# Include non-numeric columns
print("\nAll columns summary:")
print(df.describe(include='all'))

In [None]:
# Checking for missing values
print("Missing values:")
print(df.isnull().sum())

print("\nPercentage of missing values:")
print((df.isnull().sum() / len(df)) * 100)

In [None]:
# Value counts for categorical columns
print("Product distribution:")
print(df['Product'].value_counts())

print("\nRegion distribution:")
print(df['Region'].value_counts())

# Proportions instead of counts
print("\nProduct proportions:")
print(df['Product'].value_counts(normalize=True))

In [None]:
# Unique values
print("Unique products:", df['Product'].unique())
print("Number of unique products:", df['Product'].nunique())

# Check for duplicates
print("\nNumber of duplicate rows:", df.duplicated().sum())

---
## 4. Data Selection and Indexing

Pandas provides multiple ways to select and filter data. Understanding these methods is crucial for efficient data manipulation.

### Label-based Indexing with .loc

`.loc[]` is used for label-based indexing. It selects data based on row and column labels.

In [None]:
# Create a sample DataFrame with custom index
employees = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
    'Salary': [70000, 60000, 75000, 80000, 65000]
}, index=['E001', 'E002', 'E003', 'E004', 'E005'])

print("Employee DataFrame:")
print(employees)

In [None]:
# Select a single row by label
print("Employee E003:")
print(employees.loc['E003'])

# Select multiple rows
print("\nEmployees E001 and E003:")
print(employees.loc[['E001', 'E003']])

# Select range of rows (inclusive)
print("\nEmployees E002 to E004:")
print(employees.loc['E002':'E004'])

In [None]:
# Select specific rows and columns
print("Name and Age for E001 and E005:")
print(employees.loc[['E001', 'E005'], ['Name', 'Age']])

# Select all rows with specific columns
print("\nAll employees' names and salaries:")
print(employees.loc[:, ['Name', 'Salary']])

# Select specific rows with all columns
print("\nEmployee E002 (all details):")
print(employees.loc['E002', :])

In [None]:
# Modifying data with .loc
employees_copy = employees.copy()

# Update a single value
employees_copy.loc['E001', 'Salary'] = 72000
print("After updating E001's salary:")
print(employees_copy.loc['E001'])

# Update multiple values
employees_copy.loc['E002', ['Age', 'Salary']] = [31, 62000]
print("\nAfter updating E002:")
print(employees_copy.loc['E002'])

### Position-based Indexing with .iloc

`.iloc[]` is used for integer position-based indexing. It works with integer positions (0-based) like NumPy arrays.

In [None]:
# Select by position
print("First row (position 0):")
print(employees.iloc[0])

# Select multiple rows by position
print("\nFirst and third rows:")
print(employees.iloc[[0, 2]])

# Select range of rows (exclusive end)
print("\nRows 1 to 3 (positions 1, 2):")
print(employees.iloc[1:3])

In [None]:
# Select rows and columns by position
print("First two rows, first two columns:")
print(employees.iloc[0:2, 0:2])

# Select specific positions
print("\nRows 0 and 2, columns 1 and 3:")
print(employees.iloc[[0, 2], [1, 3]])

# Get last row
print("\nLast row:")
print(employees.iloc[-1])

In [None]:
# Combining slicing techniques
print("Every other row, all columns:")
print(employees.iloc[::2, :])

print("\nAll rows, every other column:")
print(employees.iloc[:, ::2])

### Boolean Indexing (Filtering)

Boolean indexing allows you to filter data based on conditions. This is one of the most powerful features for data analysis.

In [None]:
# Simple condition
print("Employees with salary > 65000:")
high_salary = employees[employees['Salary'] > 65000]
print(high_salary)

# Multiple conditions (AND)
print("\nIT employees with age > 30:")
it_senior = employees[(employees['Department'] == 'IT') & (employees['Age'] > 30)]
print(it_senior)

In [None]:
# Multiple conditions (OR)
print("HR employees OR salary > 70000:")
hr_or_high = employees[(employees['Department'] == 'HR') | (employees['Salary'] > 70000)]
print(hr_or_high)

# NOT condition
print("\nNon-IT employees:")
non_it = employees[~(employees['Department'] == 'IT')]
print(non_it)

In [None]:
# Using isin() for multiple values
print("IT or HR employees:")
it_hr = employees[employees['Department'].isin(['IT', 'HR'])]
print(it_hr)

# String methods for filtering
print("\nEmployees whose name starts with 'A' or 'C':")
names_ac = employees[employees['Name'].str.startswith(('A', 'C'))]
print(names_ac)

In [None]:
# Between condition
print("Employees aged between 28 and 32 (inclusive):")
age_range = employees[employees['Age'].between(28, 32)]
print(age_range)

# Combining .loc with boolean indexing
print("\nNames of employees with salary < 70000:")
low_salary_names = employees.loc[employees['Salary'] < 70000, 'Name']
print(low_salary_names)

### Query Method

The `.query()` method provides a convenient way to filter data using string expressions.

In [None]:
# Simple query
print("Salary > 65000 using query:")
print(employees.query('Salary > 65000'))

# Multiple conditions
print("\nAge > 28 and Department == 'IT':")
print(employees.query('Age > 28 and Department == "IT"'))

# Using variables in query
min_salary = 70000
print(f"\nSalary >= {min_salary}:")
print(employees.query('Salary >= @min_salary'))

---
## 5. Data Cleaning

Real-world data is often messy and requires cleaning before analysis. This section covers handling missing values, duplicates, and data type conversions.

### Handling Missing Values

Missing data is represented as `NaN` (Not a Number) in Pandas. It's crucial to identify and handle missing values appropriately.

In [None]:
# Create a DataFrame with missing values
data_with_nulls = {
    'Name': ['Alice', 'Bob', None, 'David', 'Eve', 'Frank'],
    'Age': [25, np.nan, 35, 28, np.nan, 40],
    'City': ['NYC', 'LA', 'Chicago', None, 'Boston', 'Seattle'],
    'Salary': [70000, 60000, np.nan, 80000, 65000, 75000]
}

df_nulls = pd.DataFrame(data_with_nulls)
print("DataFrame with missing values:")
print(df_nulls)

In [None]:
# Detecting missing values
print("Is null (True where missing):")
print(df_nulls.isnull())

print("\nCount of missing values per column:")
print(df_nulls.isnull().sum())

print("\nTotal missing values:", df_nulls.isnull().sum().sum())

In [None]:
# Dropping rows with any missing values
print("After dropping rows with ANY missing values:")
df_dropped_any = df_nulls.dropna()
print(df_dropped_any)
print(f"Rows remaining: {len(df_dropped_any)} out of {len(df_nulls)}")

# Dropping rows where ALL values are missing
print("\nDropping rows where ALL values are missing:")
df_dropped_all = df_nulls.dropna(how='all')
print(df_dropped_all)

In [None]:
# Dropping rows based on specific columns
print("Drop rows where 'Name' is missing:")
df_dropped_name = df_nulls.dropna(subset=['Name'])
print(df_dropped_name)

# Dropping columns with missing values
print("\nDrop columns with any missing values:")
df_dropped_cols = df_nulls.dropna(axis=1)
print(df_dropped_cols)

In [None]:
# Filling missing values with a constant
df_filled = df_nulls.copy()

# Fill all missing values with a specific value
df_filled_zero = df_filled.fillna(0)
print("Filled with 0:")
print(df_filled_zero)

# Fill different columns with different values
df_filled_custom = df_nulls.fillna({
    'Name': 'Unknown',
    'Age': df_nulls['Age'].mean(),  # Mean age
    'City': 'Not Specified',
    'Salary': df_nulls['Salary'].median()  # Median salary
})
print("\nFilled with custom values:")
print(df_filled_custom)

In [None]:
# Forward fill and backward fill
time_series = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=6),
    'Value': [100, np.nan, np.nan, 200, np.nan, 300]
})

print("Original time series:")
print(time_series)

# Forward fill (use previous valid value)
print("\nForward fill:")
print(time_series.fillna(method='ffill'))

# Backward fill (use next valid value)
print("\nBackward fill:")
print(time_series.fillna(method='bfill'))

In [None]:
# Interpolation for numeric data
print("Linear interpolation:")
time_series['Value_Interpolated'] = time_series['Value'].interpolate()
print(time_series)

### Handling Duplicates

Duplicate rows can skew analysis results and should be identified and handled appropriately.

In [None]:
# Create DataFrame with duplicates
data_with_dupes = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'David'],
    'Age': [25, 30, 25, 35, 30, 28],
    'City': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'Boston']
}

df_dupes = pd.DataFrame(data_with_dupes)
print("DataFrame with duplicates:")
print(df_dupes)

In [None]:
# Identifying duplicates
print("Duplicate rows (True if duplicate):")
print(df_dupes.duplicated())

print("\nNumber of duplicate rows:", df_dupes.duplicated().sum())

# Show duplicate rows
print("\nActual duplicate rows:")
print(df_dupes[df_dupes.duplicated()])

In [None]:
# Removing duplicates (keeps first occurrence)
df_no_dupes = df_dupes.drop_duplicates()
print("After removing duplicates:")
print(df_no_dupes)

# Keep last occurrence instead
df_keep_last = df_dupes.drop_duplicates(keep='last')
print("\nKeeping last occurrence:")
print(df_keep_last)

In [None]:
# Check duplicates based on specific columns
print("Duplicates based on 'Name' only:")
print(df_dupes[df_dupes.duplicated(subset=['Name'])])

# Remove duplicates based on specific columns
df_unique_names = df_dupes.drop_duplicates(subset=['Name'])
print("\nUnique names only:")
print(df_unique_names)

### Data Type Conversion

Ensuring correct data types is essential for proper analysis and memory efficiency.

In [None]:
# Create DataFrame with mixed types
mixed_data = {
    'ID': ['1', '2', '3', '4', '5'],
    'Price': ['100.5', '200.3', '150.0', '300.7', '250.2'],
    'Quantity': ['10', '20', '15', '30', '25'],
    'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05']
}

df_mixed = pd.DataFrame(mixed_data)
print("Original data types:")
print(df_mixed.dtypes)
print("\nDataFrame:")
print(df_mixed)

In [None]:
# Converting to appropriate types
df_converted = df_mixed.copy()

# Convert to numeric
df_converted['ID'] = df_converted['ID'].astype(int)
df_converted['Price'] = df_converted['Price'].astype(float)
df_converted['Quantity'] = pd.to_numeric(df_converted['Quantity'])

# Convert to datetime
df_converted['Date'] = pd.to_datetime(df_converted['Date'])

print("Converted data types:")
print(df_converted.dtypes)
print("\nConverted DataFrame:")
print(df_converted)

In [None]:
# Handling errors during conversion
messy_numbers = pd.Series(['1', '2', 'three', '4', '5'])

# This will raise an error: messy_numbers.astype(int)

# Use to_numeric with error handling
clean_numbers = pd.to_numeric(messy_numbers, errors='coerce')  # Invalid = NaN
print("With 'coerce' (invalid become NaN):")
print(clean_numbers)

# Ignore errors (keep original)
ignore_errors = pd.to_numeric(messy_numbers, errors='ignore')
print("\nWith 'ignore' (keep original):")
print(ignore_errors)

### String Cleaning

Text data often requires cleaning to standardize format and remove unwanted characters.

In [None]:
# Create DataFrame with messy strings
messy_strings = pd.DataFrame({
    'Name': ['  Alice  ', 'BOB', 'charlie', '  DAVID'],
    'Email': ['alice@GMAIL.com', 'bob@yahoo.COM', 'charlie@Gmail.Com', 'david@YAHOO.com']
})

print("Messy strings:")
print(messy_strings)

In [None]:
# String cleaning operations
cleaned = messy_strings.copy()

# Remove leading/trailing whitespace
cleaned['Name'] = cleaned['Name'].str.strip()

# Standardize case
cleaned['Name'] = cleaned['Name'].str.title()  # Title Case
cleaned['Email'] = cleaned['Email'].str.lower()  # Lowercase

print("Cleaned strings:")
print(cleaned)

In [None]:
# More string operations
text_data = pd.Series(['Hello World', 'Python-Programming', 'Data_Science'])

print("Original:", text_data.tolist())
print("Replace space with underscore:", text_data.str.replace(' ', '_').tolist())
print("Replace dash and underscore:", text_data.str.replace('[-_]', ' ', regex=True).tolist())
print("Extract first word:", text_data.str.split().str[0].tolist())

---
## 6. Data Transformation

Data transformation involves reshaping, sorting, grouping, and aggregating data to extract insights and prepare it for analysis.

### Sorting Data

Sorting helps organize data for better visualization and analysis.

In [None]:
# Create sample data for transformation
transform_data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Watch', 'Laptop', 'Phone'],
    'Region': ['North', 'South', 'East', 'West', 'South', 'North'],
    'Sales': [1000, 1500, 800, 600, 1200, 1300],
    'Quantity': [5, 10, 8, 12, 6, 9]
}

df_transform = pd.DataFrame(transform_data)
print("Original DataFrame:")
print(df_transform)

In [None]:
# Sort by single column
print("Sorted by Sales (ascending):")
print(df_transform.sort_values('Sales'))

print("\nSorted by Sales (descending):")
print(df_transform.sort_values('Sales', ascending=False))

In [None]:
# Sort by multiple columns
print("Sorted by Product (asc) then Sales (desc):")
sorted_df = df_transform.sort_values(['Product', 'Sales'], ascending=[True, False])
print(sorted_df)

# Sort by index
print("\nSorted by index:")
print(sorted_df.sort_index())

### Grouping and Aggregation

GroupBy operations allow you to split data into groups, apply functions, and combine results.

In [None]:
# Simple groupby with single aggregation
print("Total sales by Product:")
product_sales = df_transform.groupby('Product')['Sales'].sum()
print(product_sales)

print("\nAverage quantity by Region:")
region_avg = df_transform.groupby('Region')['Quantity'].mean()
print(region_avg)

In [None]:
# Multiple aggregations
print("Multiple statistics by Product:")
product_stats = df_transform.groupby('Product')['Sales'].agg(['sum', 'mean', 'count', 'min', 'max'])
print(product_stats)

In [None]:
# Different aggregations for different columns
print("Custom aggregations:")
custom_agg = df_transform.groupby('Product').agg({
    'Sales': ['sum', 'mean'],
    'Quantity': ['sum', 'max']
})
print(custom_agg)

In [None]:
# Groupby multiple columns
print("Sales by Product and Region:")
product_region = df_transform.groupby(['Product', 'Region'])['Sales'].sum()
print(product_region)

# Convert to DataFrame for better readability
print("\nAs DataFrame:")
print(product_region.reset_index())

In [None]:
# Apply custom functions
def sales_range(x):
    return x.max() - x.min()

print("Sales range by Product:")
range_by_product = df_transform.groupby('Product')['Sales'].apply(sales_range)
print(range_by_product)

### Pivot Tables

Pivot tables provide a spreadsheet-style way to aggregate data.

In [None]:
# Create pivot table
print("Pivot table: Sales by Product and Region:")
pivot = df_transform.pivot_table(
    values='Sales',
    index='Product',
    columns='Region',
    aggfunc='sum',
    fill_value=0  # Fill missing combinations with 0
)
print(pivot)

In [None]:
# Pivot with multiple aggregations
print("Pivot with multiple aggregations:")
multi_pivot = df_transform.pivot_table(
    values='Sales',
    index='Product',
    columns='Region',
    aggfunc=['sum', 'mean'],
    fill_value=0
)
print(multi_pivot)

### Adding Calculated Columns

Create new columns based on existing data.

In [None]:
# Simple calculated column
df_calc = df_transform.copy()
df_calc['Revenue_per_Unit'] = df_calc['Sales'] / df_calc['Quantity']
print("With calculated column:")
print(df_calc)

# Using apply with lambda
df_calc['Sales_Category'] = df_calc['Sales'].apply(
    lambda x: 'High' if x > 1000 else 'Low'
)
print("\nWith category column:")
print(df_calc)

In [None]:
# Conditional column with np.where
df_calc['Performance'] = np.where(
    df_calc['Sales'] > 1000,
    'Excellent',
    np.where(df_calc['Sales'] > 800, 'Good', 'Needs Improvement')
)
print("Multi-condition column:")
print(df_calc[['Product', 'Sales', 'Performance']])

### Binning and Categorization

Convert continuous values into categories.

In [None]:
# Create bins for sales
bins = [0, 800, 1200, 2000]
labels = ['Low', 'Medium', 'High']

df_binned = df_transform.copy()
df_binned['Sales_Bin'] = pd.cut(df_binned['Sales'], bins=bins, labels=labels)

print("With binned sales:")
print(df_binned)

print("\nValue counts per bin:")
print(df_binned['Sales_Bin'].value_counts())

In [None]:
# Quantile-based binning
df_quantile = df_transform.copy()
df_quantile['Sales_Quartile'] = pd.qcut(
    df_quantile['Sales'],
    q=3,  # Split into 3 equal-sized groups
    labels=['Bottom', 'Middle', 'Top']
)

print("Quantile-based bins:")
print(df_quantile)

---
## 7. Merging and Joining Datasets

Combining multiple datasets is essential for comprehensive analysis. Pandas provides several methods to merge, join, and concatenate DataFrames.

### Concatenation

Concatenation stacks DataFrames vertically (rows) or horizontally (columns).

In [None]:
# Create sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 88]
})

df2 = pd.DataFrame({
    'ID': [4, 5, 6],
    'Name': ['David', 'Eve', 'Frank'],
    'Score': [92, 87, 95]
})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

In [None]:
# Vertical concatenation (default)
print("Concatenated vertically:")
vertical_concat = pd.concat([df1, df2])
print(vertical_concat)

# Reset index after concatenation
print("\nWith reset index:")
print(vertical_concat.reset_index(drop=True))

In [None]:
# Horizontal concatenation
df_extra = pd.DataFrame({
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
})

print("Concatenated horizontally:")
horizontal_concat = pd.concat([df1, df_extra], axis=1)
print(horizontal_concat)

In [None]:
# Concatenation with keys (hierarchical index)
print("With hierarchical index:")
keyed_concat = pd.concat([df1, df2], keys=['Group1', 'Group2'])
print(keyed_concat)

### Merging DataFrames

Merging combines DataFrames based on common columns (similar to SQL joins).

In [None]:
# Create sample DataFrames for merging
employees = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'DepartmentID': [101, 102, 101, 103]
})

departments = pd.DataFrame({
    'DepartmentID': [101, 102, 103, 104],
    'Department': ['IT', 'HR', 'Finance', 'Marketing'],
    'Location': ['NYC', 'LA', 'Chicago', 'Boston']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

In [None]:
# Inner join (default) - only matching rows
print("Inner join:")
inner_merge = pd.merge(employees, departments, on='DepartmentID', how='inner')
print(inner_merge)

In [None]:
# Left join - all rows from left DataFrame
print("Left join:")
left_merge = pd.merge(employees, departments, on='DepartmentID', how='left')
print(left_merge)

# Right join - all rows from right DataFrame
print("\nRight join:")
right_merge = pd.merge(employees, departments, on='DepartmentID', how='right')
print(right_merge)

In [None]:
# Outer join - all rows from both DataFrames
print("Outer join:")
outer_merge = pd.merge(employees, departments, on='DepartmentID', how='outer')
print(outer_merge)

In [None]:
# Merging on different column names
sales = pd.DataFrame({
    'EmpID': [1, 2, 3],
    'Sales': [50000, 60000, 55000]
})

print("Merging on different column names:")
merged_diff = pd.merge(
    employees,
    sales,
    left_on='EmployeeID',
    right_on='EmpID',
    how='left'
)
print(merged_diff)

In [None]:
# Merging on index
df_indexed1 = pd.DataFrame(
    {'A': [1, 2, 3]},
    index=['a', 'b', 'c']
)

df_indexed2 = pd.DataFrame(
    {'B': [4, 5, 6]},
    index=['a', 'b', 'd']
)

print("Merging on index:")
merged_index = pd.merge(
    df_indexed1,
    df_indexed2,
    left_index=True,
    right_index=True,
    how='outer'
)
print(merged_index)

### Join Method

The `.join()` method is a convenient way to merge DataFrames on their index.

In [None]:
# Set index and use join
emp_indexed = employees.set_index('EmployeeID')
sales_indexed = sales.set_index('EmpID')

print("Using join method:")
joined = emp_indexed.join(sales_indexed, how='left')
print(joined)

---
## 8. Time Series Basics

Time series data has timestamps as the index. Pandas provides powerful tools for working with dates, times, and time-indexed data.

### Working with Dates and Times

In [None]:
# Creating date ranges
date_range = pd.date_range(start='2024-01-01', end='2024-01-10', freq='D')
print("Daily date range:")
print(date_range)

# Different frequencies
print("\nWeekly dates:")
weekly = pd.date_range(start='2024-01-01', periods=5, freq='W')
print(weekly)

print("\nHourly dates:")
hourly = pd.date_range(start='2024-01-01', periods=5, freq='H')
print(hourly)

In [None]:
# Converting strings to datetime
date_strings = ['2024-01-01', '2024-02-15', '2024-03-20']
dates = pd.to_datetime(date_strings)
print("Converted to datetime:")
print(dates)
print("Data type:", dates.dtype)

In [None]:
# Parsing different date formats
custom_format = ['01-01-2024', '15-02-2024', '20-03-2024']
parsed_dates = pd.to_datetime(custom_format, format='%d-%m-%Y')
print("Parsed custom format:")
print(parsed_dates)

### Creating Time Series DataFrames

In [None]:
# Create a time series DataFrame
dates = pd.date_range('2024-01-01', periods=30, freq='D')
np.random.seed(42)
values = np.random.randn(30).cumsum() + 100  # Random walk

ts_df = pd.DataFrame({
    'Date': dates,
    'Value': values
})

print("Time series DataFrame:")
print(ts_df.head(10))

# Set date as index
ts_df.set_index('Date', inplace=True)
print("\nWith date index:")
print(ts_df.head())

### Extracting Date Components

In [None]:
# Extract date components
ts_analysis = ts_df.copy()
ts_analysis['Year'] = ts_analysis.index.year
ts_analysis['Month'] = ts_analysis.index.month
ts_analysis['Day'] = ts_analysis.index.day
ts_analysis['DayOfWeek'] = ts_analysis.index.dayofweek  # Monday=0, Sunday=6
ts_analysis['DayName'] = ts_analysis.index.day_name()

print("With extracted components:")
print(ts_analysis.head())

### Time-based Indexing and Slicing

In [None]:
# Select data by date
print("Data for January 5, 2024:")
print(ts_df.loc['2024-01-05'])

# Select date range
print("\nData from Jan 10 to Jan 15:")
print(ts_df.loc['2024-01-10':'2024-01-15'])

# Select by month
print("\nAll January data:")
print(ts_df.loc['2024-01'])

### Resampling Time Series

Resampling allows you to change the frequency of time series data (e.g., daily to weekly).

In [None]:
# Resample to weekly (downsampling)
print("Weekly mean values:")
weekly_mean = ts_df.resample('W').mean()
print(weekly_mean)

print("\nWeekly sum:")
weekly_sum = ts_df.resample('W').sum()
print(weekly_sum)

In [None]:
# Upsample to hourly (forward fill)
print("Upsampled to 12-hour frequency (first 10 rows):")
upsampled = ts_df.resample('12H').ffill()  # Forward fill missing values
print(upsampled.head(10))

### Rolling Window Calculations

Rolling windows compute statistics over a sliding window of data points.

In [None]:
# Calculate rolling mean (moving average)
ts_rolling = ts_df.copy()
ts_rolling['Rolling_Mean_7'] = ts_rolling['Value'].rolling(window=7).mean()
ts_rolling['Rolling_Std_7'] = ts_rolling['Value'].rolling(window=7).std()

print("With rolling statistics (7-day window):")
print(ts_rolling.head(10))

In [None]:
# Rolling sum and other aggregations
print("7-day rolling sum:")
print(ts_df['Value'].rolling(window=7).sum().head(10))

print("\n7-day rolling max:")
print(ts_df['Value'].rolling(window=7).max().head(10))

### Shifting Data

Shifting moves data forward or backward in time, useful for calculating changes and lags.

In [None]:
# Shift data
ts_shift = ts_df.copy()
ts_shift['Previous_Day'] = ts_shift['Value'].shift(1)  # Shift forward (lag)
ts_shift['Next_Day'] = ts_shift['Value'].shift(-1)  # Shift backward (lead)
ts_shift['Daily_Change'] = ts_shift['Value'] - ts_shift['Previous_Day']

print("With shifted values:")
print(ts_shift.head(10))

---
## 9. Practical Examples with Real-World Scenarios

Let's apply everything we've learned to realistic data analysis scenarios.

### Example 1: Sales Analysis

Analyze sales data to identify trends, top products, and regional performance.

In [None]:
# Create comprehensive sales dataset
np.random.seed(42)
n_records = 200

sales_complete = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Watch', 'Headphones'], n_records),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
    'Sales_Amount': np.random.randint(100, 2000, n_records),
    'Quantity': np.random.randint(1, 20, n_records),
    'Customer_Segment': np.random.choice(['Individual', 'Corporate', 'Government'], n_records)
})

# Add some missing values to make it realistic
missing_indices = np.random.choice(sales_complete.index, size=10, replace=False)
sales_complete.loc[missing_indices, 'Sales_Amount'] = np.nan

print("Sales dataset:")
print(sales_complete.head())
print(f"\nShape: {sales_complete.shape}")
print(f"Missing values: {sales_complete.isnull().sum().sum()}")

In [None]:
# Data cleaning
sales_clean = sales_complete.copy()

# Fill missing sales amounts with median
sales_clean['Sales_Amount'].fillna(sales_clean['Sales_Amount'].median(), inplace=True)

# Add calculated columns
sales_clean['Price_per_Unit'] = sales_clean['Sales_Amount'] / sales_clean['Quantity']
sales_clean['Month'] = sales_clean['Date'].dt.month
sales_clean['Quarter'] = sales_clean['Date'].dt.quarter

print("Cleaned dataset:")
print(sales_clean.head())
print(f"\nMissing values after cleaning: {sales_clean.isnull().sum().sum()}")

In [None]:
# Analysis 1: Top products by revenue
print("Top 5 products by total sales:")
top_products = sales_clean.groupby('Product')['Sales_Amount'].sum().sort_values(ascending=False)
print(top_products)

# Analysis 2: Regional performance
print("\nSales by region:")
regional_sales = sales_clean.groupby('Region').agg({
    'Sales_Amount': ['sum', 'mean', 'count'],
    'Quantity': 'sum'
}).round(2)
print(regional_sales)

In [None]:
# Analysis 3: Monthly trend
print("Monthly sales trend:")
monthly_sales = sales_clean.groupby('Month')['Sales_Amount'].agg(['sum', 'mean']).round(2)
monthly_sales.columns = ['Total_Sales', 'Average_Sales']
print(monthly_sales)

# Analysis 4: Best performing product-region combination
print("\nTop 10 Product-Region combinations:")
product_region = sales_clean.groupby(['Product', 'Region'])['Sales_Amount'].sum().sort_values(ascending=False).head(10)
print(product_region)

In [None]:
# Analysis 5: Customer segment analysis
print("Sales by customer segment:")
segment_pivot = sales_clean.pivot_table(
    values='Sales_Amount',
    index='Product',
    columns='Customer_Segment',
    aggfunc='sum',
    fill_value=0
).round(2)
print(segment_pivot)

### Example 2: Customer Data Analysis

Analyze customer demographics and purchase behavior.

In [None]:
# Create customer dataset
np.random.seed(42)

customers = pd.DataFrame({
    'CustomerID': range(1, 101),
    'Age': np.random.randint(18, 70, 100),
    'Gender': np.random.choice(['M', 'F', 'Other'], 100),
    'City': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix'], 100),
    'Total_Purchases': np.random.randint(1, 50, 100),
    'Total_Spend': np.random.randint(100, 10000, 100),
    'Join_Date': pd.date_range('2023-01-01', periods=100, freq='3D')
})

print("Customer dataset:")
print(customers.head())
print(f"\nBasic statistics:\n{customers.describe()}")

In [None]:
# Analysis 1: Age segmentation
customers['Age_Group'] = pd.cut(
    customers['Age'],
    bins=[0, 25, 40, 60, 100],
    labels=['18-25', '26-40', '41-60', '60+']
)

print("Spending by age group:")
age_analysis = customers.groupby('Age_Group').agg({
    'Total_Spend': ['mean', 'sum'],
    'Total_Purchases': 'mean',
    'CustomerID': 'count'
}).round(2)
age_analysis.columns = ['Avg_Spend', 'Total_Spend', 'Avg_Purchases', 'Customer_Count']
print(age_analysis)

In [None]:
# Analysis 2: Calculate metrics
customers['Avg_Order_Value'] = customers['Total_Spend'] / customers['Total_Purchases']
customers['Days_Since_Join'] = (pd.Timestamp('2024-07-18') - customers['Join_Date']).dt.days

print("Top 10 customers by average order value:")
top_aov = customers.nlargest(10, 'Avg_Order_Value')[['CustomerID', 'Age', 'City', 'Avg_Order_Value']]
print(top_aov)

In [None]:
# Analysis 3: Geographic analysis
print("City-wise performance:")
city_analysis = customers.groupby('City').agg({
    'CustomerID': 'count',
    'Total_Spend': 'sum',
    'Avg_Order_Value': 'mean'
}).round(2)
city_analysis.columns = ['Customer_Count', 'Total_Revenue', 'Avg_Order_Value']
city_analysis = city_analysis.sort_values('Total_Revenue', ascending=False)
print(city_analysis)

In [None]:
# Analysis 4: Customer lifetime value segments
customers['CLV_Segment'] = pd.qcut(
    customers['Total_Spend'],
    q=4,
    labels=['Low', 'Medium', 'High', 'Premium']
)

print("Customer distribution by value segment:")
print(customers['CLV_Segment'].value_counts().sort_index())

print("\nSegment characteristics:")
segment_stats = customers.groupby('CLV_Segment').agg({
    'Age': 'mean',
    'Total_Purchases': 'mean',
    'Total_Spend': 'mean',
    'Avg_Order_Value': 'mean'
}).round(2)
print(segment_stats)

### Example 3: Time Series Forecasting Preparation

Prepare time series data for forecasting models.

In [None]:
# Create daily sales time series
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2024-06-30', freq='D')

# Simulate sales with trend and seasonality
trend = np.linspace(1000, 2000, len(dates))
seasonal = 200 * np.sin(np.arange(len(dates)) * 2 * np.pi / 365)
noise = np.random.normal(0, 100, len(dates))
sales_values = trend + seasonal + noise

ts_sales = pd.DataFrame({
    'Date': dates,
    'Sales': sales_values
}).set_index('Date')

print("Time series sales data:")
print(ts_sales.head())
print(f"\nDate range: {ts_sales.index.min()} to {ts_sales.index.max()}")
print(f"Total days: {len(ts_sales)}")

In [None]:
# Feature engineering for time series
ts_features = ts_sales.copy()

# Date features
ts_features['Year'] = ts_features.index.year
ts_features['Month'] = ts_features.index.month
ts_features['Day'] = ts_features.index.day
ts_features['DayOfWeek'] = ts_features.index.dayofweek
ts_features['Quarter'] = ts_features.index.quarter
ts_features['WeekOfYear'] = ts_features.index.isocalendar().week

# Lag features
ts_features['Sales_Lag1'] = ts_features['Sales'].shift(1)
ts_features['Sales_Lag7'] = ts_features['Sales'].shift(7)
ts_features['Sales_Lag30'] = ts_features['Sales'].shift(30)

# Rolling statistics
ts_features['Sales_Rolling7_Mean'] = ts_features['Sales'].rolling(window=7).mean()
ts_features['Sales_Rolling30_Mean'] = ts_features['Sales'].rolling(window=30).mean()
ts_features['Sales_Rolling7_Std'] = ts_features['Sales'].rolling(window=7).std()

print("With engineered features:")
print(ts_features.head(35))

In [None]:
# Monthly aggregation
monthly_agg = ts_sales.resample('M').agg({
    'Sales': ['sum', 'mean', 'min', 'max', 'std']
}).round(2)

print("Monthly aggregated data:")
print(monthly_agg.head(12))

In [None]:
# Calculate growth rates
ts_growth = ts_sales.copy()
ts_growth['Daily_Change'] = ts_growth['Sales'].diff()
ts_growth['Daily_Pct_Change'] = ts_growth['Sales'].pct_change() * 100
ts_growth['Weekly_Pct_Change'] = ts_growth['Sales'].pct_change(periods=7) * 100

print("Growth metrics:")
print(ts_growth.head(10))

print("\nSummary statistics for growth:")
print(ts_growth[['Daily_Pct_Change', 'Weekly_Pct_Change']].describe().round(2))

---
## Summary

### Key Takeaways

In this comprehensive guide, you've learned the essential skills for data manipulation with Pandas and NumPy:

1. **NumPy Fundamentals**
   - Creating and manipulating arrays efficiently
   - Performing vectorized operations for speed
   - Using universal functions and statistical operations
   - Boolean indexing and array reshaping

2. **Pandas Data Structures**
   - Working with Series (1D labeled arrays)
   - Creating and manipulating DataFrames (2D labeled tables)
   - Understanding the power of labeled data

3. **Data Loading and Exploration**
   - Reading data from various formats (CSV, Excel, JSON)
   - Exploring data with `.head()`, `.info()`, `.describe()`
   - Identifying data characteristics and quality issues

4. **Data Selection Techniques**
   - Label-based selection with `.loc[]`
   - Position-based selection with `.iloc[]`
   - Boolean indexing for filtering data
   - Using the `.query()` method for readable filters

5. **Data Cleaning**
   - Detecting and handling missing values (dropna, fillna)
   - Identifying and removing duplicates
   - Converting data types appropriately
   - Cleaning and standardizing text data

6. **Data Transformation**
   - Sorting data by single or multiple columns
   - Grouping data with `.groupby()` and aggregating
   - Creating pivot tables for summarization
   - Adding calculated columns and categorizing data

7. **Combining Datasets**
   - Concatenating DataFrames vertically and horizontally
   - Merging data with different join types (inner, left, right, outer)
   - Understanding when to use merge vs. join

8. **Time Series Analysis**
   - Working with datetime objects and date ranges
   - Extracting date components (year, month, day)
   - Resampling time series to different frequencies
   - Calculating rolling statistics and shifts

9. **Real-World Applications**
   - Sales analysis: identifying trends and top performers
   - Customer analytics: segmentation and behavior analysis
   - Time series preparation: feature engineering for forecasting

### Next Steps

To continue your data manipulation journey:

- **Practice**: Work with real datasets from sources like Kaggle, UCI ML Repository, or government open data portals
- **Visualization**: Learn data visualization libraries (Matplotlib, Seaborn, Plotly) to visualize your findings
- **Advanced Pandas**: Explore multi-indexing, categorical data, and performance optimization
- **Statistical Analysis**: Study statistical methods and hypothesis testing
- **Machine Learning**: Apply these skills to prepare data for machine learning models

### Resources

- Official documentation: [Pandas](https://pandas.pydata.org/docs/) and [NumPy](https://numpy.org/doc/)
- Practice datasets: [Kaggle](https://www.kaggle.com/datasets), [UCI ML Repository](https://archive.ics.uci.edu/ml/index.php)
- Community: Stack Overflow, Reddit (r/datascience, r/learnpython)

Remember: Data manipulation is a skill that improves with practice. Start with simple datasets and gradually work towards more complex analyses!