# Merging and Joining Data

Combine multiple datasets like a pro.

## Key Concepts
- **Concatenation:** Stacking data (vertical/horizontal)
- **Merging:** Database-style joins (SQL-like)
- **Joining:** Index-based combining
- **Keys:** Columns to match on

In [None]:
import pandas as pd
import numpy as np

# Display options
pd.set_option('display.max_rows', 10)

## 1. Concatenation (pd.concat)
Stacking DataFrames on top of each other (axis=0) 
or side-by-side (axis=1).

In [None]:
# Create dummy data
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({
    'A': ['A4', 'A5', 'A6', 'A7'],
    'B': ['B4', 'B5', 'B6', 'B7'],
    'C': ['C4', 'C5', 'C6', 'C7'],
    'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7])

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

In [None]:
# Vertical Stack (Axis 0)
result = pd.concat([df1, df2])
print("Concatenated (Vertical):")
print(result)

In [None]:
# Horizontal Stack (Axis 1)
df3 = pd.DataFrame({
    'E': ['E0', 'E1', 'E2', 'E3'],
    'F': ['F0', 'F1', 'F2', 'F3']
})

# Note: Indexes must align for horizontal concat
result_h = pd.concat([df1, df3], axis=1)
print("Concatenated (Horizontal):")
print(result_h)

## 2. Merging (pd.merge)
SQL-style joins using common columns.

In [None]:
# Employees dataset
employees = pd.DataFrame({
    'emp_id': [101, 102, 103, 104],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'dept_id': ['D1', 'D2', 'D1', 'D3']
})

# Departments dataset
departments = pd.DataFrame({
    'dept_id': ['D1', 'D2', 'D4'],
    'dept_name': ['HR', 'IT', 'Marketing']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

In [None]:
# Inner Join (Intersection)
# Only matching records from both
inner = pd.merge(
    employees, 
    departments, 
    on='dept_id', 
    how='inner'
)
print("Inner Join (Only matched):")
print(inner)

In [None]:
# Left Join
# All employees, match dept if exists
left = pd.merge(
    employees, 
    departments, 
    on='dept_id', 
    how='left'
)
print("Left Join (All employees):")
print(left)
print("\nNote: David has NaN because D3 isn't in Departments")

In [None]:
# Right Join
# All departments, match emp if exists
right = pd.merge(
    employees, 
    departments, 
    on='dept_id', 
    how='right'
)
print("Right Join (All departments):")
print(right)
print("\nNote: Marketing has NaN because no employee is in D4")

In [None]:
# Outer Join (Union)
# Everything from both
outer = pd.merge(
    employees, 
    departments, 
    on='dept_id', 
    how='outer'
)
print("Outer Join (Everything):")
print(outer)

## 3. Merging on Multiple Columns

In [None]:
df_left = pd.DataFrame({
    'key1': ['K0', 'K0', 'K1', 'K2'],
    'key2': ['K0', 'K1', 'K0', 'K1'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})

df_right = pd.DataFrame({
    'key1': ['K0', 'K1', 'K1', 'K2'],
    'key2': ['K0', 'K0', 'K0', 'K0'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
})

result = pd.merge(
    df_left, 
    df_right, 
    on=['key1', 'key2']
)
print("Merge on multiple keys:")
print(result)

## 4. Joining (Index-based)
Combine using indices instead of columns.

In [None]:
left_idx = pd.DataFrame(
    {'A': ['A0', 'A1', 'A2']},
    index=['K0', 'K1', 'K2']
)

right_idx = pd.DataFrame(
    {'B': ['B0', 'B2', 'B3']},
    index=['K0', 'K2', 'K3']
)

result = left_idx.join(right_idx, how='outer')
print("Index Join:")
print(result)

## 5. Handling Duplicates in Merge
Validate relationships (one-to-one, one-to-many).

In [None]:
try:
    # Validation checks if keys are unique
    pd.merge(
        df_left, 
        df_right, 
        on=['key1', 'key2'], 
        validate='one_to_one'
    )
except Exception as e:
    print(f"Merge Failed: {e}")

## Practice Exercises

### Exercise 1
Merge two datasets with different column names 
(e.g., 'id' vs 'customer_id') using `left_on` and `right_on`.

In [None]:
# Your code here


### Exercise 2
Combine three DataFrames using `concat` and ignore index.

In [None]:
# Your code here


## Key Takeaways

✅ **concat** - Simple stacking  
✅ **merge** - Powerful SQL-style joins  
✅ **join** - Fast index-based combining  
✅ **Types** - Inner (shared), Left (all left), Outer (all)  
✅ **Keys** - Match on single or multiple columns  

**Next:** [Time Series Analysis](05_time_series.ipynb) →