# Chapter 23: Data Wrangling — Join, Combine, Reshape

Merging datasets, concatenation, reshaping, and pivoting



### Why Data Wrangling? (Slide 78)


<p>Real-world data lives in <strong>multiple tables</strong>. You need to combine, reshape, and reorganize it before analysis.</p>
<p><strong>Core Operations:</strong></p>
<ul>
<li><strong>Merge / Join</strong> — combine datasets by matching keys (like SQL JOIN)</li>
<li><strong>Concatenate</strong> — stack datasets vertically or horizontally</li>
<li><strong>Reshape</strong> — pivot between long and wide formats</li>
</ul>
<p><strong>pandas Tools:</strong></p>
<ul>
<li><code>pd.merge()</code> — database-style joins on columns or indexes</li>
<li><code>pd.concat()</code> — glue DataFrames together along an axis</li>
<li><code>.stack()</code> / <code>.unstack()</code> — pivot between row and column levels</li>
<li><code>.pivot_table()</code> — spreadsheet-style pivot with aggregation</li>
<li><code>pd.melt()</code> — unpivot wide format to long format</li>
</ul>


> **Note:** Merging is the pandas equivalent of SQL JOIN


### pd.merge: Inner Join (Slide 79)


In [1]:
# pd.merge(left, right, on='key') — merge two DataFrames on a column
# how='inner'  — keep only matching rows (default)
# how='left'   — keep all left rows, NaN for no match
# how='right'  — keep all right rows
# how='outer'  — keep all rows from both

import pandas as pd

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})

# Inner join (default) — only matching keys
print(pd.merge(df1, df2, on='key'))
# 'c' from df1 dropped (not in df2)
# 'd' from df2 dropped (not in df1)

# Different column names? Use left_on / right_on
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c'],
                    'data1': range(4)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})
print(pd.merge(df3, df4, left_on='lkey', right_on='rkey'))


  key  data1  data2
0   b      0      1
1   b      1      1
2   a      2      0
3   a      4      0
4   a      5      0
5   b      6      1
  lkey  data1 rkey  data2
0    b      0    b      1
1    b      1    b      1
2    a      2    a      0


> **Note:** Inner join = intersection of keys (only matching rows)


### pd.merge: Outer, Left, Right Joins (Slide 80)


In [2]:
# how='outer' — ALL keys from BOTH tables (NaN where no match)
# how='left'  — ALL keys from LEFT table
# how='right' — ALL keys from RIGHT table
# suffixes=('_left', '_right') — rename overlapping columns

import pandas as pd

df1 = pd.DataFrame({'key': ['a', 'b', 'c'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['b', 'c', 'd'], 'val2': [4, 5, 6]})

# Outer join — union of all keys
print(pd.merge(df1, df2, on='key', how='outer'))
#   key  val1  val2
#    a   1.0   NaN
#    b   2.0   4.0
#    c   3.0   5.0
#    d   NaN   6.0

# Left join — all rows from df1
print(pd.merge(df1, df2, on='key', how='left'))

# Right join — all rows from df2
print(pd.merge(df1, df2, on='key', how='right'))

# Multiple keys
left = pd.DataFrame({'key1': ['a', 'a', 'b'], 'key2': ['one', 'two', 'one'], 'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['a', 'b', 'b'], 'key2': ['one', 'one', 'two'], 'rval': [4, 5, 6]})
print(pd.merge(left, right, on=['key1', 'key2'], how='outer'))

# Handle duplicate column names
print(pd.merge(df1, df2, on='key', suffixes=('_L', '_R')))


  key  val1  val2
0   a   1.0   NaN
1   b   2.0   4.0
2   c   3.0   5.0
3   d   NaN   6.0
  key  val1  val2
0   a     1   NaN
1   b     2   4.0
2   c     3   5.0
  key  val1  val2
0   b   2.0     4
1   c   3.0     5
2   d   NaN     6
  key1 key2  lval  rval
0    a  one   1.0   4.0
1    a  two   2.0   NaN
2    b  one   3.0   5.0
3    b  two   NaN   6.0
  key  val1  val2
0   b     2     4
1   c     3     5


> **Note:** Left join is most common — keep all your rows, add matching data


### Merging on Index (Slide 81)


In [3]:
# left_index=True  — use left DataFrame's index as join key
# right_index=True — use right DataFrame's index as join key
# df.join(other)   — shorthand for merging on index

import pandas as pd

left = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                     'value': range(6)})
right = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

# Merge left column with right index
print(pd.merge(left, right, left_on='key', right_index=True))

# Both indexes
left2 = pd.DataFrame({'val': [1, 2, 3]}, index=['a', 'b', 'c'])
right2 = pd.DataFrame({'val2': [4, 5, 6]}, index=['b', 'c', 'd'])
print(pd.merge(left2, right2, left_index=True, right_index=True,
               how='outer'))

# .join() shorthand (merges on index by default)
print(left2.join(right2, how='outer'))

# Join multiple DataFrames at once
other1 = pd.DataFrame({'x': [1, 2]}, index=['a', 'b'])
other2 = pd.DataFrame({'y': [3, 4]}, index=['a', 'c'])
print(left2.join([other1, other2], how='outer'))


  key  value  group_val
0   a      0        3.5
1   b      1        7.0
2   a      2        3.5
3   a      3        3.5
4   b      4        7.0
   val  val2
a  1.0   NaN
b  2.0   4.0
c  3.0   5.0
d  NaN   6.0
   val  val2
a  1.0   NaN
b  2.0   4.0
c  3.0   5.0
d  NaN   6.0
   val    x    y
a    1  1.0  3.0
b    2  2.0  NaN
c    3  NaN  4.0


> **Note:** .join() is a convenient shorthand for index-based merges


### pd.concat: Stacking DataFrames (Slide 82)


In [4]:
# pd.concat([df1, df2, ...])          — stack vertically (default)
# pd.concat([df1, df2], axis=1)       — stack horizontally
# ignore_index=True                   — reset index 0, 1, 2, ...
# keys=['a', 'b']                     — add hierarchical index
# join='inner'                        — only keep shared columns

import pandas as pd
import numpy as np

s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

# Vertical concat (default)
print(pd.concat([s1, s2, s3]))  # a=0, b=1, c=2, ..., g=6

# Horizontal concat
print(pd.concat([s1, s2, s3], axis=1))  # NaN where missing

# Add keys to identify source
result = pd.concat([s1, s2, s3], keys=['one', 'two', 'three'])
print(result)  # Hierarchical index

# DataFrames — reset index after concat
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
print(pd.concat([df1, df2], ignore_index=True))


a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64
     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
   A  B
0  1  3
1  2  4
2  5  7
3  6  8


> **Note:** Use ignore_index=True when original indexes are meaningless


### combine_first: Patching Missing Data (Slide 83)


In [5]:
# .combine_first(other) — fill NaN values in self with values from other
# Think of it as: 'patch holes in my data with this backup'
# np.where(pd.isnull(a), b, a) — equivalent logic

import pandas as pd
import numpy as np

a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
              index=['a', 'b', 'c', 'd', 'e', 'f'])

# Fill NaN in 'a' with values from 'b'
print(a.combine_first(b))
# a    5.0   ← was NaN, filled from b
# b    4.5   ← kept from a
# c    3.5   ← kept from a
# d    2.0   ← was NaN, filled from b
# e    2.5   ← kept from a
# f    0.0   ← was NaN, filled from b

# Works on DataFrames too
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.]})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3.],
                    'b': [np.nan, 3., 4., 5.]})
print(df1.combine_first(df2))


a    0.0
b    4.5
c    3.5
d    NaN
e    2.5
f    5.0
dtype: float64
     a    b
0  1.0  NaN
1  4.0  2.0
2  5.0  4.0
3  3.0  6.0


> **Note:** combine_first = 'use my data, patch holes from the other'


### Reshaping: stack & unstack (Slide 84)


In [6]:
# .stack()    — pivot COLUMNS → ROWS (wide → long)
# .unstack()  — pivot ROWS → COLUMNS (long → wide)
# .unstack(level) — specify which level to unstack
# Both return a view when possible

import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=['Ohio', 'Colorado'],
                    columns=['one', 'two', 'three'])

# stack: columns become inner row index
stacked = data.stack()
print(stacked)
# Ohio      one      0
#           two      1
#           three    2
# Colorado  one      3
#           two      4
#           three    5

# unstack: inner row index → columns
print(stacked.unstack())     # Back to original
print(stacked.unstack(0))    # Unstack the OTHER level

# Unstacking with missing data → NaN fills the gaps
s = pd.Series([1, 2, 3, 4], index=[['a', 'a', 'b', 'b'],
                                    ['one', 'two', 'three', 'four']])
print(s.unstack())


Ohio      one      0
          two      1
          three    2
Colorado  one      3
          two      4
          three    5
dtype: int64
          one  two  three
Ohio        0    1      2
Colorado    3    4      5
       Ohio  Colorado
one       0         3
two       1         4
three     2         5
   four  one  three  two
a   NaN  1.0    NaN  2.0
b   4.0  NaN    3.0  NaN


> **Note:** stack = wide→long, unstack = long→wide


### Pivoting: Long to Wide (Slide 85)


In [7]:
# df.pivot(index, columns, values) — reshape long → wide
# ⚠️ Fails if duplicate (index, column) pairs exist!
# Use pivot_table() for duplicates (it aggregates)

import pandas as pd

data = pd.DataFrame({'date': ['2024-01-01'] * 3 + ['2024-01-02'] * 3,
                     'item': ['A', 'B', 'C', 'A', 'B', 'C'],
                     'value': [1, 2, 3, 4, 5, 6]})

# Long format (tidy)
print(data)
#         date item  value
# 0 2024-01-01    A      1
# 1 2024-01-01    B      2
# ...

# Pivot to wide format
wide = data.pivot(index='date', columns='item', values='value')
print(wide)
# item         A  B  C
# date
# 2024-01-01   1  2  3
# 2024-01-02   4  5  6

# Reverse: wide → long with pd.melt()
long = pd.melt(wide.reset_index(), id_vars=['date'],
               value_vars=['A', 'B', 'C'])
print(long)


         date item  value
0  2024-01-01    A      1
1  2024-01-01    B      2
2  2024-01-01    C      3
3  2024-01-02    A      4
4  2024-01-02    B      5
5  2024-01-02    C      6
item        A  B  C
date               
2024-01-01  1  2  3
2024-01-02  4  5  6
         date item  value
0  2024-01-01    A      1
1  2024-01-02    A      4
2  2024-01-01    B      2
3  2024-01-02    B      5
4  2024-01-01    C      3
5  2024-01-02    C      6


> **Note:** pivot for unique keys; pivot_table when you need aggregation


### pivot_table: Aggregated Pivoting (Slide 86)


In [8]:
# df.pivot_table(values, index, columns, aggfunc)
# aggfunc='mean'    — average (default)
# aggfunc='sum'     — total
# aggfunc='count'   — count
# aggfunc=['mean', 'sum'] — multiple aggregations
# margins=True      — add row/column totals
# fill_value=0      — replace NaN with 0

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
                   'B': ['one', 'one', 'two', 'two', 'one', 'one'],
                   'C': np.random.randn(6),
                   'D': np.random.randn(6)})

# Average of C grouped by A and B
print(df.pivot_table(values='C', index='A', columns='B',
                     aggfunc='mean'))

# Sum with row/column totals
print(df.pivot_table(values=['C', 'D'], index='A',
                     aggfunc='sum', margins=True))

# Multiple aggregation functions
print(df.pivot_table(values='C', index='A', columns='B',
                     aggfunc=['mean', 'count'],
                     fill_value=0))


B         one       two
A                      
bar  1.760446  1.240286
foo  1.089991 -0.076091
            C         D
A                      
bar  4.761178  0.812009
foo  2.103892  1.257166
All  6.865070  2.069175
         mean           count    
B         one       two   one two
A                                
bar  1.760446  1.240286     2   1
foo  1.089991 -0.076091     2   1


> **Note:** pivot_table is like Excel's PivotTable feature


### pd.melt: Wide to Long (Slide 87)


In [9]:
# pd.melt(df, id_vars, value_vars) — unpivot wide → long
# id_vars=['col']    — columns to KEEP as identifiers
# value_vars=['a','b'] — columns to UNPIVOT into rows
# var_name='name'    — name for the 'variable' column
# value_name='val'   — name for the 'value' column

import pandas as pd

# Wide format (one column per measurement)
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'],
                   'math': [90, 80, 70],
                   'science': [85, 95, 75],
                   'english': [88, 78, 92]})
print(df)

# Melt to long format (one row per measurement)
long = pd.melt(df, id_vars=['name'],
               value_vars=['math', 'science', 'english'],
               var_name='subject', value_name='score')
print(long)
#      name  subject  score
# 0   Alice     math     90
# 1     Bob     math     80
# 2 Charlie     math     70
# 3   Alice  science     85
# ...

# Melt is the inverse of pivot!


      name  math  science  english
0    Alice    90       85       88
1      Bob    80       95       78
2  Charlie    70       75       92
      name  subject  score
0    Alice     math     90
1      Bob     math     80
2  Charlie     math     70
3    Alice  science     85
4      Bob  science     95
5  Charlie  science     75
6    Alice  english     88
7      Bob  english     78
8  Charlie  english     92


> **Note:** melt = unpivot. Converts wide tables into tidy long format


### pd.crosstab: Frequency Tables (Slide 88)


In [10]:
# pd.crosstab(row_data, col_data) — frequency table
# Similar to pivot_table with aggfunc='count'
# normalize=True     — show proportions instead of counts
# normalize='index'  — normalize across rows
# normalize='columns' — normalize down columns
# margins=True       — add row/column totals

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({'gender': np.random.choice(['M', 'F'], 100),
                   'handedness': np.random.choice(['R', 'L'], 100),
                   'city': np.random.choice(['NY', 'LA', 'CHI'], 100)})

# Basic frequency table
print(pd.crosstab(df.gender, df.handedness))
# handedness   L   R
# gender
# F           25  30
# M           20  25

# With margins (totals)
print(pd.crosstab(df.gender, df.handedness, margins=True))

# Proportions
print(pd.crosstab(df.gender, df.handedness, normalize=True))

# Multiple variables
print(pd.crosstab([df.gender, df.city], df.handedness))


handedness   L   R
gender            
F           25  31
M           19  25
handedness   L   R  All
gender                 
F           25  31   56
M           19  25   44
All         44  56  100
handedness     L     R
gender                
F           0.25  0.31
M           0.19  0.25
handedness    L   R
gender city        
F      CHI    8   9
       LA    11  12
       NY     6  10
M      CHI    9   7
       LA     2   6
       NY     8  12


> **Note:** crosstab is great for quick categorical data exploration


### GroupBy Basics (Slide 89)


In [11]:
# df.groupby('col')        — group rows by column values
# .agg(func)               — apply aggregation function
# .agg(['mean', 'sum'])    — multiple aggregations
# .agg({'col1': 'sum', 'col2': 'mean'}) — per-column aggregation
# .transform(func)         — apply func, return same-shape result
# .filter(func)            — filter groups based on condition

import pandas as pd
import numpy as np

df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                   'key2': ['one', 'two', 'one', 'two', 'one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

# Group by key1, compute mean
print(df.groupby('key1').mean(numeric_only=True))

# Group by multiple keys
print(df.groupby(['key1', 'key2']).mean(numeric_only=True))

# Multiple aggregations
print(df.groupby('key1')['data1'].agg(['mean', 'std', 'count']))

# Per-column aggregation
print(df.groupby('key1').agg({'data1': 'sum', 'data2': 'mean'}))

# Iterate over groups
for name, group in df.groupby('key1'):
    print(f'Group: {name}, Shape: {group.shape}')


         data1     data2
key1                    
a     1.076427 -0.282492
b    -0.178197 -0.147071
              data1     data2
key1 key2                    
a    one   1.177846 -0.691905
     two   0.873589  0.536336
b    one   0.009144 -0.914691
     two  -0.365539  0.620548
          mean       std  count
key1                           
a     1.076427  0.557175      3
b    -0.178197  0.264941      2
         data1     data2
key1                    
a     3.229281 -0.282492
b    -0.356395 -0.147071
Group: a, Shape: (3, 4)
Group: b, Shape: (2, 4)


> **Note:** GroupBy = split-apply-combine pattern


### GroupBy: Transform & Filter (Slide 90)


In [12]:
# .transform(func) — apply func to each group, broadcast back to original shape
# .filter(func)    — keep/discard entire groups based on a condition
# .apply(func)     — flexible: apply any function to each group

import pandas as pd
import numpy as np

df = pd.DataFrame({'key': ['a', 'a', 'b', 'b', 'c', 'c'],
                   'value': [1, 2, 3, 4, 5, 100]})

# Transform: normalize within each group (z-score)
df['normalized'] = df.groupby('key')['value'].transform(
    lambda x: (x - x.mean()) / x.std()
)
print(df)

# Filter: keep only groups with mean > 3
print(df.groupby('key').filter(lambda x: x['value'].mean() > 3))

# Apply: custom function per group
def top_n(group, n=1):
    return group.nlargest(n, 'value')

print(df.groupby('key').apply(top_n))

# Fill NaN with group mean
df2 = pd.DataFrame({'key': ['a', 'a', 'b', 'b'],
                    'val': [1, np.nan, np.nan, 4]})
df2['val'] = df2.groupby('key')['val'].transform(
    lambda x: x.fillna(x.mean()))


  key  value  normalized
0   a      1   -0.707107
1   a      2    0.707107
2   b      3   -0.707107
3   b      4    0.707107
4   c      5   -0.707107
5   c    100    0.707107
  key  value  normalized
2   b      3   -0.707107
3   b      4    0.707107
4   c      5   -0.707107
5   c    100    0.707107
       value  normalized
key                     
a   1      2    0.707107
b   3      4    0.707107
c   5    100    0.707107


> **Note:** transform keeps original shape — great for normalization


### Wrangling Quick Reference (Slide 91)


<p><strong>Combining Data:</strong></p>
<table>
<tr><th>Tool</th><th>When to Use</th></tr>
<tr><td><code>pd.merge()</code></td><td>Join on matching column values (SQL-style)</td></tr>
<tr><td><code>.join()</code></td><td>Join on index (shorthand for merge)</td></tr>
<tr><td><code>pd.concat()</code></td><td>Stack DataFrames vertically or horizontally</td></tr>
<tr><td><code>.combine_first()</code></td><td>Patch NaN holes from another source</td></tr>
</table>
<p><strong>Reshaping Data:</strong></p>
<table>
<tr><th>Tool</th><th>Direction</th></tr>
<tr><td><code>.pivot()</code> / <code>.pivot_table()</code></td><td>Long → Wide</td></tr>
<tr><td><code>pd.melt()</code></td><td>Wide → Long</td></tr>
<tr><td><code>.stack()</code></td><td>Columns → Rows</td></tr>
<tr><td><code>.unstack()</code></td><td>Rows → Columns</td></tr>
<tr><td><code>pd.crosstab()</code></td><td>Frequency tables</td></tr>
</table>
<p><strong>GroupBy Pattern:</strong> <code>split → apply → combine</code></p>
<ul>
<li><code>.agg()</code> — reduce each group to one row</li>
<li><code>.transform()</code> — return same-sized result</li>
<li><code>.filter()</code> — keep/discard entire groups</li>
<li><code>.apply()</code> — anything goes</li>
</ul>


> **Note:** merge = SQL JOIN, concat = SQL UNION, pivot = Excel PivotTable
