## Joining

In [1]:
import pandas as pd

In [2]:
nyt = pd.read_csv('chapter6/nyt_names.csv')
nyt

Unnamed: 0,nyt_name,category
0,Lucifer,forbidden
1,Lilith,forbidden
2,Danger,forbidden
3,Amen,evangelical
4,Savior,evangelical
5,Canaan,evangelical
6,Creed,evangelical
7,Saint,evangelical
8,Susan,boomer
9,Debbie,boomer


In [3]:
baby = pd.read_csv('chapter6/babynames.txt')
baby

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,19659,2020
1,Noah,M,18252,2020
2,Oliver,M,14147,2020
3,Elijah,M,13034,2020
4,William,M,12541,2020
...,...,...,...,...
2020717,Ula,F,5,1880
2020718,Vannie,F,5,1880
2020719,Verona,F,5,1880
2020720,Vertie,F,5,1880


#### Inner Joins

In [6]:
baby.merge(nyt, left_on='Name', right_on='nyt_name')

Unnamed: 0,Name,Sex,Count,Year,nyt_name,category
0,Julius,M,960,2020,Julius,mythology
1,Cassius,M,596,2020,Cassius,mythology
2,Saint,M,476,2020,Saint,evangelical
3,Onyx,M,442,2020,Onyx,mineral
4,Creed,M,288,2020,Creed,evangelical
...,...,...,...,...,...,...
2288,Cassius,M,17,1880,Cassius,mythology
2289,Creed,M,7,1880,Creed,evangelical
2290,Susan,F,286,1880,Susan,boomer
2291,Celestia,F,6,1880,Celestia,celestial


```merge()``` function matches rows using the values in the ```Name``` and ```nyt_name``` columns, dropping rows that don't have matching values.

#### Left, Right, and Outer Joins
Sometimes we want to keep rows without a match instead of dropping them entirely.

**Left join**: rows in the ```left``` table without a match are kept in the final result.
**Right join**: rows in the ```right``` table without a match are kept in the final result.
**Outer join**: keeps rows from ```both tables``` even when they don't have a match.

In [7]:
baby.merge(nyt,
           left_on='Name',
           right_on='nyt_name',
           how='left')

Unnamed: 0,Name,Sex,Count,Year,nyt_name,category
0,Liam,M,19659,2020,,
1,Noah,M,18252,2020,,
2,Oliver,M,14147,2020,,
3,Elijah,M,13034,2020,,
4,William,M,12541,2020,,
...,...,...,...,...,...,...
2020717,Ula,F,5,1880,,
2020718,Vannie,F,5,1880,,
2020719,Verona,F,5,1880,,
2020720,Vertie,F,5,1880,,


#### Example: Popularity of NYT Name Categories

In [8]:
baby.head(2)

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,19659,2020
1,Noah,M,18252,2020


In [9]:
nyt.head(2)

Unnamed: 0,nyt_name,category
0,Lucifer,forbidden
1,Lilith,forbidden


We want to know how the popularity of name categories in ```nyt``` has changed over time. To answer this question:
1. Inner join ```baby``` with ```nyt```.
2. Group the table by ```category``` and ```Year```.
3. Aggregate the counts using a sum:

In [10]:
cate_counts = (
    baby.merge(nyt, left_on='Name', right_on='nyt_name')    #1
    .groupby(['category', 'Year'])                                           #2
    ['Count']                                                                        #3
    .sum()                                                                           #3
    .reset_index()
)
cate_counts

Unnamed: 0,category,Year,Count
0,boomer,1880,292
1,boomer,1881,298
2,boomer,1882,326
3,boomer,1883,322
4,boomer,1884,335
...,...,...,...
645,mythology,2016,2671
646,mythology,2017,2797
647,mythology,2018,2944
648,mythology,2019,3320


Now we can plot the popularity of ```boomer``` names and ```mythology``` names:

In [17]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [22]:
# boomers = px.line(cate_counts[cate_counts['category'] == 'boomer'], x='Year', y='Count')
boomers = cate_counts.query('category == "boomer"')
myths = cate_counts.query('category == "mythology"')

# Create subplots
fig = make_subplots(rows=1, cols=2,
                    subplot_titles=['Boomer names', 'Mythology names'])

# Add traces
fig.add_trace(go.Scatter(x=boomers['Year'], y=boomers['Count'],
                         mode='lines', name='Boomer'), row=1, col=1)
fig.add_trace(go.Scatter(x=myths['Year'], y=myths['Count'],
                         mode='lines', name='Mythology'), row=1, col=2)

# Update layout
fig.update_layout(width=500, height=200, margin=dict(t=30), showlegend=False)
fig.show()

As the NYT article claims, baby boomer names have become less popular since 2000, while mythology names have become more popular.

We can also plot the popularity of all the categories at once.

In [27]:
fig = px.line(cate_counts, x='Year', y='Count',
              facet_col='category', facet_col_wrap=3,
              facet_col_spacing=0.05,
              width=800, height=400)
fig.update_yaxes(matches=None, showticklabels=False)