
## Merging Data
### [Merging DataFrames](#MD)
* [Merging company DataFrames](#McD)
* [Merging on a specific column](#Moasc)
* [Merging on columns with non-matching labels](#Mocwnl)
* [Merging on multiple columns](#Momc)


### [Joining DataFrames](#JD)
* [Joining by Index](#JbI)
* [Choosing a joining strategy](#Cajs)
* [Left & right merging on multiple columns](#L&rmomc)
* [Merging DataFrames with outer join](#MDwoj)

### [Ordered merges](#Om)
* [Using merge_ordered()](#Um)
* [Using merge_asof()](#Um)      


## Case Study - Summer Olympics
* [Medals in the Summer Olympics](#MitSO)
* [Loading Olympic edition DataFrame](#LOeD)
* [Loading IOC codes DataFrame](#LIcD)
* [Building medals DataFrame](#BmD)
* [Quantifying Performance](#QP)
* [Counting medals by country/edition in a pivot table](#Cmbciapt)
* [Computing fraction of medals per Olympic edition](#CfompOe)
* [Computing percentage change in fraction of medals won](#Cpcifomw)
* [Reshaping and plotting](#Rap)
* [Building hosts DataFrame](#BhD)
* [Reshaping for analysis](#Rfa)
* [Merging to compute influence](#Mtci)
* [Plotting influence of host country](#Piohc)
* [Final thoughts](#Ft)


In [74]:
import pandas as pd

In [162]:
revenue = pd.DataFrame.from_dict({ 
'city': ['Austin','Denver', 'SpringField', 'Mendocino' ],
'branch_id' :[10, 20, 30, 47],
'revenue' : [100, 83, 4, 200]
})

In [163]:
manager = pd.DataFrame.from_dict(
{
'city': ['Austin','Denver', 'SpringField', 'Mendocino' ],
'branch_id' :[10, 20, 30, 47],
	'manager':['Charles', 'Joel', 'Brett', 'Sally']
}	
)

In [164]:

sales = pd.DataFrame.from_dict({
	'city':['Mendocino', 'Denver', 'Austin', 'Springfield', 'Springfield'],
	'state':['CA', 'CO', 'TX', 'MO', 'IL'],
	'units': [1, 4, 2, 5, 1]
})

<p id ='MD'><p>
### Merging DataFrames

<p id ='McD'><p>
### Merging company DataFrames

<p id ='Moasc'><p>
### Merging on a specific column

In [77]:
manager

Unnamed: 0,city,branch_id,manager
0,Austin,10,Charles
1,Denver,20,Joel
2,SpringField,30,Brett
3,Mendocino,47,Sally


In [78]:
revenue

Unnamed: 0,city,branch_id,revenue
0,Austin,10,100
1,Denver,20,83
2,SpringField,30,4
3,Mendocino,47,200


In [79]:
merge_by_city = pd.merge(manager, revenue, on= ['city'])
merge_by_city

Unnamed: 0,city,branch_id_x,manager,branch_id_y,revenue
0,Austin,10,Charles,10,100
1,Denver,20,Joel,20,83
2,SpringField,30,Brett,30,4
3,Mendocino,47,Sally,47,200


In [80]:
merge_by_id = pd.merge(manager, revenue, on= ['branch_id'])
merge_by_id

Unnamed: 0,city_x,branch_id,manager,city_y,revenue
0,Austin,10,Charles,Austin,100
1,Denver,20,Joel,Denver,83
2,SpringField,30,Brett,SpringField,4
3,Mendocino,47,Sally,Mendocino,200


In [81]:
merger = pd.merge(manager, revenue, on= ['city', 'branch_id'])
merger

Unnamed: 0,city,branch_id,manager,revenue
0,Austin,10,Charles,100
1,Denver,20,Joel,83
2,SpringField,30,Brett,4
3,Mendocino,47,Sally,200


<p id ='Mocwnl'><p>
### Merging on columns with non-matching labels

In [82]:
# lets mess manager data
manager=manager.rename(columns = {'city':'branch'})
manager

Unnamed: 0,branch,branch_id,manager
0,Austin,10,Charles
1,Denver,20,Joel
2,SpringField,30,Brett
3,Mendocino,47,Sally


In [83]:
revenue

Unnamed: 0,city,branch_id,revenue
0,Austin,10,100
1,Denver,20,83
2,SpringField,30,4
3,Mendocino,47,200


In [84]:
combined = pd.merge(manager, revenue, left_on='branch', right_on='city')
combined

Unnamed: 0,branch,branch_id_x,manager,city,branch_id_y,revenue
0,Austin,10,Charles,Austin,10,100
1,Denver,20,Joel,Denver,20,83
2,SpringField,30,Brett,SpringField,30,4
3,Mendocino,47,Sally,Mendocino,47,200


<p id ='Momc'><p>
### Merging on multiple columns

In [165]:
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']

# Add 'state' column to managers: managers['state']
manager['state'] = ['TX','CO','CA','MO']

In [166]:
revenue

Unnamed: 0,city,branch_id,revenue,state
0,Austin,10,100,TX
1,Denver,20,83,CO
2,SpringField,30,4,IL
3,Mendocino,47,200,CA


In [87]:
manager=manager.rename(columns = {'branch':'city'})
manager

Unnamed: 0,city,branch_id,manager,state
0,Austin,10,Charles,TX
1,Denver,20,Joel,CO
2,SpringField,30,Brett,CA
3,Mendocino,47,Sally,MO


In [88]:
combined = pd.merge(revenue, manager, on = ['city', 'state', 'branch_id'])
combined

Unnamed: 0,city,branch_id,revenue,state,manager
0,Austin,10,100,TX,Charles
1,Denver,20,83,CO,Joel


<p id ='JD'><p>
### Joining DataFrames

In [89]:
revenue = revenue.set_index('branch_id')
revenue

Unnamed: 0_level_0,city,revenue,state
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,Austin,100,TX
20,Denver,83,CO
30,SpringField,4,IL
47,Mendocino,200,CA


In [138]:
manager.index = [10, 20, 31, 47]
manager.index.name = 'branch_id'

In [140]:
manager

Unnamed: 0_level_0,city,manager
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10,Austin,Charles
20,Denver,Joel
31,SpringField,Brett
47,Mendocino,Sally


In [142]:
pd.merge(revenue, manager, on= 'branch_id')

Unnamed: 0_level_0,city_x,revenue,state,city_y,manager
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,Austin,100,TX,Austin,Charles
20,Denver,83,CO,Denver,Joel
47,Mendocino,200,CA,Mendocino,Sally


In [143]:
# ALL columns from the left dataframe (revenue)
pd.merge(revenue, manager, on= 'branch_id', how = 'left')

Unnamed: 0_level_0,city_x,revenue,state,city_y,manager
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,Austin,100,TX,Austin,Charles
20,Denver,83,CO,Denver,Joel
30,SpringField,4,IL,,
47,Mendocino,200,CA,Mendocino,Sally


In [144]:
revenue.join(manager, lsuffix='_rev', rsuffix='_mng', how = 'outer')

Unnamed: 0_level_0,city_rev,revenue,state,city_mng,manager
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,Austin,100.0,TX,Austin,Charles
20,Denver,83.0,CO,Denver,Joel
30,SpringField,4.0,IL,,
31,,,,SpringField,Brett
47,Mendocino,200.0,CA,Mendocino,Sally


In [145]:
manager.join(revenue, lsuffix='_mng', rsuffix = '_rev', how = 'left')

Unnamed: 0_level_0,city_mng,manager,city_rev,revenue,state
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,Austin,Charles,Austin,100.0,TX
20,Denver,Joel,Denver,83.0,CO
31,SpringField,Brett,,,
47,Mendocino,Sally,Mendocino,200.0,CA


<p id ='JbI'><p>
### Joining by Index

In [167]:
sales

Unnamed: 0,city,state,units
0,Mendocino,CA,1
1,Denver,CO,4
2,Austin,TX,2
3,Springfield,MO,5
4,Springfield,IL,1


In [168]:
revenue

Unnamed: 0,city,branch_id,revenue,state
0,Austin,10,100,TX
1,Denver,20,83,CO
2,SpringField,30,4,IL
3,Mendocino,47,200,CA


In [176]:
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how = 'right', on = ['city', 'state'])
revenue_and_sales

Unnamed: 0,city,branch_id,revenue,state,units
0,Austin,10.0,100.0,TX,2
1,Denver,20.0,83.0,CO,4
2,Mendocino,47.0,200.0,CA,1
3,Springfield,,,MO,5
4,Springfield,,,IL,1


In [174]:
manager_and_sales = pd.merge(sales, manager, how ='left', on= ['city', 'state'] )
manager_and_sales

Unnamed: 0,city,state,units,branch_id,manager
0,Mendocino,CA,1,,
1,Denver,CO,4,20.0,Joel
2,Austin,TX,2,10.0,Charles
3,Springfield,MO,5,,
4,Springfield,IL,1,,


In [182]:
manager

Unnamed: 0,city,branch_id,manager,state
0,Austin,10,Charles,TX
1,Denver,20,Joel,CO
2,SpringField,30,Brett,CA
3,Mendocino,47,Sally,MO


In [172]:
sales

Unnamed: 0,city,state,units
0,Mendocino,CA,1
1,Denver,CO,4
2,Austin,TX,2
3,Springfield,MO,5
4,Springfield,IL,1


<p id ='Om'><p>
## Ordered merges

In [183]:
austin = pd.DataFrame.from_dict({
	'date':['2016-01-01', '2016-02-08', '2016-01-17'],
	'ratings': ['Cloudy', 'Cloudy', 'Sunny'],
})

houston  = pd.DataFrame.from_dict({
	'date':['2016-01-04', '2016-01-01', '2016-03-01'],
	'ratings': ['Rainy', 'Cloudy', 'Sunny']
})

<p id ='Um'><p>
### Using `merge_ordered()`
Weather conditions were recorded on separate days and you need to merge these two DataFrames together such that the dates are ordered. To do this, you'll use `pd.merge_ordered()`. After you're done, note the order of the rows before and after merging.



In [184]:
austin

Unnamed: 0,date,ratings
0,2016-01-01,Cloudy
1,2016-02-08,Cloudy
2,2016-01-17,Sunny


In [185]:
houston

Unnamed: 0,date,ratings
0,2016-01-04,Rainy
1,2016-01-01,Cloudy
2,2016-03-01,Sunny


In [186]:
tx_weather = pd.merge_ordered(austin, houston)
tx_weather


Unnamed: 0,date,ratings
0,2016-01-01,Cloudy
1,2016-01-04,Rainy
2,2016-01-17,Sunny
3,2016-02-08,Cloudy
4,2016-03-01,Sunny


In [187]:
tx_weather_suff = pd.merge_ordered(austin, houston, on= 'date', suffixes = ['_aus', '_hus'])
tx_weather_suff



Unnamed: 0,date,ratings_aus,ratings_hus
0,2016-01-01,Cloudy,Cloudy
1,2016-01-04,,Rainy
2,2016-01-17,Sunny,
3,2016-02-08,Cloudy,
4,2016-03-01,,Sunny


In [188]:
tx_weather_ffill = pd.merge_ordered(austin, houston, on= 'date', suffixes = ['_aus', '_hus'], fill_method='ffill')
tx_weather_ffill



Unnamed: 0,date,ratings_aus,ratings_hus
0,2016-01-01,Cloudy,Cloudy
1,2016-01-04,Cloudy,Rainy
2,2016-01-17,Sunny,Rainy
3,2016-02-08,Cloudy,Rainy
4,2016-03-01,Cloudy,Sunny


<p id ='Um'><p>
### Using `merge_asof()`

In [218]:
oil = pd.read_csv('./data/oil_price.csv', parse_dates=True)
oil['Date'] = pd.to_datetime(oil['Date'])
oil.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 2 columns):
Date     156 non-null datetime64[ns]
Price    156 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 2.5 KB


In [204]:
print(oil.shape)
print(auto.shape)

(156, 2)
(392, 9)


In [219]:
auto = pd.read_csv('./data/automobiles.csv')
auto['yr'] = pd.to_datetime(auto['yr'])
auto.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,1970-01-01,US,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,1970-01-01,US,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,1970-01-01,US,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,1970-01-01,US,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,1970-01-01,US,ford torino


In [222]:
merged = pd.merge_asof(auto, oil , left_on='yr', right_on='Date')
merged.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name,Date,Price
0,18.0,8,307.0,130,3504,12.0,1970-01-01,US,chevrolet chevelle malibu,1970-01-01,3.35
1,15.0,8,350.0,165,3693,11.5,1970-01-01,US,buick skylark 320,1970-01-01,3.35
2,18.0,8,318.0,150,3436,11.0,1970-01-01,US,plymouth satellite,1970-01-01,3.35
3,16.0,8,304.0,150,3433,12.0,1970-01-01,US,amc rebel sst,1970-01-01,3.35
4,17.0,8,302.0,140,3449,10.5,1970-01-01,US,ford torino,1970-01-01,3.35


<p id ='MitSO'><p>
# Medals in the Summer Olympics

<p id ='LOeD'><p>
### Loading Olympic edition DataFrame

<p id ='LIcD'><p>
### Loading IOC codes DataFrame

<p id ='BmD'><p>
### Building medals DataFrame

<p id ='QP'><p>
### Quantifying Performance

<p id ='Cmbciapt'><p>
### Counting medals by country/edition in a pivot table

<p id ='CfompOe'><p>
### Computing fraction of medals per Olympic edition

<p id ='Cpcifomw'><p>
### Computing percentage change in fraction of medals won

<p id ='Rap'><p>
### Reshaping and plotting

<p id ='BhD'><p>
### Building hosts DataFrame

<p id ='Rfa'><p>
### Reshaping for analysis

<p id ='Mtci'><p>
### Merging to compute influence

<p id ='Piohc'><p>
### Plotting influence of host country

<p id ='Ft'><p>
### Final thoughts