## Exercise 7.5
### 1. Merging company DataFrames

Suppose your company has operations in several different cities under several different managers. The DataFrames `revenue` and `managers` contain partial information related to the company. That is, the rows of the `city` columns don't quite match in `revenue` and `managers` (the Mendocino branch has no revenue yet since it just opened and the manager of Springfield branch recently left the company).

In [28]:
import pandas as pd

rev = {'city': ['Austin', 'Denver', 'Springfield'], 'revenue': [100, 83, 4]}
man = {'city': ['Austin', 'Denver', 'Mendocino'], 'manager': ['Charles', 'Joel', 'Brett']}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
display(revenue, managers)

Unnamed: 0,city,revenue
0,Austin,100
1,Denver,83
2,Springfield,4


Unnamed: 0,city,manager
0,Austin,Charles
1,Denver,Joel
2,Mendocino,Brett


#### Instructions (2 points)
- use `pd.merge()` to merge the DataFrames `revenue` and `managers` on the `'city'` column of each, and store the result in `combined`.
- print out `combined` to check how many records are there in the combined dataframe. This has been done for you.

In [29]:
# use `pd.merge()` to merge the DataFrames `revenue` and `managers` on the `'city'` column of each.
combined = pd.merge(revenue, managers, on='city')
combined

Unnamed: 0,city,revenue,manager
0,Austin,100,Charles
1,Denver,83,Joel


### 2. Merging on a specific column

This exercise follows on the last one with the DataFrames `revenue` and `managers` for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a `branch_id` column to both DataFrames. Moreover, new cities have been added to both the `revenue` and `managers` DataFrames as well.

In [3]:
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'], 'revenue': [100, 83, 4, 200], 'branch_id': [10, 20, 30, 47]}
man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'], 'manager': ['Charles', 'Joel', 'Brett', 'Sally'], 'branch_id': [10, 20, 47, 31]}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
display(revenue, managers)

Unnamed: 0,city,revenue,branch_id
0,Austin,100,10
1,Denver,83,20
2,Springfield,4,30
3,Mendocino,200,47


Unnamed: 0,city,manager,branch_id
0,Austin,Charles,10
1,Denver,Joel,20
2,Mendocino,Brett,47
3,Springfield,Sally,31


#### Instructions (4 points)

* Using `pd.merge()`, merge the DataFrames `revenue` and `managers` on the `'city'` column of each. Store the result as `merge_by_city`.
* Merge the DataFrames `revenue` and `managers` on the `'branch_id'` column of each. Store the result as `merge_by_id`.

In [13]:
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on='city')

# Print merge_by_city
merge_by_city

Unnamed: 0,city,revenue,branch_id_x,state_x,manager,branch_id_y,state_y
0,Austin,100,10,TX,Charles,10,TX
1,Denver,83,20,CO,Joel,20,CO
2,Springfield,4,30,IL,Sally,31,MO
3,Mendocino,200,47,CA,Brett,47,CA


In [14]:
# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on='branch_id')

# Print merge_by_id
merge_by_id

Unnamed: 0,city_x,revenue,branch_id,state_x,city_y,manager,state_y
0,Austin,100,10,TX,Austin,Charles,TX
1,Denver,83,20,CO,Denver,Joel,CO
2,Mendocino,200,47,CA,Mendocino,Brett,CA


Notice that when you merge on `'city'`, the resulting DataFrame has a peculiar result: In row 2, the city <font color='red'>Springfield</font> has two different branch IDs. This is because there are actually two different cities named <font color='red'>Springfield</font> - one in the State of <font color='yellow'>Illinois</font>, and the other in <font color='yellow'>Missouri</font>. The `revenue` DataFrame has the one from <font color='yellow'>Illinois</font>, and the `managers` DataFrame has the one from <font color='yellow'>Missouri</font>. Consequently, when you merge on `'branch_id'`, both of these get dropped from the merged DataFrame.

### 3. Merging on multiple columns

Another strategy to disambiguate cities with identical names is to add information on the states in which the cities are located. To this end, you add a column called `state` to both DataFrames from the preceding exercises. 

#### Instructions (2 points)

* Create a column called `'state'` in the DataFrame `revenue`, consisting of the list `['TX','CO','IL','CA']`.
* Create a column called `'state'` in the DataFrame `managers`, consisting of the list `['TX','CO','CA','MO']`.

In [15]:
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']
# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']
display(revenue, managers)

Unnamed: 0,city,revenue,branch_id,state
0,Austin,100,10,TX
1,Denver,83,20,CO
2,Springfield,4,30,IL
3,Mendocino,200,47,CA


Unnamed: 0,city,manager,branch_id,state
0,Austin,Charles,10,TX
1,Denver,Joel,20,CO
2,Mendocino,Brett,47,CA
3,Springfield,Sally,31,MO


#### Instructions (3 points)
* Merge the DataFrames `revenue` and `managers` using two columns :`'city'`, and `'state'`. Pass them in as a list to the `on` paramater of `pd.merge()`. Use suffixes `_rev` and `_man` for the two dataframes. Store the result in `combined`.

In [18]:

# Merge revenue & managers on 'city', & 'state'; use suffixes `_rev` and `_man`; store in `combined``
combined = pd.merge(revenue, managers, on=['city','state'], suffixes=['_rev', '_man'])

# Print combined
print(combined)

        city  revenue  branch_id_rev state  manager  branch_id_man
0     Austin      100             10    TX  Charles             10
1     Denver       83             20    CO     Joel             20
2  Mendocino      200             47    CA    Brett             47


## Exercise 7.6
### 1: Joining by Index

The DataFrames `revenue` and `managers` are again made available. This time, they are indexed by `'branch_id'`.

In [19]:
import pandas as pd
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],
       'state': ['TX','CO','IL','CA'],
       'revenue': [100, 83, 4, 200],
       'branch_id': [10, 20, 30, 47]}

man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'],
       'state': ['TX','CO','CA','MO'],
       'manager': ['Charles', 'Joel', 'Brett', 'Sally'],
       'branch_id': [10, 20, 47, 31]}

revenue = pd.DataFrame.from_dict(rev)
revenue.set_index('branch_id', inplace=True)
managers = pd.DataFrame.from_dict(man)
managers.set_index('branch_id', inplace=True)
display(revenue, managers)

Unnamed: 0_level_0,city,state,revenue
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,Austin,TX,100
20,Denver,CO,83
30,Springfield,IL,4
47,Mendocino,CA,200


Unnamed: 0_level_0,city,state,manager
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,Austin,TX,Charles
20,Denver,CO,Joel
47,Mendocino,CA,Brett
31,Springfield,MO,Sally


#### Instructions (3 points)
- call `.join()` on `revenue` to join it with `managers` by outer join. Use `_rev` as `lsuffix` and `_man` as `rsuffix`.

In [20]:
# call `.join()` on `revenue` to join it with `managers`. 
revenue.join(managers, lsuffix='_rev', rsuffix='_man', how='outer')

Unnamed: 0_level_0,city_rev,state_rev,revenue,city_man,state_man,manager
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10,Austin,TX,100.0,Austin,TX,Charles
20,Denver,CO,83.0,Denver,CO,Joel
30,Springfield,IL,4.0,,,
31,,,,Springfield,MO,Sally
47,Mendocino,CA,200.0,Mendocino,CA,Brett


### 2: Left & right merging on multiple columns

You now have, in addition to the `revenue` and `managers` DataFrames, a DataFrame `sales` that summarizes units sold from specific branches.

The `managers` DataFrame uses the label `branch` in place of `city` as in the other two DataFrames. Your task here is to employ *left* and *right* joins to preserve data and identify where data is missing.

In [21]:
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],       
       'state': ['TX','CO','IL','CA'],
       'revenue': [100, 83, 4, 200]}

man = {'branch': ['Austin', 'Denver', 'Mendocino', 'Springfield'],       
       'state': ['TX','CO','CA','MO'],
       'manager': ['Charles', 'Joel', 'Brett', 'Sally']}

sal = {'city': ['Mendocino', 'Denver', 'Austin', 'Springfield', 'Springfield'],
        'state': ['CA', 'CO', 'TX', 'MO', 'IL'],
        'units': [1, 4, 2, 5, 1]}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
sales = pd.DataFrame.from_dict(sal)
display(revenue, managers, sales)

Unnamed: 0,city,state,revenue
0,Austin,TX,100
1,Denver,CO,83
2,Springfield,IL,4
3,Mendocino,CA,200


Unnamed: 0,branch,state,manager
0,Austin,TX,Charles
1,Denver,CO,Joel
2,Mendocino,CA,Brett
3,Springfield,MO,Sally


Unnamed: 0,city,state,units
0,Mendocino,CA,1
1,Denver,CO,4
2,Austin,TX,2
3,Springfield,MO,5
4,Springfield,IL,1


#### Instructions (6 points)

* Execute a right merge using `pd.merge()` with `revenue` and `sales` to yield a new DataFrame `revenue_and_sales`.
    * Use `on=['city', 'state']`.
* Execute a left merge with `sales` and `managers` to yield a new DataFrame `sales_and_managers`.
    * Use `left_on=['city', 'state']`, and `right_on=['branch', 'state']`.

In [22]:
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how='right', on=['city', 'state'])

# Print revenue_and_sales
revenue_and_sales

Unnamed: 0,city,state,revenue,units
0,Mendocino,CA,200.0,1
1,Denver,CO,83.0,4
2,Austin,TX,100.0,2
3,Springfield,MO,,5
4,Springfield,IL,4.0,1


In [23]:
# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales, managers, how='left', left_on=['city', 'state'], right_on=['branch', 'state'])

# Print sales_and_managers
sales_and_managers

Unnamed: 0,city,state,units,branch,manager
0,Mendocino,CA,1,Mendocino,Brett
1,Denver,CO,4,Denver,Joel
2,Austin,TX,2,Austin,Charles
3,Springfield,MO,5,Springfield,Sally
4,Springfield,IL,1,,
