## Introduction and Setup

- Notebook Author: [Trenton McKinney](https://trenton3983.github.io/)
- Course: **[DataCamp: Merging DataFrames with pandas](https://learn.datacamp.com/courses/merging-dataframes-with-pandas)**
 - This [notebook](https://github.com/trenton3983/DataCamp/blob/master/2019-03-23_merging_dataframes_with_pandas.ipynb) was created as a reproducible reference.
 - The material is from the course
 - I completed the exercises
 - If you find the content beneficial, consider a [DataCamp Subscription](https://www.datacamp.com/pricing?period=yearly).

### Synopsis

The code cells that follow have been updated to `pandas v2.2.2`, so some code does not match associated instructions, but these updates are minor.

Here is a concise synopsis of each major section within the document:

### Preparing Data
- Discusses various techniques for importing multiple files into DataFrames.
- Focuses on how to use Indexes to share information between DataFrames.
- Covers essential commands for reading data files such as `pd.read_csv()` and highlights their importance for subsequent merging tasks.

### Concatenating Data
- Explores database-style operations to append and concatenate DataFrames using real-world datasets.
- Describes methods like `.append()` and `.concat()` which stack rows or join DataFrames along an axis.
  - Some instructions reference `.append()`, which is removed in favor of `.concat()`. 
- Emphasizes the handling of indices during the concatenation process and introduces the concept of hierarchical indexing.

### Merging Data
- Provides an in-depth look at merging techniques in pandas, including different types of joins (left, right, inner, outer).
- Explains the use of `merge()` function to align rows using one or more columns.
- Discusses ordered merging, which is particularly useful when dealing with columns that have a natural ordering, like dates.

### Case Study - Summer Olympics
- Applies previously discussed DataFrame skills to analyze Olympic medal data, integrating lessons from both the current and prior pandas courses.
- Uses the dataset of Summer Olympic medalists from 1896 to 2008 to showcase data manipulation capabilities in pandas.

Each section builds on the knowledge from the previous, culminating in a practical case study that utilizes all the discussed DataFrame manipulation techniques.

**Imports**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint as pp
import csv
from pathlib import Path
import yfinance as yf

In [None]:
print(f'Pandas Version: {pd.__version__}')
print(f'Matplotlib Version: {plt.matplotlib.__version__}')
print(f'Numpy Version: {np.__version__}')
print(f'Yahoo Finance Version: {yf.__version__}')

**Pandas Configuration Options**

In [None]:
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

**Data Files Location**

* Most data files for the exercises can be found on the [course site](https://www.datacamp.com/courses/merging-dataframes-with-pandas)
    * [Baby Names](https://assets.datacamp.com/production/repositories/516/datasets/43c9b6bf4c283ab024b2d7d61fbf15a0baa1e44d/Baby%20names.zip)
    * [Summer Olympic Medals](https://assets.datacamp.com/production/repositories/516/datasets/2d14df8d3c6a1773358fa000f203282c2e1107d6/Summer%20Olympic%20medals.zip)
    * [Automobile Fuel Efficiency](https://assets.datacamp.com/production/repositories/516/datasets/2f3d8b2156d5669fb7e12137f1c2e979c3c9ce0b/automobiles.csv)
    * [Exchange Rates](https://assets.datacamp.com/production/repositories/516/datasets/e91482db6a7bae394653278e4e908e63ed9ac833/exchange.csv)
    * [GDP](https://assets.datacamp.com/production/repositories/516/datasets/a0858a700501f88721ca9e4bdfca99b9e10b937f/GDP.zip)
    * [Oil Prices](https://assets.datacamp.com/production/repositories/516/datasets/707566cf46c4dd6290b9029f5e07a92baf3fe3f7/oil_price.csv)
    * [Pittsburgh Weather](https://assets.datacamp.com/production/repositories/516/datasets/58c1ead59818b2451324e9e84239db7bda6b11d3/pittsburgh2013.csv)
    * [Sales](https://assets.datacamp.com/production/repositories/516/datasets/2b89c1b00016e1ebcfd7f08a127d2c79589ce5c0/Sales.zip)
    * [S&P 500](https://assets.datacamp.com/production/repositories/516/datasets/7a9b570a02ef589891d9576a86876a616ca5f3c8/sp500.csv)
* Other data files may be found in my [DataCamp repository](https://github.com/trenton3983/DataCamp/tree/master/data)

**Data File Objects**

In [None]:
data = Path.cwd() / 'data' / 'merging-dataframes-with-pandas'
auto_fuel_file = data / 'auto_fuel_efficiency.csv'
baby_1881_file = data / 'baby_names1881.csv'
baby_1981_file = data / 'baby_names1981.csv'
exch_rates_file = data / 'exchange_rates.csv'
gdp_china_file = data / 'gdp_china.csv'
gdp_usa_file = data / 'gdp_usa.csv'
oil_price_file = data / 'oil_price.csv'
pitts_file = data / 'pittsburgh_weather_2013.csv'
sales_feb_hardware_file = data / 'sales-feb-Hardware.csv'
sales_feb_service_file = data / 'sales-feb-Service.csv'
sales_feb_software_file = data / 'sales-feb-Software.csv'
sales_jan_2015_file = data / 'sales-jan-2015.csv'
sales_feb_2015_file = data / 'sales-feb-2015.csv'
sales_mar_2015_file = data / 'sales-mar-2015.csv'
sp500_file = data / 'sp500.csv'
so_bronze_file = data / 'summer_olympics_Bronze.csv'
so_bronze5_file = data / 'summer_olympics_bronze_top5.csv'
so_gold_file = data / 'summer_olympics_Gold.csv'
so_gold5_file = data / 'summer_olympics_gold_top5.csv'
so_silver_file = data / 'summer_olympics_Silver.csv'
so_silver5_file = data / 'summer_olympics_silver_top5.csv'
so_all_medalists_file = data / 'summer_olympics_medalists 1896 to 2008 - ALL MEDALISTS.tsv'
so_editions_file = data / 'summer_olympics_medalists 1896 to 2008 - EDITIONS.tsv'
so_ioc_codes_file = data / 'summer_olympics_medalists 1896 to 2008 - IOC COUNTRY CODES.csv'

### Course Description

As a Data Scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions.

## Preparing Data

In this chapter, you'll learn about different techniques you can use to import multiple files into DataFrames. Having imported your data into individual DataFrames, you'll then learn how to share information between DataFrames using their Indexes. Understanding how Indexes work is essential information that you'll need for merging DataFrames later in the course.

### Reading multiple data files

#### Tools for pandas data import

* pd.read_csv() for CSV files
    * dataframe = pd.read_csv(filepath)
    * dozens of optional input parameters
* Other data import tools:
    * pd.read_excel()
    * pd.read_html()
    * pd.read_json()

#### Loading separate files

```python
import pandas as pd
dataframe0 = pd.read_csv('sales-jan-2015.csv')
dataframe1 = pd.read_csv('sales-feb-2015.csv')
```

#### Using a loop

```python
filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f))
```

#### Using a comprehension

```python
filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = [pd.read_csv(f) for f in filenames]
```

#### Using glob

```python
from glob import glob
filenames = glob('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]
```

### Exercises

#### Reading DataFrames from multiple files

When data is spread among several files, you usually invoke pandas' <code>read_csv()</code> (or a similar data import function) multiple times to load the data into several DataFrames.

The data files for this example have been derived from a [list of Olympic medals awarded between 1896 & 2008](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data) compiled by the Guardian.

The column labels of each DataFrame are <code>NOC</code>, <code>Country</code>, & <code>Total</code> where <code>NOC</code> is a three-letter code for the name of the country and <code>Total</code> is the number of medals of that type won (bronze, silver, or gold).

**Instructions**
<ul>
<li>Import <span style="background-color: #A733FF">pandas</span> as <span style="background-color: #A733FF">pd</span>.</li>
<li>Read the file <span style="background-color: #A733FF">'Bronze.csv'</span> into a DataFrame called <span style="background-color: #A733FF">bronze</span>.</li>
<li>Read the file <span style="background-color: #A733FF">'Silver.csv'</span> into a DataFrame called <span style="background-color: #A733FF">silver</span>.</li>
<li>Read the file <span style="background-color: #A733FF">'Gold.csv'</span> into a DataFrame called <span style="background-color: #A733FF">gold</span>.</li>
<li>Print the first 5 rows of the DataFrame <span style="background-color: #A733FF">gold</span>. This has been done for you, so hit 'Submit Answer' to see the results.</li></ul>

In [None]:
# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv(so_bronze_file)

# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv(so_silver_file)

# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv(so_gold_file)

# Print the first five rows of gold
gold.head()

#### Reading DataFrames from multiple files in a loop

As you saw in the video, loading data from multiple files into DataFrames is more efficient in a <em>loop</em> or a <em>list comprehension</em>.

Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.

Here, you'll continue working with [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

**Instructions**

* Create a list of file names called <mark>filenames</mark> with three strings <mark>'Gold.csv'</mark>, <mark>'Silver.csv'</mark>, & <mark>'Bronze.csv'</mark>. This has been done for you.
* Use a <mark>for</mark> loop to create another list called <mark>dataframes</mark> containing the three DataFrames loaded from <mark>filenames</mark>:
    * Iterate over <mark>filenames</mark>.
    * Read each CSV file in <mark>filenames</mark> into a DataFrame and append it to <mark>dataframes</mark> by using <mark>pd.read_csv()</mark> inside a call to <mark>.append()</mark>.
* Print the first 5 rows of the first DataFrame of the list <mark>dataframes</mark>. This has been done for you, so hit 'Submit Answer' to see the results.

In [None]:
# Create the list of file names: filenames
filenames = [so_bronze_file, so_silver_file, so_gold_file]

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

# Print top 5 rows of 1st DataFrame in dataframes
dataframes[0].head()

#### Combining DataFrames from multiple data files

In this exercise, you'll *combine* the three DataFrames from earlier exercises - `gold`, `silver`, & `bronze` - into a single DataFrame called `medals`. The approach you'll use here is clumsy. Later on in the course, you'll see various powerful methods that are frequently used in practice for *concatenating* or *merging* DataFrames.

Remember, the column labels of each DataFrame are `NOC`, `Country`, and `Total`, where `NOC` is a three-letter code for the name of the country and `Total` is the number of medals of that type won.

**Instructions**

* Construct a copy of the DataFrame `gold` called `medals` using the `.copy()` method.
* Create a list called `new_labels` with entries `'NOC'`, `'Country'`, & `'Gold'`. This is the same as the column labels from `gold` with the column label `'Total'` replaced by `'Gold'`.
* Rename the columns of `medals` by assigning `new_labels` to `medals.columns`.
* Create new columns `'Silver'` and `'Bronze'` in medals using `silver['Total']` & `bronze['Total']`.
* Print the top 5 rows of the final DataFrame `medals`. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
# Make a copy of gold: medals
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels

# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total']
medals['Bronze'] = bronze['Total']

# Print the head of medals
medals.head()

In [None]:
del bronze, silver, gold, dataframes, medals

### Reindexing DataFrames

#### "Indexes" vs. "Indices"

* indices: many index labels within Index data structures
* indexes: many pandas Index data structures

![](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2019-03-23_merging_dataframes_with_pandas/indices_indexes.JPG "Indices & Indexes")

#### Importing weather data

```python
import pandas as pd
w_mean = pd.read_csv('quarterly_mean_temp.csv', index_col='Month')
w_max = pd.read_csv('quarterly_max_temp.csv', index_col='Month')
```

#### Examining the data

```python
print(w_mean)
        Mean TemperatureF
Month
Apr     61.956044
Jan     32.133333
Jul     68.934783
Oct     43.434783

print(w_max)
        Max TemperatureF
Month
Jan     68
Apr     89
Jul     91
Oct     84
```

#### The DataFrame indexes

```python
print(w_mean.index)
Index(['Apr', 'Jan', 'Jul', 'Oct'], dtype='object', name='Month')

print(w_max.index)
Index(['Jan', 'Apr', 'Jul', 'Oct'], dtype='object', name='Month')

print(type(w_mean.index))
<class 'pandas.indexes.base.Index'>
```

#### Using .reindex()

```python
ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
print(w_mean2)

        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Jul     68.934783
Oct     43.434783
```

#### Using .sort_index()

```python
w_mean2.sort_index()
        Mean TemperatureF
Month
Apr     61.956044
Jan     32.133333
Jul     68.934783
Oct     43.434783
```

#### Reindex from a DataFrame Index

```python
w_mean.reindex(w_max.index)
        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Jul     68.934783
Oct     43.434783
```

#### Reindexing with missing labels

```python
w_mean3 = w_mean.reindex(['Jan', 'Apr', 'Dec'])
print(w_mean3)
        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Dec     NaN
```

#### Reindex from a DataFrame Index

```python
w_max.reindex(w_mean3.index)
        Max TemperatureF
Month
Jan     68.0
Apr     89.0
Dec     NaN

w_max.reindex(w_mean3.index).dropna()
        Max TemperatureF
Month
Jan     68.0
Apr     89.0
```

#### Order matters

```python
w_max.reindex(w_mean.index)
        Max TemperatureF
Month
Apr     89
Jan     68
Jul     91
Oct     84

w_mean.reindex(w_max.index)
        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Jul     68.934783
Oct     43.434783
```

### Exercises

#### Sorting DataFrame with the Index & columns

It is often useful to rearrange the sequence of the rows of a DataFrame by *sorting*. You don't have to implement these yourself; the principal methods for doing this are `.sort_index()` and `.sort_values()`.

In this exercise, you'll use these methods with a DataFrame of temperature values indexed by month names. You'll sort the rows alphabetically using the Index and numerically using a column. Notice, for this data, the original ordering is probably most useful and intuitive: the purpose here is for you to understand what the sorting methods do.

**Instructions**

* Read `'monthly_max_temp.csv'` into a DataFrame called `weather1` with `'Month'` as the index.
* Sort the index of `weather1` in alphabetical order using the `.sort_index()` method and store the result in `weather2`.
* Sort the index of `weather1` in *reverse* alphabetical order by specifying the additional keyword argument `ascending=False` inside `.sort_index()`.
* Use the `.sort_values()` method to sort `weather1` in increasing numerical order according to the values of the column `'Max TemperatureF'`.

In [None]:
monthly_max_temp = {'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
                    'Max TemperatureF': [68, 60, 68, 84, 88, 89, 91, 86, 90, 84, 72, 68]}

In [None]:
# Read 'monthly_max_temp.csv' into a DataFrame: weather1
# weather1 = pd.read_csv('monthly_max_temp.csv', index_col='Month')
weather1 = pd.DataFrame.from_dict(monthly_max_temp)
weather1.set_index('Month', inplace=True)

# Print the head of weather1
print(weather1.head())

# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
print(weather2.head())

# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)

# Print the head of weather3
print(weather3.head())

# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values(by='Max TemperatureF')

# Print the head of weather4
print(weather4.head())

#### Reindexing DataFrame from a list

Sorting methods are not the only way to change DataFrame Indexes. There is also the `.reindex()` method.

In this exercise, you'll reindex a DataFrame of quarterly-sampled mean temperature values to contain monthly samples (this is an example of *upsampling* or increasing the rate of samples, which you may recall from the [pandas Foundations](https://www.datacamp.com/courses/pandas-foundations) course).

The original data has the first month's abbreviation of the quarter (three-month interval) on the Index, namely `Apr`, `Jan`, `Jul`, and `Oct`. This data has been loaded into a DataFrame called `weather1` and has been printed in its entirety in the IPython Shell. Notice it has only four rows (corresponding to the first month of each quarter) and that the rows are not sorted chronologically.

You'll initially use a list of all twelve month abbreviations and subsequently apply the `.ffill()` method to *forward-fill* the null entries when upsampling. This list of month abbreviations has been pre-loaded as `year`.

**Instructions**

* Reorder the rows of `weather1` using the `.reindex()` method with the list `year` as the argument, which contains the abbreviations for each month.
* Reorder the rows of `weather1` just as you did above, this time chaining the `.ffill()` method to replace the null values with the last preceding non-null value.

In [None]:
monthly_max_temp = {'Month': ['Jan', 'Apr', 'Jul', 'Oct'],
                    'Max TemperatureF': [32.13333, 61.956044, 68.934783, 43.434783]}
weather1 = pd.DataFrame.from_dict(monthly_max_temp)
weather1.set_index('Month', inplace=True)
weather1

In [None]:
year = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
weather2

In [None]:
# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
weather3

#### Reindexing DataFrame using another DataFrame Index

Another common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame `.reindex()` method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its `.index` attribute.

The [Baby Names Dataset](https://www.data.gov/developers/baby-names-dataset/) from [data.gov](https://data.gov/) summarizes counts of names (with genders) from births registered in the US since 1881. In this exercise, you will start with two baby-names DataFrames `names_1981` and `names_1881` loaded for you.

The DataFrames `names_1981` and `names_1881` both have a MultiIndex with levels `name` and `gender` giving unique labels to counts in each row. If you're interested in seeing how the MultiIndexes were set up, `names_1981` and `names_1881` were read in using the following commands:

```python
names_1981 = pd.read_csv('names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv('names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))
```

As you can see by looking at their shapes, which have been printed in the IPython Shell, the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.

Your job here is to use the DataFrame `.reindex()` and `.dropna()` methods to make a DataFrame `common_names` counting names from 1881 that were still popular in 1981.

**Instructions**

* Create a new DataFrame `common_names` by reindexing `names_1981` using the Index of the DataFrame `names_1881` of older names.
* Print the shape of the new `common_names` DataFrame. This has been done for you. It should be the same as that of `names_1881`.
* Drop the rows of `common_names` that have null counts using the `.dropna()` method. These rows correspond to names that fell out of fashion between 1881 & 1981.
* Print the shape of the reassigned `common_names` DataFrame. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
names_1981 = pd.read_csv(baby_1981_file, header=None, names=['name', 'gender', 'count'], index_col=(0,1))
names_1981.head()

In [None]:
names_1881 = pd.read_csv(baby_1881_file, header=None, names=['name','gender','count'], index_col=(0,1))
names_1881.head()

In [None]:
# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
common_names.shape

In [None]:
# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
common_names.shape

In [None]:
common_names.head(10)

In [None]:
del weather1, weather2, weather3, weather4, common_names, names_1881, names_1981

### Arithmetic with Series & DataFrames

#### Loading weather data

```python
import pandas as pd
weather = pd.read_csv('pittsburgh2013.csv', index_col='Date', parse_dates=True)
weather.loc['2013-7-1':'2013-7-7', 'PrecipitationIn']

Date
2013-07-01 0.18
2013-07-02 0.14
2013-07-03 0.00
2013-07-04 0.25
2013-07-05 0.02
2013-07-06 0.06
2013-07-07 0.10
Name: PrecipitationIn, dtype: float64
```

#### Scalar multiplication

```python
weather.loc['2013-07-01':'2013-07-07', 'PrecipitationIn'] * 2.54

Date
2013-07-01 0.4572
2013-07-02 0.3556
2013-07-03 0.0000
2013-07-04 0.6350
2013-07-05 0.0508
2013-07-06 0.1524
2013-07-07 0.2540
Name: PrecipitationIn, dtype: float64
```

#### Absolute temperature range

```python
week1_range = weather.loc['2013-07-01':'2013-07-07', ['Min TemperatureF', 'Max TemperatureF']]
print(week1_range)
Min TemperatureF Max TemperatureF
Date
2013-07-01 66 79
2013-07-02 66 84
2013-07-03 71 86
2013-07-04 70 86
2013-07-05 69 86
2013-07-06 70 89
2013-07-07 70 77
```

#### Average temperature

```python
week1_mean = weather.loc['2013-07-01':'2013-07-07', 'Mean TemperatureF']
print(week1_mean)
Date
2013-07-01 72
2013-07-02 74
2013-07-03 78
2013-07-04 77
2013-07-05 76
2013-07-06 78
2013-07-07 72
Name: Mean TemperatureF, dtype: int64
```

#### Relative temperature range

```python
week1_range / week1_mean
RuntimeWarning: Cannot compare type 'Timestamp' with type 'str', sort order is
undefined for incomparable objects
return this.join(other, how=how, return_indexers=return_indexers)

2013-07-01 00:00:00 2013-07-02 00:00:00 2013-07-03 00:00:00 \
Date
2013-07-01 NaN NaN NaN
2013-07-02 NaN NaN NaN
2013-07-03 NaN NaN NaN
2013-07-04 NaN NaN NaN
2013-07-05 NaN NaN NaN
2013-07-06 NaN NaN NaN
2013-07-07 NaN NaN NaN
2013-07-04 00:00:00 2013-07-05 00:00:00 2013-07-06 00:00:00 \
Date
2013-07-01 NaN NaN NaN
... ...
```

#### Relative temperature range

```python
week1_range.divide(week1_mean, axis='rows')

Min TemperatureF Max TemperatureF
Date
2013-07-01 0.916667 1.097222
2013-07-02 0.891892 1.135135
2013-07-03 0.910256 1.102564
2013-07-04 0.909091 1.116883
2013-07-05 0.907895 1.131579
2013-07-06 0.897436 1.141026
2013-07-07 0.972222 1.069444
```

#### Percentage changes

```python
week1_mean.pct_change() * 100

Date
2013-07-01 NaN
2013-07-02 2.777778
2013-07-03 5.405405
2013-07-04 -1.282051
2013-07-05 -1.298701
2013-07-06 2.631579
2013-07-07 -7.692308
Name: Mean TemperatureF, dtype: float64
```

#### Bronze Olympic medals

```python
bronze = pd.read_csv('bronze_top5.csv', index_col=0)
print(bronze)
Total
Country
United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0
```

#### Silver Olympic medals

```python
silver = pd.read_csv('silver_top5.csv', index_col=0)
print(silver)
Total
Country
United States 1195.0
Soviet Union 627.0
United Kingdom 591.0
France 461.0
Italy 394.0
```

#### Gold Olympic medals

```python
gold = pd.read_csv('gold_top5.csv', index_col=0)
print(gold)
Total
Country
United States 2088.0
Soviet Union 838.0
United Kingdom 498.0
Italy 460.0
Germany 407.0
```

#### Adding bronze, silver

```python
bronze + silver

Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
```

#### Adding bronze, silver

```python
bronze + silver

Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
In [22]: print(bronze['United States'])
1052.0
In [23]: print(silver['United States'])
1195.0
```

#### Using the .add() method

```python
bronze.add(silver)

Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
```

#### Using a fill_value

```python
bronze.add(silver, fill_value=0)

Country
France 936.0
Germany 454.0
Italy 394.0
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
```

#### Adding bronze, silver, gold

```python
bronze + silver + gold

Country
France NaN
Germany NaN
Italy NaN
Soviet Union 2049.0
United Kingdom 1594.0
United States 4335.0
Name: Total, dtype: float64
```

#### Chaining .add()

```python
bronze.add(silver, fill_value=0).add(gold, fill_value=0)

Country
France 936.0
Germany 861.0
Italy 854.0
Soviet Union 2049.0
United Kingdom 1594.0
United States 4335.0
Name: Total, dtype: float64
```

### Exercises

#### Adding unaligned DataFrames

The DataFrames `january` and `february`, which have been printed in the IPython Shell, represent the sales a company made in the corresponding months.

The Indexes in both DataFrames are called `Company`, identifying which company bought that quantity of units. The column `Units` is the number of units sold.

If you were to add these two `DataFrames` by executing the command `total = january + february`, how many rows would the resulting DataFrame have? Try this in the IPython Shell and find out for yourself.

In [None]:
jan_dict = {'Company': ['Acme Corporation', 'Hooli', 'Initech', 'Mediacore', 'Streeplex'],
            'Units': [19, 17, 20, 10, 13]}
feb_dict = {'Company': ['Acme Corporation', 'Hooli', 'Mediacore', 'Vandelay Inc'],
            'Units': [15, 3, 12, 25]}

january = pd.DataFrame.from_dict(jan_dict)
january.set_index('Company', inplace=True)
print(january)

february = pd.DataFrame.from_dict(feb_dict)
february.set_index('Company', inplace=True)
print('\n', february, '\n')

print(january + february)

#### Broadcasting in Arithmetic formulas

In this exercise, you'll work with weather data pulled from [wunderground.com](https://www.wunderground.com/). The DataFrame <code>weather</code> has been pre-loaded along with <code>pandas as pd</code>. It has 365 rows (observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different weather measurements each day.

You'll subset a collection of columns related to temperature measurements in degrees Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame to reflect the change of units.

Remember, ordinary arithmetic operators (like <code>+</code>, <code>-</code>, <code>*</code>, and <code>/</code>) broadcast scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. Broadcasting also works with pandas Series and NumPy arrays.

**Instructions**

* Create a new DataFrame `temps_f` by extracting the columns `'Min TemperatureF'`, `'Mean TemperatureF'`, & `'Max TemperatureF'` from `weather` as a new DataFrame `temps_f`. To do this, pass the relevant columns as a list to `weather[]`.
* Create a new DataFrame `temps_c` from `temps_f` using the formula `(temps_f - 32) * 5/9`.
* Rename the columns of `temps_c` to replace `'F'` with `'C'` using the `.str.replace('F', 'C')` method on `temps_c.columns`.
* Print the first 5 rows of DataFrame `temps_c`. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
weather = pd.read_csv(pitts_file)
weather.set_index('Date', inplace=True)
weather.head(3)

In [None]:
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace('F', 'C')

# Print first 5 rows of temps_c
temps_c.head()

#### Computing percentage growth of GDP

Your job in this exercise is to compute the yearly percent-change of US GDP ([Gross Domestic Product](https://en.wikipedia.org/wiki/Gross_domestic_product)) since 2008.

The data has been obtained from the [Federal Reserve Bank of St. Louis](https://fred.stlouisfed.org/series/GDP/downloaddata) and is available in the file `GDP.csv`, which contains quarterly data; you will resample it to annual sampling and then compute the annual growth of GDP. For a refresher on resampling, check out the relevant material from [pandas Foundations](https://campus.datacamp.com/courses/pandas-foundations/time-series-in-pandas?ex=7).

**Instructions**

* Read the file `'GDP.csv'` into a DataFrame called `gdp`.
* Use `parse_dates=True` and `index_col='DATE'`.
* Create a DataFrame `post2008` by slicing `gdp` such that it comprises all rows from 2008 onward.
* Print the last 8 rows of the slice `post2008`. This has been done for you. This data has quarterly frequency so the indices are separated by three-month intervals.
* Create the DataFrame `yearly` by resampling the slice `post2008` by year. Remember, you need to chain `.resample()` (using the alias `'A'` for annual frequency) with some kind of aggregation; you will use the aggregation method `.last()` to select the last element when resampling.
* Compute the percentage growth of the resampled DataFrame `yearly` with `.pct_change() * 100`.

In [None]:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv(gdp_usa_file, parse_dates=True, index_col='DATE')
gdp.head()

In [None]:
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':]

# Print the last 8 rows of post2008
post2008.tail(8)

In [None]:
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('YE').last()

# Print yearly
yearly

In [None]:
# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100

# Print yearly again
yearly

#### Converting currency of stocks

In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from [Yahoo Finance](https://finance.yahoo.com/). The files `sp500.csv` for sp500 and `exchange.csv` for the exchange rates are both provided to you.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

**Instructions**

* Read the DataFrames `sp500` & `exchange` from the files `'sp500.csv'` & `'exchange.csv'` respectively..
* Use `parse_dates=True` and `index_col='Date'`.
* Extract the columns `'Open'` & `'Close'` from the DataFrame `sp500` as a new DataFrame `dollars` and print the first 5 rows.
* Construct a new DataFrame `pounds` by converting US dollars to British pounds. You'll use the `.multiply()` method of `dollars` with `exchange['GBP/USD']` and `axis='rows'`
* Print the first 5 rows of the new DataFrame `pounds`. This has been done for you, so hit 'Submit Answer' to see the results!.

In [None]:
# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv(sp500_file, parse_dates=True, index_col='Date')
sp500.head()

In [None]:
# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv(exch_rates_file, parse_dates=True, index_col='Date')
exchange.head()

In [None]:
# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]

# Print the head of dollars
dollars.head()

In [None]:
# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis='rows')

# Print the head of pounds
pounds.head()

In [None]:
del january, february, feb_dict, jan_dict, weather, temps_f, temps_c, gdp, post2008, yearly, sp500, exchange, dollars, pounds

## Concatenating Data

Having learned how to import multiple DataFrames and share information using Indexes, in this chapter you'll learn how to perform database-style operations to combine DataFrames. In particular, you'll learn about appending and concatenating DataFrames while working with a variety of real-world datasets.

### Appending & concatenating Series

#### append()

* .append(): Series & DataFrame method
* Invocation:
* s1.append(s2)
* Stacks rows of s2 below s1
* Method for Series & DataFrames

#### concat()

* concat(): pandas module function
* Invocation:
* pd.concat([s1, s2, s3])
* Can stack row-wise or column-wise

#### concat() & .append()

* Equivalence of concat() & .append():
* result1 = pd.concat([s1, s2, s3])
* result2 = s1.append(s2).append(s3)
* result1 == result2 elementwise

#### Series of US states

```python
import pandas as pd
northeast = pd.Series(['CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA'])
south = pd.Series(['DE', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'DC', 'WV', 'AL', 'KY', 'MS', 'TN', 'AR', 'LA', 'OK', 'TX'])
midwest = pd.Series(['IL', 'IN', 'MN', 'MO', 'NE', 'ND', 'SD', 'IA', 'KS', 'MI', 'OH', 'WI'])
west = pd.Series(['AZ', 'CO', 'ID', 'MT',
```

#### Using .append()

```python
east = northeast.append(south)
print(east)
0 CT       7 DC
1 ME       8 WV
2 MA       9 AL
3 NH       10 KY
4 RI       11 MS
5 VT       12 TN
6 NJ       13 AR
7 NY       14 LA
8 PA       15 OK
0 DE       16 TX
1 FL       dtype: object
2 GA
3 MD
4 NC
5 SC
6 VA
```

#### The appended Index

```python
print(east.index)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype='int64')

print(east.loc[3])
3 NH
3 MD
dtype: object
```

#### Using .reset_index()

```python
new_east = northeast.append(south).reset_index(drop=True)
print(new_east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
9 DE
10 FL
dtype: object

print(new_east.index)
RangeIndex(start=0, stop=26, step=1)
```

#### Using concat()

```python
east = pd.concat([northeast, south])
print(east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
0 DE
1 FL
dtype: object
print(east.index)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype='int64')
```

#### Using ignore_index

```python
new_east = pd.concat([northeast, south], ignore_index=True)
print(new_east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
9 DE
10 FL
dtype: object
print(new_east.index)
RangeIndex(start=0, stop=26, step=1)
```

### Exercises

#### Appending Series with nonunique Indices

The Series `bronze` and `silver`, which have been printed in the IPython Shell, represent the 5 countries that won the most bronze and silver Olympic medals respectively between 1896 & 2008. The Indexes of both Series are called `Country` and the values are the corresponding number of medals won.

If you were to run the command `combined = bronze.append(silver)`, how many rows would `combined` have? And how many rows would `combined.loc['United States']` return? Find out for yourself by running these commands in the IPython Shell.

**Instructions**

Possible Answers
* combined has 5 rows and combined.loc['United States'] is empty (0 rows).
* <mark>combined has 10 rows and combined.loc['United States'] has 2 rows.</mark>
* combined has 6 rows and combined.loc['United States'] has 1 row.
* combined has 5 rows and combined.loc['United States'] has 2 rows.

In [None]:
bronze = pd.read_csv(so_bronze5_file, index_col=0)
bronze

In [None]:
silver = pd.read_csv(so_silver5_file, index_col=0)
silver

In [None]:
combined = pd.concat([bronze, silver])
combined

In [None]:
combined.loc['United States']

#### Appending pandas Series

In this exercise, you'll load sales data from the months January, February, and March into DataFrames. Then, you'll extract Series with the `'Units'` column from each and append them together with method chaining using `.append()`.

To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.

**Instructions**

* Read the files `'sales-jan-2015.csv'`, `'sales-feb-2015.csv'` and `'sales-mar-2015.csv'` into the DataFrames `jan`, `feb`, and `mar` respectively.
* Use `parse_dates=True` and `index_col='Date'`.
* Extract the `'Units'` column of `jan`, `feb`, and `mar` to create the Series `jan_units`, `feb_units`, and `mar_units` respectively.
* Construct the Series `quarter1` by appending `feb_units` to `jan_units` and then appending `mar_units` to the result. Use chained calls to the `.append()` method to do this.
* Verify that `quarter1` has the individual Series stacked vertically. To do this:
* Print the slice containing rows from `jan 27, 2015` to `feb 2, 2015`.
* Print the slice containing rows from `feb 26, 2015` to `mar 7, 2015`.
* Compute and print the total number of units sold from the Series `quarter1`. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv(sales_jan_2015_file, parse_dates=True, index_col='Date')

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv(sales_feb_2015_file, parse_dates=True, index_col='Date')

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv(sales_mar_2015_file, parse_dates=True, index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = pd.concat([jan_units, feb_units, mar_units])

# Print the first slice from quarter1
display(quarter1.sort_index().loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
display(quarter1.sort_index().loc['feb 26, 2015':'mar 7, 2015'])

# Compute & print total sales in quarter1
display(quarter1.sum())

#### Concatenating pandas Series along row axis

Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously. This time, the DataFrames `jan`, `feb`, and` mar` have been pre-loaded.

Your job is to use `pd.concat()` with a list of Series to achieve the same result that you would get by chaining calls to `.append()`.

You may be wondering about the difference between `pd.concat()` and pandas' `.append()` method. One way to think of the difference is that `.append()` is a specific case of a concatenation, while `pd.concat()` gives you more flexibility, as you'll see in later exercises.

**Instructions**

* Create an empty list called `units`. This has been done for you.
    * Use a `for` loop to iterate over `[jan, feb, mar]`:
* In each iteration of the loop, append the `'Units'` column of each DataFrame to `units`.
    * Concatenate the Series contained in the list `units` into a longer Series called `quarter1` using `pd.concat()`.
* Specify the keyword argument `axis='rows'` to stack the Series vertically.
* Verify that `quarter1` has the individual Series stacked vertically by printing slices. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month.Units)

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')

# Print slices from quarter1
print(quarter1.sort_index().loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.sort_index().loc['feb 26, 2015':'mar 7, 2015'])

In [None]:
del bronze, silver, combined, jan, feb, mar, jan_units, feb_units, mar_units, quarter1

### Appending & concatenating DataFrames

#### Loading population data

```python
import pandas as pd
pop1 = pd.read_csv('population_01.csv', index_col=0)
pop2 = pd.read_csv('population_02.csv', index_col=0)
```

In [None]:
pop1_data = {'Zip Code ZCTA': [66407, 72732, 50579, 46421], '2010 Census Population': [479, 4716, 2405, 30670]}
pop2_data = {'Zip Code ZCTA': [12776, 76092, 98360, 49464], '2010 Census Population': [2180, 26669, 12221, 27481]}

pop1 = pd.DataFrame.from_dict(pop1_data)
pop1.set_index('Zip Code ZCTA', drop=True, inplace=True)
pop2 = pd.DataFrame.from_dict(pop2_data)
pop2.set_index('Zip Code ZCTA', drop=True, inplace=True)

#### Examining population data

In [None]:
pop1

In [None]:
pop2

In [None]:
print(type(pop1), pop1.shape)
print(type(pop2), pop2.shape)

#### Appending population DataFrames

In [None]:
pd.concat([pop1, pop2])

In [None]:
print(pop1.index.name, pop1.columns)
print(pop2.index.name, pop2.columns)

#### Population & unemployment data

```python
population = pd.read_csv('population_00.csv', index_col=0)
unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
```

In [None]:
pop_data = {'Zip Code ZCTA': [57538, 59916, 37660, 2860], '2010 Census Population': [322, 130, 40038, 45199]}
emp_data = {'Zip': [2860, 46167, 1097, 80808], 'unemployment': [0.11, 0.02, 0.33, 0.07], 'participants': [34447, 4800, 42, 4310]}

population = pd.DataFrame.from_dict(pop_data)
population.set_index('Zip Code ZCTA', drop=True, inplace=True)
unemployment = pd.DataFrame.from_dict(emp_data)
unemployment.set_index('Zip', drop=True, inplace=True)

In [None]:
population

In [None]:
unemployment

#### Appending population & unemployment

In [None]:
pd.concat([population, unemployment], sort=True)

#### Repeated index labels

In [None]:
pd.concat([population, unemployment], sort=True)

#### Concatenating rows

* with `axis=0`, `pd.concat` is the same as `population.append(unemployment, sort=True)`

In [None]:
pd.concat([population, unemployment], axis=0, sort=True)

#### Concatenating column

* outer join

In [None]:
pd.concat([population, unemployment], axis=1, sort=True)

In [None]:
del pop1_data, pop2_data, pop1, pop2, pop_data, emp_data, population, unemployment

### Exercises

#### Appending DataFrames with ignore_index

In this exercise, you'll use the [Baby Names Dataset](https://www.data.gov/developers/baby-names-dataset/) (from [data.gov](https://data.gov/)) again. This time, both DataFrames `names_1981` and `names_1881` are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You'll use the DataFrame `.append()` method to make a DataFrame `combined_names`. To distinguish rows from the original two DataFrames, you'll add a `'year'` column to each with the year (1881 or 1981 in this case). In addition, you'll specify `ignore_index=True` so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled `0, 1, ..., n-1`, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

**Instructions**

* Create a `'year'` column in the DataFrames `names_1881` and `names_1981`, with values of `1881` and `1981` respectively. Recall that assigning a scalar value to a DataFrame column broadcasts that value throughout.
* Create a new DataFrame called `combined_names` by appending the rows of `names_1981` underneath the rows of `names_1881`. Specify the keyword argument `ignore_index=True` to make a new RangeIndex of unique integers for each row.
* Print the shapes of all three DataFrames. This has been done for you.
* Extract all rows from `combined_names` that have the name `'Morgan'`. To do this, use the `.loc[]` accessor with an appropriate filter. The relevant column of `combined_names` here is `'name'`.

In [None]:
names_1881 = pd.read_csv(baby_1881_file, header=None, names=['name', 'gender', 'count'])
names_1981 = pd.read_csv(baby_1981_file, header=None, names=['name', 'gender', 'count'])

In [None]:
names_1981.head()

In [None]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = pd.concat([names_1881, names_1981], ignore_index=True, sort=False)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

# Print all rows that contain the name 'Morgan'
combined_names[combined_names.name  == 'Morgan']

#### Concatenating pandas DataFrames along column axis

The function `pd.concat()` can concatenate DataFrames *horizontally* as well as *vertically* (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument `axis=1` or` axis='columns'`.

In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an outer join (which you will explore in more detail in later exercises).

The files `'quarterly_max_temp.csv'` and `'monthly_mean_temp.csv'` have been pre-loaded into the DataFrames `weather_max` and `weather_mean` respectively, and `pandas` has been imported as `pd`.

**Instructions**

* Create a new DataFrame called `weather` by concatenating the DataFrames `weather_max` and `weather_mean` *horizontally*.
    * Pass the DataFrames to `pd.concat()` as a list and specify the keyword argument `axis=1` to stack them horizontally.
* Print the new DataFrame `weather`.

In [None]:
weather_mean_data = {'Mean TemperatureF': [53.1, 70., 34.93548387, 28.71428571, 32.35483871, 72.87096774, 70.13333333, 35., 62.61290323, 39.8, 55.4516129 , 63.76666667],
                     'Month': ['Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep']}
weather_max_data = {'Max TemperatureF': [68, 89, 91, 84], 'Month': ['Jan', 'Apr', 'Jul', 'Oct']}

weather_mean = pd.DataFrame.from_dict(weather_mean_data)
weather_mean.set_index('Month', inplace=True, drop=True)
weather_max = pd.DataFrame.from_dict(weather_max_data)
weather_max.set_index('Month', inplace=True, drop=True)

In [None]:
weather_max

In [None]:
weather_mean

In [None]:
# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean], axis=1, sort=True)

# Print weather
weather

#### Reading multiple files to build a DataFrame

It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you'll work with DataFrames compiled from [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

`pandas` has been imported as `pd` and two lists have been pre-loaded: An empty list called `medals`, and `medal_types`, which contains the strings `'bronze'`, `'silver'`, and `'gold'`.

**Instructions**

* Iterate over `medal_types` in the for `loop`.
* Inside the `for` loop:
    * Create `file_name` using string interpolation with the loop variable `medal`. This has been done for you. The expression `"%s_top5.csv" % medal` evaluates as a string with the value of `medal` replacing `%s` in the format string.
    * Create the list of column names called `columns`. This has been done for you.
    * Read `file_name` into a DataFrame called `medal_df`. Specify the keyword arguments `header=0`, `index_col='Country'`, and `names=columns` to get the correct row and column Indexes.
    * Append `medal_df` to `medals` using the list `.append()` method.
* Concatenate the list of DataFrames `medals` horizontally (using `axis='columns'`) to create a single DataFrame called `medals`. Print it in its entirety.

In [None]:
top_five = data.glob('*_top5.csv')
for file in top_five:
    print(file)

In [None]:
medal_types = ['bronze', 'silver', 'gold']
medal_list = list()

for medal in medal_types:

    # Create the file name: file_name
    file_name = data / f'summer_olympics_{medal}_top5.csv'
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medal_list.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medal_list, axis='columns', sort=True)

# Print medals
medals

In [None]:
del names_1881, names_1981, combined_names, weather_mean_data, weather_max_data, weather_mean, weather_max, weather, top_five, medals, medal_list

### Concatenation, keys & MultiIndexes

#### Loading rainfall data

```python
import pandas as pd
file1 = 'q1_rainfall_2013.csv'
rain2013 = pd.read_csv(file1, index_col='Month', parse_dates=True)
file2 = 'q1_rainfall_2014.csv'
rain2014 = pd.read_csv(file2, index_col='Month', parse_dates=True)
```

In [None]:
rain_2013_data = {'Month': ['Jan', 'Feb', 'Mar'], 'Precipitation': [0.096129, 0.067143, 0.061613]}
rain_2014_data = {'Month': ['Jan', 'Feb', 'Mar'], 'Precipitation': [0.050323, 0.082143, 0.070968]}

rain2013 = pd.DataFrame.from_dict(rain_2013_data)
rain2013.set_index('Month', inplace=True)
rain2014 = pd.DataFrame.from_dict(rain_2014_data)
rain2014.set_index('Month', inplace=True)

#### Examining rainfall data

In [None]:
rain2013

In [None]:
rain2014

#### Concatenating rows

In [None]:
pd.concat([rain2013, rain2014], axis=0)

#### Using multi-index on rows

In [None]:
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis=0)
rain1314

#### Accessing a multi-index

In [None]:
rain1314.loc[2014]

#### Concatenating columns

In [None]:
rain1314 = pd.concat([rain2013, rain2014], axis='columns')
rain1314

#### Using a multi-index on columns

In [None]:
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis='columns')
rain1314

In [None]:
rain1314[2013]

#### pd.concat() with dict

In [None]:
rain_dict = {2013: rain2013, 2014: rain2014}
rain1314 = pd.concat(rain_dict, axis='columns')
rain1314

In [None]:
del rain_2013_data, rain_2014_data, rain2013, rain2014, rain1314

### Exercises

#### Concatenating vertically to get MultiIndexed rows

When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the `keys` parameter in the call to `pd.concat()`, which generates a hierarchical index with the labels from `keys` as the outermost index label. So you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.

Here, you'll continue working with DataFrames compiled from [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data). Once again, `pandas` has been imported as `pd` and two lists have been pre-loaded: An empty list called `medals`, and `medal_types`, which contains the strings `'bronze'`, `'silver'`, and `'gold'`.

**Instructions**

* Within the `for` loop:
    * Read `file_name` into a DataFrame called `medal_df`. Specify the index to be `'Country'`.
    * Append `medal_df` to `medals`.
* Concatenate the list of DataFrames `medals` into a single DataFrame called `medals`. Be sure to use the keyword argument `keys=['bronze', 'silver', 'gold']` to create a vertically stacked DataFrame with a MultiIndex.
* Print the new DataFrame `medals`. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
medal_types = ['bronze', 'silver', 'gold']
medal_list = list()

for medal in medal_types:

    # Create the file name: file_name
    file_name = data / f'summer_olympics_{medal}_top5.csv'
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, index_col='Country')
    
    # Append medal_df to medals
    medal_list.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'])

# Print medals in entirety
print(medals)

#### Slicing MultiIndexed DataFrames

This exercise picks up where the last ended (again using [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data)).

You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the `pd.IndexSlice` to extract specific slices. Check out [this exercise](https://campus.datacamp.com/courses/manipulating-dataframes-with-pandas/advanced-indexing?ex=10) from Manipulating DataFrames with pandas to refresh your memory on how to deal with MultiIndexed DataFrames.

`pandas` has been imported for you as `pd` and the DataFrame `medals` is already in your namespace.

**Instructions**

* Create a new DataFrame `medals_sorted` with the entries of `medals` sorted. Use `.sort_index(level=0)` to ensure the Index is sorted suitably.
* Print the number of bronze medals won by Germany and all of the silver medal data. This has been done for you.
* Create an alias for `pd.IndexSlice` called `idx`. A slicer `pd.IndexSlice` is required when slicing on the *inner* level of a MultiIndex.
* Slice all the data on medals won by the United Kingdom. To do this, use the `.loc[]` accessor with `idx[:,'United Kingdom'], :`.

In [None]:
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])

In [None]:
# Print data about silver medals
print(medals_sorted.loc['silver'])

In [None]:
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
medals_sorted.loc[idx[:, 'United Kingdom'], :]

#### Concatenating horizontally to get MultiIndexed columns

It is also possible to construct a DataFrame with hierarchically indexed columns. For this exercise, you'll start with pandas imported and a list of three DataFrames called `dataframes`. All three DataFrames contain `'Company'`, `'Product'`, and `'Units'` columns with a `'Date'` column as the index pertaining to sales transactions during the month of February, 2015. The first DataFrame describes `Hardware` transactions, the second describes `Software` transactions, and the third, `Service` transactions.

Your task is to concatenate the DataFrames horizontally and to create a MultiIndex on the columns. From there, you can summarize the resulting DataFrame and slice some information from it.

**Instructions**

* Construct a new DataFrame `february` with MultiIndexed columns by concatenating the list `dataframes`.
* Use `axis=1` to stack the DataFrames horizontally and the keyword argument `keys=['Hardware', 'Software', 'Service']` to construct a hierarchical Index from each DataFrame.
* Print summary information from the new DataFrame `february` using the `.info()` method. This has been done for you.
* Create an alias called `idx` for `pd.IndexSlice`.
* Extract a slice called `slice_2_8` from `february` (using `.loc[]` & `idx`) that comprises rows between Feb. 2, 2015 to Feb. 8, 2015 from columns under `'Company'`.
* Print the `slice_2_8`. This has been done for you, so hit 'Submit Answer' to see the sliced data!


In [None]:
hw = pd.read_csv(sales_feb_hardware_file, index_col='Date')
sw = pd.read_csv(sales_feb_software_file, index_col='Date')
sv = pd.read_csv(sales_feb_service_file, index_col='Date')

dataframes = [hw, sw, sv]
dataframes

In [None]:
# Concatenate dataframes: february
february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'], sort=True)

# Print february.info()
february.info()

In [None]:
february

In [None]:
# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['2015-02-02':'2015-02-08', idx[:, 'Company']]

# Print slice_2_8
slice_2_8

#### Concatenating DataFrames from a dict

You're now going to revisit the sales data you worked with earlier in the chapter. Three DataFrames `jan`, `feb`, and `mar` have been pre-loaded for you. Your task is to aggregate the sum of all sales over the `'Company'` column into a single DataFrame. You'll do this by constructing a dictionary of these DataFrames and then concatenating them.

**Instructions**

* Create a list called `month_list` consisting of the tuples `('january', jan)`, `('february', feb)`, and `('march', mar)`.
* Create an empty dictionary called `month_dict`.
* Inside the `for` loop:
    * Group `month_data` by `'Company'` and use `.sum()` to aggregate.
* Construct a new DataFrame called `sales` by concatenating the DataFrames stored in `month_dict`.
* Create an alias for `pd.IndexSlice` and print all sales by `'Mediacore'`. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
jan = pd.read_csv(sales_jan_2015_file)
feb = pd.read_csv(sales_feb_2015_file)
mar = pd.read_csv(sales_mar_2015_file)

In [None]:
mar

In [None]:
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = dict()

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby(['Company']).sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
display(sales)

# Print all sales by Mediacore
idx = pd.IndexSlice
display(sales.loc[idx[:, 'Mediacore'], :])

In [None]:
del medal_types, medal_list, medal_df, medals, medals_sorted, idx, hw, sw, sv, dataframes, february, slice_2_8

### Outer & inner joins

#### Using with arrays

In [None]:
A = np.arange(8).reshape(2, 4) + 0.1
A

In [None]:
B = np.arange(6).reshape(2,3) + 0.2
B

In [None]:
C = np.arange(12).reshape(3,4) + 0.3
C

#### Stacking arrays horizontally

In [None]:
np.hstack([B, A])

In [None]:
np.concatenate([B, A], axis=1)

#### Stacking arrays vertically

In [None]:
np.vstack([A, C])

In [None]:
np.concatenate([A, C], axis=0)

#### Incompatible array dimensions

In [None]:
np.concatenate([A, B], axis=0) # incompatible columns

In [None]:
np.concatenate([A, C], axis=1) # incompatible rows

#### Population & unemployment data

```python
population = pd.read_csv('population_00.csv', index_col=0)

unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
print(population)
2010 Census Population
Zip Code ZCTA
57538 322
59916 130
37660 40038
2860 45199

print(unemployment)
unemployment participants
Zip
2860 0.11 34447
46167 0.02 4800
1097 0.33 42
80808 0.07 4310
```

#### Converting to arrays

```python
population_array = np.array(population)
print(population_array) # Index info is lost
[[ 322]
[ 130]
[40038]
[45199]]

unemployment_array = np.array(unemployment)
print(population_array)
[[ 1.10000000e-01 3.44470000e+04]
[ 2.00000000e-02 4.80000000e+03]
[ 3.30000000e-01 4.20000000e+01]
[ 7.00000000e-02 4.31000000e+03]]
```

#### Manipulating data as arrays

```python
print(np.concatenate([population_array, unemployment_array], axis=1))
[[ 3.22000000e+02 1.10000000e-01 3.44470000e+04]
[ 1.30000000e+02 2.00000000e-02 4.80000000e+03]
[ 4.00380000e+04 3.30000000e-01 4.20000000e+01]
[ 4.51990000e+04 7.00000000e-02 4.31000000e+03]]
```

#### Joins

* Joining tables: Combining rows of multiple tables
* Outer join
    * Union of index sets (all labels, no repetition)
    * Missing fields filled with NaN
    * Preserves the indices in the original tables, filling null values for missing rows
    * Has all the indices of the original tables without repetiton (like a set union)
* Inner join
    * Intersection of index sets (only common labels)
    * Has only labels common to both tables (like a set intersection)

#### Concatenation & inner join

* only the row label present in both DataFrames is preserved

```python
pd.concat([population, unemployment], axis=1, join='inner')

2010 Census Population unemployment participants
2860 45199 0.11 34447
```

#### Concatenation & outer join

* All row indiecs from the original two indexes exist in the joind DataFrame index.
* When a row occurs in one DataFrame, but not in the other, the missing column entries are filled with null values

```python
pd.concat([population, unemployment], axis=1, join='outer')

2010 Census Population unemployment participants
1097 NaN 0.33 42.0
2860 45199.0 0.11 34447.0
37660 40038.0 NaN NaN
46167 NaN 0.02 4800.0
57538 322.0 NaN NaN
59916 130.0 NaN NaN
80808 NaN 0.07 4310.0
```

#### Inner join on other axis

* The resulting DataFrame is empty becasue no column index label appears in both population and unemployment

```python
pd.concat([population, unemployment], join='inner', axis=0)

Empty DataFrame
Columns: []
Index: [2860, 46167, 1097, 80808, 57538, 59916, 37660, 2860]
```

### Exercises

#### Concatenating DataFrames with inner join

Here, you'll continue working with DataFrames compiled from [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

The DataFrames `bronze`, `silver`, and `gold` have been pre-loaded for you.

Your task is to compute an inner join.

**Instructions**

* Construct a list of DataFrames called `medal_list` with entries `bronze`, `silver`, and `gold`.
* Concatenate `medal_list` horizontally with an inner join to create `medals`.
    * Use the keyword argument `keys=['bronze', 'silver', 'gold']` to yield suitable hierarchical indexing.
    * Use `axis=1` to get horizontal concatenation.
    * Use `join='inner'` to keep only rows that share common index labels.
* Print the new DataFrame `medals`.

In [None]:
bronze = pd.read_csv(so_bronze5_file)
silver = pd.read_csv(so_silver_file)
gold = pd.read_csv(so_gold_file)

In [None]:
# Create the list of DataFrames: medal_list
medal_list = [bronze, silver, gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis=1, join='inner')

# Print medals
medals

#### Resampling & concatenating DataFrames with inner join

In this exercise, you'll compare the historical 10-year GDP (Gross Domestic Product) growth in the US and in China. The data for the US starts in 1947 and is recorded quarterly; by contrast, the data for China starts in 1961 and is recorded annually.

You'll need to use a combination of resampling and an inner join to align the index labels. You'll need an appropriate [offset alias](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for resampling, and the method `.resample()` must be chained with some kind of aggregation method (`.pct_change()` and `.last()` in this case).

`pandas` has been imported as `pd`, and the DataFrames `china` and `us` have been pre-loaded, with the output of `china.head()` and `us.head()` printed in the IPython Shell.

**Instructions**

* Make a new DataFrame `china_annual` by resampling the DataFrame `china` with `.resample('A').last()` (i.e., with *annual* frequency) and chaining two method calls:
* Chain `.pct_change(10)` as an aggregation method to compute the percentage change with an offset of ten years.
* Chain `.dropna()` to eliminate rows containing null values.
* Make a new DataFrame `us_annual` by resampling the DataFrame `us` exactly as you resampled `china`.
* Concatenate `china_annual` and `us_annual` to construct a DataFrame called `gdp`. Use `join='inner'` to perform an *inner* join and use `axis=1` to concatenate *horizontally*.
* Print the result of resampling `gdp` every decade (i.e., using `.resample('10A')`) and aggregating with the method `.last()`. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
china = pd.read_csv(gdp_china_file, parse_dates=['Year'])
china.rename(columns={'GDP': 'China'}, inplace=True)
china.set_index('Year', inplace=True)

us = pd.read_csv(gdp_usa_file, parse_dates=['DATE'])
us.rename(columns={'DATE': 'Year', 'VALUE': 'US'}, inplace=True)
us.set_index('Year', inplace=True)

In [None]:
china.head()

In [None]:
us.head()

In [None]:
# Resample and tidy china: china_annual
china_annual = china.resample('YE').last().pct_change(10).dropna()
china_annual.head()

In [None]:
# Resample and tidy us: us_annual
us_annual = us.resample('YE').last().pct_change(10).dropna()
us_annual.head()

In [None]:
# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual, us_annual], join='inner', axis=1)

# Resample gdp and print
gdp.resample('10YE').last()

In [None]:
del bronze, silver, gold, medal_list, medals, china, us, china_annual, us_annual, gdp

## Merging Data

Here, you'll learn all about merging pandas DataFrames. You'll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You'll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns.

* `merge()` extends `concat()` with the ability to align rows using multiple columns

### Merging DataFrames

In [None]:
pa_zipcode_population = {'Zipcode': [16855, 15681, 18657, 17307, 15635],
                         '2010 Census Population': [282, 5241, 11985, 5899, 220]}
pa_zipcode_city = {'Zipcode': [17545,18455, 17307, 15705, 16833, 16220, 18618, 16855, 16623, 15635, 15681, 18657, 15279, 17231, 18821],
                   'City': ['MANHEIM', 'PRESTON PARK', 'BIGLERVILLE', 'INDIANA', 'CURWENSVILLE', 'CROWN', 'HARVEYS LAKE', 'MINERAL SPRINGS',
                            'CASSVILLE', 'HANNASTOWN', 'SALTSBURG', 'TUNKHANNOCK', 'PITTSBURG', 'LEMASTERS', 'GREAT BEND'],
                   'State': ['PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA']}

#### Population DataFrame

In [None]:
population = pd.DataFrame.from_dict(pa_zipcode_population)
population

#### Cities DataFrame

In [None]:
cities = pd.DataFrame.from_dict(pa_zipcode_city)
cities

#### Merging

* `pd.merge()` computes a merge on ALL columns that occur in both DataFrames
    * in the following case, the common column is **Zipcode**
    * for any row in which the Zipcode entry in cities matches a row in population, a new row is made in the merfed DataFrame.
    * by default, this is an inner join
        * it's an inner join because it glues together only rows that match in the joining columns of **BOTH** DataFrames

In [None]:
pd.merge(population, cities)

#### Medal DataFrames

In [None]:
bronze = pd.read_csv(so_bronze_file)
bronze.head()

In [None]:
len(bronze)

In [None]:
gold = pd.read_csv(so_gold_file)
gold.head()

In [None]:
len(gold)

#### Merging all columns

* by default, `pd.merge()` uses all columns common to both DataFrames to merge
* the rows of the merged DataFrame consist of all rows where the **NOC**, **Country**, and **Totals** columns are identical in both DataFrames

In [None]:
so_merge = pd.merge(bronze, gold)
so_merge.head()

In [None]:
len(so_merge)

In [None]:
so_merge.columns

In [None]:
so_merge.index

#### Merging on

In [None]:
so_merge = pd.merge(bronze, gold, on='NOC')
so_merge.head()

In [None]:
len(so_merge)

#### Merging on multiple columns

* this is where merging extend concatenation in allowing matching on multiple columns

In [None]:
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'])
so_merge.head()

#### Using suffixes

In [None]:
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'])
so_merge.head()

#### Counties DataFrame

In [None]:
pa_counties = {'CITY NAME': ['SALTSBURG', 'MINERAL SPRINGS', 'BIGLERVILLE', 'HANNASTOWN', 'TUNKHANNOCK'],
               'COUNTY NAME': ['INDIANA', 'CLEARFIELD', 'ADAMS', 'WESTMORELAND', 'WYOMING']}
counties = pd.DataFrame.from_dict(pa_counties)
counties

In [None]:
cities.tail()

#### Specifying columns to merge

In [None]:
pd.merge(counties, cities, left_on='CITY NAME', right_on='City')

#### Switching left/right DataFrames

In [None]:
pd.merge(cities, counties, left_on='City', right_on='CITY NAME')

In [None]:
del pa_zipcode_population, pa_zipcode_city, population, cities, bronze, gold, so_merge, pa_counties, counties

### Exercises

#### Merging company DataFrames

Suppose your company has operations in several different cities under several different managers. The DataFrames **revenue** and **managers** contain partial information related to the company. That is, the rows of the **city** columns don't quite match in **revenue** and **managers** (the Mendocino branch has no revenue yet since it just opened and the manager of Springfield branch recently left the company).

The DataFrames have been printed in the IPython Shell. If you were to run the command `combined = pd.merge(revenue, managers, on='city')`, how many rows would **combined** have?

In [None]:
rev = {'city': ['Austin', 'Denver', 'Springfield'], 'revenue': [100, 83, 4]}
man = {'city': ['Austin', 'Denver', 'Mendocino'], 'manager': ['Charles', 'Joel', 'Brett']}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)

In [None]:
combined = pd.merge(revenue, managers, on='city')
combined

#### Merging on a specific column

This exercise follows on the last one with the DataFrames `revenue` and `managers` for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a `branch_id` column to both DataFrames. Moreover, new cities have been added to both the `revenue` and `managers` DataFrames as well. `pandas` has been imported as pd and both DataFrames are available in your namespace.

At present, there should be a 1-to-1 relationship between the `city` and `branch_id` fields. In that case, the result of a merge on the `city` columns ought to give you the same output as a merge on the `branch_id` columns. Do they? Can you spot an ambiguity in one of the DataFrames?

**Instructions**

* Using `pd.merge()`, merge the DataFrames `revenue` and `managers` on the `'city'` column of each. Store the result as `merge_by_city`.
* Print the DataFrame `merge_by_city`. This has been done for you.
* Merge the DataFrames `revenue` and `managers` on the `'branch_id'` column of each. Store the result as `merge_by_id`.
* Print the DataFrame `merge_by_id`. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'], 'revenue': [100, 83, 4, 200], 'branch_id': [10, 20, 30, 47]}
man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'], 'manager': ['Charles', 'Joel', 'Brett', 'Sally'], 'branch_id': [10, 20, 47, 31]}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)

In [None]:
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on='city')

# Print merge_by_city
merge_by_city

In [None]:
# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on='branch_id')

# Print merge_by_id
merge_by_id

Notice that when you merge on `'city'`, the resulting DataFrame has a peculiar result: In row 2, the city Springfield has two different branch IDs. This is because there are actually two different cities named Springfield - one in the State of Illinois, and the other in Missouri. The `revenue` DataFrame has the one from Illinois, and the `managers` DataFrame has the one from Missouri. Consequently, when you merge on `'branch_id'`, both of these get dropped from the merged DataFrame.

#### Merging on columns with non-matching labels

You continue working with the `revenue` & `managers` DataFrames from before. This time, someone has changed the field name `'city'` to `'branch'` in the `managers` table. Now, when you attempt to merge DataFrames, an exception is thrown:

```python
>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
    ... <text deleted> ...
    pd.merge(revenue, managers, on='city')
    ... <text deleted> ...
KeyError: 'city'
```
    
Given this, it will take a bit more work for you to join or merge on the city/branch name. You have to specify the `left_on` and `right_on` parameters in the call to `pd.merge()`.

As before, `pandas` has been pre-imported as `pd` and the `revenue` and `managers` DataFrames are in your namespace. They have been printed in the IPython Shell so you can examine the columns prior to merging.

Are you able to merge better than in the last exercise? How should the rows with `Springfield` be handled?

**Instructions**

* Merge the DataFrames `revenue` and `managers` into a single DataFrame called `combined` using the `'city'` and `'branch'` columns from the appropriate DataFrames.
    * In your call to `pd.merge()`, you will have to specify the parameters `left_on` and `right_on` appropriately.
* Print the new DataFrame `combined`.

In [None]:
state_rev = {'Austin': 'TX', 'Denver': 'CO', 'Springfield': 'IL', 'Mendocino': 'CA'}
state_man = {'Austin': 'TX', 'Denver': 'CO', 'Mendocino': 'CA', 'Springfield': 'MO'}

In [None]:
revenue['state'] = revenue['city'].map(state_rev)
managers['state'] = managers['city'].map(state_man)

In [None]:
managers.rename(columns={'city': 'branch'}, inplace=True)

In [None]:
revenue

In [None]:
managers

In [None]:
combined = pd.merge(revenue, managers, left_on='city', right_on='branch')
combined

#### Merging on multiple columns

Another strategy to disambiguate cities with identical names is to add information on the states in which the cities are located. To this end, you add a column called `state` to both DataFrames from the preceding exercises. Again, `pandas` has been pre-imported as `pd` and the `revenue` and `managers` DataFrames are in your namespace.

Your goal in this exercise is to use `pd.merge()` to merge DataFrames using multiple columns (using `'branch_id'`, `'city'`, and `'state'` in this case).

Are you able to match all your company's branches correctly?

**Instructions**

* Create a column called `'state'` in the DataFrame `revenue`, consisting of the list `['TX','CO','IL','CA']`.
* Create a column called `'state'` in the DataFrame `managers`, consisting of the list `['TX','CO','CA','MO']`.
* Merge the DataFrames `revenue` and `managers` using three columns :`'branch_id'`, `'city'`, and `'state'`. Pass them in as a list to the `on` paramater of `pd.merge()`.

In [None]:
managers.rename(columns={'branch': 'city'}, inplace=True)

In [None]:
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']

# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']

# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=['branch_id', 'city', 'state'])

# Print combined
print(combined)

In [None]:
del rev, man, revenue, managers, merge_by_city, merge_by_id, combined

### Joining DataFrames

* Pandas has to search through DataFrame rows for matches when computing joins and merges
    * It's useful to have different kinds of joins to mitigate costs

#### Medal DataFrames

In [None]:
bronze = pd.read_csv(so_bronze_file)
bronze.head()

In [None]:
len(bronze)

In [None]:
gold = pd.read_csv(so_gold_file)
gold.head()

In [None]:
len(gold)

#### Merging with inner join

* `merge()` does an inner join by default
    * it extracts the rows that match in joining columns from both DataFrames and it glues them together in the joined DataFrame
    * the property `how=innner` is the default behavior   

In [None]:
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='inner')
so_merge.head()

#### Merging with left join

* using `how=left` keeps all rows of the left DataFrame in the merged DataFrame
* Keeps all rows of the left DF in the merged DF
* For rows in the left DF with matches in the right DF:
    * Non-joining columns of right DF are appended to left DF
* For rows in the left DF with no matches in the right DF:
    * Non-joining columns are filled with nulls

In [None]:
bronze = pd.read_csv(so_bronze5_file)
gold = pd.read_csv(so_gold5_file)

In [None]:
g_noc = ['USA', 'URS', 'GBR', 'ITA', 'GER']
b_noc = ['USA', 'URS', 'GBR', 'FRA', 'GER']

In [None]:
gold['NOC'] = g_noc
bronze['NOC'] = b_noc

In [None]:
gold

In [None]:
bronze

In [None]:
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='left')

#### Merging with right join

In [None]:
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='right')

#### Merging with outer join

In [None]:
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='outer')

#### Population & unemployment data

In [None]:
population = pd.DataFrame.from_dict({'Zip Code ZCTA': [57538, 59916, 37660, 2860],
                                     '2010 Census Population': [322, 130, 40038, 45199]})
population.set_index('Zip Code ZCTA', inplace=True)
population

In [None]:
unemployment = pd.DataFrame.from_dict({'Zip': [2860, 46167, 1097],
                                       'unemployment': [0.11, 0.02, 0.33],
                                       'participants': [ 34447, 4800, 32]})
unemployment.set_index('Zip', inplace=True)
unemployment

#### Using .join(how='left')

* computes a left join using the Index by default

In [None]:
population.join(unemployment)

#### Using .join(how='right')

In [None]:
population.join(unemployment, how='right')

#### Using .join(how='inner')

In [None]:
population.join(unemployment, how='inner')

#### Using .join(how='outer')

In [None]:
population.join(unemployment, how='outer')

In [None]:
del bronze, gold, so_merge, g_noc, b_noc, population, unemployment

#### Which should you use?

* df1.append(df2): stacking vertically
* pd.concat([df1, df2]):
    * stacking many horizontally or vertically
    * simple inner/outer joins on Indexes
* df1.join(df2): inner/outer/left/right joins on Indexes
* pd.merge([df1, df2]): many joins on multiple columns

### Exercises

#### Data

In [None]:
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],
       'state': ['TX','CO','IL','CA'],
       'revenue': [100, 83, 4, 200],
       'branch_id': [10, 20, 30, 47]}

man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'],
       'state': ['TX','CO','CA','MO'],
       'manager': ['Charles', 'Joel', 'Brett', 'Sally'],
       'branch_id': [10, 20, 47, 31]}

revenue = pd.DataFrame.from_dict(rev)
revenue.set_index('branch_id', inplace=True)
managers = pd.DataFrame.from_dict(man)
managers.set_index('branch_id', inplace=True)

In [None]:
revenue

In [None]:
managers

#### Joining by Index

The DataFrames `revenue` and `managers` are displayed in the IPython Shell. Here, they are indexed by `'branch_id'`.

Choose the function call below that will join the DataFrames on their indexes and return 5 rows with index labels `[10, 20, 30, 31, 47]`. Explore each of them in the IPython Shell to get a better understanding of their functionality.

In [None]:
revenue.join(managers, lsuffix='_rev', rsuffix='_mng', how='outer')

#### Choosing a joining strategy

Suppose you have two DataFrames: `students` (with columns `'StudentID'`, `'LastName'`, `'FirstName'`, and `'Major'`) and `midterm_results` (with columns `'StudentID'`, `'Q1'`, `'Q2'`, and `'Q3'` for their scores on midterm questions).

You want to combine the DataFrames into a single DataFrame `grades`, and be able to easily spot which students wrote the midterm and which didn't (their midterm question scores `'Q1'`, `'Q2'`, & `'Q3'` should be filled with `NaN` values).

You also want to drop rows from `midterm_results` in which the `StudentID` is not found in `students`.

Which of the following strategies gives the desired result?

In [None]:
students = pd.DataFrame.from_dict({'StudentID': [], 'LastName': [], 'FirstName': [], 'Major': []})
midterm_results = pd.DataFrame.from_dict({'StudentID': [], 'Q1': [], 'Q2': [], 'Q3': []})

In [None]:
students

In [None]:
midterm_results

In [None]:
grades = pd.merge(students, midterm_results, how='left')

#### Left & right merging on multiple columns

You now have, in addition to the `revenue` and `managers` DataFrames from prior exercises, a DataFrame `sales` that summarizes units sold from specific branches (identified by `city` and `state` but not `branch_id`).

Once again, the `managers` DataFrame uses the label `branch` in place of `city` as in the other two DataFrames. Your task here is to employ *left* and *right* merges to preserve data and identify where data is missing.

By merging `revenue` and `sales` with a *right* merge, you can identify the missing `revenue` values. Here, you don't need to specify `left_on` or `right_on` because the columns to merge on have matching labels.

By merging `sales` and `managers` with a *left* merge, you can identify the missing `manager`. Here, the columns to merge on have conflicting labels, so you must specify `left_on` and `right_on`. In both cases, you're looking to figure out how to connect the fields in rows containing `Springfield`.

`pandas` has been imported as `pd` and the three DataFrames `revenue`, `managers`, and `sales` have been pre-loaded. They have been printed for you to explore in the IPython Shell.

**Instructions**

* Execute a right merge using `pd.merge()` with `revenue` and `sales` to yield a new DataFrame `revenue_and_sales`.
    * Use `how='right'` and `on=['city', 'state']`.
* Print the new DataFrame `revenue_and_sales`. This has been done for you.
* Execute a left merge with `sales` and `managers` to yield a new DataFrame `sales_and_managers`.
    * Use `how='left'`, `left_on=['city', 'state']`, and `right_on=['branch', 'state']`.
* Print the new DataFrame `sales_and_managers`. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],
       'branch_id': [10, 20, 30, 47],
       'state': ['TX','CO','IL','CA'],
       'revenue': [100, 83, 4, 200]}

man = {'branch': ['Austin', 'Denver', 'Mendocino', 'Springfield'],
       'branch_id': [10, 20, 47, 31],
       'state': ['TX','CO','CA','MO'],
       'manager': ['Charles', 'Joel', 'Brett', 'Sally']}

sale = {'city': ['Mendocino', 'Denver', 'Austin', 'Springfield', 'Springfield'],
        'state': ['CA', 'CO', 'TX', 'MO', 'IL'],
        'units': [1, 4, 2, 5, 1]}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
sales = pd.DataFrame.from_dict(sale)

In [None]:
revenue

In [None]:
managers

In [None]:
sales

In [None]:
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how='right', on=['city', 'state'])

# Print revenue_and_sales
revenue_and_sales

In [None]:
sales_and_managers = pd.merge(sales, managers, how='left', left_on=['city', 'state'], right_on=['branch', 'state'])

# Print sales_and_managers
sales_and_managers

#### Merging DataFrames with outer join

This exercise picks up where the previous one left off. The DataFrames `revenue`, `managers`, and `sales` are pre-loaded into your namespace (and, of course, `pandas` is imported as `pd`). Moreover, the merged DataFrames `revenue_and_sales` and `sales_and_managers` have been pre-computed exactly as you did in the previous exercise.

The merged DataFrames contain enough information to construct a DataFrame with 5 rows with all known information correctly aligned and each branch listed only once. You will try to merge the merged DataFrames on all matching keys (which computes an inner join by default). You can compare the result to an outer join and also to an outer join with restricted subset of columns as keys.

**Instructions**

* Merge `sales_and_managers` with `revenue_and_sales`. Store the result as `merge_default`.
* Print `merge_default`. This has been done for you.
* Merge `sales_and_managers` with `revenue_and_sales` using `how='outer'`. Store the result as `merge_outer`.
* Print `merge_outer`. This has been done for you.
* Merge `sales_and_managers` with `revenue_and_sales` only on `['city','state']` using an outer join. Store the result as `merge_outer_on` and hit 'Submit Answer' to see what the merged DataFrames look like!

In [None]:
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers, revenue_and_sales)

# Print merge_default
merge_default

In [None]:
# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers, revenue_and_sales, how='outer')

# Print merge_outer
merge_outer

In [None]:
# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers, revenue_and_sales, on=['city', 'state'], how='outer')

# Print merge_outer_on
merge_outer_on

In [None]:
del rev, man, revenue, managers, students, midterm_results, grades, sale, sales, revenue_and_sales, sales_and_managers, merge_default, merge_outer, merge_outer_on

### Ordered merges

#### Software & hardware sales

In [None]:
software = pd.read_csv(sales_feb_software_file, parse_dates=['Date']).sort_values('Date')
software.head(10)

In [None]:
hardware = pd.read_csv(sales_feb_hardware_file, parse_dates=['Date']).sort_values('Date')
hardware.head()

#### Using merge()

* attempting to merge yields an empty DataFrame because it's doing an INNER join on all columns with matching names by defaults
    * 'Units' and 'Date' columns have no overlapping values, so the result is empty

In [None]:
sales_merge = pd.merge(hardware, software)
sales_merge

In [None]:
sales_merge.info()

#### Using merge(how='outer')

In [None]:
sales_merge = pd.merge(hardware, software, how='outer')
sales_merge.head(14)

#### Sorting merge(how='outer')

In [None]:
sales_merge = pd.merge(hardware, software, how='outer').sort_values('Date')
sales_merge.head(14)

#### Using merge_ordered()

* the default is an OUTER join

In [None]:
sales_merged = pd.merge_ordered(hardware, software)
sales_merged.head(14)

#### Using on & suffixes

In [None]:
sales_merged = pd.merge_ordered(hardware, software, on=['Date', 'Company'], suffixes=['_hardware', '_software'])
sales_merged.head()

#### Stocks data

In [None]:
pwd()

In [None]:
stocks_dir = Path.cwd() / 'data' / 'merging-dataframes-with-pandas'

tickers = ['^gspc', 'AAPL', 'CSCO', 'AMZN', 'MSFT', 'IBM']

for tk in tickers:
    print(tk)
    df = yf.download(tk, start='1980-01-01', end='2024-04-30', interval='1d').assign(tkr=tk)
    if tk == '^gspc':
        tk = 'SP500'
    df.to_csv( stocks_dir / f'{tk}.csv', index=True)

In [None]:
sp500_stocks = stocks_dir / 'SP500.csv'
aapl = stocks_dir / 'AAPL.csv'
csco = stocks_dir / 'CSCO.csv'
amzn = stocks_dir / 'AMZN.csv'
msft = stocks_dir / 'MSFT.csv'
ibm = stocks_dir / 'IBM.csv'

In [None]:
sp500_df = pd.read_csv(sp500_stocks, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
aapl_df = pd.read_csv(aapl, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
csco_df = pd.read_csv(csco, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
amzn_df = pd.read_csv(amzn, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
msft_df = pd.read_csv(msft, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
ibm_df = pd.read_csv(ibm, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])

In [None]:
sp500_df.rename(columns={'Close': 'S&P'}, inplace=True)
aapl_df.rename(columns={'Close': 'AAPL'}, inplace=True)
csco_df.rename(columns={'Close': 'CSCO'}, inplace=True)
amzn_df.rename(columns={'Close': 'AMZN'}, inplace=True)
msft_df.rename(columns={'Close': 'MSFT'}, inplace=True)
ibm_df.rename(columns={'Close': 'IBM'}, inplace=True)

In [None]:
sp500_df

In [None]:
stocks = pd.concat([sp500_df, aapl_df, csco_df, amzn_df, msft_df, ibm_df], axis=1)

In [None]:
stocks.head()

In [None]:
stocks.tail()

In [None]:
stocks.to_csv(stocks_dir / 'stocks.csv', index=True, index_label='Date')

#### GDP data

In [None]:
gdp = pd.read_csv(gdp_usa_file, parse_dates=['DATE'])
gdp.sort_values(by=['DATE'], ascending=False, inplace=True)
gdp.reset_index(inplace=True, drop=True)
gdp.rename(columns={'VALUE': 'GDP', 'DATE': 'Date'}, inplace=True)
gdp.head(8)

#### Ordered merge

In [None]:
gdp_2000_2015 = gdp[(gdp['Date'].dt.year >= 2000) & (gdp['Date'].dt.year <= 2015)]

In [None]:
stocks.reset_index(inplace=True)
stocks.head(5)

In [None]:
stocks_2000_2015 = stocks[(stocks['Date'].dt.year >= 2000) & (stocks['Date'].dt.year <= 2015)]

In [None]:
ordered_df = pd.merge_ordered(stocks_2000_2015, gdp_2000_2015, on='Date')
ordered_df.head()

#### Ordered merge with ffill

In [None]:
ordered_df = pd.merge_ordered(stocks_2000_2015, gdp_2000_2015, on='Date', fill_method='ffill')
ordered_df.head()

In [None]:
del software, hardware, sales_merge, sales_merged, stocks_dir, sp500_stocks, aapl
del csco, amzn, msft, ibm, sp500_df, aapl_df, csco_df, amzn_df, msft_df, ibm_df, stocks
del gdp, gdp_2000_2015, stocks_2000_2015, ordered_df

### Exercises

#### Using merge_ordered()

This exercise uses pre-loaded DataFrames `austin` and `houston` that contain weather data from the cities Austin and Houston respectively. They have been printed in the IPython Shell for you to examine.

Weather conditions were recorded on separate days and you need to merge these two DataFrames together such that the dates are ordered. To do this, you'll use `pd.merge_ordered()`. After you're done, note the order of the rows before and after merging.

**Instructions**

* Perform an ordered merge on `austin` and `houston` using `pd.merge_ordered()`. Store the result as `tx_weather`.
* Print `tx_weather`. You should notice that the rows are sorted by the date but it is not possible to tell which observation came from which city.
* Perform another ordered merge on `austin` and `houston`.
    * This time, specify the keyword arguments `on='date'` and `suffixes=['_aus','_hus']` so that the rows can be distinguished. Store the result as `tx_weather_suff`.
* Print `tx_weather_suff` to examine its contents. This has been done for you.
* Perform a third ordered merge on `austin` and `houston`.
    * This time, in addition to the `on` and `suffixes` parameters, specify the keyword argument `fill_method='ffill'` to use *forward-filling* to replace `NaN` entries with the most recent non-null entry, and hit 'Submit Answer' to examine the contents of the merged DataFrames!

In [None]:
austin = pd.DataFrame.from_dict({'date': ['2016-01-01', '2016-02-08', '2016-01-17'], 'ratings': ['Cloudy', 'Cloudy', 'Sunny']})
houston = pd.DataFrame.from_dict({'date': ['2016-01-04', '2016-01-01', '2016-03-01'], 'ratings': ['Rainy', 'Cloudy', 'Sunny']})

In [None]:
# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)

# Print tx_weather
tx_weather

In [None]:
# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'])

# Print tx_weather_suff
tx_weather_suff

In [None]:
# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'], fill_method='ffill')

# Print tx_weather_ffill
tx_weather_ffill

In [None]:
del austin, houston, tx_weather, tx_weather_suff, tx_weather_ffill

#### Using merge_asof()

Similar to `pd.merge_ordered()`, the `pd.merge_asof()` function will also merge values in order using the `on` column, but for each row in the left DataFrame, only rows from the right DataFrame whose `'on'` column values are less than the left value will be kept.

This function can be used to align disparate datetime frequencies without having to first resample.

Here, you'll merge monthly oil prices (US dollars) into a full automobile fuel efficiency dataset. The oil and automobile DataFrames have been pre-loaded as `oil` and `auto`. The first 5 rows of each have been printed in the IPython Shell for you to explore.

These datasets will align such that the first price of the year will be broadcast into the rows of the automobiles DataFrame. This is considered correct since by the start of any given year, most automobiles for that year will have already been manufactured.

You'll then inspect the merged DataFrame, resample by year and compute the mean `'Price'` and `'mpg'`. You should be able to see a trend in these two columns, that you can confirm by computing the Pearson correlation between resampled `'Price'` and `'mpg'`.

**Instructions**

* Merge `auto` and `oil` using `pd.merge_asof()` with `left_on='yr'` and `ight_on='Date'`. Store the result as merged.
* Print the tail of `merged`. This has been done for you.
* Resample `merged` using `'A'` (annual frequency), and `on='Date'`. Select `[['mpg','Price']]` and aggregate the mean. Store the result as `yearly`.
* Hit Submit Answer to examine the contents of `yearly` and `yearly.corr()`, which shows the Pearson correlation between the resampled `'Price'` and `'mpg'`.

In [None]:
oil = pd.read_csv(oil_price_file, parse_dates=['Date'])
auto = pd.read_csv(auto_fuel_file, parse_dates=['yr'])

In [None]:
oil.head()

In [None]:
auto.head()

In [None]:
# Merge auto and oil: merged
merged = pd.merge_asof(auto, oil, left_on='yr', right_on='Date')

# Print the tail of merged
merged.tail()

In [None]:
# Resample merged: yearly
yearly = merged.resample('YE', on='Date')[['mpg','Price']].mean()

# Print yearly
yearly

In [None]:
# print yearly.corr()
yearly.corr()

## Case Study - Summer Olympics

To cement your new skills, you'll apply them by working on an in-depth study involving Olympic medal data. The analysis involves integrating your multi-DataFrame skills from this course and also skills you've gained in previous pandas courses. This is a rich dataset that will allow you to fully leverage your pandas data manipulation skills. Enjoy!

### Medals in the Summer Olympics

#### Summer Olympic medalists 1896 to 2008 - IOC COUNTRY CODES.csv

In [None]:
pd.read_csv(so_ioc_codes_file).head(8)

#### Summer Olympic medalists 1896 to 2008 - EDITIONS.tsv

In [None]:
pd.read_csv(so_editions_file, sep='\t').head(8)

#### summer_1896.csv, summer_1900.csv, …, summer_2008.csv

In [None]:
pd.read_csv(so_all_medalists_file, sep='\t', header=4).head(8)

#### Reminder: loading & merging files

* pd.read_csv() (& its many options)
* Looping over files, e.g.,
    * [pd.read_csv(f) for f in glob('*.csv')]
* Concatenating & appending, e.g.,
    * pd.concat([df1, df2], axis=0)
    * df1.append(df2)

### Case Study Explorations

#### Loading Olympic edition DataFrame

In this chapter, you'll be using [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

Your first task here is to prepare a DataFrame `editions` from a *tab-separated values* (TSV) file.

Initially, `editions` has 26 rows (one for each Olympic edition, i.e., a year in which the Olympics was held) and 7 columns: `'Edition'`, `'Bronze'`, `'Gold'`, `'Silver'`, `'Grand Total'`, `'City'`, and `'Country'`.

For the analysis that follows, you won't need the overall medal counts, so you want to keep only the useful columns from `editions`: `'Edition'`, `'Grand Total'`, `City`, and `Country`.

**Instructions**

* Read `file_path` into a DataFrame called `editions`. The identifier `file_path` has been pre-defined with the filename `'Summer Olympic medallists 1896 to 2008 - EDITIONS.tsv'`. You'll have to use the option `sep='\t'` because the file uses tabs to delimit fields (`pd.read_csv()` expects commas by default).
* Select only the columns `'Edition'`, `'Grand Total'`, `'City'`, and `'Country'` from `editions`.
* Print the final DataFrame `editions` in entirety (there are only 26 rows). This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
editions = pd.read_csv(so_editions_file, sep='\t')
editions = editions[['Edition', 'Grand Total', 'City', 'Country']]
editions.head()

#### Loading IOC codes DataFrames

Your task here is to prepare a DataFrame `ioc_codes` from a comma-separated values (CSV) file.

Initially, `ioc_codes` has 200 rows (one for each country) and 3 columns: `'Country'`, `'NOC'`, & `'ISO code'`.

For the analysis that follows, you want to keep only the useful columns from ioc_codes: `'Country'` and `'NOC'` (the column `'NOC'` contains three-letter codes representing each country).

**Instructions**

* Read `file_path` into a DataFrame called `ioc_codes`. The identifier `file_path` has been pre-defined with the filename `'Summer Olympic medallists 1896 to 2008 - IOC COUNTRY CODES.csv'`.
* Select only the columns `'Country'` and `'NOC'` from `ioc_codes`.
* Print the leading 5 and trailing 5 rows of the DataFrame `ioc_codes` (there are 200 rows in total). This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
ioc_codes = pd.read_csv(so_ioc_codes_file)
ioc_codes = ioc_codes[['Country', 'NOC']]
ioc_codes.head()

#### Building medals DataFrame

Here, you'll start with the DataFrame editions from the previous exercise.

You have a sequence of files summer_1896.csv, summer_1900.csv, ..., summer_2008.csv, one for each Olympic edition (year).

You will build up a dictionary medals_dict with the Olympic editions (years) as keys and DataFrames as values.

The dictionary is built up inside a loop over the year of each Olympic edition (from the Index of editions).

Once the dictionary of DataFrames is built up, you will combine the DataFrames using pd.concat().

**Instructions**

* Within the for loop:
    * Create the file path. This has been done for you.
    * Read file_path into a DataFrame. Assign the result to the year key of medals_dict.
    * Select only the columns 'Athlete', 'NOC', and 'Medal' from medals_dict[year].
    * Create a new column called 'Edition' in the DataFrame medals_dict[year] whose entries are all year.
* Concatenate the dictionary of DataFrames medals_dict into a DataFame called medals. Specify the keyword argument ignore_index=True to prevent repeated integer indices.
* Print the first and last 5 rows of medals. This has been done for you, so hit 'Submit Answer' to see the result!

* Following is the code used to combine all of the editions by year
    * the individual files are not available
    * the combined dataset is provided

```python
for year in editions['Edition']:

    # Create the file path: file_path
    file_path = 'summer_{:d}.csv'.format(year)
    
    # Load file_path into a DataFrame: medals_dict[year]
    medals_dict[year] = pd.read_csv(file_path)
    
    # Extract relevant columns: medals_dict[year]
    medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]
    
    # Assign year to column 'Edition' of medals_dict
    medals_dict[year]['Edition'] = year
    
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)
```

In [None]:
medals = pd.read_csv(so_all_medalists_file, sep='\t', header=4)
medals = medals[['Athlete', 'NOC', 'Medal', 'Edition']]
medals.head()

### Quantifying Performance

#### Constructing a pivot table

* Apply DataFrame pivot_table() method
    * index: column to use as index of pivot table
    * values: column(s) to aggregate
    * aggfunc: function to apply for aggregation
    * columns: categories as columns of pivot table

### Case Study Explorations

#### Counting medals by country/edition in a pivot table

Here, you'll start with the concatenated DataFrame `medals` from the previous exercise.

You can construct a pivot table to see the number of medals each country won in each year. The result is a new DataFrame with the Olympic edition on the Index and with 138 country `NOC` codes as columns. If you want a refresher on pivot tables, it may be useful to refer back to the relevant exercises in [Manipulating DataFrames with pandas](https://campus.datacamp.com/courses/manipulating-dataframes-with-pandas/rearranging-and-reshaping-data?ex=14).

**Instructions**

* Construct a pivot table from the DataFrame `medals`, aggregating by `count` (by specifying the `aggfunc` parameter). Use `'Edition'` as the `index`, `'Athlete'` for the `values`, and `'NOC'` for the `columns`.
* Print the first & last 5 rows of `medal_counts`. This has been done for you, so hit 'Submit Answer' to see the results!

In [None]:
# Construct the pivot_table: medal_counts
medal_counts = medals.pivot_table(index='Edition', columns='NOC', values='Athlete', aggfunc='count')

# Print the first & last 5 rows of medal_counts
medal_counts.head()

In [None]:
medal_counts.tail()

#### Computing fraction of medals per Olympic edition

In this exercise, you'll start with the DataFrames `editions`, `medals`, & `medal_counts` from prior exercises.

You can extract a Series with the total number of medals awarded in each Olympic edition.

The DataFrame `medal_counts` can be divided row-wise by the total number of medals awarded each edition; the method `.divide()` performs the broadcast as you require.

This gives you a normalized indication of each country's performance in each edition.

**Instructions**

* Set the index of the DataFrame `editions` to be `'Edition'` (using the method `.set_index()`). Save the result as `totals`.
* Extract the `'Grand Total'` column from `totals` and assign the result back to `totals`.
* Divide the DataFrame `medal_counts` by `totals` along each row. You will have to use the `.divide()` method with the option `axis='rows'`. Assign the result to `fractions`.
* Print first & last 5 rows of the DataFrame `fractions`. This has been done for you, so hit 'Submit Answer' to see the results!

In [None]:
# Set Index of editions: totals
totals = editions.set_index('Edition')
totals.head()

In [None]:
# Reassign totals['Grand Total']: totals
totals = totals['Grand Total']
totals.head()

In [None]:
# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals, axis='rows')

# Print first & last 5 rows of fractions
fractions.head()

In [None]:
fractions.tail()

#### Computing percentage change in fraction of medals won

Here, you'll start with the DataFrames `editions`, `medals`, `medal_counts`, & `fractions` from prior exercises.

To see if there is a host country advantage, you first want to see how the fraction of medals won changes from edition to edition.

The *expanding mean* provides a way to see this down each column. It is the value of the mean with all the data available up to that point in time. If you are interested in learning more about pandas' expanding transformations, this section of the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.expanding.html) has additional information.

**Instructions**

* Create `mean_fractions` by chaining the methods `.expanding().mean()` to `fractions`.
* Compute the percentage change in `mean_fractions` down each column by applying `.pct_change()` and multiplying by `100`. Assign the result to `fractions_change`.
* Reset the index of `fractions_change` using the `.reset_index()` method. This will make `'Edition'` an ordinary column.
* Print the first and last 5 rows of the DataFrame `fractions_change`. This has been done for you, so hit 'Submit Answer' to see the results!

In [None]:
# Apply the expanding mean: mean_fractions
mean_fractions = fractions.expanding().mean()
mean_fractions.head()

In [None]:
# Compute the percentage change: fractions_change
fractions_change = mean_fractions.pct_change()*100
fractions_change.head()

In [None]:
# Reset the index of fractions_change: fractions_change
fractions_change = fractions_change.reset_index()

# Print first & last 5 rows of fractions_change
fractions_change.head()

In [None]:
fractions_change.tail()

### Reshaping and plotting

### Case Study Explorations

#### Building hosts DataFrame

Your task here is to prepare a DataFrame `hosts` by left joining `editions` and `ioc_codes`.

Once created, you will subset the `Edition` and `NOC` columns and set `Edition` as the Index.

There are some missing `NOC` values; you will set those explicitly.

Finally, you'll reset the Index & print the final DataFrame.

**Instructions**

* Create the DataFrame `hosts` by doing a left join on DataFrames `editions` and `ioc_codes` (using `pd.merge()`).
* Clean up `hosts` by subsetting and setting the Index.
    * Extract the columns `'Edition'` and `'NOC'`.
    * Set `'Edition'` column as the Index.
* Use the `.loc[]` accessor to find and assign the missing values to the `'NOC'` column in `hosts`. This has been done for you.
* Reset the index of `hosts` using `.reset_index()`, which you'll need to save as the `hosts` DataFrame.

In [None]:
# Left join editions and ioc_codes: hosts
hosts = pd.merge(editions, ioc_codes, how='left')
hosts.head()

In [None]:
# Extract relevant columns and set index: hosts
hosts = hosts[['Edition', 'NOC']].set_index('Edition')
hosts.head()

In [None]:
# Fix missing 'NOC' values of hosts
hosts.loc[hosts.NOC.isnull()]

In [None]:
hosts.loc[1972, 'NOC'] = 'FRG'
hosts.loc[1980, 'NOC'] = 'URS'
hosts.loc[1988, 'NOC'] = 'KOR'

In [None]:
# Reset Index of hosts: hosts
hosts.reset_index(inplace=True)

In [None]:
hosts.head()

#### Reshaping for analysis

This exercise starts off with `fractions_change` and `hosts` already loaded.

Your task here is to reshape the `fractions_change` DataFrame for later analysis.

Initially, `fractions_change` is a wide DataFrame of 26 rows (one for each Olympic edition) and 139 columns (one for the edition and 138 for the competing countries).

On reshaping with `pd.melt()`, as you will see, the result is a tall DataFrame with 3588 rows and 3 columns that summarizes the fractional change in the expanding mean of the percentage of medals won for each country in blocks.

**Instructions**

* Create a DataFrame `reshaped` by reshaping the DataFrame `fractions_change` with `pd.melt()`.
* You'll need to use the keyword argument `id_vars='Edition'` to set the identifier variable.
* You'll also need to use the keyword argument `value_name='Change'` to set the measured variables.
* Print the shape of the DataFrames `reshaped` and `fractions_change`. This has been done for you.
* Create a DataFrame `chn` by extracting all the rows from `reshaped` in which the three letter code for each country (`'NOC'`) is `'CHN'`.
* Print the last 5 rows of the DataFrame `chn` using the `.tail()` method.

In [None]:
# Reshape fractions_change: reshaped
reshaped = pd.melt(fractions_change, id_vars='Edition', value_name='Change')

# Print reshaped.shape and fractions_change.shape
reshaped.shape

In [None]:
fractions_change.shape

In [None]:
# Extract rows from reshaped where 'NOC' == 'CHN': chn
chn = reshaped[reshaped.NOC == 'CHN']

# Print last 5 rows of chn with .tail()
chn.tail()

**On looking at the hosting countries from the last 5 Olympic editions and the fractional change of medals won by China the last 5 editions, you can see that China fared significantly better in 2008 (i.e., when China was the host country).**

#### Merging to compute influence

This exercise starts off with the DataFrames `reshaped` and `hosts` in the namespace.

Your task is to merge the two DataFrames and tidy the result.

The end result is a DataFrame summarizing the fractional change in the expanding mean of the percentage of medals won for the host country in each Olympic edition.

**Instructions**

* Merge `reshaped` and `hosts` using an inner join. Remember, `how='inner'` is the default behavior for `pd.merge()`.
* Print the first 5 rows of the DataFrame `merged`. This has been done for you. You should see that the rows are jumbled chronologically.
* Set the index of `merged` to be `'Edition'` and sort the index.
* Print the first 5 rows of the DataFrame `influence`.

In [None]:
# Merge reshaped and hosts: merged
merged = pd.merge(reshaped, hosts, how='inner')
# Print first 5 rows of merged
merged.head()

In [None]:
# Set Index of merged and sort it: influence
influence = merged.set_index('Edition').sort_index()

# Print first 5 rows of influence
influence.head()

#### Plotting influence of host country

This final exercise starts off with the DataFrames `influence` and `editions` in the namespace. Your job is to plot the influence of being a host country.

**Instructions**

* Create a Series called `change` by extracting the `'Change'` column from `influence`.
* Create a bar plot of `change` using the `.plot()` method with `kind='bar'`. Save the result as `ax` to permit further customization.
* Customize the bar plot of `change` to improve readability:
* Apply the method `.set_ylabel("% Change of Host Country Medal Count")` to `ax`.
* Apply the method `.set_title("Is there a Host Country Advantage?")` to `ax`.
* Apply the method `.set_xticklabels(editions['City'])` to ax.
* Reveal the final plot using `plt.show()`.

In [None]:
# Extract influence['Change']: change
change = influence['Change']

# Make bar plot of change: ax
ax = change.plot(kind='bar')

# Customize the plot to improve readability
ax.set_ylabel("% Change of Host Country Medal Count")
ax.set_title("Is there a Host Country Advantage?")
ax.set_xticklabels(editions['City'])

# Display the plot
plt.show()

# Certificate

![](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2019-03-23_merging_dataframes_with_pandas/2019-05-02_merging_dataframes_with_pandas_certificate.JPG)