# Advanced Indexing, Merging and Reshaping Data

Data can be organized in many forms. Today, we will discuss some of these variants and methods for processing them. In addition, we will discuss some common methods for merging multiple data sources together.

Friendly Reminders:

* DataCamp Module - Bringing it all together! (Python Data Science Toolbox), due tonight by 11:59 p.m.
* Homework #4 due Thursday by 11:59 p.m.
* Project progress report due March 28 by 11:59 p.m.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

## Alternative Indexing and Selection

We have explored several approaching for selecting values within a Series or DataFrame, including:

* .loc, .iloc methods

Alternatively, we can use filtering approaches to reduce a Series or DataFrame object to observations of interest.

By default, when we create a Series or DataFrame object, we are assigned an Index that ranges from 0 to one less than the length of the object. Selecting values according to this default Index is not particularly convenient, because we need to know the positional location of the value. Alternatively, it can be useful to use a more meaningful index that makes it easier to find the observations that we want to select.

In [2]:
# Define example Series object
ser = Series(np.random.rand(10))
ser

0    0.518709
1    0.810080
2    0.895074
3    0.700830
4    0.150092
5    0.858899
6    0.387727
7    0.064722
8    0.976734
9    0.632055
dtype: float64

In [3]:
# Basic Series indexing
ser[5]

0.8588992972507546

When importing data (primarily applicable to DataFrames), we have the option of specifying the *index_col* argument, which can set the index to a specific column of the data. This approach works well if we would like to access the values with an alternative index.

In [4]:
# Import video game data
# path = '/Users/seanbarnes/Dropbox/Teaching/Courses/BUDT758X/data/'
df = pd.read_csv('vgsales.csv', index_col='Rank')
df.head()

Unnamed: 0_level_0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [5]:
# Select row(s) via rank index
df.loc[100]

Name              Battlefield 3
Platform                   X360
Year                       2011
Genre                   Shooter
Publisher       Electronic Arts
NA_Sales                   4.46
EU_Sales                   2.13
JP_Sales                   0.06
Other_Sales                0.69
Global_Sales               7.34
Name: 100, dtype: object

In [6]:
# Select row(s) via filtering
df[df.index == 100]

Unnamed: 0_level_0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Battlefield 3,X360,2011.0,Shooter,Electronic Arts,4.46,2.13,0.06,0.69,7.34


Sometimes, it's useful to make adjustments to the index, based on the next step in your processing/analysis. There are two primary methods for doing this:

* .set_index(*cols*, *drop*=True, *inplace*=False) - Reindex object using specific column(s) in the DataFrame
* .reset_index(*level*, *drop*=False, *inplace*=False) - Shift one or more index levels back to DataFrame columns

The *drop* argument specifies whether you want to retain the specified column/level in the new data structure.

In [7]:
# Reset index - drop=False
df.reset_index().head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [8]:
# Reset index - drop=True
df.reset_index(drop=True).head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


### 2-Minute Activity

Create a new DataFrame (*df_nes*) that only contains NES games, then preview the first 5 rows. What do you notice about the index when you perform this filtering?

In [9]:
# First, reset index of inc DataFrame
df.reset_index(inplace=True)
# inplace = True, update current dataset rather than create a copy

In [10]:
# Then, filter to companies that operate in California
df_nes = df[df.Platform == 'NES']
df_nes.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
9,10,Duck Hunt,NES,1984.0,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
22,23,Super Mario Bros. 3,NES,1988.0,Platform,Nintendo,9.54,3.44,3.84,0.46,17.28
96,97,Super Mario Bros. 2,NES,1988.0,Platform,Nintendo,5.39,1.18,0.7,0.19,7.46
127,128,The Legend of Zelda,NES,1986.0,Action,Nintendo,3.74,0.93,1.69,0.14,6.51


In [11]:
df_nes.reset_index(drop=True).head()
# no longer need such infomation, and do not wanna such info re-join the dataset

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
1,10,Duck Hunt,NES,1984.0,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
2,23,Super Mario Bros. 3,NES,1988.0,Platform,Nintendo,9.54,3.44,3.84,0.46,17.28
3,97,Super Mario Bros. 2,NES,1988.0,Platform,Nintendo,5.39,1.18,0.7,0.19,7.46
4,128,The Legend of Zelda,NES,1986.0,Action,Nintendo,3.74,0.93,1.69,0.14,6.51


## Hierarchical Indexing

Hierarchical indexing in pandas allows you to have multiple index levels on an axis (e.g., row or column). This functionality provides us with the ability to work with higher dimensional data, but also be able to select observations in our data more directly (as in the previous example).

There are multiple ways of specifying a hierarchical index (MultiIndex) for a Series or DataFrame object:

1. On creation, using the *index* or *columns* arguments for the Series or DataFrame functions
2. Assigning to the .index and/or .columns attributes
3. Using the .reindex method, specifying the *index* and/or *columns* arguments
4. Using the .set_index method, specifying multiple columns (e.g., as a list) to the *keys* argument

A MultiIndex object is essentially a sequence of unique tuples, where each value of the tuple corresponds to a particular level of the hierarchy. Together, the values in the tuple specify the index of the observation(s). There are three ways to create a MultiIndex object:

1. pd.MultiIndex.from_arrays - Each corresponding sequence of elements are zipped into tuple of hierarchical indices
2. pd.MultiIndex.from_tuples - Provide list of hierarchical index tuples explicitly
3. pd.MultiIndex.from_product - Generates all combinations of indices with values from each iterable (level)

The *from_arrays* approach (without creating the MultiIndex) is also accepted for many of the aforementioned options for specifying a hierarchical index (e.g., 1-3 above).

Additional comments:

* Data alignment still occurs with hierarchical indexing, so you have to be careful if performing operations or comparisons between hierarchically indexed objects.
* Our coverage today focuses on hierarchically indexed *rows*, but the methods described apply to hierchically indexed *columns* as well
* For additional details on MultiIndex objects and hierarchical indexing, see https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

In [12]:
# MultiIndex from arrays
nums = [1,1,1,2,2,2,3,3,3]
lets = ['a','b','c'] * 3
midx = pd.MultiIndex.from_arrays([nums,lets])
midx

MultiIndex(levels=[[1, 2, 3], ['a', 'b', 'c']],
           codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])

In [13]:
ser = Series(np.random.randn(9), index=midx)
ser

1  a   -1.469009
   b    1.322576
   c   -0.245392
2  a   -1.302881
   b    1.316681
   c    0.444896
3  a    1.632043
   b    0.330817
   c    0.583783
dtype: float64

In [14]:
# Equivalent MultiIndex from tuples - Explicit
midx = pd.MultiIndex.from_tuples([(1,'a'),(1,'b'),(1,'c'),(2,'a'),(2,'b'),(2,'c'),(3,'a'),(3,'b'),(3,'c')])
midx

MultiIndex(levels=[[1, 2, 3], ['a', 'b', 'c']],
           codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])

In [15]:
# Generating tuples in a Pythonic way
tuples = list(zip(*[nums,lets]))
print(tuples)

[(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c'), (3, 'a'), (3, 'b'), (3, 'c')]


In [16]:
# Equivalent MultiIndex from tuples
midx = pd.MultiIndex.from_tuples(tuples)
midx

MultiIndex(levels=[[1, 2, 3], ['a', 'b', 'c']],
           codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])

In [17]:
# Equivalent MultiIndex from product
midx = pd.MultiIndex.from_product([[1,2,3],['a','b','c']])
midx

MultiIndex(levels=[[1, 2, 3], ['a', 'b', 'c']],
           codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])

In [18]:
# Note the difference from atomic tuple indices
ser.index = tuples
ser

(1, a)   -1.469009
(1, b)    1.322576
(1, c)   -0.245392
(2, a)   -1.302881
(2, b)    1.316681
(2, c)    0.444896
(3, a)    1.632043
(3, b)    0.330817
(3, c)    0.583783
dtype: float64

In [19]:
# .reindex method
ser = ser.reindex(index=midx)
ser

1  a   -1.469009
   b    1.322576
   c   -0.245392
2  a   -1.302881
   b    1.316681
   c    0.444896
3  a    1.632043
   b    0.330817
   c    0.583783
dtype: float64

In [20]:
# set_index method
df = df.set_index(keys=['Platform','Genre']).sort_index(level=[0,1])
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Rank,Name,Year,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Platform,Genre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2600,Action,736,Frogger,1981.0,Parker Bros.,2.06,0.12,0.0,0.02,2.2
2600,Action,866,E.T.: The Extra Terrestrial,1981.0,Atari,1.84,0.11,0.0,0.02,1.97
2600,Action,1587,Combat,,Atari,1.17,0.07,0.0,0.01,1.25
2600,Action,2234,Spider-Man,1981.0,Parker Bros.,0.87,0.05,0.0,0.01,0.93
2600,Action,2518,Custer's Revenge,1981.0,Mystique,0.76,0.05,0.0,0.01,0.82
2600,Action,2598,Alien,1981.0,20th Century Fox Video Games,0.74,0.04,0.0,0.01,0.79
2600,Action,2666,Air Raid,1981.0,Men-A-Vision,0.72,0.04,0.0,0.01,0.77
2600,Action,2674,Crystal Castles,1983.0,Atari,0.72,0.04,0.0,0.01,0.77
2600,Action,2942,King Kong,1981.0,Tigervision,0.65,0.04,0.0,0.01,0.69
2600,Action,3046,Adventures of Tron,1981.0,Mattel Interactive,0.63,0.03,0.0,0.01,0.67


The advantage of hierarhical indexing over the atomic approach is that we can perform various operations across specific levels of the hierarchy:

* Selection
* Summary statistics
* Reshaping (next)
* Group operations (after Spring Break)

In [21]:
# Selecting outer level
ser[2]

a   -1.302881
b    1.316681
c    0.444896
dtype: float64

In [22]:
# Selecting multiple levels
ser[2,'a']

-1.3028808948170674

In [23]:
# Selecting inner level
ser[:,'b']

1    1.322576
2    1.316681
3    0.330817
dtype: float64

In [24]:
# Selection - DataFrame
df.loc[('SNES', 'Platform')].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Rank,Name,Year,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Platform,Genre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SNES,Platform,19,Super Mario World,1990.0,Nintendo,12.78,3.75,3.54,0.55,20.61
SNES,Platform,58,Super Mario All-Stars,1993.0,Nintendo,5.99,2.15,2.12,0.29,10.55
SNES,Platform,72,Donkey Kong Country,1994.0,Nintendo,4.36,1.71,3.0,0.23,9.3
SNES,Platform,188,Donkey Kong Country 2: Diddy's Kong Quest,1995.0,Nintendo,2.1,0.74,2.2,0.11,5.15
SNES,Platform,283,Super Mario World 2: Yoshi's Island,1995.0,Nintendo,1.65,0.61,1.76,0.09,4.12


In [25]:
# .xs method
df.xs('Sports', level=1).head()
# select 'Sports' platform
# ser[:,'b']

Unnamed: 0_level_0,Rank,Name,Year,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2600,3645,The Activision Decathlon,1982.0,Activision,0.52,0.03,0.0,0.01,0.55
2600,3882,Fishing Derby,,Activision,0.48,0.03,0.0,0.01,0.51
2600,4015,RealSports Tennis,1982.0,Atari,0.46,0.03,0.0,0.01,0.5
2600,4027,Ice Hockey,1980.0,Activision,0.46,0.03,0.0,0.01,0.49
2600,5958,RealSports Boxing,1986.0,Atari,0.28,0.02,0.0,0.0,0.29


In [26]:
# Summary statistics - Series
ser.mean(level=1)
# ser.mean(level=0)

a   -0.379949
b    0.990025
c    0.261096
dtype: float64

In [27]:
# Summary statistics - DataFrame
df['Global_Sales'].max(level=0).sort_values(ascending=False).head(10)

Platform
Wii     82.74
NES     40.24
GB      31.37
DS      30.01
X360    21.82
PS3     21.40
PS2     20.81
SNES    20.61
GBA     15.85
3DS     14.35
Name: Global_Sales, dtype: float64

In [31]:
df['Global_Sales'].max(level=1).sort_values(ascending=False).head(10)

Genre
Sports          82.74
Platform        40.24
Racing          35.82
Role-Playing    31.37
Puzzle          30.26
Misc            29.02
Shooter         28.31
Simulation      24.76
Action          21.40
Fighting        13.04
Name: Global_Sales, dtype: float64

## Reshaping Data

There are several methods for reshaping data structures in pandas:

* .T to transpose DataFrames (i.e., exchange rows with columns)
* .stack, .unstack for pivoting hierarchically indexed data
* .pivot, .melt for converting between long and wide data formats

### Transpose

In [34]:
# Import movies data
mov = pd.read_csv('movies.csv', index_col='Film').sort_index()
mov.head()

Unnamed: 0_level_0,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
(500) Days of Summer,Comedy,Fox,81,8.096,87,60.72,2009
27 Dresses,Comedy,Fox,71,5.343622,40,160.308654,2008
A Dangerous Method,Drama,Independent,89,0.448645,79,8.972895,2011
A Serious Man,Drama,Universal,64,4.382857,89,30.68,2009
Across the Universe,Romance,Independent,84,0.652603,54,29.367143,2007


In [35]:
# Transpose DataFrame
mov[mov['Worldwide Gross'] > 200].head().T

Film,Enchanted,High School Musical 3: Senior Year,It's Complicated,Knocked Up,Mamma Mia!
Genre,Comedy,Comedy,Comedy,Comedy,Comedy
Lead Studio,Disney,Disney,Universal,Universal,Universal
Audience score %,80,76,63,83,76
Profitability,4.00574,22.9131,2.64235,6.6364,9.23445
Rotten Tomatoes %,93,65,56,91,53
Worldwide Gross,340.488,252.045,224.6,219.001,609.474
Year,2007,2008,2009,2007,2008


### Stack/Unstack

In [36]:
# Unstack inner level (default) - Pivots row index to column(s)
ser.unstack()
# transform serials to dataframe

Unnamed: 0,a,b,c
1,-1.469009,1.322576,-0.245392
2,-1.302881,1.316681,0.444896
3,1.632043,0.330817,0.583783


In [37]:
# Unstack specific level
ser.unstack(level=0)

Unnamed: 0,1,2,3
a,-1.469009,-1.302881,1.632043
b,1.322576,1.316681,0.330817
c,-0.245392,0.444896,0.583783


In [38]:
# Stack - Pivotes column index into row(s)
entries = 3
mov.stack().head(entries * len(mov.columns))

Film                                   
(500) Days of Summer  Genre                     Comedy
                      Lead Studio                  Fox
                      Audience  score %             81
                      Profitability              8.096
                      Rotten Tomatoes %             87
                      Worldwide Gross            60.72
                      Year                        2009
27 Dresses            Genre                     Comedy
                      Lead Studio                  Fox
                      Audience  score %             71
                      Profitability            5.34362
                      Rotten Tomatoes %             40
                      Worldwide Gross          160.309
                      Year                        2008
A Dangerous Method    Genre                      Drama
                      Lead Studio          Independent
                      Audience  score %             89
                      Pro

### Pivot/Melt

In [40]:
# Import time series data
ts = pd.read_csv('macrodata1.csv')
ts.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [41]:
# Combine year and quarter into date
ts.insert(0, 'date', pd.PeriodIndex(year=ts.year, quarter=ts.quarter))
del ts['year'], ts['quarter']
ts.head()

Unnamed: 0,date,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959Q1,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959Q2,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959Q3,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959Q4,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960Q1,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [42]:
# Melt - Convert from wide to long format
ts_melt = ts.melt(id_vars='date', value_vars=['realgdp','unemp','infl']).sort_values(by='date')
ts_melt.head()

Unnamed: 0,date,variable,value
0,1959Q1,realgdp,2710.349
203,1959Q1,unemp,5.8
406,1959Q1,infl,0.0
1,1959Q2,realgdp,2778.801
204,1959Q2,unemp,5.1


In [43]:
# Pivot - Convert from long to wide format
ts_melt.pivot(index='date', columns='variable', values='value').head()

variable,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959Q1,0.0,2710.349,5.8
1959Q2,2.34,2778.801,5.1
1959Q3,2.74,2775.488,5.3
1959Q4,0.27,2785.204,5.6
1960Q1,2.31,2847.699,5.2


## Merging Data

Oftentimes, data is stored in multiple storage entities (e.g., files, database tables, websites), and we need to combine them together. The two primary functions in padas for combining data are:

* pd.concat - Concatenates (or stacks) data together, similar to Numpy functions (np.concatenate, stack variations)
* pd.merge - Database-style merge, based on key column(s) and/or Index/MultiIndex

Whether to use pd.concat or pd.merge depends on how the data structures to be combined are organized:

* If the data structures are aligned in some manner (e.g., the row indices are aligned or the column indices are aligned), then concatenation is probably the correct approach
* If you need to combine the data structures based on one or more common keys (e.g., name of person/organization/company/team, record or identification number, reference code), then pd.merge is probably more appropriate.

### Concatenation

In [None]:
pd.concat?

In [44]:
# Import summer Olympics data - www.sports-reference.com
so = pd.read_html('https://www.sports-reference.com/olympics/summer/', match='City')[0]
so['Type'] = 'Summer'
so.head()

Unnamed: 0,Year,City,Country,Countries,Participants,Men,Women,Sports,Events,Type
0,2016,Rio de Janeiro,Brazil,207,11191,6147,5037,34,306,Summer
1,2012,London,Great Britain,205,10517,5864,4653,32,302,Summer
2,2008,Beijing,China,204,10902,6290,4610,34,303,Summer
3,2004,Athina,Greece,201,10560,6257,4303,34,301,Summer
4,2000,Sydney,Australia,200,10648,6579,4068,34,300,Summer


In [45]:
# Import summer Olympics data - www.sports-reference.com
wo = pd.read_html('https://www.sports-reference.com/olympics/winter/', match='City')[0]
wo['Type'] = 'Winter'
wo.head()

Unnamed: 0,Year,City,Country,Countries,Participants,Men,Women,Sports,Events,Type
0,2014,Sochi,Russia,89,2749,1643,1103,15,98,Winter
1,2010,Vancouver,Canada,82,2536,1503,1033,15,86,Winter
2,2006,Torino,Italy,79,2494,1539,955,15,84,Winter
3,2002,Salt Lake City,United States,77,2399,1513,886,15,78,Winter
4,1998,Nagano,Japan,72,2180,1390,789,14,68,Winter


In [46]:
# Concatenate DataFrames together
olympics = pd.concat([so, wo], axis=0, ignore_index=True)
olympics

Unnamed: 0,Year,City,Country,Countries,Participants,Men,Women,Sports,Events,Type
0,2016,Rio de Janeiro,Brazil,207,11191,6147,5037,34,306,Summer
1,2012,London,Great Britain,205,10517,5864,4653,32,302,Summer
2,2008,Beijing,China,204,10902,6290,4610,34,303,Summer
3,2004,Athina,Greece,201,10560,6257,4303,34,301,Summer
4,2000,Sydney,Australia,200,10648,6579,4068,34,300,Summer
5,1996,Atlanta,United States,197,10344,6820,3521,31,271,Summer
6,1992,Barcelona,Spain,169,9386,6659,2721,29,257,Summer
7,1988,Seoul,South Korea,159,8454,6249,2203,27,237,Summer
8,1984,Los Angeles,United States,140,6799,5224,1567,26,221,Summer
9,1980,Moskva,Soviet Union,80,5259,4135,1123,23,203,Summer


In [47]:
# Resort DataFrame
olympics.sort_values(by=['Year','Type'], ascending=False)

Unnamed: 0,Year,City,Country,Countries,Participants,Men,Women,Sports,Events,Type
0,2016,Rio de Janeiro,Brazil,207,11191,6147,5037,34,306,Summer
29,2014,Sochi,Russia,89,2749,1643,1103,15,98,Winter
1,2012,London,Great Britain,205,10517,5864,4653,32,302,Summer
30,2010,Vancouver,Canada,82,2536,1503,1033,15,86,Winter
2,2008,Beijing,China,204,10902,6290,4610,34,303,Summer
31,2006,Torino,Italy,79,2494,1539,955,15,84,Winter
3,2004,Athina,Greece,201,10560,6257,4303,34,301,Summer
32,2002,Salt Lake City,United States,77,2399,1513,886,15,78,Winter
4,2000,Sydney,Australia,200,10648,6579,4068,34,300,Summer
33,1998,Nagano,Japan,72,2180,1390,789,14,68,Winter


### Merging

In [None]:
pd.merge?

In [48]:
# Create DataFrame example 1
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data': range(7)})
df1

Unnamed: 0,key,data
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [49]:
# Create DataFrame example 2
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data': range(3)})
df2

Unnamed: 0,key,data
0,a,0
1,b,1
2,d,2


In [50]:
# Join by key on both DataFrames - Default (how='inner'), overlapping keys only
pd.merge(df1, df2, on='key')

Unnamed: 0,key,data_x,data_y
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [51]:
# Join by key on both DataFrames - All keys
pd.merge(df1, df2, on='key', how='outer')

Unnamed: 0,key,data_x,data_y
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


In [52]:
# Update df2 index
df2.set_index('key', inplace=True)
df2

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
a,0
b,1
d,2


In [53]:
# Merge based on index
# merge index + column, column+column, index+index
pd.merge(df1, df2, left_on='key', right_index=True, how='left', suffixes=('','_new'))

Unnamed: 0,key,data,data_new
0,b,0,1.0
1,b,1,1.0
2,a,2,0.0
3,c,3,
4,a,4,0.0
5,a,5,0.0
6,b,6,1.0


In [54]:
df1

Unnamed: 0,key,data
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


## Next Time: Data Wrangling Lab