*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

# Combining Datasets: Concat and Append

In [1]:
import pandas as pd
import numpy as np

In [2]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)

# example DataFrame
make_df('ABC', range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


## Simple Concatenation with ``pd.concat``

In [3]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
s = pd.concat([ser1, ser2, ser2, ser1])

In [4]:
ser1

1    A
2    B
3    C
dtype: object

In [5]:
ser2

4    D
5    E
6    F
dtype: object

In [6]:
s

1    A
2    B
3    C
4    D
5    E
6    F
4    D
5    E
6    F
1    A
2    B
3    C
dtype: object

## Operations on data frames

In [7]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])


In [8]:
df1

Unnamed: 0,A,B
1,A1,B1
2,A2,B2


In [9]:
df2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4


In [10]:
pd.concat([df1, df2])

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


In [11]:
df1 = make_df('AB', [1, 2])
df2 = make_df('BC', [3, 4])

In [12]:
df1

Unnamed: 0,A,B
1,A1,B1
2,A2,B2


In [13]:
df2

Unnamed: 0,B,C
3,B3,C3
4,B4,C4


In [14]:
pd.concat([df1, df2])

Unnamed: 0,A,B,C
1,A1,B1,
2,A2,B2,
3,,B3,C3
4,,B4,C4


In [15]:
df1 = make_df('AB', [1, 2])
df2 = make_df('CD', [3, 4])

In [16]:
df1

Unnamed: 0,A,B
1,A1,B1
2,A2,B2


In [17]:
df2

Unnamed: 0,C,D
3,C3,D3
4,C4,D4


In [18]:
pd.concat([df1, df2])

Unnamed: 0,A,B,C,D
1,A1,B1,,
2,A2,B2,,
3,,,C3,D3
4,,,C4,D4


By default, the concatenation takes place row-wise within the ``DataFrame`` (i.e., ``axis=0``).
Like ``np.concatenate``, ``pd.concat`` allows specification of an axis along which concatenation will take place.
Consider the following example:

In [19]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])

In [20]:
df3

Unnamed: 0,A,B
0,A0,B0
1,A1,B1


In [21]:
df4

Unnamed: 0,C,D
0,C0,D0
1,C1,D1


In [22]:
pd.concat([df3, df4], axis='columns')

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1


In [23]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [1, 2])

In [24]:
df3

Unnamed: 0,A,B
0,A0,B0
1,A1,B1


In [25]:
df4

Unnamed: 0,C,D
1,C1,D1
2,C2,D2


In [26]:
pd.concat([df3, df4], axis='columns')

Unnamed: 0,A,B,C,D
0,A0,B0,,
1,A1,B1,C1,D1
2,,,C2,D2


### Duplicate indices


In [27]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index  # make duplicate indices!

In [28]:
x

Unnamed: 0,A,B
0,A0,B0
1,A1,B1


In [29]:
y

Unnamed: 0,A,B
0,A2,B2
1,A3,B3


In [30]:
pd.concat([x, y])

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
0,A2,B2
1,A3,B3


#### Catching the repeats as an error

If you'd like to simply verify that the indices in the result of ``pd.concat()`` do not overlap, you can specify the ``verify_integrity`` flag.
With this set to True, the concatenation will raise an exception if there are duplicate indices.
Here is an example, where for clarity we'll catch and print the error message:

In [31]:
#pd.concat([x, y], verify_integrity=True)

In [32]:
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


#### Ignoring the index

Sometimes the index itself does not matter, and you would prefer it to simply be ignored.
This option can be specified using the ``ignore_index`` flag.
With this set to true, the concatenation will create a new integer index for the resulting ``Series``:

In [33]:
x

Unnamed: 0,A,B
0,A0,B0
1,A1,B1


In [34]:
y

Unnamed: 0,A,B
0,A2,B2
1,A3,B3


In [35]:
pd.concat([x, y], ignore_index=True)

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


#### Adding MultiIndex keys

Another option is to use the ``keys`` option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data:

In [36]:
pd.concat([x, y], keys=['x', 'y'])

Unnamed: 0,Unnamed: 1,A,B
x,0,A0,B0
x,1,A1,B1
y,0,A2,B2
y,1,A3,B3


In [37]:
pd.concat([x, y], keys=['sales', 'marketing'])

Unnamed: 0,Unnamed: 1,A,B
sales,0,A0,B0
sales,1,A1,B1
marketing,0,A2,B2
marketing,1,A3,B3


### Concatenation with joins

In the simple examples we just looked at, we were mainly concatenating ``DataFrame``s with shared column names.
In practice, data from different sources might have different sets of column names, and ``pd.concat`` offers several options in this case.
Consider the concatenation of the following two ``DataFrame``s, which have some (but not all!) columns in common:

In [38]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])

In [39]:
df5

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2


In [40]:
df6

Unnamed: 0,B,C,D
3,B3,C3,D3
4,B4,C4,D4


In [41]:
pd.concat([df5, df6])

Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,,B3,C3,D3
4,,B4,C4,D4


In [42]:
pd.concat([df5, df6]).dropna(axis=1)

Unnamed: 0,B,C
1,B1,C1
2,B2,C2
3,B3,C3
4,B4,C4


In [43]:
pd.concat([df5, df6], join='inner')

Unnamed: 0,B,C
1,B1,C1
2,B2,C2
3,B3,C3
4,B4,C4


In [44]:
pd.concat([df5, df6], join='outer') #outer is the default

Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,,B3,C3,D3
4,,B4,C4,D4


In [45]:
# pd.concat([df5, df6], join_axes=[df5.columns]) #nolonger works

In [46]:
pd.concat([df5, df6])[['A', 'B', 'C']]

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2
3,,B3,C3
4,,B4,C4


### The ``append()`` method

Because direct array concatenation is so common, ``Series`` and ``DataFrame`` objects have an ``append`` method that can accomplish the same thing in fewer keystrokes.
For example, rather than calling ``pd.concat([df1, df2])``, you can simply call ``df1.append(df2)``:

In [47]:
df1

Unnamed: 0,A,B
1,A1,B1
2,A2,B2


In [48]:
df2

Unnamed: 0,C,D
3,C3,D3
4,C4,D4


In [49]:
df1.append(df2)

Unnamed: 0,A,B,C,D
1,A1,B1,,
2,A2,B2,,
3,,,C3,D3
4,,,C4,D4
