# `pandas` - Concatenation

__Contents:__

1.  Concatenating objects
2.  Set logic on other axes
3.  Concatenating using append
4.  Ignoring indexes on the concatenation axis
5.  Concatenating

Related/useful documentation:
- https://pandas.pydata.org/pandas-docs/stable/merging.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html

In [4]:
from IPython.display import HTML, display
from PIL import Image

### Load libraries

In [6]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

__Concatenation__

`concat` function - The concat function does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html

__1. Concatenating Objects__

pandas.concat takes a list or dict of homogeneously-typed objects and concatenates them with some configurable handling of “what to do with the other axes”:

__Example__

Defining sample data panda dataframes df1 and df2

In [10]:
 df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])
 df1

In [11]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7])
df2

In [12]:
frames = [df1, df2]

Using concat function to concatenate two dataframes df1 and df2. The concatenated dataframed in stored in `result`.

In [14]:
result = pd.concat(frames)
result

Suppose we wanted to associate specific keys with each of the pieces of the chopped up DataFrame. We can use the keys argument:

In [16]:
result = pd.concat(frames, keys=['x', 'y']).index
result

The resulting object's index has a hierarchical index. We can select each chunk by key:

In [18]:
result.loc['y']

__2. Set logic on other axes__:

While appending multiple data frames, you have a choice how to handle other axes. This can be done in three ways:

1. `join = 'outer'` which takes sorted union of all, zero information loss
2. `join = 'inner'` which takes the intersection
3. `join_axes` Use a specific index or indexes

`join = 'outer'`

In [21]:
df3 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])
df3

In [22]:
df1

In [23]:
result = pd.concat([df1, df3], axis=1)
result

`join` = 'inner'

In [25]:
result = pd.concat([df1, df3], axis=1, join='inner')
result

`join_axes`

In [27]:
result = pd.concat([df1, df3], axis=1, join_axes=[df1.index])
result

__3. Concatenating using append__

A useful shortcut to concat are the append instance methods on Series and DataFrame. These methods actually predated concat. They concatenate along axis=0, namely the index:

In [29]:
result = df1.append(df2)
result

`append` may take multiple objects to concatenate:

In [31]:
result = df1.append([df2, df3])
result

__4. Ignoring indexes on the concatenation axis__

For DataFrames which don’t have a meaningful index, you may wish to append them and ignore the fact that they may have overlapping indexes. We can set `ignore_index` = `True`. The same argument works in a similar way with `DataFrame.append`

In [33]:
result = pd.concat([df1, df3], ignore_index=True)
result

__5. Concatenating with mixed ndims__

We can also concatenate a mix of Series and DataFrames. The Series gets transformed to DataFrames with the column name as the name of the Series.

In [35]:
s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')
s1

In [36]:
result = pd.concat([df1, s1], axis=1)
result

##Example of concact using the iris dataset

In [38]:
iris = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/iris.csv')
iris

In [39]:
iris.shape

Split the dataset into two dataframes of different sizes

In [41]:
from sklearn.model_selection import train_test_split
iris_df1,iris_df2= train_test_split(iris,test_size=0.4,train_size=0.6)

Now we will try applying the functions on the two subsets (`iris_df1`,`iris_df2`) of the iris dataset

In [43]:
frame = [iris_df1, iris_df2]

In [44]:
iris_concat1 = pd.concat(frame, keys=['x', 'y'])
iris_concat1

In [45]:
iris_concat2= pd.concat([iris_df1, iris_df2])
iris_concat2.shape

In [46]:
iris_df1

In [47]:
iris_append = iris_df1.append(iris_df2)
iris_append.shape

In [48]:
iris

__The End__