# Pandas Concatenation

`pd.concat` concatenates a list of `DataFrame` or `Series` objects across either rows (axis=0) or columns(axis=1)

In [1]:
import numpy as np
import pandas as pd

X = pd.DataFrame(np.r_[:9].reshape(3,3), columns='A B C'.split(), index='x y z'.split())
X

Unnamed: 0,A,B,C
x,0,1,2
y,3,4,5
z,6,7,8


In [2]:
Y = pd.DataFrame(np.r_[:900:100].reshape(3,3), columns='A C F'.split(), index='w x y'.split())
Y

Unnamed: 0,A,C,F
w,0,100,200
x,300,400,500
y,600,700,800


If you concatenate across rows, Pandas tries to align the columns (filling in NaN / None) where it can't

In [3]:
Z = pd.concat([X, Y], sort=False)
Z

Unnamed: 0,A,B,C,F
x,0,1.0,2,
y,3,4.0,5,
z,6,7.0,8,
w,0,,100,200.0
x,300,,400,500.0
y,600,,700,800.0


Likewise, concatenating across columns tries to align the index

In [4]:
Z = pd.concat([X, Y], axis=1, sort=False)
Z

Unnamed: 0,A,B,C,A.1,C.1,F
x,0.0,1.0,2.0,300.0,400.0,500.0
y,3.0,4.0,5.0,600.0,700.0,800.0
z,6.0,7.0,8.0,,,
w,,,,0.0,100.0,200.0


Concatenation *will* copy the underlying data

In [5]:
X.loc['x', 'A'] = 232

In [6]:
Z

Unnamed: 0,A,B,C,A.1,C.1,F
x,0.0,1.0,2.0,300.0,400.0,500.0
y,3.0,4.0,5.0,600.0,700.0,800.0
z,6.0,7.0,8.0,,,
w,,,,0.0,100.0,200.0


In [7]:
X

Unnamed: 0,A,B,C
x,232,1,2
y,3,4,5
z,6,7,8


## Dealing with Scikit-Learn datasets

When dealing with Scikit-Learn datasets, the target column is provided as a separate entry. If we want to store the whole dataset as one object, we need to concatenate it:

In [8]:
from sklearn import datasets

iris = datasets.load_iris()

In [9]:
type(iris)

sklearn.utils.Bunch

In [13]:
data = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.Series(iris.target, name='Species')

In [14]:
data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [15]:
target.head()

0    0
1    0
2    0
3    0
4    0
Name: Species, dtype: int64

In [19]:
# There is an error here -- can you guess what it is before executing?

df_iris = pd.concat([data, target])
df_iris.head()


Unnamed: 0,0,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm)
0,,1.4,0.2,5.1,3.5
1,,1.4,0.2,4.9,3.0
2,,1.3,0.2,4.7,3.2
3,,1.5,0.2,4.6,3.1
4,,1.4,0.2,5.0,3.6


In [17]:
df_iris.tail()

Unnamed: 0,0,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm)
145,2.0,,,,
146,2.0,,,,
147,2.0,,,,
148,2.0,,,,
149,2.0,,,,


# .

# .

# .

# .

# .

# .

# .

# .

# .



In [18]:
df_iris = pd.concat([data, target], axis=1)
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


# Merging

Although you can use `concat` to do "joins" (especially on the index), I usually use `pd.merge` for that purpose.

In [23]:
sales = pd.read_csv('./data/kaggle-sales/sales_train.csv.gz', parse_dates=['date'])
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,2013-02-01,0,59,22154,999.0,1.0
1,2013-03-01,0,25,2552,899.0,1.0
2,2013-05-01,0,25,2552,899.0,-1.0
3,2013-06-01,0,25,2554,1709.05,1.0
4,2013-01-15,0,25,2555,1099.0,1.0


In [24]:
items = pd.read_csv('./data/kaggle-sales/items.csv.gz')
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


In [25]:
categories = pd.read_csv('./data/kaggle-sales/item_categories.csv.gz')
categories.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


We can merge in the item data to sales first...

In [26]:
data = pd.merge(sales, items)  # merge on common column names
# data = pd.merge(sales, items, on='item_id')
# data = pd.merge(sales, items, left_on='item_id', right_on='item_id')
# data = pd.merge(sales, items, left_on='item_id', right_index=True)  # if items has index

data.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id
0,2013-02-01,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37
1,2013-01-23,0,24,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37
2,2013-01-20,0,27,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37
3,2013-02-01,0,25,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37
4,2013-03-01,0,25,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37


... and then merge in the categories to get our 'fully-flattened' data

In [27]:
data = pd.merge(data, categories)
data.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name
0,2013-02-01,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
1,2013-01-23,0,24,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
2,2013-01-20,0,27,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
3,2013-02-01,0,25,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
4,2013-03-01,0,25,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray


Now, we can answer questions like "which categories had the most/fewest transactions?"

In [28]:
pd.value_counts(data.item_category_name)

Кино - DVD                             564652
Игры PC - Стандартные издания          351591
Музыка - CD локального производства    339585
Игры - PS3                             208219
Кино - Blu-Ray                         192674
                                        ...  
PC - Гарнитуры/Наушники                     3
Аксессуары - PS2                            2
Книги - Открытки                            2
Книги - Познавательная литература           1
Игровые консоли - PS2                       1
Name: item_category_name, Length: 84, dtype: int64

In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935849 entries, 0 to 2935848
Data columns (total 9 columns):
 #   Column              Dtype         
---  ------              -----         
 0   date                datetime64[ns]
 1   date_block_num      int64         
 2   shop_id             int64         
 3   item_id             int64         
 4   item_price          float64       
 5   item_cnt_day        float64       
 6   item_name           object        
 7   item_category_id    int64         
 8   item_category_name  object        
dtypes: datetime64[ns](1), float64(2), int64(4), object(2)
memory usage: 224.0+ MB


Open the [Pandas merging lab][pandas-merging-lab]

[pandas-merging-lab]: ./pandas-merging-lab.ipynb