# Merging and Concatenating Dataframes


In this section, you will merge and concatenate multiple dataframes. Merging is one of the most common operations you will do, since data often comes in various files. 

In our case, we have sales data of a retail store spread across multiple files. We will now work with all these data files and learn to:
* Merge multiple dataframes using common columns/keys using ```pd.merge()```
* Concatenate dataframes using ```pd.concat()```

Let's first read all the data files.

In [1]:
# loading libraries and reading the data
import numpy as np
import pandas as pd

market_df = pd.read_csv("./global_sales_data/market_fact.csv")
customer_df = pd.read_csv("./global_sales_data/cust_dimen.csv")
product_df = pd.read_csv("./global_sales_data/prod_dimen.csv")
shipping_df = pd.read_csv("./global_sales_data/shipping_dimen.csv")
orders_df = pd.read_csv("./global_sales_data/orders_dimen.csv")

### Merging Dataframes Using ```pd.merge()```

There are five data files:
1. The ```market_fact``` table contains the sales data of each order
2. The other 4 files are called 'dimension tables/files' and contain metadata about customers, products, shipping details, order details etc.

If you are familiar with star schemas and data warehouse designs, you will note that we have one fact table and four dimension tables. 


In [2]:
# Already familiar with market data: Each row is an order
market_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38


In [3]:
# Customer dimension table: Each row contains metadata about customers
customer_df.head()

Unnamed: 0,Customer_Name,Province,Region,Customer_Segment,Cust_id
0,MUHAMMED MACINTYRE,NUNAVUT,NUNAVUT,SMALL BUSINESS,Cust_1
1,BARRY FRENCH,NUNAVUT,NUNAVUT,CONSUMER,Cust_2
2,CLAY ROZENDAL,NUNAVUT,NUNAVUT,CORPORATE,Cust_3
3,CARLOS SOLTERO,NUNAVUT,NUNAVUT,CONSUMER,Cust_4
4,CARL JACKSON,NUNAVUT,NUNAVUT,CORPORATE,Cust_5


In [4]:
customer_df.loc[:, ['Cust_id', 'Customer_Name', 'Customer_Segment', 'Province', 'Region']].head(10)

Unnamed: 0,Cust_id,Customer_Name,Customer_Segment,Province,Region
0,Cust_1,MUHAMMED MACINTYRE,SMALL BUSINESS,NUNAVUT,NUNAVUT
1,Cust_2,BARRY FRENCH,CONSUMER,NUNAVUT,NUNAVUT
2,Cust_3,CLAY ROZENDAL,CORPORATE,NUNAVUT,NUNAVUT
3,Cust_4,CARLOS SOLTERO,CONSUMER,NUNAVUT,NUNAVUT
4,Cust_5,CARL JACKSON,CORPORATE,NUNAVUT,NUNAVUT
5,Cust_6,MONICA FEDERLE,CORPORATE,NUNAVUT,NUNAVUT
6,Cust_7,DOROTHY BADDERS,HOME OFFICE,NUNAVUT,NUNAVUT
7,Cust_8,NEOLA SCHNEIDER,HOME OFFICE,NUNAVUT,NUNAVUT
8,Cust_9,CARLOS DALY,HOME OFFICE,NUNAVUT,NUNAVUT
9,Cust_10,CLAUDIA MINER,SMALL BUSINESS,NUNAVUT,NUNAVUT


In [5]:
# Product dimension table
product_df.head()

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
1,OFFICE SUPPLIES,APPLIANCES,Prod_2
2,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,Prod_3
3,TECHNOLOGY,TELEPHONES AND COMMUNICATION,Prod_4
4,FURNITURE,OFFICE FURNISHINGS,Prod_5


In [6]:
# Shipping metadata
shipping_df.head()

Unnamed: 0,Order_ID,Ship_Mode,Ship_Date,Ship_id
0,3,REGULAR AIR,20-10-2010,SHP_1
1,293,DELIVERY TRUCK,02-10-2012,SHP_2
2,293,REGULAR AIR,03-10-2012,SHP_3
3,483,REGULAR AIR,12-07-2011,SHP_4
4,515,REGULAR AIR,30-08-2010,SHP_5


In [7]:
# Orders dimension table
orders_df.head()

Unnamed: 0,Order_ID,Order_Date,Order_Priority,Ord_id
0,3,13-10-2010,LOW,Ord_1
1,293,01-10-2012,HIGH,Ord_2
2,483,10-07-2011,HIGH,Ord_3
3,515,28-08-2010,NOT SPECIFIED,Ord_4
4,613,17-06-2011,HIGH,Ord_5


### Merging Dataframes

Say you want to select all orders and observe the ```Sales``` of the customer segment *Corporate*. Since customer segment details are present in the dataframe ```customer_df```, we will first need to merge it with ```market_df```.


In [8]:
# Merging the dataframes
# Note that Cust_id is the common column/key, which is provided to the 'on' argument
# how = 'inner' makes sure that only the customer ids present in both dfs are included in the result
df_1 = pd.merge(market_df, customer_df, how='inner', on='Cust_id')
df_1.head()


Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54,AARON BERGMAN,ALBERTA,WEST,CORPORATE
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59,AARON BERGMAN,ALBERTA,WEST,CORPORATE
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37,AARON BERGMAN,ALBERTA,WEST,CORPORATE
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38,AARON BERGMAN,ALBERTA,WEST,CORPORATE


In [9]:
print()
df_1.shape




(8399, 14)

In [10]:
df_1['Customer_Segment'] == 'CORPORATE'

0        True
1        True
2        True
3        True
4        True
        ...  
8394    False
8395    False
8396    False
8397    False
8398    False
Name: Customer_Segment, Length: 8399, dtype: bool

In [11]:
columns= ['Cust_id', 'Customer_Name', 'Ord_id', 'Prod_id', 'Sales','Profit']
df_1.loc[(df_1.Customer_Segment == 'CORPORATE') & (df_1.Profit > 1000), columns]

Unnamed: 0,Cust_id,Customer_Name,Ord_id,Prod_id,Sales,Profit
2,Cust_1818,AARON BERGMAN,Ord_5446,Prod_4,4701.6900,1148.90
4,Cust_1818,AARON BERGMAN,Ord_5485,Prod_17,4233.1500,1219.87
57,Cust_1474,ADAM HART,Ord_4546,Prod_1,5208.7800,1547.78
64,Cust_1474,ADAM HART,Ord_4475,Prod_4,7640.2250,2027.68
82,Cust_1749,ADAM SHILLINGSBURG,Ord_5156,Prod_4,8374.1320,2568.10
...,...,...,...,...,...,...
8025,Cust_1311,TOM STIVERS,Ord_2396,Prod_4,4687.6735,1304.56
8110,Cust_1235,TRACY COLLINS,Ord_3400,Prod_2,7841.5700,2347.18
8137,Cust_890,TRACY PODDAR,Ord_2267,Prod_2,6608.2400,2164.64
8263,Cust_1694,VICTORIA PISTEKA,Ord_4954,Prod_15,7987.4300,1304.90


In [12]:
# Now, you can subset the orders made by customers from 'Corporate' segment
df_1.loc[df_1['Customer_Segment'] == 'CORPORATE', :]

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.60,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54,AARON BERGMAN,ALBERTA,WEST,CORPORATE
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.00,26,1148.90,2.50,0.59,AARON BERGMAN,ALBERTA,WEST,CORPORATE
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.30,0.37,AARON BERGMAN,ALBERTA,WEST,CORPORATE
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.30,0.38,AARON BERGMAN,ALBERTA,WEST,CORPORATE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8385,Ord_1833,Prod_3,SHP_2527,Cust_637,611.16,0.04,46,100.22,4.98,0.40,YANA SORENSEN,NEWFOUNDLAND,ATLANTIC,CORPORATE
8386,Ord_2324,Prod_7,SHP_3189,Cust_851,121.87,0.07,39,11.32,1.35,0.40,YANA SORENSEN,QUEBEC,QUEBEC,CORPORATE
8387,Ord_2220,Prod_3,SHP_3019,Cust_851,41.06,0.04,4,-16.39,6.28,0.35,YANA SORENSEN,QUEBEC,QUEBEC,CORPORATE
8388,Ord_4424,Prod_1,SHP_6165,Cust_1519,994.04,0.03,10,-335.06,35.00,,YANA SORENSEN,YUKON,YUKON,CORPORATE


In [13]:
filter_1_CORPORATE_isin_Customer_Segment =df_1['Customer_Segment'].isin(['CORPORATE'])
filter_1_CORPORATE_isin_Customer_Segment

0        True
1        True
2        True
3        True
4        True
        ...  
8394    False
8395    False
8396    False
8397    False
8398    False
Name: Customer_Segment, Length: 8399, dtype: bool

In [15]:
df_1[filter_1_CORPORATE_isin_Customer_Segment]

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.60,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54,AARON BERGMAN,ALBERTA,WEST,CORPORATE
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.00,26,1148.90,2.50,0.59,AARON BERGMAN,ALBERTA,WEST,CORPORATE
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.30,0.37,AARON BERGMAN,ALBERTA,WEST,CORPORATE
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.30,0.38,AARON BERGMAN,ALBERTA,WEST,CORPORATE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8385,Ord_1833,Prod_3,SHP_2527,Cust_637,611.16,0.04,46,100.22,4.98,0.40,YANA SORENSEN,NEWFOUNDLAND,ATLANTIC,CORPORATE
8386,Ord_2324,Prod_7,SHP_3189,Cust_851,121.87,0.07,39,11.32,1.35,0.40,YANA SORENSEN,QUEBEC,QUEBEC,CORPORATE
8387,Ord_2220,Prod_3,SHP_3019,Cust_851,41.06,0.04,4,-16.39,6.28,0.35,YANA SORENSEN,QUEBEC,QUEBEC,CORPORATE
8388,Ord_4424,Prod_1,SHP_6165,Cust_1519,994.04,0.03,10,-335.06,35.00,,YANA SORENSEN,YUKON,YUKON,CORPORATE


In [16]:
# Example 2: Select all orders from product category = office supplies and from the corporate segment
# We now need to merge the product_df

df_2 = pd.merge(df_1, product_df, how='inner', on='Prod_id')
df_2.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment,Product_Category,Product_Sub_Category
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
1,Ord_2978,Prod_16,SHP_4112,Cust_1088,305.05,0.04,27,23.12,3.37,0.57,AARON HAWKINS,ONTARIO,ONTARIO,HOME OFFICE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
2,Ord_5484,Prod_16,SHP_7663,Cust_1820,322.82,0.05,35,-17.58,3.98,0.56,ADRIAN SHAMI,ALBERTA,WEST,CONSUMER,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
3,Ord_3730,Prod_16,SHP_5175,Cust_1314,459.08,0.04,34,61.57,3.14,0.6,ALEKSANDRA GANNAWAY,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
4,Ord_4143,Prod_16,SHP_5771,Cust_1417,207.21,0.06,24,-78.64,6.14,0.59,ALLEN ARMOLD,NEW BRUNSWICK,ATLANTIC,HOME OFFICE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"


In [17]:
df_1.head(2)

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54,AARON BERGMAN,ALBERTA,WEST,CORPORATE


In [18]:
product_df.head(2)

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
1,OFFICE SUPPLIES,APPLIANCES,Prod_2


In [19]:
df2 = pd.merge(df_1, product_df, on='Prod_id', how='inner')


In [20]:
columns= ['Cust_id', 'Customer_Name', 'Prod_id', 'Ord_id', 'Sales','Profit']
df2.loc[df2.Profit > 1000, columns].sort_values(by='Profit', ascending=False)

Unnamed: 0,Cust_id,Customer_Name,Prod_id,Ord_id,Sales,Profit
3008,Cust_1151,EMILY PHAN,Prod_17,Ord_3084,89061.050,27220.69
2901,Cust_1307,ANDY REITER,Prod_17,Ord_3707,28359.400,14440.39
8334,Cust_1170,DEBORAH BRUMFIELD,Prod_14,Ord_3143,28664.520,13340.26
8360,Cust_1571,KAREN CARLISLE,Prod_14,Ord_4614,29884.600,12748.86
3161,Cust_1763,RICK WILSON,Prod_17,Ord_5186,26095.130,12606.81
...,...,...,...,...,...,...
2322,Cust_1361,KELLY ANDREADA,Prod_6,Ord_3956,2123.640,1013.77
2813,Cust_582,TODD BOYES,Prod_6,Ord_1694,2145.050,1012.67
5680,Cust_180,HAROLD ENGLE,Prod_15,Ord_490,5186.310,1009.38
819,Cust_1385,AMY HUNT,Prod_4,Ord_4066,4390.029,1007.49


In [21]:
# Select all orders from product category = office supplies and from the corporate segment
df_2.loc[(df_2['Product_Category']=='OFFICE SUPPLIES') & (df_2['Customer_Segment']=='CORPORATE'),:]

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment,Product_Category,Product_Sub_Category
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.60,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
3,Ord_3730,Prod_16,SHP_5175,Cust_1314,459.08,0.04,34,61.57,3.14,0.60,ALEKSANDRA GANNAWAY,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
7,Ord_4506,Prod_16,SHP_6273,Cust_1544,92.02,0.07,9,-24.88,4.68,0.59,AMY COX,YUKON,YUKON,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
9,Ord_1551,Prod_16,SHP_2145,Cust_531,184.77,0.00,29,-71.96,5.30,0.55,ANDY YOTOV,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
11,Ord_1429,Prod_16,SHP_1976,Cust_510,539.06,0.05,42,-123.07,4.59,0.82,ANNA HABERLIN,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7545,Ord_4629,Prod_1,SHP_6447,Cust_1587,848.19,0.06,25,120.02,5.49,0.60,VICTORIA PISTEKA,BRITISH COLUMBIA,WEST,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION
7546,Ord_4604,Prod_1,SHP_6403,Cust_1522,234.24,0.09,24,-151.80,9.45,0.60,VICTORIA PISTEKA,YUKON,YUKON,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION
7551,Ord_3543,Prod_1,SHP_4905,Cust_1266,1184.11,0.07,6,-145.07,19.99,0.71,WILLIAM BROWN,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION
7552,Ord_2722,Prod_1,SHP_3731,Cust_1006,3508.33,0.04,21,-546.98,35.00,0.85,XYLONA PRICE,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION


In [22]:
fltr_1_OfficeSupplies_isin_ProductCategory = df_2['Product_Category'].isin(['OFFICE SUPPLIES'])
fltr_2_Corporate_isin_Product_Segment = df_2['Customer_Segment'].isin(['CORPORATE'])

In [23]:
df_2[fltr_1_OfficeSupplies_isin_ProductCategory & fltr_2_Corporate_isin_Product_Segment]

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment,Product_Category,Product_Sub_Category
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.60,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
3,Ord_3730,Prod_16,SHP_5175,Cust_1314,459.08,0.04,34,61.57,3.14,0.60,ALEKSANDRA GANNAWAY,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
7,Ord_4506,Prod_16,SHP_6273,Cust_1544,92.02,0.07,9,-24.88,4.68,0.59,AMY COX,YUKON,YUKON,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
9,Ord_1551,Prod_16,SHP_2145,Cust_531,184.77,0.00,29,-71.96,5.30,0.55,ANDY YOTOV,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
11,Ord_1429,Prod_16,SHP_1976,Cust_510,539.06,0.05,42,-123.07,4.59,0.82,ANNA HABERLIN,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7545,Ord_4629,Prod_1,SHP_6447,Cust_1587,848.19,0.06,25,120.02,5.49,0.60,VICTORIA PISTEKA,BRITISH COLUMBIA,WEST,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION
7546,Ord_4604,Prod_1,SHP_6403,Cust_1522,234.24,0.09,24,-151.80,9.45,0.60,VICTORIA PISTEKA,YUKON,YUKON,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION
7551,Ord_3543,Prod_1,SHP_4905,Cust_1266,1184.11,0.07,6,-145.07,19.99,0.71,WILLIAM BROWN,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION
7552,Ord_2722,Prod_1,SHP_3731,Cust_1006,3508.33,0.04,21,-546.98,35.00,0.85,XYLONA PRICE,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION



Similary, you can merge the other dimension tables - ```shipping_df``` and ```orders_df``` to create a ```master_df``` and perform indexing using any column in the master dataframe.


In [24]:
# Merging shipping_df
df_3 = pd.merge(df_2, shipping_df, how='inner', on='Ship_id')
df_3.shape

(8399, 19)

In [25]:
# Merging the orders table to create a master df
master_df = pd.merge(df_3, orders_df, how='inner', on='Ord_id')
master_df.shape
master_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,...,Region,Customer_Segment,Product_Category,Product_Sub_Category,Order_ID_x,Ship_Mode,Ship_Date,Order_ID_y,Order_Date,Order_Priority
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56,...,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36262,REGULAR AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED
1,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59,...,WEST,CORPORATE,TECHNOLOGY,TELEPHONES AND COMMUNICATION,36262,EXPRESS AIR,27-07-2010,36262,27-07-2010,NOT SPECIFIED
2,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37,...,WEST,CORPORATE,OFFICE SUPPLIES,PAPER,36262,EXPRESS AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED
3,Ord_2978,Prod_16,SHP_4112,Cust_1088,305.05,0.04,27,23.12,3.37,0.57,...,ONTARIO,HOME OFFICE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",37863,REGULAR AIR,26-02-2011,37863,24-02-2011,HIGH
4,Ord_5484,Prod_16,SHP_7663,Cust_1820,322.82,0.05,35,-17.58,3.98,0.56,...,WEST,CONSUMER,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",53026,REGULAR AIR,03-03-2012,53026,26-02-2012,LOW


Similary, you can perform left, right and outer merges (joins) by using the argument ```how = 'left' / 'right' / 'outer'```.

### Concatenating Dataframes

Concatenation is much more straightforward than merging. It is used when you have dataframes having the same columns and want to append them (pile one on top of the other), or having the same rows and want to append them side-by-side.

#### Concatenating Dataframes Having the Same columns

Say you have two dataframes having the same columns, like so:

In [40]:
# dataframes having the same columns
df1 = pd.DataFrame({'Name': ['Aman', 'Joy', 'Rashmi', 'Saif'],
                    'Age': ['34', '31', '22', '33'],
                    'Gender': ['M', 'M', 'F', 'M']}
                  )

df2 = pd.DataFrame({'Name': ['Akhil', 'Asha', 'Preeti'],
                    'Age': ['31', '22', '23'],
                    'Gender': ['M', 'F', 'F']}
                  )
df1

Unnamed: 0,Name,Age,Gender
0,Aman,34,M
1,Joy,31,M
2,Rashmi,22,F
3,Saif,33,M


In [41]:
df2

Unnamed: 0,Name,Age,Gender
0,Akhil,31,M
1,Asha,22,F
2,Preeti,23,F


In [42]:
# To concatenate them, one on top of the other, you can use pd.concat
# The first argument is a sequence (list) of dataframes
# axis = 0 indicates that we want to concat along the row axis
pd.concat([df1, df2], axis = 0)

Unnamed: 0,Name,Age,Gender
0,Aman,34,M
1,Joy,31,M
2,Rashmi,22,F
3,Saif,33,M
0,Akhil,31,M
1,Asha,22,F
2,Preeti,23,F


In [45]:
pd.concat([df1, df2], axis=0)

Unnamed: 0,Name,Age,Gender
0,Aman,34,M
1,Joy,31,M
2,Rashmi,22,F
3,Saif,33,M
0,Akhil,31,M
1,Asha,22,F
2,Preeti,23,F


In [48]:
s1= pd.Series(['a', 'b', 'c'])
s2= pd.Series(['p', 'q', 'r', 's'])
s3= pd.Series(['x','y', 'z'])
pd.concat([s1, s2, s3], ignore_index=True )

0    a
1    b
2    c
3    p
4    q
5    r
6    s
7    x
8    y
9    z
dtype: object

In [50]:
# A useful and intuitive alternative to concat along the rows is the append() function
# It concatenates along the rows
df1.append(df2)


  df1.append(df2)


Unnamed: 0,Name,Age,Gender
0,Aman,34,M
1,Joy,31,M
2,Rashmi,22,F
3,Saif,33,M
0,Akhil,31,M
1,Asha,22,F
2,Preeti,23,F


#### Concatenating Dataframes Having the Same Rows

You may also have dataframes having the same rows but different columns (and having no common columns). In this case, you may want to concat them side-by-side. For e.g.:

In [52]:
df1 = pd.DataFrame({'Name': ['Aman', 'Joy', 'Rashmi', 'Saif'],
                    'Age': ['34', '31', '22', '33'],
                    'Gender': ['M', 'M', 'F', 'M']}
                  )
df1

Unnamed: 0,Name,Age,Gender
0,Aman,34,M
1,Joy,31,M
2,Rashmi,22,F
3,Saif,33,M


In [53]:
df2 = pd.DataFrame({'School': ['RK Public', 'JSP', 'Carmel Convent', 'St. Paul'],
                    'Graduation Marks': ['84', '89', '76', '91']}
                  )
df2

Unnamed: 0,School,Graduation Marks
0,RK Public,84
1,JSP,89
2,Carmel Convent,76
3,St. Paul,91


In [54]:
# To join the two dataframes, use axis = 1 to indicate joining along the columns axis
# The join is possible because the corresponding rows have the same indices
pd.concat([df1, df2], axis = 1)

Unnamed: 0,Name,Age,Gender,School,Graduation Marks
0,Aman,34,M,RK Public,84
1,Joy,31,M,JSP,89
2,Rashmi,22,F,Carmel Convent,76
3,Saif,33,M,St. Paul,91


Note that you can also use the ```pd.concat()``` method to merge dataframes using common keys, though here we will not discuss that. For simplicity, we have used the ```pd.merge()``` method for database-style merging and ```pd.concat()``` for appending dataframes having no common columns.

#### Performing Arithmetic Operations on two or more dataframes

We can also perform simple arithmetic operations on two or more dataframes. Below are the stats for IPL 2018 and 2017.

In [60]:
# Teamwise stats for IPL 2018
IPL_2018 = pd.DataFrame({'IPL Team': ['CSK', 'SRH', 'KKR', 'RR', 'MI', 'RCB', 'KXIP', 'DD'],
                         'Matches Played': [16, 17, 16, 15, 14, 14, 14, 14],
                         'Matches Won': [11, 10, 9, 7, 6, 6, 6, 5]}
                       )

# Set the 'IPL Team' column as the index to perform arithmetic operations on the other rows using the team as reference
IPL_2018.set_index('IPL Team', inplace = True)
IPL_2018

Unnamed: 0_level_0,Matches Played,Matches Won
IPL Team,Unnamed: 1_level_1,Unnamed: 2_level_1
CSK,16,11
SRH,17,10
KKR,16,9
RR,15,7
MI,14,6
RCB,14,6
KXIP,14,6
DD,14,5


In [61]:
# Similarly, we have the stats for IPL 2017
IPL_2017 = pd.DataFrame({'IPL Team': ['MI', 'RPS', 'KKR', 'SRH', 'KXIP', 'DD', 'GL', 'RCB'],
                         'Matches Played': [17, 16, 16, 15, 14, 14, 14, 14],
                         'Matches Won': [12, 10, 9, 8, 7, 6, 4, 3]}
                       )
IPL_2017.set_index('IPL Team', inplace = True)
IPL_2017

Unnamed: 0_level_0,Matches Played,Matches Won
IPL Team,Unnamed: 1_level_1,Unnamed: 2_level_1
MI,17,12
RPS,16,10
KKR,16,9
SRH,15,8
KXIP,14,7
DD,14,6
GL,14,4
RCB,14,3


In [62]:
# Simply add the two DFs using the add opearator

Total = IPL_2018 + IPL_2017
Total

Unnamed: 0_level_0,Matches Played,Matches Won
IPL Team,Unnamed: 1_level_1,Unnamed: 2_level_1
CSK,,
DD,28.0,11.0
GL,,
KKR,32.0,18.0
KXIP,28.0,13.0
MI,31.0,18.0
RCB,28.0,9.0
RPS,,
RR,,
SRH,32.0,18.0


In [75]:
pd.concat([IPL_2017, IPL_2018], axis=1).sort_values(by='IPL Team')

Unnamed: 0_level_0,Matches Played,Matches Won,Matches Played,Matches Won
IPL Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CSK,,,16.0,11.0
DD,14.0,6.0,14.0,5.0
GL,14.0,4.0,,
KKR,16.0,9.0,16.0,9.0
KXIP,14.0,7.0,14.0,6.0
MI,17.0,12.0,14.0,6.0
RCB,14.0,3.0,14.0,6.0
RPS,16.0,10.0,,
RR,,,15.0,7.0
SRH,15.0,8.0,17.0,10.0


In [88]:
import math
IPL_df = IPL_2017 + IPL_2018
IPL_df

Unnamed: 0_level_0,Matches Played,Matches Won
IPL Team,Unnamed: 1_level_1,Unnamed: 2_level_1
CSK,,
DD,28.0,11.0
GL,,
KKR,32.0,18.0
KXIP,28.0,13.0
MI,31.0,18.0
RCB,28.0,9.0
RPS,,
RR,,
SRH,32.0,18.0


Notice that there are a lot of NaN values. This is because some teams which played in IPL 2017 were not present in IPL 2018. In addition, there were also new teams present in IPL 2018. We can handle these NaN values by using `df.add()` instead of the simple add operator. Let's see how.

In [90]:
# The fill_value argument inside the df.add() function replaces all the NaN values in the two dataframes w.r.t. each other with zero.
Total = IPL_2018.add(IPL_2017, fill_value = 0)
Total

Unnamed: 0_level_0,Matches Played,Matches Won
IPL Team,Unnamed: 1_level_1,Unnamed: 2_level_1
CSK,16.0,11.0
DD,28.0,11.0
GL,14.0,4.0
KKR,32.0,18.0
KXIP,28.0,13.0
MI,31.0,18.0
RCB,28.0,9.0
RPS,16.0,10.0
RR,15.0,7.0
SRH,32.0,18.0


Also notice how the resultant dataframe is sorted by the index, i.e. 'IPL Team' alphabetically.

In [92]:
# Creating a new column - 'Win Percentage'

Total['Win Percentage'] = Total['Matches Won']/Total['Matches Played']
Total

Unnamed: 0_level_0,Matches Played,Matches Won,Win Percentage
IPL Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CSK,16.0,11.0,0.6875
DD,28.0,11.0,0.392857
GL,14.0,4.0,0.285714
KKR,32.0,18.0,0.5625
KXIP,28.0,13.0,0.464286
MI,31.0,18.0,0.580645
RCB,28.0,9.0,0.321429
RPS,16.0,10.0,0.625
RR,15.0,7.0,0.466667
SRH,32.0,18.0,0.5625


In [95]:
Total['Matches Lost'] = Total['Matches Played'] - Total['Matches Won']
Total

Unnamed: 0_level_0,Matches Played,Matches Won,Win Percentage,Matches Lost
IPL Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CSK,16.0,11.0,0.6875,5.0
DD,28.0,11.0,0.392857,17.0
GL,14.0,4.0,0.285714,10.0
KKR,32.0,18.0,0.5625,14.0
KXIP,28.0,13.0,0.464286,15.0
MI,31.0,18.0,0.580645,13.0
RCB,28.0,9.0,0.321429,19.0
RPS,16.0,10.0,0.625,6.0
RR,15.0,7.0,0.466667,8.0
SRH,32.0,18.0,0.5625,14.0


In [96]:
# Sorting to determine the teams with most number of wins. If the number of wins of two teams are the same, sort by the win percentage.

Total.sort_values(by = (['Matches Won', 'Win Percentage']), ascending = False)

Unnamed: 0_level_0,Matches Played,Matches Won,Win Percentage,Matches Lost
IPL Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MI,31.0,18.0,0.580645,13.0
KKR,32.0,18.0,0.5625,14.0
SRH,32.0,18.0,0.5625,14.0
KXIP,28.0,13.0,0.464286,15.0
CSK,16.0,11.0,0.6875,5.0
DD,28.0,11.0,0.392857,17.0
RPS,16.0,10.0,0.625,6.0
RCB,28.0,9.0,0.321429,19.0
RR,15.0,7.0,0.466667,8.0
GL,14.0,4.0,0.285714,10.0


Apart from add(), there are also other operator-equivalent mathematical functions that you can use on Dataframes. Below is a list of all the functions that you can use to perform operations on two or more dataframes
-  `add()`: +
-  `sub()`: -
-  `mul()`: *
-  `div()`: /
-  `floordiv()`: //
-  `mod()`: %
-  `pow()`: **

In [3]:
import pandas as pd

In [6]:
restaurant_1 = pd.read_csv("./restaurant_data/restaurant-1.csv")
restaurant_1

Unnamed: 0,name,address,city,cuisine,unique_id
0,arnie morton's of chicago,"""435 s. la cienega blvd.""","""los angeles""","""steakhouses""",'0'
1,art's deli,"""12224 ventura blvd.""","""studio city""","""delis""",'1'
2,bel-air hotel,"""701 stone canyon rd.""","""bel air""","""californian""",'2'
3,cafe bizou,"""14016 ventura blvd.""","""sherman oaks""","""french bistro""",'3'
4,campanile,"""624 s. la brea ave.""","""los angeles""","""californian""",'4'
...,...,...,...,...,...
107,mifune,"""1737 post st.""","""san francisco""","""japanese""",'107'
108,plumpjack cafe,"""3127 fillmore st.""","""san francisco""","""american (new)""",'108'
109,postrio,"""545 post st.""","""san francisco""","""californian""",'109'
110,ritz-carlton dining room (san francisco),"""600 stockton st.""","""san francisco""","""french (new)""",'110'


In [9]:
restaurant_1.describe()

Unnamed: 0,name,address,city,cuisine,unique_id
count,112,112,112,112,112
unique,112,111,16,31,112
top,arnie morton's of chicago,"""3434 peachtree rd. ne""","""new york city""","""american (new)""",'0'
freq,1,2,43,20,1


In [10]:
restaurant_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       112 non-null    object
 1   address    112 non-null    object
 2   city       112 non-null    object
 3   cuisine    112 non-null    object
 4   unique_id  112 non-null    object
dtypes: object(5)
memory usage: 4.5+ KB


In [8]:
restaurant_2 = pd.read_csv("./restaurant_data/restaurant-2.csv")
restaurant_2

Unnamed: 0,name_2,address_2,city_2,cuisine_2,unique_id
0,arnie morton's of chicago,"""435 s. la cienega blv.""","""los angeles""","""american""",'0'
1,art's delicatessen,"""12224 ventura blvd.""","""studio city""","""american""",'1'
2,hotel bel-air,"""701 stone canyon rd.""","""bel air""","""californian""",'2'
3,cafe bizou,"""14016 ventura blvd.""","""sherman oaks""","""french""",'3'
4,campanile,"""624 s. la brea ave.""","""los angeles""","""american""",'4'
...,...,...,...,...,...
747,ti couz,"""3108 16th st.""","""san francisco""","""french""",'748'
748,trio cafe,"""1870 fillmore st.""","""san francisco""","""american""",'749'
749,tu lan,"""8 sixth st.""","""san francisco""","""vietnamese""",'750'
750,vicolo pizzeria,"""201 ivy st.""","""san francisco""","""pizza""",'751'


In [30]:
restaurant_2.describe() 

Unnamed: 0,name_2,address_2,city_2,cuisine_2,unique_id
count,752,752,752,752,752
unique,746,731,49,80,752
top,le colonial,"""3570 las vegas blvd. s""","""new york""","""american""",'0'
freq,2,5,250,152,1


In [27]:
restaurant_1.pivot_table(index=['city'], values=[ 'cuisine'], aggfunc='count')

Unnamed: 0_level_0,cuisine
city,Unnamed: 1_level_1
"""atlanta""",20
"""bel air""",1
"""beverly hills""",2
"""brooklyn""",1
"""chinatown""",1
"""las vegas""",7
"""los angeles""",7
"""los feliz""",1
"""malibu""",1
"""new york city""",43


In [29]:
restaurant_2.pivot_table(index=['city_2'], values=[ 'name_2'], aggfunc='count')

Unnamed: 0_level_0,name_2
city_2,Unnamed: 1_level_1
"""atlanta""",100
"""bel air""",1
"""beverly hills""",6
"""brentwood""",1
"""brooklyn""",5
"""burbank""",1
"""century city""",1
"""chinatown""",1
"""college park""",1
"""culver city""",1


In [32]:
restaurant_df = pd.merge(restaurant_1, restaurant_2, how='inner', on='unique_id')
restaurant_df

Unnamed: 0,name,address,city,cuisine,unique_id,name_2,address_2,city_2,cuisine_2
0,arnie morton's of chicago,"""435 s. la cienega blvd.""","""los angeles""","""steakhouses""",'0',arnie morton's of chicago,"""435 s. la cienega blv.""","""los angeles""","""american"""
1,art's deli,"""12224 ventura blvd.""","""studio city""","""delis""",'1',art's delicatessen,"""12224 ventura blvd.""","""studio city""","""american"""
2,bel-air hotel,"""701 stone canyon rd.""","""bel air""","""californian""",'2',hotel bel-air,"""701 stone canyon rd.""","""bel air""","""californian"""
3,cafe bizou,"""14016 ventura blvd.""","""sherman oaks""","""french bistro""",'3',cafe bizou,"""14016 ventura blvd.""","""sherman oaks""","""french"""
4,campanile,"""624 s. la brea ave.""","""los angeles""","""californian""",'4',campanile,"""624 s. la brea ave.""","""los angeles""","""american"""
...,...,...,...,...,...,...,...,...,...
107,mifune,"""1737 post st.""","""san francisco""","""japanese""",'107',mifune japan center kintetsu building,"""1737 post st.""","""san francisco""","""asian"""
108,plumpjack cafe,"""3127 fillmore st.""","""san francisco""","""american (new)""",'108',plumpjack cafe,"""3201 fillmore st.""","""san francisco""","""mediterranean"""
109,postrio,"""545 post st.""","""san francisco""","""californian""",'109',postrio,"""545 post st.""","""san francisco""","""american"""
110,ritz-carlton dining room (san francisco),"""600 stockton st.""","""san francisco""","""french (new)""",'110',ritz-carlton restaurant and dining room,"""600 stockton st.""","""san francisco""","""american"""


In [35]:
import numpy as np 

# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
                         'Medals': [15, 13, 9]}
                    )
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
                        'Medals': [29, 20, 16]}
                    )
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
                        'Medals': [40, 28, 27]})

In [54]:
pd.concat([gold, silver, bronze], axis=0, ignore_index=True)
df2= pd.merge(gold, silver, how='outer', on='Country')
df2


Unnamed: 0,Country,Medals_x,Medals_y
0,USA,15.0,29.0
1,France,13.0,
2,Russia,9.0,16.0
3,Germany,,20.0


In [55]:
df3= pd.merge(df2, bronze, how='outer', on='Country')
df3

Unnamed: 0,Country,Medals_x,Medals_y,Medals
0,USA,15.0,29.0,28.0
1,France,13.0,,40.0
2,Russia,9.0,16.0,
3,Germany,,20.0,
4,UK,,,27.0


In [56]:
df3.pivot_table(index=['Country'])


Unnamed: 0_level_0,Medals,Medals_x,Medals_y
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
France,40.0,13.0,
Germany,,,20.0
Russia,,9.0,16.0
UK,27.0,,
USA,28.0,15.0,29.0


In [67]:
gold

Unnamed: 0_level_0,Medals
Country,Unnamed: 1_level_1
USA,15
France,13
Russia,9


In [65]:
silver.set_index('Country', inplace=True)

In [66]:
silver

Unnamed: 0_level_0,Medals
Country,Unnamed: 1_level_1
USA,29
Germany,20
Russia,16


In [71]:
bronze

Unnamed: 0_level_0,Medals
Country,Unnamed: 1_level_1
France,40
USA,28
UK,27


In [75]:
pd.concat([gold, silver, bronze], axis=1)

Unnamed: 0_level_0,Medals,Medals,Medals
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
USA,15.0,29.0,28.0
France,13.0,,40.0
Russia,9.0,16.0,
Germany,,20.0,
UK,,,27.0


In [79]:
result = gold.add(silver, fill_value=0).add(bronze, fill_value=0)
result.sort_values(by='Medals', ascending=False)

Unnamed: 0_level_0,Medals
Country,Unnamed: 1_level_1
USA,72.0
France,53.0
UK,27.0
Russia,25.0
Germany,20.0
