# Joins Data in Pandas Dataframes

Here are the steps involved in joining data using Pandas Dataframe APIs

- First, we need to define index for both the dataframes that are supposed to be joined.
- Use join to join Dataframes based on index. The default join is left, we need use right join based on the requirement

In our case, we will join `orders` and `customers` using inner join.

In [1]:
import json
import pandas as pd

In [2]:
# First we will create 2 dataframes in pandas using orders and customers csv file for data.
# To get the header details we will use schema.json file to define the column name.
# We already develop a function which will return the column names in sorted order.
schema_file_path = 'E:/Projects/Data_Engineering/Data-Engineering/data/retail_db/schemas.json'
schema = json.load(open(schema_file_path))

def get_column_name(schema, tableName, sortingKey='column_position'):
    column_details = schema[tableName]
    column_details_sort = sorted(column_details, key=lambda col:col[sortingKey])
    return [col['column_name'] for col in column_details_sort]


orders_column_names = get_column_name(schema, 'orders')
customer_column_names = get_column_name(schema, 'customers')


In [3]:
# We have the column list for both the table now we will read the data from CSV file

orders_data_file_path = 'E:/Projects/Data_Engineering/Data-Engineering/data/retail_db/orders/part-00000'
customer_data_file_path = 'E:/Projects/Data_Engineering/Data-Engineering/data/retail_db/customers/part-00000'

orders = pd.read_csv(
                        orders_data_file_path,
                        names=orders_column_names
                    )

customers = pd.read_csv(
                        customer_data_file_path,
                        names=customer_column_names
                        )

In [4]:
orders

Unnamed: 0,order_id,order_date,order_customer_id,order_status
0,1,2013-07-25 00:00:00.0,11599,CLOSED
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
2,3,2013-07-25 00:00:00.0,12111,COMPLETE
3,4,2013-07-25 00:00:00.0,8827,CLOSED
4,5,2013-07-25 00:00:00.0,11318,COMPLETE
...,...,...,...,...
68878,68879,2014-07-09 00:00:00.0,778,COMPLETE
68879,68880,2014-07-13 00:00:00.0,1117,COMPLETE
68880,68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68881,68882,2014-07-22 00:00:00.0,10000,ON_HOLD


In [5]:
customers

Unnamed: 0,customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode
0,1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521
1,2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126
2,3,Ann,Smith,XXXXXXXXX,XXXXXXXXX,3422 Blue Pioneer Bend,Caguas,PR,725
3,4,Mary,Jones,XXXXXXXXX,XXXXXXXXX,8324 Little Common,San Marcos,CA,92069
4,5,Robert,Hudson,XXXXXXXXX,XXXXXXXXX,10 Crystal River Mall,Caguas,PR,725
...,...,...,...,...,...,...,...,...,...
12430,12431,Mary,Rios,XXXXXXXXX,XXXXXXXXX,1221 Cinder Pines,Kaneohe,HI,96744
12431,12432,Angela,Smith,XXXXXXXXX,XXXXXXXXX,1525 Jagged Barn Highlands,Caguas,PR,725
12432,12433,Benjamin,Garcia,XXXXXXXXX,XXXXXXXXX,5459 Noble Brook Landing,Levittown,NY,11756
12433,12434,Mary,Mills,XXXXXXXXX,XXXXXXXXX,9720 Colonial Parade,Caguas,PR,725


In [6]:
# In order to perform the joins we need to first analysis both the data and check the key columns in both the tables on which we can join the dataframes.
# We need to create both the key column as index column and join will be perform based on that indexes.

orders = orders.set_index('order_customer_id')
customers = customers.set_index('customer_id')


In [14]:
customers.shape

(12435, 8)

In [10]:
order_customer = customers. \
                            join(
                                orders,
                                how='inner'
                                 )


In [15]:
order_customer

Unnamed: 0_level_0,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode,order_id,order_date,order_status
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521,22945,2013-12-13 00:00:00.0,COMPLETE
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,15192,2013-10-29 00:00:00.0,PENDING_PAYMENT
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,33865,2014-02-18 00:00:00.0,COMPLETE
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,57963,2013-08-02 00:00:00.0,ON_HOLD
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,67863,2013-11-30 00:00:00.0,COMPLETE
...,...,...,...,...,...,...,...,...,...,...,...
12434,Mary,Mills,XXXXXXXXX,XXXXXXXXX,9720 Colonial Parade,Caguas,PR,725,42915,2014-04-16 00:00:00.0,COMPLETE
12434,Mary,Mills,XXXXXXXXX,XXXXXXXXX,9720 Colonial Parade,Caguas,PR,725,51800,2014-06-14 00:00:00.0,ON_HOLD
12434,Mary,Mills,XXXXXXXXX,XXXXXXXXX,9720 Colonial Parade,Caguas,PR,725,61777,2013-12-26 00:00:00.0,COMPLETE
12435,Laura,Horton,XXXXXXXXX,XXXXXXXXX,5736 Honey Downs,Summerville,SC,29483,41643,2014-04-08 00:00:00.0,PENDING


In [12]:
order_customer.shape

(68883, 11)

## Perform data operations on the join result 

In [16]:
# Aggregations
# First we need to rest the index although we can groupby data based on index but it is not a good approch.

order_customer.reset_index()

Unnamed: 0,customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode,order_id,order_date,order_status
0,1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521,22945,2013-12-13 00:00:00.0,COMPLETE
1,2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,15192,2013-10-29 00:00:00.0,PENDING_PAYMENT
2,2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,33865,2014-02-18 00:00:00.0,COMPLETE
3,2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,57963,2013-08-02 00:00:00.0,ON_HOLD
4,2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126,67863,2013-11-30 00:00:00.0,COMPLETE
...,...,...,...,...,...,...,...,...,...,...,...,...
68878,12434,Mary,Mills,XXXXXXXXX,XXXXXXXXX,9720 Colonial Parade,Caguas,PR,725,42915,2014-04-16 00:00:00.0,COMPLETE
68879,12434,Mary,Mills,XXXXXXXXX,XXXXXXXXX,9720 Colonial Parade,Caguas,PR,725,51800,2014-06-14 00:00:00.0,ON_HOLD
68880,12434,Mary,Mills,XXXXXXXXX,XXXXXXXXX,9720 Colonial Parade,Caguas,PR,725,61777,2013-12-26 00:00:00.0,COMPLETE
68881,12435,Laura,Horton,XXXXXXXXX,XXXXXXXXX,5736 Honey Downs,Summerville,SC,29483,41643,2014-04-08 00:00:00.0,PENDING


In [20]:
order_customer. \
                groupby('customer_id')['order_id']. \
                agg(order_count = 'count').reset_index().\
                    query('order_count >= 10')  # type: ignore

Unnamed: 0,customer_id,order_count
70,71,10
171,172,10
173,174,12
196,197,11
219,221,15
...,...,...
12311,12341,10
12317,12347,10
12375,12406,10
12400,12431,16
