In [None]:
# loading libraries and reading the data
import numpy as np
import pandas as pd

market_df = pd.read_csv("./global_sales_data/market_fact.csv")
customer_df = pd.read_csv("./global_sales_data/cust_dimen.csv")
product_df = pd.read_csv("./global_sales_data/prod_dimen.csv")
shipping_df = pd.read_csv("./global_sales_data/shipping_dimen.csv")
orders_df = pd.read_csv("./global_sales_data/orders_dimen.csv")

### Merging Dataframes

There are five data files:
1. The ```market_fact``` table contains the sales data of each order
2. The other 4 files are called 'dimension tables/files' and contain metadata about customers, products, shipping details, order details etc.

In [None]:
market_df.head()

In [None]:
# Customer dimension table: Each row contains metadata about customers
customer_df.head()

In [None]:
# Product dimension table
product_df.head()

In [None]:
# Shipping metadata
shipping_df.head()

In [None]:
# Orders dimension table
orders_df.head()

### Merging Dataframes

Say you want to select all orders and observe the ```Sales``` of the customer segment *Corporate*. Since customer segment details are present in the dataframe ```customer_df```, we will first need to merge it with ```market_df```.


In [None]:
# Merging the dataframes
# Note that Cust_id is the common column/key, which is provided to the 'on' argument
# how = 'inner' makes sure that only the customer ids present in both dfs are included in the result
df_1 = pd.merge(market_df, customer_df, how='inner', on='Cust_id')
df_1.head()

In [None]:
# Now, you can subset the orders made by customers from 'Consumer' segment
df_1.loc[df_1['Customer_Segment'] == 'CONSUMER', :]

In [None]:
# Example 2: Select all orders from product category = office supplies and from the corporate segment
# We now need to merge the product_df

df_2 = pd.merge(df_1, product_df, how='inner', on='Prod_id')
df_2.head()

In [None]:
# Select all orders from product category = FURNITURE and from the consumer segment
df_2.loc[(df_2['Product_Category']=='FURNITURE') & (df_2['Customer_Segment']=='CONSUMER'),:]


Similary, you can merge the other dimension tables - ```shipping_df``` and ```orders_df``` to create a ```master_df``` and perform indexing using any column in the master dataframe.


In [None]:
# Merging shipping_df
df_3 = pd.merge(df_2, shipping_df, how='inner', on='Ship_id')
df_3.shape

In [None]:
# Merging the orders table to create a master df
master_df = pd.merge(df_3, orders_df, how='inner', on='Ord_id')
master_df.shape
master_df.head()

Similary, you can perform left, right and outer merges (joins) by using the argument ```how = 'left' / 'right' / 'outer'```.

### Concatenating Dataframes

#### Concatenating Dataframes Having the Same columns

In [None]:
# dataframes having the same columns
df1 = pd.DataFrame({'Name': [<4 names from class>],
                    'Age': [<their age>],
                    'Gender': [<their gender in 'M', 'F' format>]}
                  )

df2 = pd.DataFrame({'Name': [<3 names from class>],
                    'Age': [<their age>],
                    'Gender': [<their gender in 'M', 'F' format>]}
                  )
df1

In [None]:
df2

In [None]:
# To concatenate them, one on top of the other, you can use pd.concat
# The first argument is a sequence (list) of dataframes
# axis = 0 indicates that we want to concat along the row axis
pd.concat([df1, df2], axis = 0)

In [None]:
# A useful and intuitive alternative to concat along the rows is the append() function
# It concatenates along the rows
df1.append(df2)


#### Concatenating Dataframes Having the Same Rows

You may also have dataframes having the same rows but different columns (and having no common columns). In this case, you may want to concat them side-by-side. For e.g.:

In [None]:
df1 = pd.DataFrame({'Name': [<4 names from class>],
                    'Age': [<their age>],
                    'Gender': [<their gender in 'M', 'F' format>]}
                  )
df1

In [None]:
df2 = pd.DataFrame({'School': [<their school names>],
                    'Marks': [<12th marks>]}
                  )
df2

In [None]:
# To join the two dataframes, use axis = 1 to indicate joining along the columns axis
# The join is possible because the corresponding rows have the same indices
pd.concat([df1, df2], axis = 1)

Note that you can also use the ```pd.concat()``` method to merge dataframes using common keys, though here we will not discuss that. For simplicity, we have used the ```pd.merge()``` method for database-style merging and ```pd.concat()``` for appending dataframes having no common columns.

#### Performing Arithmetic Operations on two or more dataframes

We can also perform simple arithmetic operations on two or more dataframes. Below are the stats for IPL 2018 and 2017.

In [None]:
# Teamwise stats for IPL 2018
IPL_2019 = pd.DataFrame({'IPL Team': [<short forms of the teams>],
                         'Matches Played': [<number of matches>],
                         'Matches Won': [<out of matches played how many won>]}
                       )

# Set the 'IPL Team' column as the index to perform arithmetic operations on the other rows using the team as reference
IPL.set_index('IPL Team', inplace = True)
IPL

In [None]:
# Similarly, we have the stats for IPL 2017
IPL_2018 = pd.DataFrame({'IPL Team': [<short forms of the teams>],
                         'Matches Played': [<number of matches>],
                         'Matches Won': [<out of matches played how many won>]}
                       )
IPL_2018.set_index('IPL Team', inplace = True)
IPL_2018

In [None]:
# Simply add the two DFs using the add opearator

Total = IPL_2019 + IPL_2018
Total

Notice that there are a lot of NaN values. This is because some teams which played in IPL 2018 were not present in IPL 2019. In addition, there were also new teams present in IPL 2019. We can handle these NaN values by using `df.add()` instead of the simple add operator. 

In [None]:
# The fill_value argument inside the df.add() function replaces all the NaN values in the two dataframes w.r.t. each other with zero.
Total = IPL_2019.add(IPL_2018, fill_value = 0)
Total

Also notice how the resultant dataframe is sorted by the index, i.e. 'IPL Team' alphabetically.

In [None]:
# Creating a new column - 'Win Percentage'

Total['Win Percentage'] = Total['Matches Won']/Total['Matches Played']
Total

In [None]:
# Sorting to determine the teams with most number of wins. If the number of wins of two teams are the same, sort by the win percentage.

Total.sort_values(by = (['Matches Won', 'Win Percentage']), ascending = False)

Apart from add(), there are also other operator-equivalent mathematical functions that you can use on Dataframes. Below is a list of all the functions that you can use to perform operations on two or more dataframes
-  `add()`: +
-  `sub()`: -
-  `mul()`: *
-  `div()`: /
-  `floordiv()`: //
-  `mod()`: %
-  `pow()`: **