# Merging DataFrames

# Table of Contents

1.  [Overview](# 1. Overview)
2.  [products, aisles, departments .csv files](# 2. Products, Aisles, departments csv files)
3.  [Create a final_products dataframe](# 3. Create a final_products dataframe)
4.  [orders, order_products_prior, order_products_train .csv files](# 4. orders, order_products_prior, order_products_train .csv files)
5.  [Create a final_orders dataframe](# 5. Create a final_products dataframe)
6.  [Merge final_orders with final_products dataframe](# 6. Merge final_orders with final_products dataframe)



# 1. Overview
At this notebook we will have a deeper look into all available datasets that Instacart provides.<br/>
Later, we will show how we can combine (merge) all of them into a single DataFrame.<br/>
Below you will find a summary of all available .csv files and the main attributes that they hold.

![CSV_NAMES](https://kaggle2.blob.core.windows.net/forum-message-attachments/183176/6539/instacartFiles.png)

## 1.1 Import Packages
As always, we will first import the pandas package but also a new package called "gc" [Garbage Collector].


In [None]:
import pandas as pd
import gc

The gc can be proved really helpful for handling big DataFrames; as every DataFrame manipulation reserves a great amount of memory in our resources (e.g. the RAM memory), gc aims to clean unneeded reserved memory.<br/>
In practice, it does not change anything in our computations, but it helps our machine (local or cloud) to better handle next requests.

# 2. products, aisles, departments csv files
## 2.1 Load data from the CSV files

At this stage we will import our three of our .csv files.

In [None]:
products = pd.read_csv('../input/products.csv')
aisles = pd.read_csv('../input/aisles.csv')
departments = pd.read_csv('../input/departments.csv')

## 2.2 View and Understand data

And we use the <b>head()</b> function in order to get the first 5 rows of the three first dataframes.

In [None]:
products.head()

Products table describe the 49688 available products of Instacart with:
1. product_id as the index (unique value) for each product.
2. product_name to store the name of the product.
3. aisle_id to indicate the index of the category that the product belongs.
4. department_id to indicate the index of which department it belongs.

Below aisles DF returns the names of the different product categories that Instacart has.

In [None]:
aisles.head()

And departments DF, the names of the different departments of products that Instacart has.

In [None]:
departments.head()

What we can notice by looking at the dataframes of products, aisles and departments?

Answer: The dataframe of products includes some columns that we can also see in the dataframes of aisles and departments. <br/>
The "aisle_id" which can be found on aisles DF & the "department_id" which can be found on departments DF.<br/>
In reality, these columns on products DF indicate an index that match a record in the DFs aisle & departments

So for example, if we have a look in first product (product_id = 1) we see that: <br/>
"Chocolate Sandwich Cookies" belong to category with aisle_id = 61 <br/>
Be checking at the aisles DF the aisle_id=61:

In [None]:
aisles[aisles.aisle_id == 61]

We see that "Chocolate Sandwich Cookies" belong to the product category "cookies cakes" <br/>
In the same fashion, "Chocolate Sandwich Cookies"  have as department_id=19. <br/>
By checking on departments DF:

In [None]:
departments[departments.department_id == 19]

We see that "Chocolate Sandwich Cookies" also belong to the department "snacks" <br/>
So, the information regarding the category and department **for each** product, can be found on the aisle & departments DFs. <br/>
This means that we can <b>merge</b> these dataframes into a new one. 



# 3. Create a final_products dataframe

In order to create a merged dataframe, we need to join the dataframes we have. We create a new dataframe final_products which contains the dataframes products, aisles and departments. We can see that product dataframe includes the columns "aisle_id" and "department_id" which are common columns at aisles and departments dataframes too.  Towards this end, we use the merge() function, which performs a join operation by columns or indexes.

First of all we have to choose the right type of join in order to create a dataframe with the data we want. There are four types of join:
1. (INNER) JOIN: Returns records(rows) that have matching values in both dataframes.
2. LEFT (OUTER) JOIN: Return all records from the leftdataframes, and the matched records from the right dataframes.
3. RIGHT (OUTER) JOIN: Return all records from the right dataframes, and the matched records from the left dataframes.
4. FULL (OUTER) JOIN: Return all records when there is a match in either left or right dataframes.
<img src="https://imgur.com/yLDkld9.png" width="400">

## 3.1 Merge of products and aisles dataframes

The new_products (the merged) dataframe should have only the data we want. We need all rows and columns from this dataframe, and the column <b>"aisle"</b> from aisles dataframe. According to the shape above, we understand that we need to use the left join.  


In [None]:
new_products = pd.merge(products,aisles,on="aisle_id", how="left")
new_products.head()

The function we used is the: <b><i>pd.merge(products,aisles)</i></b> Let's explain what happened above:
* The function by default (without any expressions inside it) makes an <b>inner join</b> between the two dataframes. That's why we used the expression <b>how="left"</b>. Using the expression "how" we can use the four types of joins that exist.
* The function by default uses the common column in order to make the join we asked for. In our example we use the expression <b>on="aisle_id"</b> so we can emphasize at the common column of our dataframes. If you try to run the code without this expression you will see that we will get the same result. Can you imagine in which case, this expression is useful for us?

Answer:
If we want to merge two dataframes which have more than one common columns, we should use the expression "on" to indicate the column or the columns that the function will use. E.x:

<i>merge( x, y, on="key")</i>    
<i>merge( x, y, on=["first_key","second_key"])</i>

## 3.2 Merge of new_products and departments dataframes

In this section, we would like to merge the "new_products" that we created before with the departments dataframe. In order to study a more complicated case of the function "merge()", we will make a small change at the column names of the departments dataframe. We set new labels at columns' names. So let's take it as the default situation and see how we can handle it. 

In [None]:
 new_products.columns = ['product_id','product_name', "aisle_id", "departments_id","aisle"]
 new_products.head()

Looking the "head()" function above we can observe that the name of the column <b>"department_id"</b> is now <b>"departments_id"</b>. As we said previously we would like to merge the dataframe "new_products" with the "departments". Looking more carefully we will see that the common column is the department_id but its name is different between the two dataframes. How will we handle it?

In [None]:
final_products = pd.merge(new_products,departments,left_on="departments_id",right_on="department_id",how="left")
final_products.head()

The function we used is the: <b><i>pd.merge(new_products,departments,left_on="departments_id",right_on="department_id",how="left")</i></b> Let's explain what happened above:
* We used the expression <b>how="left"</b> as it happened before.
* We used the expressions <b>left_on</b> and <b>right_on</b> in order to specify which columns should be used for the merging.

## 3.3 Make a uniform format for string columns
At the final_products DF, the columns which contain strings are the product_name , aisle & department. In the following lines, we show how to convert all strings to a single word (convert spaces to underscores and turn all letters to lower)

In [None]:
final_products.product_name = final_products.product_name.str.replace(' ', '_').str.lower()
final_products.department = final_products.department.str.replace(' ', '_').str.lower()
final_products.aisle= final_products.aisle.str.replace(' ', '_').str.lower()
final_products.head()

## 3.3 Delete unnecessary columns

Finally, in this section we will delete some columns which are not useful for our new dataframe. This columns are the "aisles_id", "departments_id", "department_id". 

In [None]:
del final_products["aisle_id"]
del final_products["departments_id"]
del final_products["department_id"]
final_products.head()

# 4. orders, order_products_prior, order_products_train .csv files
## 4.1 Load data from the CSV files


Now let's have look again in the overview of our available .csv files <br/>
![CSV_NAMES](https://kaggle2.blob.core.windows.net/forum-message-attachments/183176/6539/instacartFiles.png)


At this stage we will work with the rest .csv files, except sample_submission.csv file as it is mainly used for the competition hosted on kaggle.
First we import the csv files:

In [None]:
orders = pd.read_csv('../input/orders.csv' )
op_prior = pd.read_csv('../input/order_products__prior.csv')
op_train = pd.read_csv('../input/order_products__train.csv' )

## 4.2 View and Understand data

Now we will explore each DF:

In [None]:
orders.head()

The orders DF keeps track of the basic information for each order.
1. order_id is the unique index key for each order.
2. user_id is a unique index key for each customer.
3. eval_set has three distinct values [prior, train, test] ; for the being we will not worry about this attribute.
4. order_number indicates the rank of a given order of a specific user [in the orders.head( ) we see the first five orders of the user_id=1].
5. order_dow indicates a day of the week [values 0,1,...6].
6. order_hour_of_day indicates an hour of the day that an order has been placed.
7. days_since_prior_order indicates how many days have passed since the previous order [that's why the first order has Not A Number (NaN) value].


In [None]:
op_prior.head()

The op_prior DF keeps track the products purchased on each order
1. order_id indicates the equivalent key of orders DF.
2. product_id indicates the unique id of a product purchased in this particular order.
3. add_to_cart_order indicates the rank of the product added on a specific order.
4. reordered shows if this product has been reordered from the previous order or not [1: reorder / 0: not reordered].

So if we would like to see which products does the first order from orders DF include we would have to check all rows that have orded_id=2539329 on op_prior:




In [None]:
op_prior[op_prior.order_id == 2539329]

Now op_train DF holds the same info but for these orders labeled as train in the orders DF:



In [None]:
op_train.head()

For the scopes of this notebook, we will not examine why there are two DFs which hold the products of each order.

# 5. Create a final_orders dataframe
## 5.1 Merge  op_prior & op_train with orders

From the above example we understand that both op_prior & op_train contain more information of the orders found on orders DF. This information includes the products purchased on a particular order as well as other info. <br/>

As these two DFs are identical and have the same columns we will merge them one down other.
We will take the rows of op_train and we will stick them below the rows of op_prior.<br/>
Do this we will use the pd.concat( ) function, where we will enter as argument the two DFs in a list

In [None]:
final_orders = pd.concat([op_prior, op_train])
final_orders.head()

Which returns a DF with both op_prior & op_train. <br/>
Now on DF we will include all the relevant info for each order, from the DF "orders" <br/>
We will use a left join to DF log, as we want to keep all the products purchased on each order, and we will merge it with "orders" DF to fetch all relevant info.

In [None]:
#execution time 20s
final_orders = pd.merge(final_orders , orders,how='left')
final_orders.head()

The final DF includes all the orders, as well as the products purchased on each order and other metrics.
However, we see that the first order is that with order_id==2 which means that our rows are not sorted properly. <br/>
To perform a sorting in the rows of our DF, we will use the .sort_values( ) method. <br/>
The sort_values( ) method requires as an argument a list with the column names to base its sorting. <br/>
In our case, we would like to order the DF by the 'order_id' and as a second criterion (when there are repetitive rows with the same value) the 'add_to_cart_order':


In [None]:
final_orders.sort_values(['order_id', 'add_to_cart_order'])
final_orders.head()

# 6. Merge final_orders with final_products dataframe

In the last stage of this notebook, we will merge the final_orders DF with the final_products DF. This will lead to a DF that contains all the available information provided from Instacart.

In [None]:
final = pd.merge(final_orders, final_products, how='left')
final.head()

The final DF contains all the orders, all the products placed in each order, as well as all available information regarding each product