# Merging your Instacart Data

# 1. If you haven’t merged your prepared Instacart orders data with the new orders_products_prior dataframe per the instructions in this Exercise, do so now.

df_ords and df_prods dataframes don’t contain a common column. How, then, are you supposed to combine them? 

To solve this problem, you’ll combine your df_ords dataframe with a new dataframe called orders_products_prior.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import os

# Import the data set as a new dataframe

In [2]:
# Define path to the dfs
path = r'/Users/renataherrera/Documents/CF RH 2023-2024/CF DATA IMMERSION/CF RH A4 PYTHON/RH_PYTHON_Instacart Basket Analysis'

In [3]:
# Importing prepared data orders_checked.csv df
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_checked.csv'), index_col = False)

In [4]:
# Importing new original data orders_products_prior.csv df
df_ords_prior = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders_products_prior.csv'), index_col = False)

In [5]:
# Check the output
df_ords_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


Get into the habit of using the head() function to ensure nothing looks off about your imported dataframes.

You’ll also want to run some shape checks on both dataframes using the shape function. This is a great way to confirm the total size of your imported dataframes:

In [6]:
df_ords_prior.shape

(32434489, 4)

In [7]:
#  Confirm the total size of your imported dataframes
df_ords.shape

(3421083, 9)

your new dataframe is quite large. This is because it contains each and every order of each and every user—as well as what exactly they ordered

In [8]:
# Check the output
df_ords.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order
0,0,0,2539329,1,prior,1,2,8,
1,1,1,2398795,1,prior,2,3,7,15.0
2,2,2,473747,1,prior,3,3,12,21.0
3,3,3,2254736,1,prior,4,4,7,29.0
4,4,4,431534,1,prior,5,4,15,28.0


In [9]:
# Importing prepared data orders_checked.csv df
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'), index_col = False)

In [10]:
# Check the output
df_prods.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


You’ll also want to run some shape checks on both dataframes using the shape function. This is a great way to confirm the total size of your imported dataframes

In [11]:
df_prods.shape

(49693, 6)

You want to merge this new dataframe with your df_ords dataframe, something you should be able to do easily despite their different shape, thanks to their shared column: “order_id.” In theory, you should have a fully matching “order_id” column, so you shouldn’t need to specify a type of join. 

As a reminder, the default type for a join is “inner,” which means the resulting data set will only contain observations included in both input data sets. Go ahead and execute the following code:

In [12]:
# Creating a new df, df_merged_large,that contains the combined df_ords and df_ords_prior df's
# uses the "order_id" column as its key
# also includes the indicato r= True argument to check for a full match
df_merged_large = df_ords.merge(df_ords_prior, on = 'order_id', indicator = True)

In [13]:
# Checking the output of columns
df_merged_large.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order,product_id,add_to_cart_order,reordered,_merge
0,0,0,2539329,1,prior,1,2,8,,196,1,0,both
1,0,0,2539329,1,prior,1,2,8,,14084,2,0,both
2,0,0,2539329,1,prior,1,2,8,,12427,3,0,both
3,0,0,2539329,1,prior,1,2,8,,26088,4,0,both
4,0,0,2539329,1,prior,1,2,8,,26405,5,0,both


In [14]:
#Sum up all the values in the "_merge" column
df_merged_large['_merge'].value_counts()

_merge
both          32434489
left_only            0
right_only           0
Name: count, dtype: int64

Let’s see what the merge flag frequency shows you here. In the “_merge” column, you can see that there are only entries that have a value of “both,” leading you to think that your key column, “order_id,” exists completely in both dataframes. However, this conclusion is wrong

But does this mean that you have a full match? The answer is no. 

To recap, the resulting dataframe (after the merge) has 32,434,489 rows, and each of those rows have information found in both input data sets. Keep track of this number! It can help you keep your dataframes straight when working with numerous dataframes. 

Also, running checks in your notebooks before and after performing significant procedures will allow you to track the way the shape of your data is changing. This is most important after importing or just before exporting data.



# 2. Export the merged file in pickle format as “orders_products_combined.pkl”.

A pickle, or “.pkl,” is a pandas format used to store data on your computer. While it’s similar to “.csv” files, pickles can only be opened using Python. Importing a pickle into your Jupyter notebook follows the same procedure as importing a “.csv” file and produces the same dataframe.

As you can see, the biggest difference when it comes to importing and exporting “.csv” files and “.pkl” files is efficiency. Your df_merged_large dataframe, for instance, would likely take around two minutes to export as a pickle, while it could take upwards of ten minutes to export as a “.csv” file.

Importing pickle files also follows a similar syntax to its “.csv” counterpart: the only difference comes in the function (read_pickle()) and the lack of an index_col, since pickle-format files include this information already.

In [15]:
# Exporting the merged file in pickle format
df_merged_large.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_combined.pkl'))

# to_pickle() is used. Pay attention to the file extension and ensure it matches the function you use!
