In [1]:
# Import libraries 
import pandas as pd 
import numpy as np 
import os 

In [2]:
# Create path
path = r'/Users/bentley/Documents/Instacart'

In [4]:
# Import csv file into a dataframe
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'), index_col = False)

In [5]:
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [6]:
# Rename a column and overwrite dataframe 
df_ords.rename(columns = {'order_dow' : 'order_day_of_week'}, inplace = True)

In [7]:
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [8]:
# Import csv file into dataframe
df_ords_prior = pd.read_csv(os.path.join(path,'02 Data', 'Prepared Data','orders_products_prior.csv'), index_col = False)

In [9]:
df_ords_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [10]:
# Run a shape check on dataframes 
df_ords.shape

(3421083, 7)

In [11]:
df_ords_prior.shape

(32434489, 4)

Note: Both dataframes are large. To prep for a merge, I find a shared column: "order_id" to use as a key. "order_id" columns is a full match in both dataframes. The default type for a join is INNER, which means the resulting data set will only contain observations included in <i>both</i> input data sets. I'm executing the following code: 

In [12]:
df_merged_large = df_ords.merge(df_ords_prior, on = 'order_id', indicator = True)

Note: df_merged_large is a new dataframe that contains a combination of df_ords and df_ords_prior using 'order_id' column as its key. It also includes the indicator = True argument so I'm able to check for a full match.

In [13]:
# Check output 
df_merged_large.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,prior,1,2,8,,196,1,0,both
1,2539329,1,prior,1,2,8,,14084,2,0,both
2,2539329,1,prior,1,2,8,,12427,3,0,both
3,2539329,1,prior,1,2,8,,26088,4,0,both
4,2539329,1,prior,1,2,8,,26405,5,0,both


In [14]:
# Check to see if a full match exist in merge 
df_merged_large['_merge'].value_counts()

both          32434489
right_only           0
left_only            0
Name: _merge, dtype: int64

Note: value_counts() allows me to quickly sum up all of the values in the _merge column by letting me know whether or not I have a full match in the merge. The output confirms there's a full match in the merge. <br>
<br>
However, I learned this conclusion is wrong. Here's why:<br> 
What pandas does here is fill in information about each product for every “order_id” in the df_ords dataframe, which is why the resulting dataframe has 32,434,489 rows (the same total count as the df_ords_prior dataframe). This means I don't have a full match. There's one particular intricacy when using and interpreting the merge flag, and it has a lot to do with the way I chose to merge the dataframes. In this case, I chose the default option of INNER JOIN. This means that the resulting table will only contain observations found in both dataframes. As such, the merge flag here will only show entries that have a value of “both.” <br>
<br> 
How do I check whether or not I have a full match? Check out the output in the figure below, which shows the frequency of a merge using the argument how = outer. Merging like this will combine all the observations and show the real merge rate:

In [15]:
# Check output using how = outer argument 
df_merged_large['_merge'].value_counts()

both          32434489
right_only           0
left_only            0
Name: _merge, dtype: int64

Note: The output above is actually based on a default argument using INNER JOIN so nothing changed; The example provided in CareerFoundry showed numbers in both (32434489) and left_only (206209), leaving right_only with 0 using the how = outer argument nothing there's not a full match. <br>
<br>
For the Instacart project, I'll only be working with data sets that have a full merge rate, so I won't need to worry about this or apply any changes to the merge I just completed (using how = 'inner'). However, I should always double-check my merge rates using an OUTER JOIN, as well, especially when I'm exploring new data and performing test merges.<br>
<br>
Recap: The resulting dataframe (after the merge) has 32,434,489 rows and each of those rows have information found in both input data sets. I need to keep track of this number! It can help me keep my dataframes straight when working with numerous dataframes. Also, running checks in my notebooks before and after performing significant procedures will allow me to track the way the shape of my data is changing. This is most important after importing or just before exporting data.  

## Exporting Data in Pickle Format 

The biggest difference when it comes to importing and exporting CSV files and PKL files is efficiency. The df_merged_large dataframe, for instance, would likely take around two minutes to export as a pickle, while it could take upwards of ten minutes to export as a .csv file. <br> 
<br> 
The exporting syntax for both is very similar: <br>
<i>Export data to csv</i> <br>
df_merged_large.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined.csv'))<br>
<br> 
<i>Export data to pkl</i><br> 
df_merged_large.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_prodicts_combined.pkl')) <br> 
<br> 
Importing pickle files also follows a similar syntax to its “.csv” counterpart: the only difference comes in the function (read_pkl()) and the lack of an index_col, since pickle-format files include this information already.

In [16]:
# Export data to pkl 
df_merged_large.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined.pkl'))

In [17]:
# Export data to csv 
df_ords.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_checked_2.csv'))