## 4.6(b) Combining & Exporting data

### This script contains the following points:¶
1. Import the data sets into Jupyter
2. Check the dimensions of the imported dataframes
3. Determine a suitable way to combine the orders_products_combined dataframe with the products data set
4. Confirm the results of the merge using the merge flag
5. Export the newly created dataframe as ords_prods_merge in a suitable format

In [4]:
# Import libraries
import pandas as pd
import numpy as np
import os

### 1. Import the data sets into Jupyter

In [5]:
#Tell Python to remember the main folder path
path = r'/Users/gideon/Desktop/27-06-2020 Instacart Basket analysis'

In [6]:
# Import dataset orders_products_combined.pkl
df_ords_prods_combined = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined.pkl'))

In [7]:
# Import dataset products_checked.csv
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'), index_col = False)

### 2. Check the dimensions of the imported dataframes

In [8]:
# Check the output
df_ords_prods_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,7.0,196,1,0,both
1,2539329,1,1,2,8,7.0,14084,2,0,both
2,2539329,1,1,2,8,7.0,12427,3,0,both
3,2539329,1,1,2,8,7.0,26088,4,0,both
4,2539329,1,1,2,8,7.0,26405,5,0,both


In [9]:
df_ords_prods_combined.shape

(32434489, 10)

In [10]:
# Check the output
df_prods.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


In [11]:
# Drop unnecessary columns
df_prods = df_prods.drop(['Unnamed: 0'], axis=1)

In [12]:
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [13]:
df_prods.shape

(49672, 5)

### 3. Determine a suitable way to combine the orders_products_combined dataframe with the products data set

In [15]:
df_ords_prods_merge = df_prods.merge(df_ords_prods_combined, on = 'product_id', indicator = True)

ValueError: Cannot use name of an existing column for indicator column

#### The problem here is that the column "merge" already exists in the df_ords_prods_combined dataframe. Therefore, I have to drop it from the dataframe before operating the merging procedure.

In [16]:
# Drop unnecessary columns
df_ords_prods_combined = df_ords_prods_combined.drop(['_merge'], axis=1)

In [17]:
# Check output
df_ords_prods_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
0,2539329,1,1,2,8,7.0,196,1,0
1,2539329,1,1,2,8,7.0,14084,2,0
2,2539329,1,1,2,8,7.0,12427,3,0
3,2539329,1,1,2,8,7.0,26088,4,0
4,2539329,1,1,2,8,7.0,26405,5,0


In [18]:
df_ords_prods_combined.shape

(32434489, 9)

In [21]:
# Merge the updated dataframes
df_ords_prods_merge = df_prods.merge(df_ords_prods_combined, on = 'product_id', indicator = True)

In [20]:
# Check the output
df_ords_prods_merge.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,add_to_cart_order,reordered,_merge
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,5,0,both
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,1,1,both
2,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,20,0,both
3,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,7.0,10,0,both
4,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,11,1,both


In [22]:
df_ords_prods_merge.shape

(32404859, 14)

### 4. Confirm the results of the merge using the merge flag

In [23]:
df_ords_prods_merge['_merge'].value_counts()

_merge
both          32404859
left_only            0
right_only           0
Name: count, dtype: int64

#### After merging, the resulting dataframe has 32,404,859 rows, and each of those rows have information found in both input data sets, as we used an inner join for the purposes of this project.

### 5. Export the newly created dataframe as ords_prods_merge in a suitable format

In [24]:
# Export data to pkl
df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'ords_prods_merge.pkl'))