# Table of Content
- [Imports](#imports)
- [Resources](#resources)
- [Load Data Sets](#load-data-sets)
- [Merge Data](#merge-data)
- [Export Data](#export-data)


## Imports [#](#table-of-content)

In [1]:
import time
import os
from pathlib import Path

import numpy as np
import pandas as pd

## Resources [#](#table-of-content)

In [2]:
# project folder
project_folder = Path(r"C:\Users\vynde\Desktop\CareerFoundry Data Analytics\Data Immersion - 4 Python Fundamentals for Data Analysts\Instacart_Basket_Analysis")

# resource folders
original_data_folder = project_folder / "02_Data" / "Original_Data"
prepared_data_folder = project_folder / "02_Data" / "Prepared_Data"

# input files
cleaned_products_data_file = prepared_data_folder / "products_cleaned.csv"
orders_products_combined_data_file = prepared_data_folder / "orders_products_combined.pkl"

# output files
orders_products_merged_data_file = prepared_data_folder / "orders_products_merged.pkl"

## Load Data Sets [#](#table-of-content)

Products cleaned

In [3]:
df_prods = pd.read_csv(cleaned_products_data_file).drop(columns=["Unnamed: 0"])
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


Orders products combined

In [4]:
df_orders_products_combined = pd.read_pickle(orders_products_combined_data_file)

In [5]:
df_orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
0,2539329,1,1,2,8,0.0,196,1,0
1,2539329,1,1,2,8,0.0,14084,2,0
2,2539329,1,1,2,8,0.0,12427,3,0
3,2539329,1,1,2,8,0.0,26088,4,0
4,2539329,1,1,2,8,0.0,26405,5,0


Check shape

In [6]:
df_orders_products_combined.shape

(32434489, 9)

## Merge Data [#](#table-of-content)

In [7]:
df_orders_products_merged = df_orders_products_combined.merge(df_prods, on="product_id")
df_orders_products_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,0.0,196,1,0,Soda,77,7,9.0
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0


Check results

In [8]:
df_orders_products_combined.merge(df_prods, on="product_id", how="outer", indicator=True)["_merge"].value_counts()

both          32404859
left_only        30200
right_only          11
Name: _merge, dtype: int64

>No full merge

## Export Data [#](#table-of-content)

Benchmark on export speed and file size

In [9]:
#file types
file_types = ["csv", "pickle", "parquet", "feather"]  # needs to !pip install pyarrow

for file_type in file_types:
    # construct file name
    file = orders_products_merged_data_file.parent / (orders_products_merged_data_file.stem + "." + file_type)
    
    
    # measure and execute export function for each file type
    start_time = time.time()
    getattr(df_orders_products_merged,"to_" + file_type)(file)
    write_time = time.time() - start_time
    
    # measure and execute read function for each file type
    start_time = time.time()
    getattr(pd,"read_" + file_type)(file)
    read_time = time.time() - start_time
    
    # get file size
    fsize = os.path.getsize(file)
    
    # summary
    print(f"{file_type:10} - Write Time: {write_time:6.2f} s | Read Time: {read_time:6.2f} s | Size: {fsize/1024/1024:4.0f} MB")

csv        - Write Time:  76.61 s | Read Time:  21.25 s | Size: 2528 MB
pickle     - Write Time:   4.42 s | Read Time:   1.97 s | Size: 3357 MB
parquet    - Write Time:   8.34 s | Read Time:   2.82 s | Size:  461 MB
feather    - Write Time:   2.57 s | Read Time:   2.27 s | Size:  928 MB


>Conclusion

>Best write time: feather (allrounder)<br>
>Best read time: pickle (if size doesn't matter and data has to be loaded more often)<br>
>Best file size: parquet (if size does matter)<br>

>My choice: **pickle**

In [10]:
# save to pickle
df_orders_products_merged.to_pickle(orders_products_merged_data_file)