# Dealing with MemoryErrors

If you're looking ar this script, you've probably run into a MemoryError in Jupyter 😥. Not to worry - we've got you sorted! Experiencing a MemoryError means your machine can't allocate enough RAM in order to perform some hefty task. Very often these errors happen when dealing with pandas because it's hardcoded to use only one of the processor cores. In this script you'll find a couple of ways to circumvent these issues.



⚠️ Before you start running any of this code, please make sure of the following though: 
- Often times we run into MemoryErros because there are many open notebooks in Jupyter. Every open notebook "steals" RAM! 
- Go to the tab Running in the Jupyter homepage and check how many notebooks are open - they would look green. 
- If you notice notebooks glaring in green that you don't need, you should close them. To do this correctly go to the notebook in question and navigate to the File menu. From there choose Close and halt.
- Once you've closed the unnecessary notebooks, restart the kernel in the notebook you're working in. NB: save your work e.g. export all data you've manipulated somehow, because restarting the kernel would wipe everything stored in the memory. 
- If that doesn't work, try restarting Python and restarting your browser.
- In general, try not to have many browser tabs open as well as active programs on your machine (for example, large Excel files). 

## Contents 

#####  <span style="color:blue"> **Plan A** </span>
In this scenario we'll try to alter the variable types in order to reduce the memory used.

##### <span style="color:blue"> Plan B </span>
If plan A doesn't work, combine it with Plan B. Here we'll merge the data sets chunkwise. Plan A and B are to be combined if Plan A alobe doesn't suffice.

##### <span style="color:blue"> Plan C  </span>
If both plan A and B don't work, we'll resort to a completely different strategy that invloves installing a new library called Terality. You can still use the modified data from Plan A though.

### <span style="color:blue"> Plan A </span>

In the following code we've experimented and come up with the data types that save the most memory. Don't worry if not all of the columns are familiar to you - some of them are only derived in 4.9. If you're in 4.6 you wouldn't have derived them yet. Feel free to comment out the lines not relevant to you (Ctrl+/).

In [26]:
import os
import numpy as np
import pandas as pd

In [27]:
# Set path
path = r'C:\Users\asus\Documents\Instacart Basket Analysis'

In [28]:
# Import the existing data - NB these data set is just example data sets for 4.6, do use the ones you need to be merged

ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_checked.csv'), index_col = None)
ords_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'order_products_prior.csv'), index_col = None)

In [29]:
ords.columns

Index(['Unnamed: 0', 'order_id', 'user_id', 'eval_set', 'order_number',
       'order_dow', 'order_hour_of_day', 'days_since_prior_order'],
      dtype='object')

In [30]:
ords.drop(columns = {'eval_set', 'Unnamed: 0'}, inplace = True)

In [31]:
ords.columns

Index(['order_id', 'user_id', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order'],
      dtype='object')

In [32]:
ords.rename(columns = {'order_dow':'orders_day_of_week', 'order_hour_of_day':'order_time_of_day'}, inplace = True)

In [34]:
ords.columns

Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_time_of_day', 'days_since_prior_order'],
      dtype='object')

In [33]:
ords_prods.columns

Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered'], dtype='object')

In [12]:
ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 6 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   order_number            int64  
 3   orders_day_of_week      int64  
 4   order_time_of_day       int64  
 5   days_since_prior_order  float64
dtypes: float64(1), int64(5)
memory usage: 156.6 MB


In [36]:
# Change types for ords data set

ords['order_id']=ords['order_id'].astype('int32')
ords['user_id'] = ords['user_id'].astype('int32')
ords['order_number']=ords['order_number'].astype('int8')
ords['orders_day_of_week']=ords['orders_day_of_week'].astype('int8')
ords['order_time_of_day']=ords['order_time_of_day'].astype('int8')
ords['days_since_prior_order']=ords['days_since_prior_order'].astype('float16')

In [37]:
# Check output to see memory usage

ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 6 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int32  
 1   user_id                 int32  
 2   order_number            int8   
 3   orders_day_of_week      int8   
 4   order_time_of_day       int8   
 5   days_since_prior_order  float16
dtypes: float16(1), int32(2), int8(3)
memory usage: 42.4 MB


In [38]:
ords_prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
 #   Column             Dtype
---  ------             -----
 0   order_id           int64
 1   product_id         int64
 2   add_to_cart_order  int64
 3   reordered          int64
dtypes: int64(4)
memory usage: 989.8 MB


In [39]:
# Change types for ords prods data set 

ords_prods['product_id'] =ords_prods['product_id'].astype('int32')
ords_prods['reordered']=ords_prods['reordered'].astype('int8')
ords_prods['add_to_cart_order']=ords_prods['add_to_cart_order'].astype('int32')
ords_prods['order_id']=ords_prods['order_id'].astype('int32')

In [40]:
ords_prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
 #   Column             Dtype
---  ------             -----
 0   order_id           int32
 1   product_id         int32
 2   add_to_cart_order  int32
 3   reordered          int8 
dtypes: int32(3), int8(1)
memory usage: 402.1 MB


In [None]:
# Further columns from other data sets

df['aisle_id'] = df['aisle_id'].astype('int8')
df['department_id'] = df['department_id'].astype('int8')
df['prices'] = df['prices'].astype('float16')
df['age'] = df['age'].astype('int8')
df['n_dependants'] = df['n_dependants'].astype('int8')
df['income'] = df['income'].astype('int8')

NB: in the code above not all columns are listed. You may have derived some additional columns - mainly flags - that are not reflected above. Keep in mind the following: 
- flags should be objects
- avoid int64 data type because it consumes a lot of memory. Try int32 or int8. 
- After every take run `info()` to see how much the memory usage decreased (in the bottom) but also `head()` because some of these changes may affect the data! 

In [20]:
# Attemp a merge

ords_prods_merged = ords.merge(ords_prods, on = ['order_id'], how = 'outer', indicator = True)

In [21]:
ords_prods_merged['_merge'].value_counts(dropna = False)

both          32434489
left_only       206209
right_only           0
Name: _merge, dtype: int64

### <span style="color:blue"> Plan B </span>

In this plan we'll merge the 2 dataframe (with the altered data types) using a chunkwise method. This means we'll split the large data frame into chunks and merge them piece by piece to the smaller dataframe. 

In [None]:
# df1 = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_checked.csv'))
# df2 = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'order_products_prior.csv'), index_col = None)

In [22]:
%%time

ords_prods_key = ords_prods['order_id']

# Creating an empty shell to save the result
df_result = pd.DataFrame(columns=(ords.columns.append(ords_prods.columns)).unique())
df_result.to_csv(os.path.join(path, '02 Data', 'Prepared Data', "df3.csv"),index_label=False)

# Deleting ords_prods to save memory (the large one)
del(ords_prods)

# The basic idea is that we'll now load the large data set into chunks and then create a function that will 
# iteratively merge those chunks to the main data set 


# Define a function that will merge the chunks
def preprocess(x):
    ords_prods = pd.merge(ords,x, left_on = "order_id", right_on = "order_id")
    ords_prods.to_csv(os.path.join(path, '02 Data', 'Prepared Data', "df3.csv"),mode="a",header=False,index=False)

    
reader = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'order_products_prior.csv'), chunksize=10000) # chunksize depends on you colsize

[preprocess(r) for r in reader]

Wall time: 42min 24s


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [16]:
# Import the result to see what happened 

df_test = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'df3.csv'))

In [17]:
df_test.shape

(32434489, 11)

In [18]:
df_test.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
0,6226,40,382,prior,29,1,15,23.0,10070,1,1
1,6226,40,382,prior,29,1,15,23.0,42450,2,1
2,6226,40,382,prior,29,1,15,23.0,33198,3,1
3,6226,40,382,prior,29,1,15,23.0,34866,4,1
4,8161,214,503,prior,5,0,16,9.0,5499,1,0


In [22]:
df_test.columns

Index(['Unnamed: 0', 'order_id', 'user_id', 'eval_set', 'order_number',
       'order_dow', 'order_hour_of_day', 'days_since_prior_order',
       'product_id', 'add_to_cart_order', 'reordered'],
      dtype='object')

In [23]:
df1.columns

Index(['Unnamed: 0', 'order_id', 'user_id', 'eval_set', 'order_number',
       'order_dow', 'order_hour_of_day', 'days_since_prior_order'],
      dtype='object')

#### Try with the 4.9 data set 

In [33]:
orders_products_merged = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.csv'))

In [34]:
cust = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'cust_data_final.csv'))

In [35]:
orders_products_merged.columns

Index(['order_id', 'user_id', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order', 'product_id', 'add_to_cart_order',
       'reordered', 'product_name', 'aisle_id', 'department_id', '_merge'],
      dtype='object')

In [36]:
cust.columns

Index(['user_id', 'First Name', 'Surnam', 'Gender', 'STATE', 'Age',
       'date_joined', 'n_dependants', 'fam_status', 'income'],
      dtype='object')

In [37]:
%%time
df4_key = orders_products_merged['user_id']

# creating a empty bucket to save result
df_result = pd.DataFrame(columns=(cust.columns.append(orders_products_merged.columns)).unique())
df_result.to_csv(os.path.join(path, '02 Data', 'Prepared Data', "df_result_3.csv"),index_label=False)

# deleting df2 to save memory
del(orders_products_merged)

def preprocess(x):
    df5=pd.merge(cust,x, left_on = "user_id", right_on = "user_id")
    df5.to_csv(os.path.join(path, '02 Data', 'Prepared Data', "df_result_3.csv"),mode="a",header=False,index=False)

reader = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.csv'), chunksize=10000) # chunksize depends with you colsize

[preprocess(r) for r in reader]

Wall time: 7min 23s


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [38]:
# Check the result

df_test = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'df_result_3.csv'))

In [39]:
df_test.columns

Index(['user_id', 'First Name', 'Surnam', 'Gender', 'STATE', 'Age',
       'date_joined', 'n_dependants', 'fam_status', 'income', 'order_id',
       'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order', 'product_id', 'add_to_cart_order',
       'reordered', 'product_name', 'aisle_id', 'department_id', '_merge'],
      dtype='object')

In [40]:
df_test.shape

(32434489, 22)

### <span style="color:blue"> Plan C </span>

#### Setup

First, click [here](https://app.terality.com/) and follow the instructions.Here's a recap: 

- Create an account
- Install terality by typing `pip install --upgrade terality` in the command prompt/terminal/Jupyter
- Get and save your API key somewhere
- Configure the terality library in the command prompt/terminal (or in Jupyter as per the instructions in the webpage)


Terality allows 1000GB data usage per month. This is actually quite a lot and the good news is that once you upload a data set onto their server, it stays there - meaning you it doesn't get reuploaded when you run the same script again and taking up more space from your data plan. 


_*This script builds onto the tutorial script from Terality's website. The first part has been added as an example of how you can use it if you're facing difficulties with the size of the data sets in the Instacart project._

## First steps - data upload

Terality exposes dataframes and other data structures with exactly the same API as pandas. No need to learn a new framework, just import the package and start processing data exactly as in pandas!

In [23]:
import terality as te # the new library

You are using version 0.14.3 of the Terality client, but version 0.14.14 is available. Consider upgrading your version to get the latest fixes and features.


In [24]:
pip install --upgrade terality

Collecting terality
  Downloading terality-0.14.14-py3-none-any.whl (152 kB)
     -------------------------------------- 152.5/152.5 KB 4.6 MB/s eta 0:00:00


You should consider upgrading via the 'C:\Users\asus\Anaconda3\python.exe -m pip install --upgrade pip' command.



Installing collected packages: terality
  Attempting uninstall: terality
    Found existing installation: terality 0.14.3
    Uninstalling terality-0.14.3:
      Successfully uninstalled terality-0.14.3
Successfully installed terality-0.14.14


 An easy way to get started and create a `terality.DataFrame` is by importing a `pandas.DataFrame` using the function `from_pandas`:

In [41]:
df_te_ords_p = te.DataFrame.from_pandas(ords_prods)
df_te_ords = te.DataFrame.from_pandas(ords)




uploading data:   0%|                                                                       | 0.00/121M [00:00<?, ?B/s][A
uploading data:   0%|▏                                                               | 262k/121M [00:00<05:53, 340kB/s][A
uploading data:   0%|▎                                                               | 524k/121M [00:01<04:58, 402kB/s][A
uploading data:   1%|▊                                                             | 1.57M/121M [00:01<01:55, 1.03MB/s][A
uploading data:   1%|▊                                                             | 1.57M/121M [00:01<01:55, 1.03MB/s][A
uploading data:   1%|▊                                                             | 1.57M/121M [00:01<01:55, 1.03MB/s][A
uploading data:   3%|█▌                                                            | 3.15M/121M [00:01<01:29, 1.32MB/s][A
uploading data:   4%|██▍                                                           | 4.72M/121M [00:01<00:48, 2.39MB/s][A
uploading data:

uploading data:  45%|███████████████████████████▊                                  | 54.0M/121M [00:11<00:13, 5.00MB/s][A
uploading data:  46%|████████████████████████████▎                                 | 55.1M/121M [00:11<00:11, 5.75MB/s][A
uploading data:  46%|████████████████████████████▋                                 | 55.8M/121M [00:11<00:11, 5.85MB/s][A
uploading data:  47%|█████████████████████████████                                 | 56.6M/121M [00:11<00:12, 5.01MB/s][A
uploading data:  48%|█████████████████████████████▉                                | 58.2M/121M [00:12<00:11, 5.58MB/s][A
uploading data:  49%|██████████████████████████████▎                               | 59.0M/121M [00:12<00:11, 5.33MB/s][A
uploading data:  50%|██████████████████████████████▋                               | 59.8M/121M [00:12<00:10, 5.85MB/s][A
uploading data:  50%|███████████████████████████████▎                              | 60.8M/121M [00:12<00:13, 4.58MB/s][A
uploading data: 

uploading data:  91%|█████████████████████████████████████████████████████████▏     | 110M/121M [00:21<00:01, 5.53MB/s][A
uploading data:  92%|█████████████████████████████████████████████████████████▋     | 110M/121M [00:21<00:01, 5.88MB/s][A
uploading data:  92%|██████████████████████████████████████████████████████████     | 111M/121M [00:21<00:01, 5.77MB/s][A
uploading data:  93%|██████████████████████████████████████████████████████████▍    | 112M/121M [00:22<00:02, 3.38MB/s][A
uploading data:  94%|███████████████████████████████████████████████████████████    | 113M/121M [00:22<00:01, 4.00MB/s][A
uploading data:  94%|███████████████████████████████████████████████████████████▍   | 114M/121M [00:22<00:01, 4.48MB/s][A
uploading data:  95%|███████████████████████████████████████████████████████████▊   | 115M/121M [00:22<00:01, 5.05MB/s][A
uploading data:  96%|████████████████████████████████████████████████████████████▎  | 115M/121M [00:22<00:00, 5.35MB/s][A
uploading data: 

TeralityClientError: Terality can not transfer the data structure: unsupported data type: halffloat

In [42]:
df_te_ords.info() 

# Check out the memory usage at the bottom - it's really low! Showing one of the greatest features of Terality.

<class 'terality.DataFrame'>
Index: 3421083 entries, 0 to 3421082
Data columns (total 6 columns):
 #   Column                  Dtype
---  ------                  -----
 0   order_id                int64
 1   user_id                 int64
 2   order_number            int64
 3   orders_day_of_week      int64
 4   order_time_of_day       int64
 5   days_since_prior_order  float64
dtypes: int64(5), float64(1)
memory usage: 182.7 MB (run with deep=True)



uploading data: 100%|███████████████████████████████████████████████████████████████| 121M/121M [00:37<00:00, 3.59MB/s][A

## Merging the data

In [43]:
%%time 
# this command will show how much time does it take to do the merge. If you want to use it always make sure that 
# it's the very first line of code in the cell

# Merge the 2 data sets

df_2 = df_te_ords_p.merge(df_te_ords, on=["order_id"], how="left", indicator = True)

Wall time: 30.9 s


In [44]:
df_2.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,orders_day_of_week,order_time_of_day,days_since_prior_order,_merge
0,2,33120,1,1,202279,3,5,9,8.0,both
1,2,28985,2,1,202279,3,5,9,8.0,both
2,2,9327,3,0,202279,3,5,9,8.0,both
3,2,45918,4,1,202279,3,5,9,8.0,both
4,2,30035,5,0,202279,3,5,9,8.0,both


In [45]:
df_2.shape 

# The result for the rows shows that we have a successful merge. 

(32434489, 10)

In [46]:
df_2['_merge'].value_counts(dropna = False)

both          32434489
right_only           0
left_only            0
Name: _merge, dtype: int64

In [47]:
# Let's try to convert the terality dataframe back to a pandas dataframe using the special function to_pandas

df_pd_roundtrip = df_2.to_pandas()

ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

😱😱😱

##### Now that we saw we can't directly save the terality dataframe as a pandas dataframe, we need to resort to a different method. The Terality documentation offers one solution we can try:
- Save the merged dataframe by partitioning as multiple parquet files (the terality special data format)
- Import the parquet files in the Jupyter memory 
- Convert these smaller pieces to pandas dataframes as this will require less memory
- Concatinate the dataframes 


####  😈 <span style="color:red"> The upcoming sections contain the splitting into multiple files, which requires writing very similar names multiple times. This is very tedious. In order to save you time we have created an Excel Helper file that you can download from Exercise 4.6. You can then simply copy them and paste them in the cells below 😈</span>

<span style="color:dark green">  In the code below you'll that the argument `num_rows_per_file` contains 4,000,000 rows. This is because the merge in 4.6 isn't resulting in such an enormous dataframe as is has only 10 columns (plus the merge flag). However, if you're using this script for the merge in 4.9 you'll probably need to reduce the number of rows as the columns will be many more (20+) after all the aggregations. You can start with 1,000,000 rows, but if you keep getting an Error try fewer, for example 500,000 rows.  </span>

In [56]:
%%time

# Save the terality dataframe to multiple parquet files
df_2.to_parquet_folder(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_*.parquet",
                      num_rows_per_file=4000000)

Wall time: 1min


In [57]:
%%time

# Load these files into the memory

te_1 = te.read_parquet(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_1.parquet")
te_2 = te.read_parquet(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_2.parquet")
te_3 = te.read_parquet(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_3.parquet")
te_4 = te.read_parquet(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_4.parquet")
te_5 = te.read_parquet(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_5.parquet")
te_6 = te.read_parquet(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_6.parquet")
te_7 = te.read_parquet(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_7.parquet")
te_8 = te.read_parquet(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_8.parquet")
te_9 = te.read_parquet(path = "C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_9.parquet")



C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_1.parquet:   0%|      | 0.00/38.3M [00:00<?, ?B/s][A[A

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_1.parquet:   1%| | 262k/38.3M [00:00<02:13, 285kB/[A[A

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_1.parquet:   1%| | 524k/38.3M [00:01<01:42, 370kB/[A[A

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_1.parquet:   5%| | 1.84M/38.3M [00:01<01:11, 512kB[A[A

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_1.parquet:  10%| | 3.67M/38.3M [00:01<00:27, 1.26M[A[A

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_1.parquet:  10%| | 3.67M/38.3M [00:01<00:27, 1.26M[A[A

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_1.parquet:  10%| | 3.67M/38.3M [00:01<00:27, 1.26M[A[A

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_1.parquet:  12%| | 4.72M/38.3M [00:02<00

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_2.parquet:  22%|▏| 8.39M/38.3M [00:01<00:06, 4.39M[A[A[A


C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_2.parquet:  24%|▏| 9.18M/38.3M [00:01<00:06, 4.79M[A[A[A


C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_2.parquet:  26%|▎| 9.96M/38.3M [00:01<00:05, 4.97M[A[A[A


C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_2.parquet:  28%|▎| 10.7M/38.3M [00:02<00:05, 4.79M[A[A[A


C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_2.parquet:  30%|▎| 11.5M/38.3M [00:02<00:05, 5.02M[A[A[A


C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_2.parquet:  32%|▎| 12.3M/38.3M [00:02<00:05, 4.95M[A[A[A


C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_2.parquet:  34%|▎| 13.1M/38.3M [00:02<00:05, 4.56M[A[A[A


C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_2.parquet:  37

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_3.parquet:  39%|▍| 14.9M/38.3M [00:03<00:03, 6.21M[A[A[A[A



C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_3.parquet:  42%|▍| 16.0M/38.3M [00:03<00:03, 6.27M[A[A[A[A



C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_3.parquet:  44%|▍| 16.8M/38.3M [00:03<00:03, 6.05M[A[A[A[A



C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_3.parquet:  46%|▍| 17.6M/38.3M [00:03<00:03, 5.42M[A[A[A[A



C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_3.parquet:  48%|▍| 18.4M/38.3M [00:03<00:04, 4.85M[A[A[A[A



C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_3.parquet:  49%|▍| 18.9M/38.3M [00:04<00:04, 4.62M[A[A[A[A



C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_3.parquet:  51%|▌| 19.4M/38.3M [00:04<00:04, 4.28M[A[A[A[A



C:/Users/asus/Documents/Instacart Basket Analysis/order

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_4.parquet:  54%|▌| 20.7M/38.2M [00:04<00:03, 5.61M[A[A[A[A[A




C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_4.parquet:  56%|▌| 21.5M/38.2M [00:04<00:03, 5.51M[A[A[A[A[A




C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_4.parquet:  59%|▌| 22.5M/38.2M [00:04<00:02, 6.23M[A[A[A[A[A




C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_4.parquet:  61%|▌| 23.3M/38.2M [00:04<00:02, 5.76M[A[A[A[A[A




C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_4.parquet:  63%|▋| 24.1M/38.2M [00:04<00:02, 5.82M[A[A[A[A[A




C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_4.parquet:  65%|▋| 24.9M/38.2M [00:04<00:02, 5.98M[A[A[A[A[A




C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_4.parquet:  68%|▋| 25.9M/38.2M [00:05<00:02, 6.07M[A[A[A[A[A




C:/Users/asus/Documents/Ins

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_5.parquet:  75%|▋| 28.6M/38.3M [00:05<00:01, 5.13M[A[A[A[A[A[A





C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_5.parquet:  77%|▊| 29.3M/38.3M [00:05<00:01, 5.45M[A[A[A[A[A[A





C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_5.parquet:  79%|▊| 30.1M/38.3M [00:05<00:01, 5.05M[A[A[A[A[A[A





C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_5.parquet:  81%|▊| 30.9M/38.3M [00:05<00:01, 4.92M[A[A[A[A[A[A





C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_5.parquet:  83%|▊| 31.7M/38.3M [00:06<00:01, 5.31M[A[A[A[A[A[A





C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_5.parquet:  85%|▊| 32.5M/38.3M [00:06<00:00, 5.78M[A[A[A[A[A[A





C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_5.parquet:  87%|▊| 33.3M/38.3M [00:06<00:00, 5.22M[A[A[A[A[A[A






C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_6.parquet:  73%|▋| 28.0M/38.3M [00:05<00:02, 5.04M[A[A[A[A[A[A[A






C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_6.parquet:  75%|▋| 28.6M/38.3M [00:05<00:02, 4.65M[A[A[A[A[A[A[A






C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_6.parquet:  76%|▊| 29.1M/38.3M [00:05<00:01, 4.79M[A[A[A[A[A[A[A






C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_6.parquet:  78%|▊| 29.9M/38.3M [00:05<00:01, 5.22M[A[A[A[A[A[A[A






C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_6.parquet:  80%|▊| 30.7M/38.3M [00:05<00:01, 4.81M[A[A[A[A[A[A[A






C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_6.parquet:  82%|▊| 31.5M/38.3M [00:06<00:01, 4.99M[A[A[A[A[A[A[A






C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_6.parquet:  84%|▊| 32.0M/38.3M [00:06<00:01, 3.98

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_7.parquet:  64%|▋| 24.6M/38.3M [00:04<00:02, 6.26M[A[A[A[A[A[A[A[A







C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_7.parquet:  66%|▋| 25.4M/38.3M [00:04<00:02, 5.97M[A[A[A[A[A[A[A[A







C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_7.parquet:  69%|▋| 26.2M/38.3M [00:05<00:02, 5.12M[A[A[A[A[A[A[A[A







C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_7.parquet:  71%|▋| 27.0M/38.3M [00:05<00:02, 5.21M[A[A[A[A[A[A[A[A







C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_7.parquet:  73%|▋| 27.8M/38.3M [00:05<00:02, 5.21M[A[A[A[A[A[A[A[A







C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_7.parquet:  75%|▋| 28.6M/38.3M [00:05<00:01, 5.22M[A[A[A[A[A[A[A[A







C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_7.parquet:  77%|▊| 29.4M/

C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_8.parquet:  72%|▋| 27.6M/38.3M [00:05<00:01, 5.51M[A[A[A[A[A[A[A[A[A








C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_8.parquet:  74%|▋| 28.3M/38.3M [00:05<00:01, 5.56M[A[A[A[A[A[A[A[A[A








C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_8.parquet:  76%|▊| 29.1M/38.3M [00:05<00:01, 5.84M[A[A[A[A[A[A[A[A[A








C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_8.parquet:  78%|▊| 29.9M/38.3M [00:05<00:01, 5.87M[A[A[A[A[A[A[A[A[A








C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_8.parquet:  80%|▊| 30.7M/38.3M [00:06<00:01, 6.06M[A[A[A[A[A[A[A[A[A








C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_8.parquet:  82%|▊| 31.5M/38.3M [00:06<00:01, 6.25M[A[A[A[A[A[A[A[A[A








C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_8

Wall time: 1min 28s


In [58]:
%%time

# Convert the parquets to pandas dataframes

df_pd_roundtrip_1 = te_1.to_pandas()
df_pd_roundtrip_2 = te_2.to_pandas()
df_pd_roundtrip_3 = te_3.to_pandas()
df_pd_roundtrip_4 = te_4.to_pandas()
df_pd_roundtrip_5 = te_5.to_pandas()
df_pd_roundtrip_6 = te_6.to_pandas()
df_pd_roundtrip_7 = te_7.to_pandas()
df_pd_roundtrip_8 = te_8.to_pandas()
df_pd_roundtrip_9 = te_9.to_pandas()









C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_7.parquet: 100%|█| 38.3M/38.3M [00:24<00:00, 3.73M[A[A[A[A[A[A[A[A








C:/Users/asus/Documents/Instacart Basket Analysis/order_prior_merged_8.parquet: 100%|█| 38.3M/38.3M [00:24<00:00, 3.98M[A[A[A[A[A[A[A[A[A

Wall time: 1min 47s


In [59]:
%%time

# Concatinate the dataframes 

frames = [
df_pd_roundtrip_1,
df_pd_roundtrip_2 ,
df_pd_roundtrip_3 ,
df_pd_roundtrip_4 ,
df_pd_roundtrip_5 ,
df_pd_roundtrip_6 ,
df_pd_roundtrip_7 ,
df_pd_roundtrip_8 ,
df_pd_roundtrip_9 ]

df_concat = pd.concat(frames)

Wall time: 381 ms


In [60]:
#Check the shape

df_concat.shape

(11101161, 10)

In [61]:
# Check the head

df_concat.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,orders_day_of_week,order_time_of_day,days_since_prior_order,_merge
0,2,33120,1,1,202279,3,5,9,8.0,both
1,2,28985,2,1,202279,3,5,9,8.0,both
2,2,9327,3,0,202279,3,5,9,8.0,both
3,2,45918,4,1,202279,3,5,9,8.0,both
4,2,30035,5,0,202279,3,5,9,8.0,both


In [62]:
# Check out info

df_concat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11101161 entries, 0 to 32434488
Data columns (total 10 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   order_id                int64   
 1   product_id              int32   
 2   add_to_cart_order       int32   
 3   reordered               int8    
 4   user_id                 int64   
 5   order_number            int64   
 6   orders_day_of_week      int64   
 7   order_time_of_day       int64   
 8   days_since_prior_order  float64 
 9   _merge                  category
dtypes: category(1), float64(1), int32(2), int64(5), int8(1)
memory usage: 698.7 MB


## Success! 🥳 