# How you can reduce the dataset size and speedup processing

In [1]:
#Import depedancies
import pandas as pd
import numpy as np

#create a function for generating fake data
def get_dataset(size):
    # Create Fake Dataset
    df = pd.DataFrame()
    df['size'] = np.random.choice(['big','medium','small'], size)
    df['age'] = np.random.randint(1, 50, size)
    df['team'] = np.random.choice(['red','blue','yellow','green'], size)
    df['win'] = np.random.choice(['yes','no'], size)
    dates = pd.date_range('2020-01-01', '2022-12-31')
    df['date'] = np.random.choice(dates, size)
    df['prob'] = np.random.uniform(0, 1, size)
    return df

def set_dtypes(df):
    df['size'] = df['size'].astype('category')
    df['team'] = df['team'].astype('category')
    df['age'] = df['age'].astype('int16')
    df['win'] = df['win'].map({'yes':True, 'no': False})
    df['prob'] = df['prob'].astype('float32')
    return df

## 1. CSV

In [2]:
print('Reading and writing CSV')
df = get_dataset(5_000_000)
df = set_dtypes(df)
%time df.to_csv('test.csv', index= False)
%time df_csv = pd.read_csv('test.csv')

Reading and writing CSV
CPU times: total: 12.7 s
Wall time: 16.4 s
CPU times: total: 1.98 s
Wall time: 3.23 s


In [3]:
#check the csv size
%ls -G Flash test.csv

 Volume in drive E is New Volume
 Volume Serial Number is B8F9-3D09

 Directory of e:\DS-work\data_processes


 Directory of e:\DS-work\data_processes


 Directory of e:\DS-work\data_processes

11/08/2024  23:32       210,559,519 test.csv
               1 File(s)    210,559,519 bytes
               0 Dir(s)  11,781,398,528 bytes free


In [4]:
#import pickle
import pickle

## 2. Pickle

In [5]:
print('Reading and writing Pickle')
df = get_dataset(5_000_000)
df = set_dtypes(df)
%time df.to_pickle('test.pickle')
%time df_pickle = pd.read_pickle('test.pickle')

Reading and writing Pickle
CPU times: total: 0 ns
Wall time: 141 ms
CPU times: total: 31.2 ms
Wall time: 77.7 ms


In [6]:
#check pickle file size
%ls -G Flash test.pickle

 Volume in drive E is New Volume
 Volume Serial Number is B8F9-3D09

 Directory of e:\DS-work\data_processes


 Directory of e:\DS-work\data_processes


 Directory of e:\DS-work\data_processes

11/08/2024  23:33        85,001,729 test.pickle
               1 File(s)     85,001,729 bytes
               0 Dir(s)  11,781,402,624 bytes free


### 3. Parquet

In [7]:
#Serializing using Parquet
#install parquet using either of the below
#%pip install pyarrow
#or
#%pip install fastparquet

In [8]:
print('Reading and writing Parquet')
df = get_dataset(5_000_000)
df = set_dtypes(df)
%time df.to_parquet('test.parquet')
%time df_parquet = pd.read_parquet('test.parquet')

Reading and writing Parquet
CPU times: total: 391 ms
Wall time: 1.13 s
CPU times: total: 594 ms
Wall time: 599 ms


In [9]:
#test the size
%ls -G Flash test.parquet

 Volume in drive E is New Volume
 Volume Serial Number is B8F9-3D09

 Directory of e:\DS-work\data_processes


 Directory of e:\DS-work\data_processes


 Directory of e:\DS-work\data_processes

11/08/2024  23:33        36,960,385 test.parquet
               1 File(s)     36,960,385 bytes
               0 Dir(s)  11,781,402,624 bytes free


### 4. Feather

In [10]:
#Serializing in feather
print('Reading and writing Feather')
df = get_dataset(5_000_000)
df = set_dtypes(df)
%time df.to_feather('test.feather')
%time df_feather = pd.read_feather('test.feather')

Reading and writing Feather
CPU times: total: 125 ms
Wall time: 570 ms
CPU times: total: 125 ms
Wall time: 152 ms


In [11]:
#time it takes
%ls -G Flash test.feather

 Volume in drive E is New Volume
 Volume Serial Number is B8F9-3D09

 Directory of e:\DS-work\data_processes


 Directory of e:\DS-work\data_processes


 Directory of e:\DS-work\data_processes

11/08/2024  23:33        51,267,490 test.feather
               1 File(s)     51,267,490 bytes
               0 Dir(s)  11,781,402,624 bytes free


Parquet seems is the best in terms of reducing the dataset

End