<a href="https://colab.research.google.com/github/sudama-inc/EDA-with-Pandas-and-Numpy/blob/main/efficient_memory_use_in_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## efficient memory use in pandas : Type Casting the columns

In [1]:
import pandas as pd
import numpy as np

In [2]:
size = 5
def get_dataset(size):
  df = pd.DataFrame()

  df['position'] = np.random.choice(['left', 'middle', 'right'], size=size)
  df['age'] = np.random.randint(1, 50, size=size)
  df['team'] = np.random.choice(['red', 'blue', 'yellow', 'green'], size=size)
  df['win'] = np.random.choice(['yes', 'no'], size=size)
  df['prob'] = np.random.uniform(1, 50, size=size)
  return df


display(get_dataset(size))

Unnamed: 0,position,age,team,win,prob
0,right,8,green,yes,37.436137
1,right,32,blue,yes,22.21501
2,left,32,blue,yes,19.805534
3,right,31,green,no,2.819209
4,middle,15,blue,yes,12.294267


#### Fake the dataset with 1M entries

In [3]:
df = get_dataset(1_000_000)
df.shape

(1000000, 5)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   position  1000000 non-null  object 
 1   age       1000000 non-null  int64  
 2   team      1000000 non-null  object 
 3   win       1000000 non-null  object 
 4   prob      1000000 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 38.1+ MB


## Type Cast the Dataset Columns to reduce the size


1.   Cast object to category
2.   Cast int64 to int16 / int8 ...
3.   Cast float64 to float16 / float8 ...
4.   Cast object to boolean


It will help to reduce the size of Dataset / columns







In [5]:
def set_dtypes(df):
  df['position'] = df['position'].astype('category')
  df['age'] = df['age'].astype('int8')
  df['team'] = df['team'].astype('category')
  df['win'] = df['win'].map({'yes':True, 'no':False})
  df['prob'] = df['prob'].astype('float32')
  return df

In [7]:
df = set_dtypes(df)
df.shape

(1000000, 5)

#### Test the time Taken by DataFrames

In [8]:
without_cast_df = get_dataset(1_000_000)
%timeit without_cast_df['age_rank'] = without_cast_df.groupby(['team','position'])['age'].rank()
%timeit without_cast_df['prob_rank'] = without_cast_df.groupby(['team','position'])['prob'].rank()
%timeit without_cast_df['win_prob_rank'] = without_cast_df.groupby(['team','position','win'])['prob'].rank()

252 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
344 ms ± 3.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
455 ms ± 77.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
without_cast_df = get_dataset(1_000_000)
without_cast_df = set_dtypes(without_cast_df)
%timeit without_cast_df['age_rank'] = without_cast_df.groupby(['team','position'])['age'].rank()
%timeit without_cast_df['prob_rank'] = without_cast_df.groupby(['team','position'])['prob'].rank()
%timeit without_cast_df['win_prob_rank'] = without_cast_df.groupby(['team','position','win'])['prob'].rank()

116 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
246 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
262 ms ± 9.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
# Dataset also has 10 times larger
without_cast_df = get_dataset(10_000_000)
%timeit without_cast_df['age_rank'] = without_cast_df.groupby(['team','position'])['age'].rank()
%timeit without_cast_df['prob_rank'] = without_cast_df.groupby(['team','position'])['prob'].rank()
%timeit without_cast_df['win_prob_rank'] = without_cast_df.groupby(['team','position','win'])['prob'].rank()

3.5 s ± 239 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.96 s ± 167 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.8 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
# Dataset also has 10 times larger
without_cast_df = get_dataset(10_000_000)
without_cast_df = set_dtypes(without_cast_df)
%timeit without_cast_df['age_rank'] = without_cast_df.groupby(['team','position'])['age'].rank()
%timeit without_cast_df['prob_rank'] = without_cast_df.groupby(['team','position'])['prob'].rank()
%timeit without_cast_df['win_prob_rank'] = without_cast_df.groupby(['team','position','win'])['prob'].rank()

2.25 s ± 359 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.17 s ± 270 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.42 s ± 266 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Can save the Disk Space and Time by Casting the Pandas Columns