<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

<a name='s1-2.2'></a>
### Memory Utilization ###
Memory utilization on a DataFrame depends largely on the date types for each column. 

<p><img src='images/dtypes.png' width=720></p>

We can use `DataFrame.memory_usage()` to see the memory usage for each column (in bytes). Most of the common data types have a fixed size in memory, such as `int`, `float`, `datetime`, and `bool`. Memory usage for these data types is the respective memory requirement multiplied by the number of data points. For `string` data types, the memory usage reported is the number of data points times 8 bytes. This accounts for the 64-bit required for the pointer that points to an address in memory but doesn't include the memory used for the actual string values. The actual memory required for a `string` value is 49 bytes plus an additional byte for each character. The `deep` parameter provides a more accurate memory usage report that accounts for the system-level memory consumption of the contained `string` data type. 

Separately, we've provided a `dli_utils.make_decimal()` function to convert memory size into units based on powers of 2. In contrast to units based on powers of 10, this customary convention is commonly used to report memory capacity. More information about the two definitions can be found [here](https://en.wikipedia.org/wiki/Byte#Multiple-byte_units). 

In [None]:
# import dependencies
import pandas as pd
import sys
import random

# import utility
from dli_utils import make_decimal

# import data
df=pd.read_csv('2020-Mar.csv')

# preview DataFrame
df.head()

In [None]:
# convert feature as datetime data type
df['event_time']=pd.to_datetime(df['event_time'])

In [None]:
# lists each column at 8 bytes/row
memory_usage_df=df.memory_usage(index=False)
memory_usage_df.name='memory_usage'
dtypes_df=df.dtypes
dtypes_df.name='dtype'

# show each column uses roughly number of rows * 8 bytes
# 8 bytes from 64-bit numerical data as well as 8 bytes to store a pointer for object data type
byte_size=len(df) * 8 * len(df.columns)

print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')

pd.concat([memory_usage_df, dtypes_df], axis=1)

In [None]:
# lists each column's full memory usage
memory_usage_df=df.memory_usage(deep=True, index=False)
memory_usage_df.name='memory_usage'

byte_size=memory_usage_df.sum()

# show total memory usage
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')

pd.concat([memory_usage_df, dtypes_df], axis=1)

In [None]:
# alternatively, use sys.getsizeof() instead
byte_size=sys.getsizeof(df)

print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')

In [None]:
# check random string-typed column
string_cols=[col for col in df.columns if df[col].dtype=='object' ]
column_to_check=random.choice(string_cols)

overhead=49
pointer_size=8

# nan==nan when value is not a number
# nan uses 32 bytes of memory
print(f'{column_to_check} column uses : {sum([(len(item)+overhead+pointer_size) if item==item else 32 for item in df[column_to_check].values])} bytes of memory.')

<p><img src='images/tip.png' width=720></p>
When Python stores a string, it actually uses memory for the overhead of the Python object, metadata about the string, and the string itself. The amount of memory usage we calculated includes temporary objects that get deallocated after the initial import. It's important to note that Python has memory optimization mechanics for strings such that when the same string is created multiple time, Python will cache or "intern" it in memory and reuse it for later string objects. 

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>