# Profile memory usage
We notice that our data files are large and pandas dataframes approach the limit of what can be handled on a 16GiB RAM computer when reading in a whole year's worth of trip data.
This notebook simply profiles memory usage and improvements possible by
* Using smaller numeric types
* Using `categorical` type instead of `object` (strings)
* Using DateTime

In [None]:
import pandas as pd
import numpy as np

### Initial Memory Usage

In [None]:
df = pd.read_csv("data/NY_2019.csv")
df.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
# initial usage, no optimization. see column dtypes
df.info(memory_usage='deep')

In [None]:
# Total GiB (same as above)
start_memory = df.memory_usage(index=False, deep=True).sum()/(2**30)
print(f'{round(start_memory, 3)} GiB')

### Use smaller numeric types

In [None]:
# Max int values (we have no negative numbers in our data)
print(np.iinfo(np.int64).max)
print(np.iinfo(np.int32).max)
print(np.iinfo(np.int16).max)
print(np.iinfo(np.int8).max)

In [None]:
# What is the max for each numeric column?

print("Max values:")
# okay who took a 1000+ hour (44 days) trip...should we drop outliers? perhaps not because then we can't determine bike rebalancing
print(df.tripduration.max())
print(df['startstationid'].max())
print(df['endstationid'].max())
print(df.bikeid.max())
print(df.birthyear.max())

# NOTE: we have no negative values (as expected) so can use unsigned ints when downcasting
print("\nMin values:")
print(df.tripduration.min())
print(df['startstationid'].min())
print(df['endstationid'].min())
print(df.bikeid.min())
print(df.birthyear.min())


In [None]:
# Drop NAs before downcasting
df.dropna(axis=0, inplace=True)

# # Use smaller numeric types
# df['tripduration'] = df['tripduration'].astype('int32')
# df['startstationid'] = df['startstationid'].astype('int16')
# df['endstationid'] = df['endstationid'].astype('int16')
# df['bikeid'] = df['bikeid'].astype('int32')
# df['birthyear'] = df['birthyear'].astype('int16')
# df['gender'] = df['gender'].astype('int8')

# actually, let's downcast automatically instead of manually...
# NOTE: we might lose precision, but not sure if that matters based on the operations we perform on these columns
# E.g., float32 gives 6 digits of precision as opposed to 15 for float64
for column in df:
    if df[column].dtype == 'float64':
        df[column] = pd.to_numeric(df[column], downcast='float')
    if df[column].dtype == 'int64':
        df[column] = pd.to_numeric(df[column], downcast='unsigned')

In [None]:
# profile memory again
downcasted_memory = df.memory_usage(index=False, deep=True).sum()/(2**30)
print(f'{round(downcasted_memory, 3)} GiB')

### Use categorical type

In [None]:
df['usertype'] = df['usertype'].astype('category')

In [None]:
# profile memory again
categorical_memory = df.memory_usage(index=False, deep=True).sum()/(2**30)
print(f'{round(categorical_memory, 3)} GiB')

### DateTime
Not sure if this will reduce or increase size. But it's necessary to do for our time series analysis anyways, so let's see

In [None]:
df['starttime'] = pd.to_datetime(df['starttime'])
df['stoptime'] = pd.to_datetime(df['stoptime'])

In [None]:
# profile memory again
datetime_memory = df.memory_usage(index=False, deep=True).sum()/(2**30)
print(f'{round(datetime_memory, 3)} GiB')

### Outcome
Wow! Using DateTime helps a lot and we get significant gains in memory reduction from using a categorical type. Smaller numeric types give a smaller percentage reduction, but still useful

In [None]:
print(f'Reduced dataframe size by {round(100*(start_memory - datetime_memory)/start_memory, 2)}%')