# Profile memory usage
We notice that our data files are large and pandas dataframes approach the limit of what can be handled on a 16GiB RAM computer when reading in a whole year's worth of trip data.
This notebook simply profiles memory usage and improvements possible by
* Using smaller `numeric` types
* Using `categorical` type instead of `object` (strings)
* Using `datetime`

In [None]:
import numpy as np
import pandas as pd
from tabulate import tabulate

### Initial Memory Usage
How big is this 4GiB (disk) data when we load it in as a pandas dataframe? There will be size inflation

In [None]:
%%time
df = pd.read_csv("data/NY_2019.csv")
df.drop("Unnamed: 0", axis=1, inplace=True)

In [None]:
# Here's what the data looks like
# Surprising...Why does `startstationid` have a decimal point?
df.sample(10)

In [None]:
# initial usage, no optimization. see column dtypes
df.info(memory_usage="deep")

In [None]:
# Total Memory Usage GiB (Note that this is larger than the space the data occupies on disk: ~4GiB)
start_memory = df.memory_usage(index=False, deep=True).sum() / (2 ** 30)
print(f"{round(start_memory, 3)} GiB")

### Use smaller numeric types
Let's take a look at how big a number we can store in each integer. Then, we can decide what type of int to use for each numeric column

In [None]:
# Max integer values (we have no negative numbers in our data, so not checking the lower bounds)
int_min_max = [
    ["int64", np.iinfo(np.int64).min, np.iinfo(np.int64).max],
    ["int32", np.iinfo(np.int32).min, np.iinfo(np.int32).max],
    ["int16", np.iinfo(np.int16).min, np.iinfo(np.int16).max],
    ["int8", np.iinfo(np.int8).min, np.iinfo(np.int8).max],
]
print(
    tabulate(
        int_min_max,
        headers=["type", "min value", "max value"],
        showindex=True,
        tablefmt="github",
        numalign="right",
    )
)

In [None]:
# What is the max for each numeric column?

citibike_min_max = [
    ["tripduration", df.tripduration.min(), df.tripduration.max()],
    ["startstationid", df.startstationid.min(), df.startstationid.max()],
    ["endstationid", df.endstationid.min(), df.endstationid.max()],
    ["bikeid", df.bikeid.min(), df.bikeid.max()],
    ["birthyear", df.birthyear.min(), df.birthyear.max()],
]
print(
    tabulate(
        citibike_min_max,
        headers=["column", "min value", "max value"],
        showindex=True,
        tablefmt="github",
        numalign="right",
    )
)

#### What do we find out?
Someone took a 44 day bike trip...
More pertinently, we have no negative values so we can use **unsigned ints** when downcasting to save even more space

#### Downcasting the data
Now that we've looked at the numeric values, we can tell pandas to use the downcast to the appropriate numeric types for each column

In [None]:
# Drop NAs before downcasting
df.dropna(axis=0, inplace=True)

# # Use smaller numeric types
# df['tripduration'] = df['tripduration'].astype('int32')
# df['startstationid'] = df['startstationid'].astype('int16')
# df['endstationid'] = df['endstationid'].astype('int16')
# df['bikeid'] = df['bikeid'].astype('int32')
# df['birthyear'] = df['birthyear'].astype('int16')
# df['gender'] = df['gender'].astype('int8')

# actually, let's downcast automatically instead of manually...
# NOTE: for floats, we lose precision, but that isn't important because we are not doing arithmetic operations that would require high precision
# E.g., float32 gives 6 digits of precision as opposed to 15 for float64
for column in df:
    if df[column].dtype == "float64":
        df[column] = pd.to_numeric(df[column], downcast="float")
    if df[column].dtype == "int64":
        df[column] = pd.to_numeric(df[column], downcast="unsigned")

In [None]:
# profile memory again
downcasted_memory = df.memory_usage(index=False, deep=True).sum() / (2 ** 30)
print(f"{round(downcasted_memory, 3)} GiB")

### Use categorical type
The `usertype` column is categorical. A Citi Bike user can be either a `Subscriber` or a `Customer`

In [None]:
df["usertype"] = df["usertype"].astype("category")

In [None]:
# profile memory again
categorical_memory = df.memory_usage(index=False, deep=True).sum() / (2 ** 30)
print(f"{round(categorical_memory, 3)} GiB")

### DateTime
The `starttime` and `stoptime` columns for a trip being as strings. 
When we do our time series data analysis, we'd like them to be `datetime`s
Will this reduce the size?

In [None]:
df["starttime"] = pd.to_datetime(df["starttime"])
df["stoptime"] = pd.to_datetime(df["stoptime"])

In [None]:
# profile memory again
datetime_memory = df.memory_usage(index=False, deep=True).sum() / (2 ** 30)
print(f"{round(datetime_memory, 3)} GiB")

### Outcome
Wow! Using DateTime helps a lot and we get significant gains in memory reduction from using a categorical type. Smaller numeric types give a smaller percentage reduction, but still useful

In [None]:
print(
    f"Reduced dataframe size by {round(100*(start_memory - datetime_memory)/start_memory, 2)}%"
)