# Data Type Optimization
With the data we are working in this course, it is easy to take advantage of the abundant computer resources offered. Most of the datasets easily fit within memory. But what happens if your dataset is massive, say 64GB! You can either get a bigger machine, or you can see if optimizing how pandas handles the dataset provides another solution.

Here we will take the `clean_08.csv` dataset we produced in the last lesson, *Fixing Data Types*, and show how altering the data types can shrink the memory footprint of a DataFrame.

In [None]:
import pandas as pd

In [None]:
# read the clean_08 CSV
df = pd.read_csv("clean_08.csv")

In [None]:
# use .info() to view the current Dtypes, and the memory usage.
df.info()

### Numerical Optimization
Currently the DataFrame is a mixture of objects, float64, and int64 data types. We can also see the memory usage - `100.4+ KB`

Let's see what values are present in the `*_mpg` columns

In [None]:
# find city_mpg value counts
df.city_mpg.value_counts()

In [None]:
# find hwy_mpg value counts
df.hwy_mpg.value_counts()

In [None]:
# find cmb_mpg value counts
df.cmb_mpg.value_counts()

Even though the DataFrame labels it as float64, when inspecting each one they return `int64`. Let's change it to make it official.

In [None]:
# Change city_mpg, hwy_mpg, cmb_mpg to be an int using .astype()
df["city_mpg"] = df["city_mpg"].astype("int")
df["hwy_mpg"] = df["hwy_mpg"].astype("int")
df["cmb_mpg"] = df["cmb_mpg"].astype("int")

In [None]:
# df info to view data type and memory usage changes
df.info()

Well that did not change anything. Instead of an `int64`, let's change them to be `int8`. The values for each column only range from 8 - 48. Use `.describe()` to view the min/max of each column

In [None]:
df[["city_mpg", "hwy_mpg", "cmb_mpg"]].describe()

In [None]:
# Change the data type to be an int8
df["city_mpg"] = df["city_mpg"].astype("int8")
df["hwy_mpg"] = df["hwy_mpg"].astype("int8")
df["cmb_mpg"] = df["cmb_mpg"].astype("int8")

In [None]:
# df info to view data type and memory usage changes
df.info()

Now we are getting somewhere! We just changed the memory usage from `100.4+ KB` to `80.1+ KB` by changing how we are storing our int values.

How about changing how we store strings?

### String Optimization
Look at the value counts of each `object` data type: `trans`, `drive`, `fuel`, `veh_class`, `smartway`, and `model`.

In [None]:
# find trans value counts
df["trans"].value_counts()

In [None]:
# find drive value counts
df["drive"].value_counts()

In [None]:
# find fuel value counts
df["fuel"].value_counts()

In [None]:
# find veh_class value counts
df["veh_class"].value_counts()

In [None]:
# find smartway value counts
df["smartway"].value_counts()

In [None]:
# find model value counts
df["model"].value_counts()

Except for `model`, all of the object types have 2 - 13 unique values. In pandas there is a specialized data type called [Categorical](https://pandas.pydata.org/docs/user_guide/categorical.html#). Categorical data types are useful when you have object columns with a low number of unique values. You can create them as categories, and pandas will store those columns more efficiently.

Let's change the 5 `object` columns to `category`

In [None]:
# assign trans, drive, fuel, veh_class, and smartway to "category" using .astype()
df["trans"] = df["trans"].astype("category")
df["drive"] = df["drive"].astype("category")
df["fuel"] = df["fuel"].astype("category")
df["veh_class"] = df["veh_class"].astype("category")
df["smartway"] = df["smartway"].astype("category")

In [None]:
# df info to view data type and memory usage changes
df.info()

Wow! By changing those columns to categories, we further reduced our dataset from `80.1+ KB` to `47.8+ KB`. We effectively reduced the memory usage by 50%. While it may not be important in smaller datasets such as this one, you can really see the power when working on large data.

For more information, check out pandas [use-efficient-datatypes](https://pandas.pydata.org/docs/user_guide/scale.html#use-efficient-datatypes) section.