## Reduce the size of large datasets by using the right formats

Pandas for large datasets -
https://www.dataquest.io/blog/pandas-big-data/

Other useful reads

* https://pythonspeed.com/articles/pandas-load-less-data/
* https://pythonspeed.com/articles/pandas-reduce-memory-lossy/
* https://pythonspeed.com/articles/chunking-pandas/
* https://pythonspeed.com/articles/faster-pandas-dask/

In [1]:
import pandas as pd
import numpy as np

In [2]:
btomb = 1024**2 # Byte to MBytes

In [4]:
df = pd.read_csv("/v/courses/dataexpviz.public/Datasets/D-LargeData/crimes.csv")


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,



In [5]:
# That is the current space occupied by the data in the memory after loading it.
usage = df.memory_usage()
usage = usage / btomb
usage 

Index                    0.000122
ID                      54.369461
Case Number             54.369461
Date                    54.369461
Block                   54.369461
IUCR                    54.369461
Primary Type            54.369461
Description             54.369461
Location Description    54.369461
Arrest                   6.796183
Domestic                 6.796183
Beat                    54.369461
District                54.369461
Ward                    54.369461
Community Area          54.369461
FBI Code                54.369461
X Coordinate            54.369461
Y Coordinate            54.369461
Year                    54.369461
Updated On              54.369461
Latitude                54.369461
Longitude               54.369461
Location                54.369461
dtype: float64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7126314 entries, 0 to 7126313
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(1

In [7]:
print(np.iinfo('uint32'), np.iinfo('uint16'), df.ID.max())

Machine parameters for uint32
---------------------------------------------------------------
min = 0
max = 4294967295
---------------------------------------------------------------
 Machine parameters for uint16
---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------
 12067449


In [8]:
#conv_ID = df.ID.apply(pd.to_numeric,downcast='unsigned')
conv_df = pd.DataFrame()
conv_df['ID'] = df.ID.astype("uint32")

In [10]:
print(f"Old dataframe: \n{df[['ID']].memory_usage() / btomb}\n New dataframe: {conv_df.memory_usage() / btomb}")

Old dataframe: 
Index     0.000122
ID       54.369461
dtype: float64
 New dataframe: Index     0.000122
ID       27.184731
dtype: float64


In [14]:
print(df.District.min(), df.District.max())

1.0 31.0


In [15]:
conv_df['District'] = df.District.astype("uint8")

ValueError: Cannot convert non-finite values (NA or inf) to integer

There are NaNs. So first substitute them with an appropriate integer that will be possible to filter out later. 0 seems to be a good choice fo this

In [16]:
 df.loc[df.District.isna(), 'District'] = 0

In [17]:
conv_df['District'] = df.District.astype("uint8")

In [18]:
conv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7126314 entries, 0 to 7126313
Data columns (total 2 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   ID        uint32
 1   District  uint8 
dtypes: uint32(1), uint8(1)
memory usage: 34.0 MB


In [22]:
# Memory usage of old and new Dataframe
print(df[['ID', 'District']].memory_usage() / btomb)
print(conv_df.memory_usage() / btomb)
print("\nWe saved more than half of the space here")

Index        0.000122
ID          54.369461
District    54.369461
dtype: float64
Index        0.000122
ID          27.184731
District     6.796183
dtype: float64

We saved more than half of the space here


In [24]:
#Just check that we deal with the same data
sim = conv_df['District'].astype("float") != df.District
print("Nr. of discrepancies: %d"%sim.sum())

Nr. of discrepancies: 0


In [25]:
df.columns

Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat',
       'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate',
       'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude',
       'Location'],
      dtype='object')

In [26]:
print(df['X Coordinate'][df['X Coordinate'].notna()].min(), df['X Coordinate'][df['X Coordinate'].notna()].max())

0.0 1205119.0


In [27]:
np.finfo('float16')

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

In [28]:
(df['X Coordinate'][df['X Coordinate'].notna()].astype('int').astype('float64') != df['X Coordinate'][df['X Coordinate'].notna()]).sum()

0

Seems like none of the floats are really float

### Objects are still there. What can we do with them?

We could use categories where it is appropriate. If the number of unique elements are very low, let's say less then 100 and (we can name them easily), then it really makes sense.

In [29]:
print(len(df['Primary Type'].unique()), df['Primary Type'].notna().sum())

36 7126314


Looks like 'Primary type' is a column where categories would make sense.

In [30]:
conv_df['Primary Type'] = df['Primary Type'].astype('category')

In [31]:
print(df[['ID', 'District', 'Primary Type']].memory_usage() / btomb)
print(conv_df.memory_usage() / btomb)

Index            0.000122
ID              54.369461
District        54.369461
Primary Type    54.369461
dtype: float64
Index            0.000122
ID              27.184731
District         6.796183
Primary Type     6.797480
dtype: float64


In [32]:
conv_df.head()

Unnamed: 0,ID,District,Primary Type
0,11034701,4,DECEPTIVE PRACTICE
1,11227287,22,CRIM SEXUAL ASSAULT
2,11227583,8,BURGLARY
3,11227293,3,THEFT
4,11227634,1,CRIM SEXUAL ASSAULT


Caveat!

In [33]:
ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])
pd.cut(ages, bins=5)

[(9.95, 20.0], (9.95, 20.0], (9.95, 20.0], (9.95, 20.0], (20.0, 30.0], (20.0, 30.0], (20.0, 30.0], (50.0, 60.0], (50.0, 60.0]]
Categories (5, interval[float64]): [(9.95, 20.0] < (20.0, 30.0] < (30.0, 40.0] < (40.0, 50.0] < (50.0, 60.0]]

There are codes referring to the category

In [34]:
conv_df['Primary Type'].cat.codes

0           9
1           5
2           3
3          34
4           5
           ..
7126309    26
7126310    12
7126311    12
7126312     9
7126313     9
Length: 7126314, dtype: int8

In [35]:
df[['Date']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7126314 entries, 0 to 7126313
Data columns (total 1 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Date    object
dtypes: object(1)
memory usage: 54.4+ MB


In [36]:
df[['Date']].head()

Unnamed: 0,Date
0,01/01/2001 11:00:00 AM
1,10/08/2017 03:00:00 AM
2,03/28/2017 02:00:00 PM
3,09/09/2017 08:17:00 PM
4,08/26/2017 10:00:00 AM


In [37]:
#pd.to_datetime(df[['Date']],format='%m/%d/%Y %I:%M:%S %p', errors='raise')
conv_df['Date'] = df[['Date']].apply(lambda x: pd.to_datetime(x, format='%m/%d/%Y %I:%M:%S %p'))
#pd.to_datetime(df[['Date']].iloc[:10])

#df["DateTime"] = pd.to_datetime(df["DateTime"],errors="coerce").dt.strftime("%d-%m-%Y %H:%M:%S")

In [38]:
print(df[['ID', 'District', 'Primary Type', 'Date']].memory_usage() / btomb)
print(conv_df.memory_usage() / btomb)

Index            0.000122
ID              54.369461
District        54.369461
Primary Type    54.369461
Date            54.369461
dtype: float64
Index            0.000122
ID              27.184731
District         6.796183
Primary Type     6.797480
Date            54.369461
dtype: float64


In [39]:
conv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7126314 entries, 0 to 7126313
Data columns (total 4 columns):
 #   Column        Dtype         
---  ------        -----         
 0   ID            uint32        
 1   District      uint8         
 2   Primary Type  category      
 3   Date          datetime64[ns]
dtypes: category(1), datetime64[ns](1), uint32(1), uint8(1)
memory usage: 95.1 MB


In [40]:
cs = conv_df[['ID', 'District', 'Primary Type']].memory_usage().sum()
ds = df[['ID', 'District', 'Primary Type']].memory_usage().sum()
print("%d %%"%((cs/ds)*100))


25 %


### What was not covered here

* [Sparse column representation](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html): very few relevant values 
* [Loading data in chunks can shrink memory usage](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html): 
  * function we run on each chunk can run independently
  * Dask provides an API that emulates Pandas, while implementing chunking and parallelization transparently.
* [Lossy compression](https://pythonspeed.com/articles/pandas-reduce-memory-lossy/)