# Saving and Serialising a dataframe

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Let's make a new dataframe and save it out using various formats

df = pd.DataFrame(np.random.random(size=(100000, 4)), columns=["A", "B", "C", "D"])

df.head()

Unnamed: 0,A,B,C,D
0,0.92815,0.481537,0.652907,0.333602
1,0.934688,0.010675,0.386517,0.518924
2,0.178277,0.364158,0.435528,0.517645
3,0.792251,0.744749,0.709547,0.874203
4,0.300547,0.582488,0.546538,0.131833


In [5]:
# Writing data to a file
# 4 decimal place tak round karenge or index skip karenge

df.to_csv("save.csv", index=False, float_format="%0.4f")

In [6]:
# saving to pkl format
df.to_pickle("save.pkl")

In [7]:
# pip install tables
# hdf jo hai wo hadoop format hota hai, big data mai use hota hai

df.to_hdf("save.hdf", key="data", format="table")

In [None]:
# !pip install feather-format

In [None]:
# !conda install feather-format -c conda-forge

In [None]:
# df.to_feather("save.fth")

In [3]:
# If you want to get the timings you can see in the video, you'll need this extension:
# https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/execute_time/readme.html

Now this is a very fast test - its only numeric data. If we add strings and categorical data things can slow down a lot! Let's try this on mixed Astronaut data from kaggle: https://www.kaggle.com/nasa/astronaut-yearbook

In [13]:
# df = pd.read_csv("astronauts.csv")
# df.head()

In [6]:
df.to_csv("save.csv", index=False, float_format="%0.4f")

In [12]:
# pd.read_csv("save.csv")

In [8]:
df.to_pickle("save.pkl")

In [11]:
# pd.read_pickle("save.pkl")

In [4]:
df.to_hdf("save.hdf", key="data", format="table")

In [5]:
pd.read_hdf("save.hdf")

Unnamed: 0,A,B,C,D
0,0.928150,0.481537,0.652907,0.333602
1,0.934688,0.010675,0.386517,0.518924
2,0.178277,0.364158,0.435528,0.517645
3,0.792251,0.744749,0.709547,0.874203
4,0.300547,0.582488,0.546538,0.131833
...,...,...,...,...
99995,0.477538,0.274994,0.699612,0.296594
99996,0.762881,0.231498,0.131880,0.955424
99997,0.729568,0.194363,0.814328,0.199783
99998,0.869640,0.723131,0.723032,0.554974


In [8]:
# df.to_feather("save.fth")

In [9]:
# pd.read_feather("save.fth")

In [10]:
%ls

 Volume in drive C has no label.
 Volume Serial Number is 8896-429E

 Directory of C:\Users\Syed Shahid Ali\Desktop\AI Q2\AIC-Q2-Codes-and-Books\onclass practice\Nasir\onClass_code_real\Class#5

02/02/2021  03:26 PM    <DIR>          .
02/02/2021  03:26 PM    <DIR>          ..
02/01/2021  10:45 PM    <DIR>          .ipynb_checkpoints
02/01/2021  10:16 PM            14,840 1DataLoading.ipynb
02/01/2021  10:44 PM            28,772 2DataInspecting.ipynb
02/02/2021  03:26 PM            53,029 3DataSavingAndSerializing.ipynb
01/27/2021  01:51 AM            81,951 astronauts.csv
02/01/2021  07:31 PM            45,173 Class#5.ipynb
01/27/2021  01:51 AM            11,328 heart.csv
02/01/2021  07:23 PM    <DIR>          practiceResource
02/02/2021  12:37 AM            87,030 save.csv
02/02/2021  03:22 PM         4,130,767 save.hdf
02/02/2021  12:41 AM            90,714 save.pkl
               9 File(s)      4,543,604 bytes
               4 Dir(s)  43,710,140,416 bytes free


# Recap

In terms of file size, HDF5 is the largest for this example. Everything elseis approximately equal. For small data sizes, often csv is the easiest as its human readable. HDF5 is great for <i>loading</i> in huge amounts of data quickly. Pickle is faster than CSV, but not human readable.

Lots of options, don't get hung up on any of them. CSV and pickle are easy and for most cases work fine.