# saving data to disk

In 601 we have focused on get data from plain text files (CSV, JSON, XML) and Excel. There are other file formats capable of holding tables.

The purpose of this notebook is to provide illustration of alternative storage formats. We can read data from any of these formats into Pandas. 

We will not be using SQL or Pickle or HDF5 in 601. <BR>
These options are not relevant for your homework or projects.

In [1]:
# https://stackoverflow.com/questions/25980018/importerror-hdfstore-requires-pytables-no-module-named-tables
!pip install tables



In [2]:
import os
import sys
print(sys.version)
import h5py
import pandas
print('pandas',pandas.__version__)
import numpy
print('numpy',numpy.__version__)
import sqlite3
print('sqlite3',sqlite3.version)
import pickle
from faker import Faker
fake = Faker()
import time

3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
pandas 0.23.4
numpy 1.13.3
sqlite3 2.6.0


# create a couple dataframes to save

## numeric data

In [3]:
row_count=1000000
df_numeric = pandas.DataFrame(numpy.random.randint(0,1000,
                      size=(row_count, 4)), 
                      columns=list('ABCD'))

print(df_numeric.shape)

df_numeric.head()

(1000000, 4)


Unnamed: 0,A,B,C,D
0,125,913,2,296
1,769,2,494,38
2,201,256,104,347
3,803,615,682,710
4,670,704,193,175


### metadata 
Did you know you can add attributes to a dataframe?

<a href="https://stackoverflow.com/questions/14688306/adding-meta-information-metadata-to-pandas-dataframe">source</a>

In [4]:
df_numeric.author_name = "Ben"

In [5]:
df_numeric.author_name

'Ben'

## text data

In [6]:
start_time=time.time()
list_of_dicts=[]
num_rows=8500 # 850 is ~1MB CSV and takes 1.5 seconds; 8500 takes 15 seconds
for indx in range(num_rows):
    list_of_dicts.append({'name':fake.name(),
                    'date':fake.date(),
                    'domain name':fake.domain_name(),
                    'day of month':fake.day_of_month(),
                    'day of week':fake.day_of_week(),
                    'country':fake.country(),
                    'company':fake.company(),
                    'city':fake.city(),
                    'email':fake.ascii_email(),
                    'bank':fake.bank_country()})
    
df_text = pandas.DataFrame(list_of_dicts)
print('elapsed',round(time.time()-start_time,2),'seconds')

elapsed 23.31 seconds


In [7]:
df_text.head()

Unnamed: 0,bank,city,company,country,date,day of month,day of week,domain name,email,name
0,GB,Priceport,Howard PLC,United Arab Emirates,2006-10-29,20,Wednesday,jones.org,kari87@yahoo.com,Leonard Bailey
1,GB,Jenniferfurt,Hall-Davidson,Lesotho,2010-10-14,24,Tuesday,peters-cardenas.com,martinezdavid@yahoo.com,Michael Coleman
2,GB,West Andreastad,Wolf-Pratt,Guernsey,1998-02-26,9,Thursday,daniels.biz,jennifer37@gmail.com,Julia Watson
3,GB,New David,Murphy-Gay,Macedonia,1988-01-30,26,Wednesday,chapman.org,perezjustin@pratt.com,Patrick Jenkins
4,GB,Michellebury,Logan LLC,Puerto Rico,1982-11-04,13,Thursday,johnston.info,ellensmith@parks-kim.com,Taylor Padilla


# compare size on disk for single dataframe 

The point of this comparison is not because file size matters.

Instead, the objectives are to show 
1. how to read and write variables to various file formats
1. that the file size on disk can be read into Python

## HDF5
HDF overview:

https://en.wikipedia.org/wiki/Hierarchical_Data_Format

Python package:

https://www.h5py.org/<BR>
http://docs.h5py.org/en/stable/

Pandas integration:

https://glowingpython.blogspot.com/2014/08/quick-hdf5-with-pandas.html<BR>
https://stackoverflow.com/questions/28170623/how-to-read-hdf5-files-in-python<BR>
https://medium.com/@jerilkuriakose/using-hdf5-with-python-6c5242d08773<BR>
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
    
Metadata:
    
https://support.hdfgroup.org/HDF5/doc/Advanced/HDF5_Metadata/index.html<BR>
http://docs.h5py.org/en/stable/high/attr.html

In [8]:
# https://stackoverflow.com/questions/41173254/how-should-i-use-h5py-lib-for-storing-time-series-data

with pandas.HDFStore('temp.h5', 'w') as h:
    df_numeric.to_hdf(h, 'temp') # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_hdf.html

In [29]:
print(round(os.path.getsize("temp.h5")/(1024*1024),2),'MB file on disk')

48.76 MB file on disk


to read the content, use

In [10]:
df_out = pandas.read_hdf('temp.h5', 'temp')

The attribute we added does not get written by ".to_hdf"

https://stackoverflow.com/questions/29129095/save-additional-attributes-in-pandas-dataframe/29130146#29130146

https://dev.to/epassaro/gsoc-2019-june-ii-fjk

https://www.science-emergence.com/Articles/How-to-save-a-large-dataset-in-a-hdf5-file-using-python--Quick-Guide/

https://www.science-emergence.com/Articles/How-to-add-metadata-to-a-data-frame-with-pandas-in-python-/

In [11]:
with pandas.HDFStore('temp_2.h5', 'w') as stor:
    stor.put('mydata', df_numeric)
    stor.get_storer('mydata').attrs.metadata = df_numeric.author_name

In [12]:
df_out_2 = pandas.read_hdf('temp_2.h5', 'mydata')

In [30]:
df_out_2.head()

Unnamed: 0,A,B,C,D
0,125,913,2,296
1,769,2,494,38
2,201,256,104,347
3,803,615,682,710
4,670,704,193,175


## compare to CSV on disk

In [14]:
df_numeric.to_csv('temp.csv')

In [31]:
print(round(os.path.getsize("temp.csv")/(1024*1024),2),'MB file on disk')

21.41 MB file on disk


## compare to SQLite

https://www.dataquest.io/blog/python-pandas-databases/<BR>
https://stackoverflow.com/questions/14431646/how-to-write-pandas-dataframe-to-sqlite-with-index<BR>
https://pythonspot.com/sqlite-database-with-pandas/<BR>
https://datacarpentry.org/python-ecology-lesson/09-working-with-sql/index.html<BR>
http://sdsawtelle.github.io/blog/output/large-data-files-pandas-sqlite.html

In [16]:
conn = sqlite3.connect("temp.db")
cur = conn.cursor() # https://docs.python.org/3/library/sqlite3.html#cursor-objects

In [17]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

df_numeric.to_sql(name="data", con=conn, if_exists="append", index=False)

In [18]:
print(os.path.getsize("temp.db")/(1024*1024),'MB file on disk')

54.08984375 MB file on disk


## Python pickle serialization

In [19]:
df_numeric.to_pickle("temp.pkl")

In [20]:
print(os.path.getsize("temp.pkl")/(1024*1024),'MB file on disk')

30.518251419067383 MB file on disk



<BR>
<BR>
<BR>
    
# Save two dataframes to file
    
So far we've shown how to save one variable to one file. 

Sometimes we want to save more than one table to a single file

## two variables to one HDF5 file

https://datascience.stackexchange.com/questions/33171/what-s-the-best-way-to-save-many-pandas-dataframes-together

In [21]:
h5_fout = h5py.File('temp.h5')

h5_fout.create_dataset(
        name='numeric',
        data=df_numeric,
        compression='gzip', compression_opts=4)

h5_fout.create_dataset(
        name='text',
        data=df_text,
        compression='gzip', compression_opts=4,
        dtype=h5py.special_dtype(vlen=str)) # http://docs.h5py.org/en/stable/special.html

h5_fout.create_dataset('description', data='some dataframes')
h5_fout.close()

## Python pickle serialization

In [22]:
with open('temp.pkl', "wb") as f:
    pickle.dump(df_text, f)
    pickle.dump(df_numeric, f)

In [32]:
print(round(os.path.getsize("temp.pkl")/(1024*1024),2),'MB file on disk')

32.0 MB file on disk


In [24]:
# https://stackoverflow.com/questions/20716812/saving-and-loading-multiple-objects-in-pickle-file
def loadall(filename):
    with open(filename, "rb") as f:
        while True:
            try:
                yield pickle.load(f)
            except EOFError:
                break

In [25]:
items = list(loadall('temp.pkl'))

In [26]:
len(items)

2

In [27]:
items[0].head()

Unnamed: 0,bank,city,company,country,date,day of month,day of week,domain name,email,name
0,GB,Priceport,Howard PLC,United Arab Emirates,2006-10-29,20,Wednesday,jones.org,kari87@yahoo.com,Leonard Bailey
1,GB,Jenniferfurt,Hall-Davidson,Lesotho,2010-10-14,24,Tuesday,peters-cardenas.com,martinezdavid@yahoo.com,Michael Coleman
2,GB,West Andreastad,Wolf-Pratt,Guernsey,1998-02-26,9,Thursday,daniels.biz,jennifer37@gmail.com,Julia Watson
3,GB,New David,Murphy-Gay,Macedonia,1988-01-30,26,Wednesday,chapman.org,perezjustin@pratt.com,Patrick Jenkins
4,GB,Michellebury,Logan LLC,Puerto Rico,1982-11-04,13,Thursday,johnston.info,ellensmith@parks-kim.com,Taylor Padilla


In [28]:
items[1].head()

Unnamed: 0,A,B,C,D
0,125,913,2,296
1,769,2,494,38
2,201,256,104,347
3,803,615,682,710
4,670,704,193,175
