Skip to content

xuancong84/pandas-serializer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

pandas-serializer

A utility for text-based serialization and deserialization (i.e., save/load) of Python nested container objects (such as list, dict and set), containing Pandas DataFrame/Series and NumPy arrays/matrices, etc. De-serialized DataFrame and Series object will try to conform to the original object type. However, since the conversion is based on to_csv() and read_csv(), data types indistinguishable by pandas.DataFrame.to_csv() will remain indistinguishable, e.g., None vs NaN, etc.

Exported functions:

  • pandas_save(obj, fp_fn) : save object into the file/filename
  • pandas_saves(obj) : save object into a string
  • pandas_load(fp_fn) : load object from the file/filename
  • pandas_loads(str) : load object from the string

Internal functions:

  • Open(filename, mode) : smartly open normal or gzipped file, - for STDIN/STDOUT
  • pandas_encode : recursively convert every non-convertible objects into specially-coded string
  • pandas_decode : recursively restore every non-convertible objects from specially-coded string

The working principle is to first convert every non-convertible object (such as DataFrame/Series/types/modules) into a specially-coded string, then use Python repr() to convert the entire object in one go. In addition, this powerful utility will also try to convert/restore other non-reverse-convertible Python objects such as types, modules, functions, etc. However, non-reverse-convertible Python object does not guarantee successful deserialization, so those data types should be avoided in general.

As shown in the built-in example, this utility can successfully serialize and deserialize the following super-complex nested data structure:

[1, 3.4, 1.1+2.1j, np.nan, None, True, False, b'ab12', 'abc', int, float,
 pd.Series(), pd.DataFrame(), pd.DataFrame, type(pd.DataFrame), ['a', 1],
 lambda a,b,*args,x=0.5,y=-0.1: a**2+b+sum([*args])+x*y+1,
 {
  'a':1,
  'b':2,
  type:0,
  int:1,
  print:max,
  pd:np,
  0:pd.DataFrame(np.random.randint(0,256,[4,4]),
                 columns=['index a1', 'index a2', 'b', 'c'],
                 index=pd.date_range('2020-01-01', '2020-01-04')).set_index(['index a1', 'index a2'], append=True),
  1:pd.Series([1, 2.5, 3+1j, np.nan, 'abc', True, None, float], index=pd.date_range('2020-01-01', '2020-01-08', tz='Asia/Singapore')),
  2:np.array([[1, 2.5, 'a'], [1+.5j, np.nan, 'b']]),
  3:np.matrix([[1, 2.5], [1+.5j, np.nan]])
 },
 {1, 3.4, 1+2j, np.nan, True, False, None, int, 'aa', os, sys, pd.concat}]

It should be noted that serialization of lambda is supported at code level, but functions are only serialized at surface name level. Therefore, in order to successfully deserialize functions, they have to be imported or defined first; and similarly for modules, they have to be imported in the first place. Otherwise, the resulting objects will be in the specially-coded string form which is nevertheless readable.

This utility is designed for semi-human-readablility, thus, it is less efficient than PyArrow on speed. For serialization of huge tables and big data, we recommend using PyArrow for binary serialization.

About

A utility for convenient save/load of Python objects containing pandas DataFrame/Series, can be nested containers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages