pandas-serializer

A utility for text-based serialization and deserialization (i.e., save/load) of Python nested container objects (such as list, dict and set), containing Pandas DataFrame/Series and NumPy arrays/matrices, etc. De-serialized DataFrame and Series object will try to conform to the original object type. However, since the conversion is based on to_csv() and read_csv(), data types indistinguishable by pandas.DataFrame.to_csv() will remain indistinguishable, e.g., None vs NaN, etc.

Exported functions:

pandas_save(obj, fp_fn) : save object into the file/filename
pandas_saves(obj) : save object into a string
pandas_load(fp_fn) : load object from the file/filename
pandas_loads(str) : load object from the string

Internal functions:

Open(filename, mode) : smartly open normal or gzipped file, - for STDIN/STDOUT
pandas_encode : recursively convert every non-convertible objects into specially-coded string
pandas_decode : recursively restore every non-convertible objects from specially-coded string

The working principle is to first convert every non-convertible object (such as DataFrame/Series/types/modules) into a specially-coded string, then use Python repr() to convert the entire object in one go. In addition, this powerful utility will also try to convert/restore other non-reverse-convertible Python objects such as types, modules, functions, etc. However, non-reverse-convertible Python object does not guarantee successful deserialization, so those data types should be avoided in general.

As shown in the built-in example, this utility can successfully serialize and deserialize the following super-complex nested data structure:

[1, 3.4, 1.1+2.1j, np.nan, None, True, False, b'ab12', 'abc', int, float,
 pd.Series(), pd.DataFrame(), pd.DataFrame, type(pd.DataFrame), ['a', 1],
 lambda a,b,*args,x=0.5,y=-0.1: a**2+b+sum([*args])+x*y+1,
 {
  'a':1,
  'b':2,
  type:0,
  int:1,
  print:max,
  pd:np,
  0:pd.DataFrame(np.random.randint(0,256,[4,4]),
                 columns=['index a1', 'index a2', 'b', 'c'],
                 index=pd.date_range('2020-01-01', '2020-01-04')).set_index(['index a1', 'index a2'], append=True),
  1:pd.Series([1, 2.5, 3+1j, np.nan, 'abc', True, None, float], index=pd.date_range('2020-01-01', '2020-01-08', tz='Asia/Singapore')),
  2:np.array([[1, 2.5, 'a'], [1+.5j, np.nan, 'b']]),
  3:np.matrix([[1, 2.5], [1+.5j, np.nan]])
 },
 {1, 3.4, 1+2j, np.nan, True, False, None, int, 'aa', os, sys, pd.concat}]

It should be noted that serialization of lambda is supported at code level, but functions are only serialized at surface name level. Therefore, in order to successfully deserialize functions, they have to be imported or defined first; and similarly for modules, they have to be imported in the first place. Otherwise, the resulting objects will be in the specially-coded string form which is nevertheless readable.

This utility is designed for semi-human-readablility, thus, it is less efficient than PyArrow on speed. For serialization of huge tables and big data, we recommend using PyArrow for binary serialization.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
pandas_serializer.py		pandas_serializer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

pandas_serializer.py

pandas_serializer.py

Repository files navigation

pandas-serializer

About

Releases

Packages

Languages

xuancong84/pandas-serializer

Folders and files

Latest commit

History

Repository files navigation

pandas-serializer

About

Resources

Stars

Watchers

Forks

Languages