## Pickle v.s. JSON

### [JSON](https://docs.python.org/2/library/json.html "18.2. json — JSON encoder and decoder")
- portable
- dict keys should be basic types (str, unicode, int, long, float, bool, None) and will be casted to string
- complex types will be casted to basic types (e.g. defaultdict -> dict)
- faster to load/dump
- dump file is smaller

### [Pickle](https://docs.python.org/2/library/json.html "11.1. pickle — Python object serialization") 
- almost every type is picklable
- lambda is not picklable. (e.g. `defaultdict(lambda: 1)`)
- slow
- keep original types

# JSON, Pickle 速度與資料型別實驗

In [1]:
!head -10000 VOA.UTF8.txt > small_VOA.txt

In [2]:
for line in open('small_VOA.txt'):
    print(line)

A German court has convicted four Algerian men of plotting to bomb a busy Christmas market in the French city of Strasbourg , and sentenced them to prison terms of 10-12 years .

The plot was foiled by German police just before it was to have taken place in late December 2000 .

In handing down the Frankfurt court 's court sentence , presiding Judge Karlheinz Zeiher said the four men had intended to kill defenseless people with the aim of spreading terror in France and throughout Europe .

He said they wanted to punish France because of its support for the Algerian government .

Judge Zeiher said the four planned to place a bomb in the middle of the square in Strasbourg , where the city 's city cathedral is located and a Christmas market is held every year .

The bomb , he said , consisted of one or more pressure cookers packed with explosives , a technique he said three of the men learned while training at an al-Qaida camp in Afghanistan .

But the prosecution was unable to establish 

In [3]:
from collections import defaultdict, Counter
import operator
import codecs

def to_ngrams( unigrams, length):
    return (' '.join(ngram) 
                for ngram in zip(*[unigrams[i:] 
                                    for i in range(length)]))  # json 的 object key 只能用 string ，故用 join 接起來

ngram_counts = defaultdict(Counter)
# with open('small_VOA.txt', 'rb') as text_file:
with codecs.open('small_VOA.txt', "r", encoding='utf-8') as text_file:
    for line in text_file: 
        words = line.strip().lower().split()
        for n in range(2, 7):
            ngram_counts[n].update(to_ngrams(words, n))

-----
原始資料是 defaultdict

In [4]:
type(ngram_counts) 

collections.defaultdict

## dump 成 pickle/json 檔案

In [5]:
import pickle
with open('VOA_2to6gram.pkl', 'wb') as pkl_file:
    %time pickle.dump(ngram_counts, pkl_file)
    
import json
with open('VOA_2to6gram.json', 'w') as json_file:
    %time json.dump(ngram_counts, json_file)

CPU times: user 394 ms, sys: 92.7 ms, total: 486 ms
Wall time: 494 ms
CPU times: user 1.67 s, sys: 35.9 ms, total: 1.71 s
Wall time: 1.72 s


**儲存時間 json 約是 pickle 數倍快**

In [6]:
!du -h VOA_2to6gram.*

 23M	VOA_2to6gram.json
 26M	VOA_2to6gram.pkl


檔案 pickle 略大於 json

## load pickle, json 檔案成物件

In [7]:
with open('VOA_2to6gram.pkl', 'rb') as pkl_file:
    %time ngram_counts_frm_pkl = pickle.load(pkl_file)
with open('VOA_2to6gram.json', 'r') as pkl_file:
    %time ngram_counts_frm_jsn = json.load(pkl_file)

CPU times: user 294 ms, sys: 76 ms, total: 370 ms
Wall time: 374 ms
CPU times: user 492 ms, sys: 88.2 ms, total: 580 ms
Wall time: 581 ms


**讀取時間 json 約是 pickle 的兩倍快**

In [8]:
type(ngram_counts_frm_pkl) # pickle

collections.defaultdict

In [9]:
type(ngram_counts_frm_pkl[2]) # pickle

collections.Counter

In [10]:
type(ngram_counts_frm_jsn) # json

dict

In [11]:
type(ngram_counts_frm_jsn['2']) #json

dict

**json 已變成原始 dict**

In [12]:
ngram_counts_frm_pkl.keys()

dict_keys([2, 3, 4, 5, 6])

In [13]:
ngram_counts_frm_jsn.keys()

dict_keys(['2', '3', '4', '5', '6'])

**json 的 dict keys 變成 unicode**

# Pickle 的 lambda 解法

In [14]:
from collections import defaultdict
try: 
    pickle.loads(pickle.dumps(defaultdict(lambda: 1))) # 無法 pickle
except:
    print('can\'t pickle lambda')

can't pickle lambda


## 使用 type 與具名 function

In [15]:
import pickle
from collections import defaultdict, Counter

pickle.loads(pickle.dumps(defaultdict(int)))     # 預設值 0

def one(): return 1
pickle.loads(pickle.dumps(defaultdict(one)))     # 預設值 1 (使用 function)

defaultdict(<function __main__.one>, {})

## 使用 partial

In [16]:
# 使用 partial 能更漂亮的處理

from functools import partial
pickle.loads(pickle.dumps(defaultdict(partial(int,1))))   # 預設值 1

pickle.loads(pickle.dumps(defaultdict(partial(float,1.0))))   # 預設值 1.0

pickle.loads(pickle.dumps(defaultdict(partial(defaultdict,list))))   # 預設值 defaultdict(list)

# 預設值 defaultdict(defaultdict(Counter))
pickle.loads(pickle.dumps(defaultdict(partial(
                                            defaultdict,partial(
                                                                defaultdict, Counter)))))

defaultdict(functools.partial(<class 'collections.defaultdict'>, functools.partial(<class 'collections.defaultdict'>, <class 'collections.Counter'>)),
            {})