Python for Data Analysis
----

# Book
CN: 利用Python进行数据分析 78.4MB.pdf
EN: Python for Data Analysis 2nd Edition.pdf

![cover](images/cover1.png)


# 概述
原书英文版，2013年由OReilly出版，中文版由机械工业出版社出版。

全书12个章节：
- 准备工作
- 引言
- IPython
- Number基础
- pandas入门
- 数据加载、存储与文件格式
- 数据规整化
- 绘图和可视化
- 数据聚合与分组运算
- 时间序列
- 金融和经济数据
- NumPy高级应用
- 附录
 - Python语言精要


# 源代码

 `git clone https://github.com/pydata/pydata-book`

# 读书笔记
## Ch01 准备工作
本章节如题，即为学习如何来用Python来分析数据的准备工作。
- what： 处理对象是什么？ 主要是结构化数据表格
- How： 工具是什么？ Python， NumPy/Matplotlib/IPython/pandas/SciPy
- Why： Python简单易用，有强大的公共库资源
- Setup: 准备代码调试编写环境 

## Ch02 引言

In [None]:
#### Load JSON
import json
path = '/opt/Work/ML/pydata-book/ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
records[0]

In [None]:
type(records[0])

#time_zones = [rec['tz'] for rec in records]
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:5]

In [None]:
def get_counts(sequence):
    counts = {} # dict
    for x in sequence:
        if x in counts.keys():
            counts[x] += 1
        else:
            counts[x] = 1
    return counts
            
get_counts(time_zones)

In [None]:
from collections import defaultdict

def get_counts2(sequence):
    counts = defaultdict(int)
    for x in sequence:
        counts[x] += 1
    return counts

get_counts2(time_zones)

In [None]:
counts = get_counts2(time_zones)

In [None]:
type(counts) #collections.defaultdict
type(counts.items()) # dict_items

In [None]:
def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]
counts = get_counts2(time_zones)
top_counts(counts)

In [None]:
import pandas as pd; import numpy as np
from pandas import DataFrame, Series
frame = DataFrame(records)

In [None]:
tz_counts = frame['tz'].value_counts()

In [None]:
clean_tz = frame['tz'].fillna('Missing')

In [None]:
type(clean_tz)

In [None]:
# NB!!!
clean_tz[clean_tz==''] = 'Unknow'

In [None]:
tz_counts = clean_tz.value_counts()
tz_counts[:10]

In [None]:
frame['a'].head()
frame.a.head()

In [None]:
results = Series(x.split()[0] for x in frame.a.dropna())
print(results.head(5))
print( results.value_counts()[:8] )

In [None]:
# ?????
cframe = frame[frame.a.notnull()]
operating_system = np.where(cframe['a'].str.contains('Windows'),
                            'Windows', 'Not Windows')

In [None]:
print( operating_system[:5])

In [None]:
by_tz_os = cframe.groupby(['tz', operating_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]

In [None]:
# ???
indexer = agg_counts.sum(1).argsort()
indexer[:10]