## 数据科学工具箱


Python中科学计算相关的包有很多，以下的组合可以说是数据科学工具箱的标配：

- **Numpy**: 提供向量，数组和矩阵类型，支持向量化运算。

- **Pandas**: 提供DataFrame数据结构，与R中的data.frame类似。内含常用的分组统计，缺失值处理等操作，包括DataFrame的读写和各种类型的转换





## Numpy

In [1]:
import numpy as np
array_example = np.array([1, 2, 3])
print ('向量化操作: %s + 1 = %s' %(array_example, array_example + 1))
array_example.

向量化操作: [1 2 3] + 1 = [2 3 4]


In [17]:
array_example = np.array([[1, 2, 3],
                          [4, 5, 6],
                          [7, 8, 9]
                         ])
# element-wise 相同纬度的数组相加/乘，与R类似
print('加法: \n', array_example + array_example)

print('乘法: \n', array_example * array_example)

加法: 
 [[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]
乘法: 
 [[ 1  4  9]
 [16 25 36]
 [49 64 81]]


注意: **转成矩阵后，乘法遵循矩阵的乘法规则**

$$C_{ij} = \sum_{k=1}^{m}A_{ik} * B{kj}$$

In [18]:
#matirx 为二维矩阵 ，而 numpy arrays (ndarrays) 可以是多维的（1D，2D，3D····ND）.
mat_example = np.mat([[1, 2, 3],
                          [4, 5, 6],
                          [7, 8, 9]
                         ])
mat_example * mat_example

matrix([[ 30,  36,  42],
        [ 66,  81,  96],
        [102, 126, 150]])

In [19]:
mat_example

matrix([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

每行/列的操作：

In [20]:
np.sum(mat_example, axis=0).tolist()

[[12, 15, 18]]

In [21]:
# max，min等函数 也支持 定义axis在行或者列上操作运算
print('所有元素之和: ', np.sum(mat_example))
print('每列元素之和: ', np.sum(mat_example, axis=0))
print('每行元素之和: ', np.sum(mat_example, axis=1).tolist())

所有元素之和:  45
每列元素之和:  [[12 15 18]]
每行元素之和:  [[6], [15], [24]]


In [22]:
#上述求和方法的等价方法
a=mat_example.sum()
print (a)

45


## Pandas 

### 方便的数据读取

In [2]:
import pandas as pd
df = pd.read_csv('Data\pandas\card.csv',encoding='gbk')
df.head(3)

Unnamed: 0,expensure,transfer
0,2976,1863
1,2963,2048
2,2950,1940


In [24]:
df.tail(10)

Unnamed: 0,expensure,transfer
290,3222,2218
291,3156,2245
292,3130,2124
293,3247,2100
294,3162,2161
295,3293,2015
296,3206,2207
297,3191,2080
298,3161,2263
299,3152,2048


In [25]:
df.describe()

Unnamed: 0,expensure,transfer
count,300.0,300.0
mean,3001.156667,1992.273333
std,189.562863,182.281835
min,2600.0,1544.0
25%,2859.5,1852.0
50%,2984.5,1998.0
75%,3148.0,2132.5
max,3424.0,2446.0


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 2 columns):
expensure    300 non-null int64
transfer     300 non-null int64
dtypes: int64(2)
memory usage: 4.8 KB


In [27]:
#新增1列表示expensure是否大于平均值
df['group'] = df['expensure'] > df['expensure'].mean()
df.head()

Unnamed: 0,expensure,transfer,group
0,2976,1863,False
1,2963,2048,False
2,2950,1940,False
3,2926,1737,False
4,2861,2227,False


In [28]:
# 分组统计
print(df.groupby('group').size())

group
False    160
True     140
dtype: int64


In [29]:
df.groupby('group').mean()

Unnamed: 0_level_0,expensure,transfer
group,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2850.975,1878.40625
True,3172.792857,2122.407143


## 学习参考:

- [Numpy Tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)


- [10 Minutes to pandas](http://python.jobbole.com/84416/)



