# Pandas简介

Pandas是功能强大且易于使用的类似数据库的分析工具。它有两个主要的对象来表示数据：Series和DataFrame。

更多帮助:

- http://pandas.pydata.org/pandas-docs/stable/10min.html
- http://pandas.pydata.org/pandas-docs/stable/tutorials.html

<table>
<tr>
    <td><img src="http://www.scipy.org/_static/images/numpylogo_med.png"  style="width:50px;height:50px;" /></td>
    <td><h4>NumPy</h4> Base N-dimensional array package </td>
    <td><img src="http://www.scipy.org/_static/images/scipy_med.png" style="width:50px;height:50px;" /></td>
    <td><h4>SciPy</h4> Fundamental library for scientific computing </td>
    <td><img src="http://www.scipy.org/_static/images/matplotlib_med.png" style="width:50px;height:50px;" /></td>
    <td><h4>Matplotlib</h4> Comprehensive 2D Plotting </td>
</tr>
<tr>
    <td><img src="http://www.scipy.org/_static/images/ipython.png" style="width:50px;height:50px;" /></td>
    <td><h4>IPython</h4> Enhanced Interactive Console </td>
    <td><img src="http://www.scipy.org/_static/images/sympy_logo.png" style="width:50px;height:50px;" /></td>
    <td><h4>SymPy</h4> Symbolic mathematics </td>
    <td style="background:Lavender;"><img src="http://www.scipy.org/_static/images/pandas_badge2.jpg" style="width:50px;height:50px;" /></td>
    <td style="background:Lavender;"><h4>Pandas</h4> Data structures & analysis </td>
</tr>
</table>

## 加载库

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Working with Series

Series是一个类似一维数组的对象，不过它有个强化的索引。

#### pd.Series(self, data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

In [3]:
x = pd.Series([1,2,3,4,5])
x

0    1
1    2
2    3
3    4
4    5
dtype: int64

注意，Series为项目生成了一个索引

## Series基本操作

In [4]:
x + 100

0    101
1    102
2    103
3    104
4    105
dtype: int64

In [5]:
(x ** 2) + 100

0    101
1    104
2    109
3    116
4    125
dtype: int64

In [6]:
x > 2

0    False
1    False
2     True
3     True
4     True
dtype: bool

## `any()` and `all()`  

In [7]:
larger_than_2 = x > 2
larger_than_2

0    False
1    False
2     True
3     True
4     True
dtype: bool

In [8]:
larger_than_2.any()   #（至少一个为True）

True

In [9]:
larger_than_2.all()   #（全部为True）

False

## `apply()`

In [10]:
def f(x):
    if x % 2 == 0:
        return x * 2
    else:
        return x * 3

x.apply(f)

0     3
1     4
2     9
3     8
4    15
dtype: int64

**Avoid looping over your data**

This is a `%%timeit` results from `apply()` and a for loop.

有人认为pandas中最核心、最经典的函数是apply  map   applymap，这三个函数是pandas里面数据变换的核心  避免了for循环

In [11]:
%%timeit

ds = pd.Series(range(10000))

for counter in range(len(ds)):
    ds[counter] = f(ds[counter])

1 loop, best of 3: 163 ms per loop


In [12]:
%%timeit

ds = pd.Series(range(10000))

ds = ds.apply(f)

100 loops, best of 3: 5.18 ms per loop


### `astype()` 转换数据类型

In [9]:
x.astype(np.float64)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

## `copy()`

In [11]:
y = x

In [12]:
y[0]

1

In [13]:
y[0] = 100

In [14]:
y

0    100
1      2
2      3
3      4
4      5
dtype: int64

In [15]:
x

0    100
1      2
2      3
3      4
4      5
dtype: int64

**Avoid using copy (is you can) to save memory**

In [16]:
y = x.copy()

In [17]:
x[0]=1

In [18]:
x

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [19]:
y

0    100
1      2
2      3
3      4
4      5
dtype: int64

In [20]:
x.describe()

count    5.000000
mean     3.000000
std      1.581139
min      1.000000
25%      2.000000
50%      3.000000
75%      4.000000
max      5.000000
dtype: float64

## 数据框(DataFrame)

#### pd.DataFrame(self, data=None, index=None, columns=None, dtype=None, copy=False)

In [21]:
data = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(data, columns=["x"])

In [22]:
df

Unnamed: 0,x
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9


## 选择数据

In [23]:
df["x"]

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
Name: x, dtype: int64

In [24]:
df["x"][0]

1

## 增加新列

In [25]:
df["x_plus_2"] = df["x"] + 2
df

Unnamed: 0,x,x_plus_2
0,1,3
1,2,4
2,3,5
3,4,6
4,5,7
5,6,8
6,7,9
7,8,10
8,9,11


In [26]:
df["x_square"] = df["x"] ** 2
df["x_factorial"] = df["x"].apply(np.math.factorial)
df

Unnamed: 0,x,x_plus_2,x_square,x_factorial
0,1,3,1,1
1,2,4,4,2
2,3,5,9,6
3,4,6,16,24
4,5,7,25,120
5,6,8,36,720
6,7,9,49,5040
7,8,10,64,40320
8,9,11,81,362880


In [27]:
df["is_even"] = df["x"] % 2
df

Unnamed: 0,x,x_plus_2,x_square,x_factorial,is_even
0,1,3,1,1,1
1,2,4,4,2,0
2,3,5,9,6,1
3,4,6,16,24,0
4,5,7,25,120,1
5,6,8,36,720,0
6,7,9,49,5040,1
7,8,10,64,40320,0
8,9,11,81,362880,1


### `map()`函数的使用

In [28]:
df["odd_even"] = df["is_even"].map({1:"odd", 0:"even"})
df
#有人认为pandas中最核心、最经典的函数是apply  map   applymap，这三个函数是pandas里面数据变换的核心  避免了for循环

Unnamed: 0,x,x_plus_2,x_square,x_factorial,is_even,odd_even
0,1,3,1,1,1,odd
1,2,4,4,2,0,even
2,3,5,9,6,1,odd
3,4,6,16,24,0,even
4,5,7,25,120,1,odd
5,6,8,36,720,0,even
6,7,9,49,5040,1,odd
7,8,10,64,40320,0,even
8,9,11,81,362880,1,odd


### `drop()`删除数据

In [29]:
df = df.drop("is_even", 1)  #删除1列
df

Unnamed: 0,x,x_plus_2,x_square,x_factorial,odd_even
0,1,3,1,1,odd
1,2,4,4,2,even
2,3,5,9,6,odd
3,4,6,16,24,even
4,5,7,25,120,odd
5,6,8,36,720,even
6,7,9,49,5040,odd
7,8,10,64,40320,even
8,9,11,81,362880,odd


## 选择多列

In [30]:
df[["x", "x_square"]]

Unnamed: 0,x,x_square
0,1,1
1,2,4
2,3,9
3,4,16
4,5,25
5,6,36
6,7,49
7,8,64
8,9,81


## 控制显示选项

In [31]:
pd.options.display.max_columns= 60
pd.options.display.max_rows= 6
pd.options.display.notebook_repr_html = False
df

    x  x_plus_2  x_square  x_factorial odd_even
0   1         3         1            1      odd
1   2         4         4            2     even
2   3         5         9            6      odd
.. ..       ...       ...          ...      ...
6   7         9        49         5040      odd
7   8        10        64        40320     even
8   9        11        81       362880      odd

[9 rows x 5 columns]

## 筛选数据

In [32]:
df[df["x"] >3]

   x  x_plus_2  x_square  x_factorial odd_even
3  4         6        16           24     even
4  5         7        25          120      odd
5  6         8        36          720     even
6  7         9        49         5040      odd
7  8        10        64        40320     even
8  9        11        81       362880      odd

In [33]:
df[df.odd_even == "even"]

   x  x_plus_2  x_square  x_factorial odd_even
1  2         4         4            2     even
3  4         6        16           24     even
5  6         8        36          720     even
7  8        10        64        40320     even

### 组合筛选

#### `|` OR

In [39]:
df[(df.odd_even == "even") | (df.x_square < 20)]

   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
1  2         4         4            2     even
2  3         5         9            6      odd
3  4         6        16           24     even
5  6         8        36          720     even
7  8        10        64        40320     even

#### `&` AND

In [40]:
df[(df.odd_even == "even") & (df.x_square < 20)]

   x  x_plus_2  x_square  x_factorial odd_even
1  2         4         4            2     even
3  4         6        16           24     even

### 更深的组合筛选

In [34]:
df[(df.odd_even == "even") & (df.x_square < 20)]["x_plus_2"][:1]

1    4
Name: x_plus_2, dtype: int64

In [35]:
df.describe()

              x   x_plus_2   x_square    x_factorial
count  9.000000   9.000000   9.000000       9.000000
mean   5.000000   7.000000  31.666667   45457.000000
std    2.738613   2.738613  28.080242  119758.341137
...         ...        ...        ...            ...
50%    5.000000   7.000000  25.000000     120.000000
75%    7.000000   9.000000  49.000000    5040.000000
max    9.000000  11.000000  81.000000  362880.000000

[8 rows x 4 columns]

## 从 CSV/TSV 文件中读取数据

In [13]:
url = "data/2014_indian_elections_results.csv"
elections_data = pd.read_csv(url)

In [14]:
elections_data

Unnamed: 0,Name of State/ UT,Parliamentary Constituency,Candidate Name,Total Votes Polled,Winner or Not?,Party Abbreviation,Party Name
0,Andhra Pradesh,Adilabad,GODAM NAGESH,430847,yes,TRS,Telangana Rashtra Samithi
1,Andhra Pradesh,Adilabad,NETHAWATH RAMDAS,41032,no,IND,Independent
2,Andhra Pradesh,Adilabad,RAMESH RATHOD,184198,no,TDP,Telugu Desam
3,Andhra Pradesh,Adilabad,RATHOD SADASHIV,94420,no,BSP,Bahujan Samaj Party
4,Andhra Pradesh,Adilabad,MOSALI CHINNAIAH,8859,no,IND,Independent
5,Andhra Pradesh,Adilabad,NARESH,259557,no,INC,Indian National Congress
6,Andhra Pradesh,Adilabad,PAWAR KRISHNA,5055,no,IND,Independent
7,Andhra Pradesh,Adilabad,BANKA SAHADEV,4787,no,IND,Independent
8,Andhra Pradesh,Adilabad,None of the Above,17084,no,NOTA,None of the Above
9,Andhra Pradesh,Peddapalle,BALKA SUMAN,565496,yes,TRS,Telangana Rashtra Samithi


# Thanks