# 3. pandas

DataFrame, 是一种带行标签和列标签，支持相同类型数据和缺失值的多维数组

Series: 一个带索引数据构成的一维数组
```
values -> ndarray
index  -> RangeIndex

Series 对象用一种显式定义的索引与数值关联

Series 是特殊的字典

怎样创建Series对象??


```


DataFrame

Index


In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
print(data.values)
print(f'Series values type {type(data.values)}')

[0.25 0.5  0.75 1.  ]
Series values type <class 'numpy.ndarray'>


In [4]:
print(data.index)
print(f'Series index type {type(data.index)}')


RangeIndex(start=0, stop=4, step=1)
Series index type <class 'pandas.core.indexes.range.RangeIndex'>


In [5]:
print(data[1])

print(data[1:3])


0.5
1    0.50
2    0.75
dtype: float64


In [6]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])

data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [7]:
# 使用不连续或者不按顺序的索引
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [8]:
"""
一种特殊的python字典，
将一种类型键映射到一组类型值的数据结构

"""

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [9]:
# 可以使用字典的方式来获取数值
population['California']

38332521

In [10]:
population['California': 'New York']

California    38332521
Texas         26448193
New York      19651127
dtype: int64

In [11]:
"""
每一种形式都可以通过显式指定索引筛选结果，Series对象只会保留显式定义的键值对

"""

s1 = pd.Series([2, 4, 6])
s2 = pd.Series(5, index=[100, 200, 300])
s3 = pd.Series({2: 'a', 1: 'b', 3: 'c'})
s4 = pd.Series({2: 'a', 1: 'b', 3: 'c'}, index=[3, 2])



## DataFrame对象

index属性来获取行标签索引

columns 列标签索引

DataFrame 亦可以看做是一个特殊的字典


In [12]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [13]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [15]:
# index 属性
print(states.index)

print(states.columns)

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
Index(['population', 'area'], dtype='object')


In [16]:
print(states['area'])

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


### 创建DataFrame的方式



In [17]:
df1 = pd.DataFrame(population, columns=['population'])
print(df1)

# 通过字典列表来创建, 字典某些键不存在时，Pandas会使用 NaN
data = [{'a': i, 'b': 2 * i} for i in range(3)]
df2 = pd.DataFrame(data)
print(df2)

# 通过Series对象字典创建
df3 = pd.DataFrame({'population': population, 'area': area})
print(df3)

# 通过Numpy二维数组
df4 = pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])
print(df4)

# 通过Numpy结构化数组创建
df5 = pd.DataFrame(np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')]))
print(df5)

            population
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
   a  b
0  0  0
1  1  2
2  2  4
            population    area
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995
        foo       bar
a  0.020944  0.610888
b  0.286238  0.961391
c  0.603803  0.002924
   A    B
0  0  0.0
1  0  0.0
2  0  0.0


## 3.2.3 Index对象

看作一个不可变数组或者有序集合,实际上是一个多集，因为index对象可能会包含重复值, 还有和numpy数组相识的属性

index 可以看做有序集合，支持集合的一些操作。


In [20]:
ind = pd.Index([2, 3, 5, 7, 11])

print(ind)
print(ind[1])
print(ind[::2])

print(ind.size, ind.shape, ind.ndim, ind.dtype)

try:
    ind[1] = 4
except Exception as e:
    # print(Exception)
    print(e)

Int64Index([2, 3, 5, 7, 11], dtype='int64')
3
Int64Index([2, 5, 11], dtype='int64')
5 (5,) 1 int64
Index does not support mutable operations


In [26]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

print(indA.intersection(indB))
print(indA.union(indB))
print(indA.symmetric_difference(indB))


Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')
