# Pandas
pandas 是基于 Numpy 的一种工具，Pandas 是基于NumPy的一种工具，该工具是为解决数据分析任务而创建。Pandas 纳入了大量库和一些标准的数据模型，提供了高效操作大型数据集所需的工具。



# 安装与导入  

## 安装 

    pip install pandas 或 conda install pandas  

## 导入  

    import pandas as pd

# 查看 pandas 版本

In [12]:
import pandas as pd
# pd.show_versions()
print(pd.__version__)

1.3.5


# Pandas 与 Numpy 区别
![image.png](attachment:image.png)

# Series 和 DataFrame
Series和DataFrame属于Pandas两种基础的数据类型。

Series的本质是一维数组，由索引和值两个部分组成。series数据结构类似于定长的有序字典，其中值的类型可以不同。

DataFrame又称为数据框，由索引(index)、值(values)和列名(column_names)三个部分组成，DataFrame的结构类似一个二维表格，即DataFrame既包含行索引，也包含列索引，可以看作由多个Series组成的大字典。 


## Series 操作

In [5]:
## 创建Series
import pandas as pd
a = [1, 2, 3]
myvar = pd.Series(a)
print(myvar)

0    1
1    2
2    3
dtype: int64


In [13]:
## 根据索引值读取数据：
import pandas as pd
a=[1, 2, 3]
myvar = pd.Series(a)
print(myvar[1])

2


In [14]:
import pandas as pd
a=[1, 2, 3]
myvar = pd.Series(a)
print(myvar[:2])

0    1
1    2
dtype: int64


In [15]:
import pandas as pd
a=[1, 2, 3]
myvar = pd.Series(a)
print(myvar[-2:])

1    2
2    3
dtype: int64


In [16]:
## 指定索引值创建Series：
import pandas as pd
a = ["Jk", "Wlw", "Dk"]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

x     Jk
y    Wlw
z     Dk
dtype: object


In [17]:
## 根据索引值读取数据:
import pandas as pd
a = ["Jk", "Wlw", "Dk"]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar["y"])

Wlw


In [18]:
import pandas as pd
a = ["Jk", "Wlw", "Dk"]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar[["y"]])

y    Wlw
dtype: object


In [19]:
import pandas as pd
a = ["Jk", "Wlw", "Dk"]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar[["y", "z"]])

y    Wlw
z     Dk
dtype: object


In [20]:
## 使用 key/value 对象，类似字典来创建 Series :
import pandas as pd
sites = {1: "JK", 2: "WLW", 3: "DK"}
myvar = pd.Series(sites)
print(myvar)

1     JK
2    WLW
3     DK
dtype: object


In [21]:
## 如果我们只需要字典中的一部分数据，只需要指定需要数据的索引：
import pandas as pd
sites = {1: "JK", 2: "WLW", 3: "DK"}
myvar = pd.Series(sites, index = [1, 2])
print(myvar)

1     JK
2    WLW
dtype: object


In [23]:
## 设置 Series 名称参数：
import pandas as pd
sites = {1: "JK", 2: "WLW", 3: "DK"}
myvar = pd.Series(sites, index = [1, 2], name="JSJ-Series-TEST" )
print(myvar)

1     JK
2    WLW
Name: JSJ-Series-TEST, dtype: object


# DataFrame
DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共同用一个索引）。
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

## 创建DataFrame

In [25]:
## 从 Series 字典来创建 DataFrame
import pandas as pd
d = {'ZY' : pd.Series(['JK','WLW', 'DK'], index=['a', 'b', 'c']),
     'RS' : pd.Series([102, 92, 48], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print (df)

    ZY   RS
a   JK  102
b  WLW   92
c   DK   48


In [26]:
## 使用列表创建DataFrame：
import pandas as pd
data = [['JK',101],['WLW',92],['DK',48]]
df = pd.DataFrame(data,columns=['ZY','RS'])
print(df)

    ZY   RS
0   JK  101
1  WLW   92
2   DK   48


In [27]:
## 使用 ndarrays 创建
import pandas as pd
data = {'ZY':['JK', 'WLW', 'DK'], 'RS':[101, 92, 48]}
df = pd.DataFrame(data)
print (df)

    ZY   RS
0   JK  101
1  WLW   92
2   DK   48


In [28]:
## 使用字典创建
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print (df)

   a   b     c
0  1   2   NaN
1  5  10  20.0


## 列操作

In [29]:
## DataFrame添加列
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print("原列：\n",df)
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print ("通过Series添加个新列:\n",df)
df['four']=df['one']+df['three']
print ("在DataFrame加入新列后",df)

原列：
    one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
通过Series添加个新列:
    one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
在DataFrame加入新列后    one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


In [33]:
## DataFrame 删除列
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
     'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("原始DataFrame为:\n", df)
del df['one']
print ("使用del方法删除one列:\n", df)
df.pop('two')
print ("使用pop弹出列:\n",df)

原始DataFrame为:
    one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
使用del方法删除one列:
    two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
使用pop弹出列:
    three
a   10.0
b   20.0
c   30.0
d    NaN


## 行操作

In [60]:
## DataFrame 读取行
## 可以通过将行索引传递给loc() 函数来选择行
import pandas as pd
d = {pd.Series([1, 2, 3], index=['a', 'b', 'c’], columns = ['one'] ),
     pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd’], columns = [‘two'] )}
df = pd.DataFrame(d)
print("原始DataFrame为:\n", df)
print ("读取b行数据为:\n", df.loc['b’])
print ("读取b到d行数据为:\n", df.loc['b':'d'])

SyntaxError: invalid syntax (1402435990.py, line 4)

In [43]:
## 将整数位置传递给iloc()函数来选择行 
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print("原始DataFrame为:\n", df)
print ("读取b行数据为:\n", df.iloc[2])

原始DataFrame为:
    one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
读取b行数据为:
 one    3.0
two    3.0
Name: c, dtype: float64


In [45]:
## 按行切片选择
import pandas as pd
d = {'one' : pd.Series([100, 200, 300], index=['a', 'b', 'c']),
     'two' : pd.Series([100, 200, 300, 400], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df[2:4])

     one  two
c  300.0  300
d    NaN  400


In [46]:
## DataFrame添加行
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a', 'b'], index=[0, 1])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a', 'b'], index=[2, 3])
df = pd.concat([df, df2])
print (df)

   a  b
0  1  2
1  3  4
2  5  6
3  7  8


In [47]:
## DataFrame 删除行
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a', 'b'], index=[0, 1])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a', 'b'], index=[2, 3])
df = pd.concat([df, df2])
print ("原DataFrame:\n", df)
df = df.drop(0)
print("删除行后:\n",df)

原DataFrame:
    a  b
0  1  2
1  3  4
2  5  6
3  7  8
删除行后:
    a  b
1  3  4
2  5  6
3  7  8


## 重命名和索引

In [48]:
## 重命名列名和索引名
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a', 'b'], index=[0, 1])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a', 'b'], index=[2, 3])
df = pd.concat([df, df2])
print ("原DataFrame:\n", df)
df3 = df.rename(columns={'a':'wlw','b':'dk'},index ={0:'XY',1:'ZY',2:'BJ',3:'RS'})
print("重命名后:\n", df3)

原DataFrame:
    a  b
0  1  2
1  3  4
2  5  6
3  7  8
重命名后:
     wlw  dk
XY    1   2
ZY    3   4
BJ    5   6
RS    7   8


## 显示维度

In [49]:
##  显示维度：
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
print (s)
print ("The dimensions of the object:")
print (s.ndim)

0   -0.145557
1    0.561418
2   -0.286116
3   -1.634518
dtype: float64
The dimensions of the object:
1


## 显示个数

In [50]:
## 元素个数：
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(2))
print (s)
print ("The size of the object:")
print (s.size)

0   -0.244504
1   -1.603545
dtype: float64
The size of the object:
2


## 排序

In [51]:
## 排序
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
print("未排序的数据:\n",unsorted_df )
sorted_df=unsorted_df.sort_index()
print("排序后的数据", sorted_df)

未排序的数据:
        col2      col1
1 -1.412828 -0.379650
4  0.745770 -0.786419
6  0.145453  2.361911
2 -0.699238  1.092076
3  0.453284 -1.634442
5  0.237350 -0.911404
9  1.512742 -0.070094
8 -0.136771  1.300646
0  1.382505 -0.147575
7 -0.067565  1.626745
排序后的数据        col2      col1
0  1.382505 -0.147575
1 -1.412828 -0.379650
2 -0.699238  1.092076
3  0.453284 -1.634442
4  0.745770 -0.786419
5  0.237350 -0.911404
6  0.145453  2.361911
7 -0.067565  1.626745
8 -0.136771  1.300646
9  1.512742 -0.070094


In [55]:
## 按值排序
import pandas as pd
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col2', ascending=False)
print (sorted_df)

   col1  col2
3     1     4
1     1     3
2     1     2
0     2     1


## 合并

In [56]:
## 合并
import pandas as pd
frame1 = pd.DataFrame({'id':['ball', 'pencil', 'pen', 'mug', 'ashtray'],
                       'price':['12.33', '11.44', '33.21', '12.23', '33.62']})
frame2 = pd.DataFrame({'id':['pencil', 'pencil', 'ball', 'pen'],
                       'color':['white', 'red', 'red', 'black']})
print("第一个DataFrame:\n", frame1)
print("第二个DataFrame:\n", frame2)
print("合并后的DataFrame:\n", pd.merge(frame1, frame2))

第一个DataFrame:
         id  price
0     ball  12.33
1   pencil  11.44
2      pen  33.21
3      mug  12.23
4  ashtray  33.62
第二个DataFrame:
        id  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black
合并后的DataFrame:
        id  price  color
0    ball  12.33    red
1  pencil  11.44  white
2  pencil  11.44    red
3     pen  33.21  black
