## DataFrame
素材来自于：微信公众号：秦路  & 官方文档

- DataFrame是一个表格型的数据结构，它含有不同的列，每列都是不同的数据类型。它既有行索引也有列索引，它类似于SQL。

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(np.arange(16).reshape(4,4),index=['x1', 'x2', 'x3','x4'],
     columns=['feature1', 'feature2','feature3','feature4'])
df

Unnamed: 0,feature1,feature2,feature3,feature4
x1,0,1,2,3
x2,4,5,6,7
x3,8,9,10,11
x4,12,13,14,15


In [3]:
df.index

Index(['x1', 'x2', 'x3', 'x4'], dtype='object')

In [4]:
df.columns

Index(['feature1', 'feature2', 'feature3', 'feature4'], dtype='object')

In [5]:
#  选择值
df.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [6]:
# 赋值 注意label列为 numeric
df.loc[:,'label']=[0,1,1,2]
df

Unnamed: 0,feature1,feature2,feature3,feature4,label
x1,0,1,2,3,0
x2,4,5,6,7,1
x3,8,9,10,11,1
x4,12,13,14,15,2


## 选择行、列数据
- x.loc[行label，列label]  
- x.iloc[行index,列index] 

In [8]:
#  选择不连续的行或者列，使用list方式 完全一样
#df.loc[['x1','x2'],['feature1','feature2','label']]

df.iloc[[0,1],[0,1,4]]

Unnamed: 0,feature1,feature2,label
x1,0,1,0
x2,4,5,1


In [10]:
df

Unnamed: 0,feature1,feature2,feature3,feature4,label
x1,0,1,2,3,0
x2,4,5,6,7,1
x3,8,9,10,11,1
x4,12,13,14,15,2


In [11]:
# 切片 Slice 方式，有很大不同
#  loc  label方式 包含下界：包含上界
#  iloc index方式 包含下界：不含上界

#df.loc['x1':'x3','feature1':'feature4']
df.iloc[0:2,0:3]

Unnamed: 0,feature1,feature2,feature3
x1,0,1,2
x2,4,5,6


## 简单的获取 行 或者 列 
- 行：用切片（slice）方式
- 列：用标签名(list,** 而无法用slice方式获取 ** )获取

In [12]:
# 用label 获取列
df[['feature1','feature3']]

Unnamed: 0,feature1,feature3
x1,0,2
x2,4,6
x3,8,10
x4,12,14


In [15]:
# 用 slice 获取行
#df['x1':'x3']
df[0:2]

Unnamed: 0,feature1,feature2,feature3,feature4,label
x1,0,1,2,3,0
x2,4,5,6,7,1


In [20]:
## 求feature3>6的
df.loc[df['feature3']>6]

Unnamed: 0,feature1,feature2,feature3,feature4,label
x3,8,9,10,11,1
x4,12,13,14,15,2


In [21]:
df.loc[lambda df: df['feature1'] == 8]

Unnamed: 0,feature1,feature2,feature3,feature4,label
x3,8,9,10,11,1


In [17]:
# 切片可以越界
df.iloc[2:10]

Unnamed: 0,feature1,feature2,feature3,feature4,label
x3,8,9,10,11,1
x4,12,13,14,15,2


In [20]:
#A single indexer that is out of bounds will raise an IndexError
df.iloc[10]

IndexError: single positional indexer is out-of-bounds

### DataFrame可以直接在列上进行运算，当DataFrame和DataFrame之间运算时，按索引进行加减乘除

In [21]:
df1=pd.DataFrame(np.arange(4).reshape(2,2),columns=['a','b'])
df2=pd.DataFrame(np.arange(6).reshape(2,3),columns=['a','b','c'])
df2

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5


In [22]:
df1

Unnamed: 0,a,b
0,0,1
1,2,3


In [23]:
df1+df2

Unnamed: 0,a,b,c
0,0,2,
1,5,7,


In [26]:
df1.add(1,fill_value=0)
# #  减乘除对应sub、mul、div

Unnamed: 0,a,b
0,1,2
1,3,4


## 统计函数
- 1、describe 针对Series或个DataFrame列计算汇总统计
- 2、min，max
- 3、idxmin，idxmax计算能够获取到最大值和最小值得索引值
- 4、mean 值得平均数，std 标准差，median 中位数，diff 计算一阶差分

In [27]:
X=df.iloc[:,:4]
y=df.iloc[:,4:]
X

Unnamed: 0,feature1,feature2,feature3,feature4
x1,0,1,2,3
x2,4,5,6,7
x3,8,9,10,11
x4,12,13,14,15


In [28]:
y

Unnamed: 0,label
x1,0
x2,1
x3,1
x4,2


In [29]:
print(X.min(),X.max())

feature1    0
feature2    1
feature3    2
feature4    3
dtype: int32 feature1    12
feature2    13
feature3    14
feature4    15
dtype: int32


In [30]:
X.loc[:,'feature1']=(X.loc[:,'feature1']-X.loc[:,'feature1'].min())/(X.loc[:,'feature1'].max()-X.loc[:,'feature1'].min())
X

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,feature1,feature2,feature3,feature4
x1,0.0,1,2,3
x2,0.333333,5,6,7
x3,0.666667,9,10,11
x4,1.0,13,14,15


In [31]:
# 同样操作：对feature2 做归一化
X.loc[:,'feature2']=(X.loc[:,'feature2']-X.loc[:,'feature2'].min())/(X.loc[:,'feature2'].max()-X.loc[:,'feature2'].min())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [32]:
X

Unnamed: 0,feature1,feature2,feature3,feature4
x1,0.0,0.0,2,3
x2,0.333333,0.333333,6,7
x3,0.666667,0.666667,10,11
x4,1.0,1.0,14,15


## 生成哑变量--one hot
- pd.get_dummies()

In [33]:
df[['label']]

Unnamed: 0,label
x1,0
x2,1
x3,1
x4,2


In [34]:
y.label=y.label.astype('str')
y.dtypes

label    object
dtype: object

In [35]:
y_label=pd.get_dummies(y,prefix='class')
y_label

Unnamed: 0,class_0,class_1,class_2
x1,1,0,0
x2,0,1,0
x3,0,1,0
x4,0,0,1


In [36]:
## one--hot还原
for y in y_label.values:
    print(np.argmax(y))

0
1
1
2


In [47]:
# 比较一下 若列：‘C’ 是 numeric 会有效果吗？
df1 = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': ['1', '2', '3']})
#df1= pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],'C': [1, 2, 3]})
df1

Unnamed: 0,A,B,C
0,a,b,1
1,b,a,2
2,a,c,3


In [48]:
df1.dtypes

A    object
B    object
C    object
dtype: object

In [49]:
#pd.get_dummies(df1, prefix=['col1', 'col2'])
pd.get_dummies(df1, prefix=['col1', 'col2','col3'])

Unnamed: 0,col1_a,col1_b,col2_a,col2_b,col2_c,col3_1,col3_2,col3_3
0,1,0,0,1,0,1,0,0
1,0,1,1,0,0,0,1,0
2,1,0,0,0,1,0,0,1


## The End