
## R常用数据框操作方法


1. 查看数据概览

  + `dplyr::glimpse`, 总结查看
  + `%>%`, 管道操作

2. 改变数据结构

  + `tidyr::gather`，宽 --> 长
  + `tidyr::spread`, 长 --> 宽
  + `tidyr::separate`, col_paste -> col1, col2, col3, ...
  + `tidyr::unite`，col1, col2, col3, ... --> col_paste

3. 按行subset

  + `dplyr::filter`，筛选
  + `dplyr::distinct`, 去重
  + `dplyr::sample_frac`, 随机按比例取样
  + `dplyr::sample_n`，随机取n行
  + `dplyr::slice`, 取其中的几行
  + `dplyr::top_n`, 前几行

4. 按列subset

  + `dplyr::select`, 按列取变量

5. 总结数据

  + `dplyr::summarise`，总结至一行
  + `dplyr::summarise_each`，按列总结

6. 生成新列

  + `dplyr::mutate`，生成一列或多列
  + `dplyr::mutate_each`， 对每一列用window function操作，常见的window function如`cumsum`, `dplyr::lead`, `pmin`等

7. 数据分组

  + `dplyr::group_by`。常和其他方法结合，如 `iris %>% group_by(...) %>% summarise(...)`或者 `iris %>% group_by(...) %>% mutate(...)`

8. 数据融合， data combine

  + `dplyr::left_join` 左联接
  + `dplyr::right_join`
  + `dplyr::full_join`
  + `dplyr::inner_join`
  + `dplyr::semi_join`
  + `dplyr::anti_join`

9. 数据集合操作, set operation

  + `dplyr::intersect`
  + `dplyr::union`
  + `dplyr::setdiff`

10. 数据结合, data bindding

  + `dplyr::bind_row`
  + `dplyr::bind_col`



## pandas对应的操作方法


In [22]:
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
# print(iris.keys())
dat = pd.DataFrame(iris.data, columns=iris.feature_names)

In [23]:
dat.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [28]:
# rename column
dat.columns = ["sepal_len", "sepal_width", "petal_len", "petal_width"]
dat.head()
# rename row
# dat.index

Unnamed: 0,sepal_len,sepal_width,petal_len,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [25]:
# summary --> glimpse
dat.describe()

Unnamed: 0,sepal_len,sepal_width,petal_len,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [26]:
dat.dtypes

sepal_len      float64
sepal_width    float64
petal_len      float64
petal_width    float64
dtype: object

In [30]:
# add one column
dat["class"] = pd.Series(iris.target)
dat.head()

Unnamed: 0,sepal_len,sepal_width,petal_len,petal_width,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [41]:
# subset row
dat[(dat["class"] == 0) | (dat["class"] == 1)].shape

(100, 5)

In [45]:
# view one row
dat.ix[0, ]

sepal_len      5.1
sepal_width    3.5
petal_len      1.4
petal_width    0.2
class          0.0
Name: 0, dtype: float64

In [56]:
# drop duplicated row
print(dat.shape)
print(dat.duplicated().head())

# duplicated terms, should not be remove in this case for sample actual exists
print(dat[dat.duplicated()])
print(dat.drop_duplicates().shape)                # ?dat.drop_duplicates to look more

(150, 5)
0    False
1    False
2    False
3    False
4    False
dtype: bool
     sepal_len  sepal_width  petal_len  petal_width  class
34         4.9          3.1        1.5          0.1      0
37         4.9          3.1        1.5          0.1      0
142        5.8          2.7        5.1          1.9      2
(147, 5)


In [59]:
# sample row

dat.sample(frac=.5, replace=False).shape        # ?dat.sample for more explanation

(75, 5)

In [61]:
# 取列变量

dat[[1, 2, 3]].head()

Unnamed: 0,sepal_width,petal_len,petal_width
0,3.5,1.4,0.2
1,3.0,1.4,0.2
2,3.2,1.3,0.2
3,3.1,1.5,0.2
4,3.6,1.4,0.2


In [62]:
dat[["sepal_len", "sepal_width"]].head()

Unnamed: 0,sepal_len,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6


In [64]:
?dat.filter