# 高级数据管理
## 一个数据处理难题
为了给所有学生确定一个单一的成绩衡量指标，需要将这些科目的成绩组合起来。另外，你还想将前20%的学生评定为A，接下来20%的学生评定为B，依次类推。最后你希望按字母顺序对学生排序。

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-12/35367117.jpg)

## 数值和字符处理函数
### 数学函数

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-12/73026866.jpg)

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-12/82857416.jpg)

### 统计函数

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-12/23372509.jpg)

代码清单5-1 均值和标准差的计算

In [1]:
x <- c(1,2,3,4,5,6,7,8)
mean(x)

In [2]:
sd(x)

#### 数据标准化
默认情况下，函数scale()对矩阵或数据框的指定列进行均值为0、标准差为1的标准化：

```r
newdata <- scale(mydata)
```

指定均值(M)和标准差(SD)的标准化：

```r
newdata <- scale(mydata) * SD + M
```

对指定列进行标准化：

```r
newdata <- transform(mydata, myvar = scale(myvar) * SD + M)
```

### 概率函数
在R中，概率函数形如：

```r
[dpqr]distribution_abbreviation()
```

其中第一个字母表示其所指分布的某一方面：

- d=密度函数（density）
- p=分布函数（distribution function）
- q=分位数函数（quantile function）
- r=生成随机数（随机偏差）

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-12/96211627.jpg)

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-12/99964175.jpg)

#### 1.设定随机数种子
通过函数set.seed()

代码清单5-2 生成服从生态分布的伪随机数

In [3]:
runif(5) # (0,1)上服从均匀分布的伪随机数

In [4]:
runif(5)

In [5]:
set.seed(1234)
runif(5)

In [6]:
set.seed(1234)
runif(5)

#### 2.生成多元正态数据
MASS包中mvrnorm()函数：

```r
mvrnorm(n, mean, sigma)
```

- mean均值向量
- sigma方差-协方差矩阵（或相关矩阵）

代码清单5-3 生成服从多元正态分布的数据

In [7]:
library(MASS)
options(digits=3)
set.seed(1234)

In [8]:
mean <- c(230.7, 146.7, 3.6)
sigma <- matrix(c(15360.8, 6721.2, -47.1,
                  6721.2, 4700.9, -16.5,
                  -47.1,   -16.5, 0.3), nrow=3, ncol=3)
mydata <- mvrnorm(500, mean, sigma)
mydata <- as.data.frame(mydata)
names(mydata) <- c("y","x1","x2")

In [9]:
dim(mydata)

In [10]:
head(mydata, n=10)

y,x1,x2
98.8,41.3,3.43
244.5,205.2,3.8
375.7,186.7,2.51
-59.2,11.2,4.71
313.0,111.0,3.45
288.8,185.1,2.72
134.8,165.0,4.39
171.7,97.4,3.64
167.2,101.0,3.5
121.1,94.5,4.1


### 字符处理函数
![](http://ou8qjsj0m.bkt.clouddn.com//17-11-12/1832183.jpg)

### 其他实用函数
![](http://ou8qjsj0m.bkt.clouddn.com//17-11-12/98382513.jpg)

### 将函数应用于矩阵和数据框
代码清单5-4 将函数应用于数据对象

In [1]:
a <- 5
sqrt(a)

In [2]:
b <- c(1.243, 5.654, 2.99)
round(b)

In [3]:
c <- matrix(runif(12), nrow=3)
c

0,1,2,3
0.98950616,0.822879,0.4278709,0.4509339
0.43458092,0.2253356,0.1094183,0.6266923
0.06919287,0.8825891,0.235145,0.2817198


In [4]:
log(c)

0,1,2,3
-0.01054929,-0.1949462,-0.8489338,-0.7964345
-0.8333731,-1.4901643,-2.212577,-0.4672996
-2.67085752,-0.1248956,-1.4475531,-1.2668423


In [5]:
mean(c)

apply()函数的使用格式为：

```r
apply(x, MARGIN, FUN, ...)
```

- x为数据对象
- MARGIN是维度的下标
- FUN是用户定义的函数

代码清单5-5 将一个函数应用到矩阵的所有行（列）

In [6]:
mydata <- matrix(rnorm(30), nrow=6)
mydata

0,1,2,3,4
-0.4539043,0.3313093,-0.294497,-1.525677,-0.4540602
0.3775933,-1.202011,1.2455713,-0.497861,0.5445697
1.4159429,-1.6577526,-0.9670505,0.6904934,-0.6496752
-1.9689923,0.3985245,0.6669216,-0.5416062,0.3743046
0.1625864,0.833311,1.130807,-0.3811411,1.2769574
-0.4050438,0.2914321,0.6424401,-0.3838955,-0.3006475


In [7]:
apply(mydata, 1, mean) # 计算每行均值

In [8]:
apply(mydata, 2, mean) # 计算每列均值

In [9]:
apply(mydata, 2, mean, trim=0.2) # 计算每行截尾均值，最高和最低20%数据被忽略

## 数据处理难题的一套解决方案
代码清单5-6 示例的一种解决方案

In [10]:
options(digits=2)
Student <- c("John Davis", "Angela Williams", "Bullwinkle Moose",
             "David Jones", "Janice Markhammer", "Cheryl Cushing",
             "Reuven Ytzrhak", "Greg Knox", "Joel England", "Mary Rayburn")
Math <- c(502, 600, 412, 358, 495, 512, 410, 625, 573, 522)
Science <- c(95, 99, 80, 82, 75, 85, 80, 95, 89, 86)
English <- c(25, 22, 18, 15, 20, 28, 15, 30, 27, 18)
roster <- data.frame(Student, Math, Science, English, stringsAsFactors=FALSE)
roster

Student,Math,Science,English
John Davis,502,95,25
Angela Williams,600,99,22
Bullwinkle Moose,412,80,18
David Jones,358,82,15
Janice Markhammer,495,75,20
Cheryl Cushing,512,85,28
Reuven Ytzrhak,410,80,15
Greg Knox,625,95,30
Joel England,573,89,27
Mary Rayburn,522,86,18


In [11]:
z <- scale(roster[,2:4])    # 归一化
score <- apply(z, 1, mean)  # 按行求均值
roster <- cbind(roster, score)
roster

Student,Math,Science,English,score
John Davis,502,95,25,0.56
Angela Williams,600,99,22,0.92
Bullwinkle Moose,412,80,18,-0.86
David Jones,358,82,15,-1.16
Janice Markhammer,495,75,20,-0.63
Cheryl Cushing,512,85,28,0.35
Reuven Ytzrhak,410,80,15,-1.05
Greg Knox,625,95,30,1.34
Joel England,573,89,27,0.7
Mary Rayburn,522,86,18,-0.18


In [12]:
y <- quantile(score, c(.8, .6, .4, .2)) # 计算百分位数
y

In [13]:
# 创建grade变量，基于百分位重编码学生评分
roster$grade[score >= y[1]] <- "A"
roster$grade[score < y[1] & score >= y[2]] <- "B"
roster$grade[score < y[2] & score >= y[3]] <- "C"
roster$grade[score < y[3] & score >= y[4]] <- "D"
roster$grade[score < y[4]] <- "F"
roster

Student,Math,Science,English,score,grade
John Davis,502,95,25,0.56,B
Angela Williams,600,99,22,0.92,A
Bullwinkle Moose,412,80,18,-0.86,D
David Jones,358,82,15,-1.16,F
Janice Markhammer,495,75,20,-0.63,D
Cheryl Cushing,512,85,28,0.35,C
Reuven Ytzrhak,410,80,15,-1.05,F
Greg Knox,625,95,30,1.34,A
Joel England,573,89,27,0.7,B
Mary Rayburn,522,86,18,-0.18,C


使用函数sapply()提取列表中每个成分的第一个元素，放入一个存储名字的向量Firstname，并提取每个成分的第二个元素，放入一个存储姓氏的向量Lastname。“[”是一个可以提取某个对象的一部分的函数。

In [14]:
name <- strsplit((roster$Student), " ")
Lastname <- sapply(name, "[", 2)
Firstname <- sapply(name, "[", 1)
roster <- cbind(Firstname, Lastname, roster[, -1]) # 丢弃第一列变量（Student）
roster

Firstname,Lastname,Math,Science,English,score,grade
John,Davis,502,95,25,0.56,B
Angela,Williams,600,99,22,0.92,A
Bullwinkle,Moose,412,80,18,-0.86,D
David,Jones,358,82,15,-1.16,F
Janice,Markhammer,495,75,20,-0.63,D
Cheryl,Cushing,512,85,28,0.35,C
Reuven,Ytzrhak,410,80,15,-1.05,F
Greg,Knox,625,95,30,1.34,A
Joel,England,573,89,27,0.7,B
Mary,Rayburn,522,86,18,-0.18,C


In [15]:
roster <- roster[order(Lastname, Firstname), ] # 根据姓氏和名字排序
roster

Unnamed: 0,Firstname,Lastname,Math,Science,English,score,grade
6,Cheryl,Cushing,512,85,28,0.35,C
1,John,Davis,502,95,25,0.56,B
9,Joel,England,573,89,27,0.7,B
4,David,Jones,358,82,15,-1.16,F
8,Greg,Knox,625,95,30,1.34,A
5,Janice,Markhammer,495,75,20,-0.63,D
3,Bullwinkle,Moose,412,80,18,-0.86,D
10,Mary,Rayburn,522,86,18,-0.18,C
2,Angela,Williams,600,99,22,0.92,A
7,Reuven,Ytzrhak,410,80,15,-1.05,F


## 控制流
### 重复和循环
#### 1. for结构
```r
for (var in seq) statement
```

In [16]:
for (i in 1:10)  print("Hello")

[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"


#### 2.while结构
```r
while (cond) statement
```

In [17]:
i <- 10
while (i > 0) {print("Hello"); i <- i - 1}

[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"
[1] "Hello"


### 条件执行
#### 1.if-else结构
```r
if (cond) statement
if (cond) statement1 else statement2
```

In [19]:
grade <- 'A'
if (is.character(grade)) grade <- as.factor(grade)
if (!is.factor(grade)) grade <- as.factor(grade) else print("Grade already is a factor")

[1] "Grade already is a factor"


#### 2.ifelse结构
```r
ifelse(cond, statement1, statement2)
```

In [20]:
score <- 0.1
ifelse(score > 0.5, print("Passed"), print("Failed"))
outcome <- ifelse (score > 0.5, "Passed", "Failed")

[1] "Failed"


#### 3.switch结构
```r
switch(expr, ...)
```

代码清单5-7 一个switch示例

In [21]:
feelings <- c("sad", "afraid")
for (i in feelings)
    print(
        switch(i,
               happy  = "I am glad you are happy",
               afraid = "There is nothing to fear",
               sad    = "Cheer up",
               angry  = "Calm down now"
            )
        )

[1] "Cheer up"
[1] "There is nothing to fear"


## 用户自编函数
```r
myfunction <- function(arg1, arg2, ... ){
    statements
    return(object)
}
```

代码清单5-8 mystats()：一个由用户编写的描述性统计量计算函数

In [25]:
mystats <- function(x, parametric=TRUE, print=FALSE) {
    if (parametric) {
        center <- mean(x); spread <- sd(x)
    } else {
        center <- median(x); spread <- mad(x)
    }
    if (print & parametric) {
        cat("Mean=", center, "\n", "SD=", spread, "\n")
    } else if (print & !parametric) {
        cat("Median=", center, "\n", "MAD=", spread, "\n")
  }
  result <- list(center=center, spread=spread)
  return(result)
}

set.seed(1234)
x <- rnorm(500)
y <- mystats(x)
y

In [23]:
y <- mystats(x, parametric=FALSE, print=TRUE)

Median= -0.021 
 MAD= 1 


一个使用了switch结构的用户自编函数：

In [31]:
Sys.getenv("TZ")

In [32]:
Sys.setenv(TZ="Asia/Shanghai")
Sys.getenv("TZ")

In [33]:
mydate <- function(type="long") {
    switch(type,
           long = format(Sys.time(), "%A %B %d %Y"),
           short = format(Sys.time(), "%m-%d-%y"),
           cat(type, "is not a recognized type\n")
        )
}

mydate("long")

In [34]:
mydate("short")

In [35]:
mydate()

In [36]:
mydate("medium")

medium is not a recognized type


## 整合与重构
### 转置
代码清单5-9 数据集的转置

In [37]:
cars <- mtcars[1:5,1:4]
cars

Unnamed: 0,mpg,cyl,disp,hp
Mazda RX4,21,6,160,110
Mazda RX4 Wag,21,6,160,110
Datsun 710,23,4,108,93
Hornet 4 Drive,21,6,258,110
Hornet Sportabout,19,8,360,175


In [38]:
t(cars)

Unnamed: 0,Mazda RX4,Mazda RX4 Wag,Datsun 710,Hornet 4 Drive,Hornet Sportabout
mpg,21,21,23,21,19
cyl,6,6,4,6,8
disp,160,160,108,258,360
hp,110,110,93,110,175


### 整合数据
```r
aggregate(x, by, FUN)
```

代码清单5-10 整合数据

In [39]:
options(digits=3)
attach(mtcars)
aggdata <-aggregate(mtcars, by=list(cyl,gear), FUN=mean, na.rm=TRUE)
aggdata

Group.1,Group.2,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
4,3,21.5,4,120,97,3.7,2.46,20.0,1.0,0.0,3,1.0
6,3,19.8,6,242,108,2.92,3.34,19.8,1.0,0.0,3,1.0
8,3,15.1,8,358,194,3.12,4.1,17.1,0.0,0.0,3,3.08
4,4,26.9,4,103,76,4.11,2.38,19.6,1.0,0.75,4,1.5
6,4,19.8,6,164,116,3.91,3.09,17.7,0.5,0.5,4,4.0
4,5,28.2,4,108,102,4.1,1.83,16.8,0.5,1.0,5,2.0
6,5,19.7,6,145,175,3.62,2.77,15.5,0.0,1.0,5,6.0
8,5,15.4,8,326,300,3.88,3.37,14.6,0.0,1.0,5,6.0


### reshape2包
你需要首先将数据`融合(melt)`，以使每一行都是唯一的标识符-变量组合。然后将数据`重铸（cast）`为你想要的任何形状。在重铸过程中，你可以使用任何函数对数据进行整合。

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-13/83299132.jpg)

在这个数据集中，`测量（measurement）`是指最后两列中的值（5、6、3、5、6、1、2、4）。

#### 1.融合
数据集的融合是将它重构为这样一种格式：每个测量变量独占一行，行中带有要唯一确定这个测量所需的标识符变量。

In [41]:
library(reshape2)
md <- melt(mydata, id=c("ID", "Time"))
md

Var1,Var2,value
1,1,-0.454
2,1,0.378
3,1,1.416
4,1,-1.969
5,1,0.163
6,1,-0.405
1,2,0.331
2,2,-1.202
3,2,-1.658
4,2,0.399


#### 2.重铸
dcast()函数读取已融合的数据，并使用你提供的公式和一个（可选的）用于整合数据的函数将其重塑。

```r
newdata <- dcast(md, formula, fun.aggregate)
```

- md:已融合的数据
- formula:想要的最后结果
- fun.aggregate:（可选的）数据整合函数。

其接受的公式形如：

```r
rowvar1 + rowvar2 + ... ~ colvar1 + colvar2 + ...
```

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-13/67558187.jpg)