# 基本数据管理
## 一个示例
1. 处理管理岗位的男性和女性在听从上级的程度上是否有所不同？
1. 这种情况是否依国家的不同而有所不同，或者说这些由性别导致的不同是否普遍存在？

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-8/71100876.jpg)

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-8/60349231.jpg)

代码清单4-1 创建leadership数据框

In [8]:
manager <- c(1, 2, 3, 4, 5)
date <- c("10/24/08", "10/28/08", "10/1/08", "10/12/08", "5/1/09")
country <- c("US", "US", "UK", "UK", "UK")
gender <- c("M", "F", "F", "M", "F")
age <- c(32, 45, 25, 39, 99)
q1 <- c(5, 3, 3, 3, 2)
q2 <- c(4, 5, 5, 3, 2)
q3 <- c(5, 2, 5, 4, 1)
q4 <- c(5, 5, 5, NA, 2)
q5 <- c(5, 5, 2, NA, 1)
leadership <- data.frame(manager, date, country, gender, age,
                         q1, q2, q3, q4, q5, stringsAsFactors=FALSE)
leadership

manager,date,country,gender,age,q1,q2,q3,q4,q5
1,10/24/08,US,M,32,5,4,5,5.0,5.0
2,10/28/08,US,F,45,3,5,2,5.0,5.0
3,10/1/08,UK,F,25,3,5,5,5.0,2.0
4,10/12/08,UK,M,39,3,3,4,,
5,5/1/09,UK,F,99,2,2,1,2.0,1.0


## 创建新变量
代码清单4-2 创建新变量

In [2]:
# 方案一
mydata<-data.frame(x1 = c(2, 2, 6, 4),
                   x2 = c(3, 4, 2, 8))
mydata$sumx  <-  mydata$x1 + mydata$x2
mydata$meanx <- (mydata$x1 + mydata$x2)/2

# 方案二
attach(mydata)
mydata$sumx  <-  x1 + x2
mydata$meanx <- (x1 + x2)/2
detach(mydata)

# 方案三
mydata <- transform(mydata,
                    sumx = x1+x2,
                    meanx = (x1 + x2)/2)

## 变量的重编码
`重编码`涉及根据同一个变量和/或其他变量的现有值创建新值的过程。举例来说：
- 将一个连续型变量修改为一组类别值；
- 将误编码的值替换为正确值；
- 基于一组分数线创建一个表示及格/不及格的变量。

将99岁的年龄值重编码为缺失值：

```r
leadership$age[leadership$age == 99] <- NA
```

接着创建agecat变量：

```r
leadership$agecat[leadership$age  > 75]  <- "Elder"
leadership$agecat[leadership$age >= 55 & leadership$age <= 75]  <- "Middle Aged"
leadership$agecat[leadership$age  < 55]  <- "Young"
```

更紧凑的写法：

```r
leadership <- within(leadership,{
                    agecat <- NA
                    agecat[age > 75]              <- "Elder"
                    agecat[age >= 55 & age <= 75] <- "Middle Aged"
                    agecat[age < 55]              <- "Young" })
```

## 变量的重命名
可以通过names()函数来重命名变量：

```r
names(leadership)[2] <- "testDate"
```

In [9]:
names(leadership)

In [10]:
names(leadership)[2] <- "testdate"
leadership

manager,testdate,country,gender,age,q1,q2,q3,q4,q5
1,10/24/08,US,M,32,5,4,5,5.0,5.0
2,10/28/08,US,F,45,3,5,2,5.0,5.0
3,10/1/08,UK,F,25,3,5,5,5.0,2.0
4,10/12/08,UK,M,39,3,3,4,,
5,5/1/09,UK,F,99,2,2,1,2.0,1.0


以类似的方式：

```r
names(leadership)[6:10] <- c("item1", "item2", "item3", "item4", "item5")
``` 

plyr包有一个rename()函数，可用于修改变量名。rename()函数的使用格式：

```r
rename(dataframe, c(oldname="newname", oldname="newname",...))
```

In [11]:
library(plyr)
leadership <- rename(leadership,
                     c(manager="managerID", testdate="testDate"))

In [13]:
names(leadership)

## 缺失值
代码清单4-3 使用is.na()函数

In [14]:
is.na(leadership[,6:10])

q1,q2,q3,q4,q5
False,False,False,False,False
False,False,False,False,False
False,False,False,False,False
False,False,False,True,True
False,False,False,False,False


### 重编码某些值为缺失值
```r
leadership$age[leadership$age == 99] <- NA
```

### 在分析中排除缺失值

In [16]:
x <- c(1, 2, NA, 3)
y <- x[1] + x[2] + x[3] + x[4]
y

In [17]:
z <- sum(x)
z

In [18]:
x <- c(1, 2, NA, 3)
y <- sum(x, na.rm=TRUE)
y

代码清单4-4 使用na.omit()删除不完整的观测

In [19]:
leadership

managerID,testDate,country,gender,age,q1,q2,q3,q4,q5
1,10/24/08,US,M,32,5,4,5,5.0,5.0
2,10/28/08,US,F,45,3,5,2,5.0,5.0
3,10/1/08,UK,F,25,3,5,5,5.0,2.0
4,10/12/08,UK,M,39,3,3,4,,
5,5/1/09,UK,F,99,2,2,1,2.0,1.0


In [20]:
newdata <- na.omit(leadership)
newdata

Unnamed: 0,managerID,testDate,country,gender,age,q1,q2,q3,q4,q5
1,1,10/24/08,US,M,32,5,4,5,5,5
2,2,10/28/08,US,F,45,3,5,2,5,5
3,3,10/1/08,UK,F,25,3,5,5,5,2
5,5,5/1/09,UK,F,99,2,2,1,2,1


## 日期值
![](http://ou8qjsj0m.bkt.clouddn.com//17-11-8/28463186.jpg)

In [22]:
mydates <- as.Date(c("2007-06-22", "2004-02-13"), tz = "GMT")
mydates

In [24]:
strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y", tz = "GMT")
dates

In [27]:
myformat <- "%m/%d/%y"
leadership$testDate <- as.Date(leadership$testDate, myformat, tz = "GMT")
leadership

managerID,testDate,country,gender,age,q1,q2,q3,q4,q5
1,2008-10-24,US,M,32,5,4,5,5.0,5.0
2,2008-10-28,US,F,45,3,5,2,5.0,5.0
3,2008-10-01,UK,F,25,3,5,5,5.0,2.0
4,2008-10-12,UK,M,39,3,3,4,,
5,2009-05-01,UK,F,99,2,2,1,2.0,1.0


In [28]:
Sys.Date() # 返回当前日期

In [29]:
date()

In [30]:
today <- Sys.Date()
format(today, format="%B %d %Y")

In [31]:
format(today, format="%A")

In [32]:
# 可以在日期上执行算术运算
startdate <- as.Date("2004-02-13")
enddate   <- as.Date("2011-01-22")
days      <- enddate - startdate
days

Time difference of 2535 days

In [33]:
# 可以使用difftime()来计算时间间隔，并以星期、天、时、分、秒来表示
today <- Sys.Date()
dob   <- as.Date("1956-10-12")
difftime(today, dob, units="weeks")

Time difference of 3186.714 weeks

## 类型转换
![](http://ou8qjsj0m.bkt.clouddn.com//17-11-8/54975715.jpg)

代码清单4-5 转换数据类型

In [34]:
a <- c(1,2,3)
a

In [35]:
is.numeric(a)

In [36]:
is.vector(a)

In [37]:
a <- as.character(a)
a

In [38]:
is.numeric(a)

In [39]:
is.vector(a)

In [40]:
is.character(a)

## 数据排序
可以使用order()函数对一个数据框进行排序。

```r
newdata <- leadership[order(leadership$age),]
```

In [43]:
attach(leadership)
newdata <- leadership[order(gender, age),]
detach(leadership)
newdata

The following objects are masked _by_ .GlobalEnv:

    age, country, gender, q1, q2, q3, q4, q5



Unnamed: 0,managerID,testDate,country,gender,age,q1,q2,q3,q4,q5
3,3,2008-10-01,UK,F,25,3,5,5,5.0,2.0
2,2,2008-10-28,US,F,45,3,5,2,5.0,5.0
5,5,2009-05-01,UK,F,99,2,2,1,2.0,1.0
1,1,2008-10-24,US,M,32,5,4,5,5.0,5.0
4,4,2008-10-12,UK,M,39,3,3,4,,


In [44]:
attach(leadership)
newdata <-leadership[order(gender, -age),]
detach(leadership)
newdata

The following objects are masked _by_ .GlobalEnv:

    age, country, gender, q1, q2, q3, q4, q5



Unnamed: 0,managerID,testDate,country,gender,age,q1,q2,q3,q4,q5
5,5,2009-05-01,UK,F,99,2,2,1,2.0,1.0
2,2,2008-10-28,US,F,45,3,5,2,5.0,5.0
3,3,2008-10-01,UK,F,25,3,5,5,5.0,2.0
4,4,2008-10-12,UK,M,39,3,3,4,,
1,1,2008-10-24,US,M,32,5,4,5,5.0,5.0


## 数据集的合并
### 向数据框加列
横向合并两个数据框，使用merge()函数。在多数情况下，两个数据框是通过一个或多个共有变量进行联结的（inner join）：

```r
total <- merge(dataframeA, dataframeB, by="ID")
```

将dataframeA和dataframeB按照ID进行了合并。类似地

```r
total <- merge(dataframeA, dataframeB, by=c("ID","Country"))
```

用cbind()进行横向合并：

```r
total <- cbind(A, B)
```

### 向数据框添加行
纵向合并两个数据框，rbind():

```r
total <- rbind(dataframeA, dataframeB)
```

两个数据框必须拥有相同的变量。如果dataframeA中拥有dataframeB中没有的变量，请在合并它们之前做以下某种处理：

- 删除dataframeA中的多余变量；
- 在dataframeB中创建追加的变量并将其值设为NA。

## 数据集取子集
### 选入（保留）变量
方法一：
```r
newdata <- leadership[, c(6:10)]
```

方法二：
```r
myvars <- c("q1", "q2", "q3", "q4", "q5")
newdata <-leadership[myvars]
```

方法三：
```r
myvars <- paste("q", 1:5, sep="")
newdata <- leadership[myvars]
```

### 剔除（丢弃）变量
剔除变量q3和q4：
```r
myvars <- names(leadership) %in% c("q3", "q4")
newdata <- leadership[!myvars]
```

在知道q3和q4是第8个和第9个变量的情况下：
```r
newdata <- leadership[c(-8,-9)]
```

也可以：

```r
leadership$q3 <- leadership$q4 <- NULL
```

这里将q3和q4两列设为了未定义（NULL）。注意，NULL与NA（表示缺失）是不同的。

### 选入观测
代码清单4-6 选入观测

In [46]:
newdata <- leadership[1:3,] # 选择第1行到第3行
newdata <- leadership[leadership$gender=="M" & leadership$age > 30,] # 选择所有30岁以上男性

attach(leadership)
newdata <- leadership[gender=='M' & age > 30,]
detach(leadership)

# 限定研究的时间范围
leadership$date <- as.Date(leadership$testDate, "%m/%d/%y")
startdate <- as.Date("2009-01-01")
enddate <- as.Date("2009-10-31")
newdata <- leadership[which(leadership$testDate >= startdate &
                            leadership$testDate <= enddate),]

The following objects are masked _by_ .GlobalEnv:

    age, country, gender, q1, q2, q3, q4, q5



### subset()  
选择所有age在24-35之间，保留变量q1~q4:

```r
newdata <- subset(leadership, age >= 35 | age < 24,
                  select=c(q1, q2, q3, q4))
```

选择所有25岁以上男性，并保留变量gender到q4：

```r
newdata <- subset(leadership, gendar == "M" | age > 25,
                  select=gender:q4)
```

### 随机抽样
从leadership数据集中随机抽取一个大小为3的样本：

```r
mysample <- leadership[sample(1:nrow(leadership), 3, replace=FALSE),]
```

## 使用SQL语句操作数据框

In [48]:
# install.packages("sqldf")
library(sqldf)
newdf <- sqldf("select * from mtcars where carb=1 order by mpg", row.names=TRUE)
newdf

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
