# 基本统计分析
## 描述性统计分析
- 每加仑汽油行驶英里数（mpg）
- 马力（hp）
- 车重（wt）

In [1]:
myvars <- c("mpg", "hp", "wt")
head(mtcars[myvars])

Unnamed: 0,mpg,hp,wt
Mazda RX4,21.0,110,2.62
Mazda RX4 Wag,21.0,110,2.875
Datsun 710,22.8,93,2.32
Hornet 4 Drive,21.4,110,3.215
Hornet Sportabout,18.7,175,3.44
Valiant,18.1,105,3.46


### 全集的描述性统计量
代码清单7-1 通过summary()计算描述性统计量

In [2]:
myvars <- c("mpg", "hp", "wt")
summary(mtcars[myvars])

      mpg              hp              wt       
 Min.   :10.40   Min.   : 52.0   Min.   :1.513  
 1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
 Median :19.20   Median :123.0   Median :3.325  
 Mean   :20.09   Mean   :146.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :335.0   Max.   :5.424  

代码清单7-2 通过sapply()计算描述性统计量

In [3]:
mystats <- function(x, na.omit=FALSE) {
    if (na.omit)
        x <- x[!is.na(x)]
    m <- mean(x)
    n <- length(x)
    s <- sd(x)
    skew <- sum((x-m)^3/s^3)/n
    kurt <- sum((x-m)^4/s^4)/n - 3
    return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt))
}

myvars <- c("mpg", "hp", "wt")
sapply(mtcars[myvars], mystats)

Unnamed: 0,mpg,hp,wt
n,32.0,32.0,32.0
mean,20.090625,146.6875,3.21725
stdev,6.026948,68.5628685,0.97845744
skew,0.610655,0.7260237,0.42314646
kurtosis,-0.372766,-0.1355511,-0.02271075


代码清单7-3 通过Hmisc包中的describe()函数计算描述性统计量

In [5]:
library(Hmisc)
myvars <- c("mpg", "hp", "wt")
describe(mtcars[myvars])

mtcars[myvars] 

 3  Variables      32  Observations
--------------------------------------------------------------------------------
mpg 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       25    0.999    20.09    6.796    12.00    14.34 
     .25      .50      .75      .90      .95 
   15.43    19.20    22.80    30.09    31.30 

lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
--------------------------------------------------------------------------------
hp 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       22    0.997    146.7    77.04    63.65    66.00 
     .25      .50      .75      .90      .95 
   96.50   123.00   180.00   243.50   253.55 

lowest :  52  62  65  66  91, highest: 215 230 245 264 335
--------------------------------------------------------------------------------
wt 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32    

代码清单7-4 通过pastecs包中的stat.desc()函数计算描述性统计量

In [7]:
library(pastecs)
myvars <- c("mpg", "hp", "wt")
stat.desc(mtcars[myvars])

Unnamed: 0,mpg,hp,wt
nbr.val,32.0,32.0,32.0
nbr.null,0.0,0.0,0.0
nbr.na,0.0,0.0,0.0
min,10.4,52.0,1.513
max,33.9,335.0,5.424
range,23.5,283.0,3.911
sum,642.9,4694.0,102.952
median,19.2,123.0,3.325
mean,20.090625,146.6875,3.21725
SE.mean,1.065424,12.1203173,0.1729685


代码清单7-5 通过psych包中的describe()计算描述性统计量

In [10]:
library(psych)
myvars <- c("mpg", "hp", "wt")
describe(mtcars[myvars])

Unnamed: 0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
mpg,1,32,20.09062,6.0269481,19.2,19.696154,5.41149,10.4,33.9,23.5,0.610655,-0.37276603,1.065424
hp,2,32,146.6875,68.5628685,123.0,141.192308,77.0952,52.0,335.0,283.0,0.7260237,-0.13555112,12.1203173
wt,3,32,3.21725,0.9784574,3.325,3.152692,0.7672455,1.513,5.424,3.911,0.4231465,-0.02271075,0.1729685


### 分组计算描述性统计量
代码清单7-6 使用aggregate()分组获取描述性统计量

In [11]:
myvars <- c("mpg", "hp", "wt")
aggregate(mtcars[myvars], by=list(am=mtcars$am), mean)

am,mpg,hp,wt
0,17.14737,160.2632,3.768895
1,24.39231,126.8462,2.411


In [12]:
aggregate(mtcars[myvars], by=list(am=mtcars$am), sd)

am,mpg,hp,wt
0,3.833966,53.9082,0.7774001
1,6.166504,84.06232,0.6169816


代码清单7-7 使用by()分组计算描述性统计量

In [13]:
dstats <- function(x) {
    sapply(x, mystats)
}

myvars <- c("mpg", "hp", "wt")
by(mtcars[myvars], mtcars$am, dstats)

mtcars$am: 0
                 mpg           hp         wt
n        19.00000000  19.00000000 19.0000000
mean     17.14736842 160.26315789  3.7688947
stdev     3.83396639  53.90819573  0.7774001
skew      0.01395038  -0.01422519  0.9759294
kurtosis -0.80317826  -1.20969733  0.1415676
------------------------------------------------------------ 
mtcars$am: 1
                 mpg          hp         wt
n        13.00000000  13.0000000 13.0000000
mean     24.39230769 126.8461538  2.4110000
stdev     6.16650381  84.0623243  0.6169816
skew      0.05256118   1.3598859  0.2103128
kurtosis -1.45535200   0.5634635 -1.1737358

代码清单7-8 使用doBy包中的summaryBy()分组计算概述统计量

In [14]:
library(doBy)
summaryBy(mpg+hp+wt~am, data=mtcars, FUN=mystats)

am,mpg.n,mpg.mean,mpg.stdev,mpg.skew,mpg.kurtosis,hp.n,hp.mean,hp.stdev,hp.skew,hp.kurtosis,wt.n,wt.mean,wt.stdev,wt.skew,wt.kurtosis
0,19,17.14737,3.833966,0.01395038,-0.8031783,19,160.2632,53.9082,-0.01422519,-1.2096973,19,3.768895,0.7774001,0.9759294,0.1415676
1,13,24.39231,6.166504,0.05256118,-1.455352,13,126.8462,84.06232,1.35988586,0.5634635,13,2.411,0.6169816,0.2103128,-1.1737358


代码清单7-9 使用psych包中的describeBy()分组计算概述统计量

In [15]:
library(psych)
myvars <- c("mpg", "hp", "wt")
describeBy(mtcars[myvars], list(am=mtcars$am))


 Descriptive statistics by group 
am: 0
    vars  n   mean    sd median trimmed   mad   min    max  range  skew
mpg    1 19  17.15  3.83  17.30   17.12  3.11 10.40  24.40  14.00  0.01
hp     2 19 160.26 53.91 175.00  161.06 77.10 62.00 245.00 183.00 -0.01
wt     3 19   3.77  0.78   3.52    3.75  0.45  2.46   5.42   2.96  0.98
    kurtosis    se
mpg    -0.80  0.88
hp     -1.21 12.37
wt      0.14  0.18
------------------------------------------------------------ 
am: 1
    vars  n   mean    sd median trimmed   mad   min    max  range skew kurtosis
mpg    1 13  24.39  6.17  22.80   24.38  6.67 15.00  33.90  18.90 0.05    -1.46
hp     2 13 126.85 84.06 109.00  114.73 63.75 52.00 335.00 283.00 1.36     0.56
wt     3 13   2.41  0.62   2.32    2.39  0.68  1.51   3.57   2.06 0.21    -1.17
       se
mpg  1.71
hp  23.31
wt   0.17

## 频数表和列联表

In [17]:
library(vcd)
head(Arthritis)

ID,Treatment,Sex,Age,Improved
57,Treated,Male,27,Some
46,Treated,Male,29,
77,Treated,Male,30,
17,Treated,Male,32,Marked
36,Treated,Male,46,Marked
23,Treated,Male,58,Marked


### 生成频数表
表7-1 用于创建和处理列联表的函数

| 函数 | 描述 |
| --- | --- |
| table(var1, var2, ..., varN) | 使用N个类别型变量（因子）创建一个N维列联表 |
| xtabs(formula, data) | 根据一个公式和一个矩阵或数据框创建一个N维列联表 |
| prop.table(table, margins) | 依margins定义的边际列表中条目表示为分数形式 |
| margin.table(table, margins) | 依margins定义的边际列表中条目的和 |
| addmargins(table, margins) | 将概述边margins（默认是求和结果）放入表中 |
| ftable(table) | 创建一个紧凑的“平铺”式列联表 |

#### 1.一维列联表

In [18]:
mytable <- with(Arthritis, table(Improved))
mytable

Improved
  None   Some Marked 
    42     14     28 

In [19]:
prop.table(mytable)

Improved
     None      Some    Marked 
0.5000000 0.1666667 0.3333333 

In [20]:
prop.table(mytable)*100

Improved
    None     Some   Marked 
50.00000 16.66667 33.33333 

#### 2.二维列联表

In [21]:
mytable <- xtabs(~ Treatment+Improved, data=Arthritis)
mytable

         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

In [22]:
margin.table(mytable, 1)

Treatment
Placebo Treated 
     43      41 

In [23]:
prop.table(mytable, 1)

         Improved
Treatment      None      Some    Marked
  Placebo 0.6744186 0.1627907 0.1627907
  Treated 0.3170732 0.1707317 0.5121951

In [24]:
margin.table(mytable, 2)

Improved
  None   Some Marked 
    42     14     28 

In [25]:
prop.table(mytable, 2)

         Improved
Treatment      None      Some    Marked
  Placebo 0.6904762 0.5000000 0.2500000
  Treated 0.3095238 0.5000000 0.7500000

In [26]:
prop.table(mytable)

         Improved
Treatment       None       Some     Marked
  Placebo 0.34523810 0.08333333 0.08333333
  Treated 0.15476190 0.08333333 0.25000000

In [27]:
addmargins(mytable)

Unnamed: 0,None,Some,Marked,Sum
Placebo,29,7,7,43
Treated,13,7,21,41
Sum,42,14,28,84


In [28]:
addmargins(prop.table(mytable))

Unnamed: 0,None,Some,Marked,Sum
Placebo,0.3452381,0.08333333,0.08333333,0.5119048
Treated,0.1547619,0.08333333,0.25,0.4880952
Sum,0.5,0.16666667,0.33333333,1.0


In [29]:
addmargins(prop.table(mytable, 1), 2)

Unnamed: 0,None,Some,Marked,Sum
Placebo,0.6744186,0.1627907,0.1627907,1
Treated,0.3170732,0.1707317,0.5121951,1


In [30]:
addmargins(prop.table(mytable, 2), 1)

Unnamed: 0,None,Some,Marked
Placebo,0.6904762,0.5,0.25
Treated,0.3095238,0.5,0.75
Sum,1.0,1.0,1.0


代码清单7-10 使用CrossTable生成二维列联表

In [31]:
library(gmodels)
CrossTable(Arthritis$Treatment, Arthritis$Improved)


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  84 

 
                    | Arthritis$Improved 
Arthritis$Treatment |      None |      Some |    Marked | Row Total | 
--------------------|-----------|-----------|-----------|-----------|
            Placebo |        29 |         7 |         7 |        43 | 
                    |     2.616 |     0.004 |     3.752 |           | 
                    |     0.674 |     0.163 |     0.163 |     0.512 | 
                    |     0.690 |     0.500 |     0.250 |           | 
                    |     0.345 |     0.083 |     0.083 |           | 
--------------------|-----------|-----------|-----------|-----------|
            Treated |        13 |         7 |        21 |        41 | 
                    |     2.744 |     0.004 |     3.935 |        

#### 3.多维列联表
代码清单7-11 三维列联表

In [32]:
mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis)
mytable

, , Improved = None

         Sex
Treatment Female Male
  Placebo     19   10
  Treated      6    7

, , Improved = Some

         Sex
Treatment Female Male
  Placebo      7    0
  Treated      5    2

, , Improved = Marked

         Sex
Treatment Female Male
  Placebo      6    1
  Treated     16    5


In [33]:
ftable(mytable)

                 Improved None Some Marked
Treatment Sex                             
Placebo   Female            19    7      6
          Male              10    0      1
Treated   Female             6    5     16
          Male               7    2      5

In [34]:
margin.table(mytable, 1)

Treatment
Placebo Treated 
     43      41 

In [35]:
margin.table(mytable, 2)

Sex
Female   Male 
    59     25 

In [36]:
margin.table(mytable, 3)

Improved
  None   Some Marked 
    42     14     28 

In [37]:
margin.table(mytable, c(1, 3))

         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

In [38]:
ftable(prop.table(mytable, c(1, 2)))

                 Improved       None       Some     Marked
Treatment Sex                                             
Placebo   Female          0.59375000 0.21875000 0.18750000
          Male            0.90909091 0.00000000 0.09090909
Treated   Female          0.22222222 0.18518519 0.59259259
          Male            0.50000000 0.14285714 0.35714286

In [39]:
ftable(addmargins(prop.table(mytable, c(1, 2)), 3))

                 Improved       None       Some     Marked        Sum
Treatment Sex                                                        
Placebo   Female          0.59375000 0.21875000 0.18750000 1.00000000
          Male            0.90909091 0.00000000 0.09090909 1.00000000
Treated   Female          0.22222222 0.18518519 0.59259259 1.00000000
          Male            0.50000000 0.14285714 0.35714286 1.00000000

In [40]:
ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100

                 Improved       None       Some     Marked        Sum
Treatment Sex                                                        
Placebo   Female           59.375000  21.875000  18.750000 100.000000
          Male             90.909091   0.000000   9.090909 100.000000
Treated   Female           22.222222  18.518519  59.259259 100.000000
          Male             50.000000  14.285714  35.714286 100.000000

### 独立性检验
#### 1.卡方独立性检验
代码清单7-12 卡方独立性检验

In [42]:
library(vcd)
mytable <- xtabs(~Treatment+Improved, data=Arthritis)
mytable

         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

In [43]:
chisq.test(mytable)


	Pearson's Chi-squared test

data:  mytable
X-squared = 13.055, df = 2, p-value = 0.001463


- 零假设：质量情况和改善情况是独立的。
- 结论：p值非常小，拒绝零假设（患者接受的治疗和改善的水平存在某种关系）。

In [44]:
mytable <- xtabs(~Improved+Sex, data=Arthritis)
mytable

        Sex
Improved Female Male
  None       25   17
  Some       12    2
  Marked     22    6

In [46]:
chisq.test(mytable)

“Chi-squared approximation may be incorrect”


	Pearson's Chi-squared test

data:  mytable
X-squared = 4.8407, df = 2, p-value = 0.08889


- 零假设：性别和改善情况是独立的。
- 结论：p>0.05，两者之间不存在关系。

#### 2. Fisher精确检验
Fisher精确检验的原假设：边界固定的列联表中行和列是相互独立的。

In [48]:
mytable <- xtabs(~Treatment+Improved, data=Arthritis)
mytable

         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

In [49]:
fisher.test(mytable)


	Fisher's Exact Test for Count Data

data:  mytable
p-value = 0.001393
alternative hypothesis: two.sided


#### 3.Cochran-Mantel-Haenszel检验
原假设：两个名义变量在第三个变量的每一层都是条件独立的。下列代码可以检验资料情况和改善情况在性别的每一水平是否独立。此检验假设不存在三阶交互作用（资料情况 X 改善情况 X 性别）。

In [50]:
mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis)
mytable

, , Sex = Female

         Improved
Treatment None Some Marked
  Placebo   19    7      6
  Treated    6    5     16

, , Sex = Male

         Improved
Treatment None Some Marked
  Placebo   10    0      1
  Treated    7    2      5


In [51]:
mantelhaen.test(mytable)


	Cochran-Mantel-Haenszel test

data:  mytable
Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647


结果表明，患者接受的治疗与得到的改善在性别的每一水平下并不独立（分性别来看，用药治疗的患者较接受安慰剂的患者有了更多的改善）

### 相关性的度量
如果可以拒绝原假设，那么你的兴趣就会自然而然地转向用以衡量相关性强弱的相关性度量。

代码清单7-13 二维列联表的相关性度量

In [52]:
library(vcd)
mytable <- xtabs(~Treatment+Improved, data=Arthritis)
mytable

         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

In [53]:
assocstats(mytable)

                    X^2 df  P(> X^2)
Likelihood Ratio 13.530  2 0.0011536
Pearson          13.055  2 0.0014626

Phi-Coefficient   : NA 
Contingency Coeff.: 0.367 
Cramer's V        : 0.394 

## 相关
### 相关的类型
#### 1.Pearson、Spearman和Kendall相关
1. Pearson积差相关系数衡量了两个定量变量之间的线性相关程度。
1. Spearman等级相关系数衡量分级定序变量之间的相关程度。
1. Kendall's Tau相关系数是一种非参数的等级相关度量。

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-15/11290305.jpg)

代码清单7-14 协方差和相关系数

In [54]:
states<- state.x77[,1:6]
states

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad
Alabama,3615,3624,2.1,69.05,15.1,41.3
Alaska,365,6315,1.5,69.31,11.3,66.7
Arizona,2212,4530,1.8,70.55,7.8,58.1
Arkansas,2110,3378,1.9,70.66,10.1,39.9
California,21198,5114,1.1,71.71,10.3,62.6
Colorado,2541,4884,0.7,72.06,6.8,63.9
Connecticut,3100,5348,1.1,72.48,3.1,56.0
Delaware,579,4809,0.9,70.06,6.2,54.6
Florida,8277,4815,1.3,70.66,10.7,52.6
Georgia,4931,4091,2.0,68.54,13.9,40.6


In [55]:
cov(states)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad
Population,19931683.7588,571229.7796,292.8679592,-407.8424612,5663.523714,-3551.509551
Income,571229.7796,377573.3061,-163.7020408,280.6631837,-521.894286,3076.76898
Illiteracy,292.868,-163.702,0.3715306,-0.4815122,1.581776,-3.235469
Life Exp,-407.8425,280.6632,-0.4815122,1.8020204,-3.86948,6.312685
Murder,5663.5237,-521.8943,1.5817755,-3.8694804,13.627465,-14.549616
HS Grad,-3551.5096,3076.769,-3.2354694,6.3126849,-14.549616,65.237894


In [56]:
cor(states) # Pearson

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad
Population,1.0,0.2082276,0.1076224,-0.06805195,0.3436428,-0.09848975
Income,0.20822756,1.0,-0.4370752,0.34025534,-0.2300776,0.61993232
Illiteracy,0.10762237,-0.4370752,1.0,-0.58847793,0.7029752,-0.65718861
Life Exp,-0.06805195,0.3402553,-0.5884779,1.0,-0.7808458,0.5822162
Murder,0.34364275,-0.2300776,0.7029752,-0.78084575,1.0,-0.48797102
HS Grad,-0.09848975,0.6199323,-0.6571886,0.5822162,-0.487971,1.0


In [57]:
cor(states, method="spearman") # Spearman

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad
Population,1.0,0.1246098,0.3130496,-0.1040171,0.3457401,-0.3833649
Income,0.1246098,1.0,-0.3145948,0.324105,-0.2174623,0.5104809
Illiteracy,0.3130496,-0.3145948,1.0,-0.5553735,0.6723592,-0.6545396
Life Exp,-0.1040171,0.324105,-0.5553735,1.0,-0.7802406,0.523941
Murder,0.3457401,-0.2174623,0.6723592,-0.7802406,1.0,-0.436733
HS Grad,-0.3833649,0.5104809,-0.6545396,0.523941,-0.436733,1.0


1. 收入和高中毕业率之间存在很强的正相关
1. 文盲率和预期寿命之间存在很强的负相关

In [58]:
x <- states[,c("Population", "Income", "Illiteracy", "HS Grad")]
y <- states[,c("Life Exp", "Murder")]
cor(x,y)

Unnamed: 0,Life Exp,Murder
Population,-0.06805195,0.3436428
Income,0.34025534,-0.2300776
Illiteracy,-0.58847793,0.7029752
HS Grad,0.5822162,-0.487971


#### 2.偏相关
`偏相关`是指在控制一个或多个定量变量时，另外两个定量变量之间的相互关系。

In [60]:
library(ggm)
colnames(states)

In [61]:
pcor(c(1,5,2,3,6), cov(states))

本例中，在控制了收入、文盲率和高中毕业率的影响时，人口和谋杀率之间的相关系数为0.346。

### 相关性的显著性检验
```r
cor.test(x, y, alternative = , method = )
```

- x和y为要检验相关性的变量
- alternative指定双侧或单侧检验（取值为"two.side","less","greater"）
- method("pearson","kendall","spearman")
- 当研究的假设为总体的相关系数小于0时，alternative="less"
- 在研究的假设为总体的相关系数大于0时，应使用alternative="greater"

代码清单7-15 检验某种相关系数的显著性

In [62]:
cor.test(states[,3], states[,5])


	Pearson's product-moment correlation

data:  states[, 3] and states[, 5]
t = 6.8479, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5279280 0.8207295
sample estimates:
      cor 
0.7029752 


这段代码检验了预期寿命和谋杀率的Pearson相关系数为0的原假设。假设总体的相关度为0，则预计在一千万次中只会有少于一次的机会见到0.703这样大的样本相关度（p=1.258e-08）。由于p如此之小，拒绝原假设。即预期寿命和谋杀率之间的总体相关度不为0。

代码清单7-16 通过corr.test计算相关矩阵并进行显著性检验

In [63]:
library(psych)
corr.test(states, use="complete")

Call:corr.test(x = states, use = "complete")
Correlation matrix 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       1.00   0.21       0.11    -0.07   0.34   -0.10
Income           0.21   1.00      -0.44     0.34  -0.23    0.62
Illiteracy       0.11  -0.44       1.00    -0.59   0.70   -0.66
Life Exp        -0.07   0.34      -0.59     1.00  -0.78    0.58
Murder           0.34  -0.23       0.70    -0.78   1.00   -0.49
HS Grad         -0.10   0.62      -0.66     0.58  -0.49    1.00
Sample Size 
[1] 50
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       0.00   0.59       1.00      1.0   0.10       1
Income           0.15   0.00       0.01      0.1   0.54       0
Illiteracy       0.46   0.00       0.00      0.0   0.00       0
Life Exp         0.64   0.02       0.00      0.0   0.00       0
Murder           0.01   0.11       0.00      0.0   0.00       0
H

- use取值"pairwise"或"complete"，分别表示对缺失值执行成对删除或行删除。
- method取值"pearson"或"spearman"或"kendall"。

人口数量和高中毕业率的相关系数（-0.10）并不显著地不为0（p=0.5）。

## t检验
### 独立样本的t检验
一个针对两组的独立样本t检验可以用于检验两个总体的均值相等的假设。这里假设两组数据是独立的，并且是从正态总体中抽得。检验的调用格式为：

```r
t.test(y ~ x, data)
```

- y是一个数值型变量
- x是一个二分变量

或为：

```r
t.test(y1, y2)
```

- 其中的y1和y2为数值型向量。

在下列代码中，使用一个假设方差不等的双侧检验，比较了南方（group 1）和非南方（group 0）各州的监禁概率：

In [2]:
library(MASS)
t.test(Prob ~ So, data=UScrime)


	Welch Two Sample t-test

data:  Prob by So
t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.03852569 -0.01187439
sample estimates:
mean in group 0 mean in group 1 
     0.03851265      0.06371269 


拒绝南方各州和非南方各州拥有相同监禁概率的假设（p<0.001）。

### 非独立样本的t检验
在两组的观测之间相关时，你获得的是一个非独立组设计（dependent groups design）。前-后测设计（pre-post design）或重复测量设计（repeated measures design）同样也会产生非独立的组。

非独立样本的t检验假定组间的差异呈正态分布。

```r
t.test(y1, y2, paired=TRUE)
```

其中y1和y2为两个非独立组的数值向量。

In [3]:
library(MASS)
sapply(UScrime[c("U1","U2")], function(x)(c(mean=mean(x),sd=sd(x))))

Unnamed: 0,U1,U2
mean,95.46809,33.97872
sd,18.02878,8.44545


In [4]:
with(UScrime, t.test(U1, U2, paired=TRUE))


	Paired t-test

data:  U1 and U2
t = 32.407, df = 46, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 57.67003 65.30870
sample estimates:
mean of the differences 
               61.48936 


差异的均值（61.5）足够大，可以保证拒绝年长和年轻男性的平均失业率相同的假设。年轻男性的失业率更高。事实上，若总体均值相等，获取一个差异如此大的样本的概率小于2.2e-16。

## 组间差异的非参数检验
### 两组的比较
若两组数据独立，可以使用Wilcoxon秩和检验（Mann-Whitney U检验）来评估观测是否从相同的概率分布中抽得的（即，在一个总体中获得更高得分的概率是否比另一个总体要大）。

```r
wilcox.test(y ~ x, data)
```

- y是数值型变量
- x是一个二分变量

或

```r
wilcox.test(y1, y2)
```

- y1和y2为各组的结果变量。

In [5]:
with(UScrime, by(Prob, So, median))

So: 0
[1] 0.038201
------------------------------------------------------------ 
So: 1
[1] 0.055552

In [6]:
wilcox.test(Prob ~ So, data=UScrime)


	Wilcoxon rank sum test

data:  Prob by So
W = 81, p-value = 8.488e-05
alternative hypothesis: true location shift is not equal to 0


拒绝南方各州和非南方各州监禁率相同的假设（p<0.001）。

Wilcoxon符号秩检验是非独立样本t检验的一种非参数替代方法。它适用于两组对数据和无法保证正态性假设的情境。

In [7]:
sapply(UScrime[c("U1","U2")], median)

In [9]:
with(UScrime, wilcox.test(U1, U2, paired=TRUE))

“cannot compute exact p-value with ties”


	Wilcoxon signed rank test with continuity correction

data:  U1 and U2
V = 1128, p-value = 2.464e-09
alternative hypothesis: true location shift is not equal to 0


在次得到了与配对t检验相同的结论。

### 多于两组的比较
如果各组独立，则Kruskal-Wallis检验将是一种实用的方法。如果各组不独立（如重复测量设计或随机区组设计），那么Friedman检验会更合适。

Kruskal-Wallis检验的调用格式：

```r
kruskal.test(y ~ A, data)
```

- y是一个数值型结果变量
- A是一个拥有两个或更多水平的分组变量（grouping variable）。若有两个水平，则它与Mann-Whitney U检验等价。

Friedman检验的调用格式：

```r
friedman.test(y ~ A | B, data)
```

- y是数值型结果变量
- A是一个分组变量
- B是一个用以认定匹配观测的区组变量（blocking variable）。

利用Kruskal-Wallis检验回答文盲率的问题。

In [11]:
states <- data.frame(state.region, state.x77)
states

Unnamed: 0,state.region,Population,Income,Illiteracy,Life.Exp,Murder,HS.Grad,Frost,Area
Alabama,South,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,West,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,West,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,South,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,West,21198,5114,1.1,71.71,10.3,62.6,20,156361
Colorado,West,2541,4884,0.7,72.06,6.8,63.9,166,103766
Connecticut,Northeast,3100,5348,1.1,72.48,3.1,56.0,139,4862
Delaware,South,579,4809,0.9,70.06,6.2,54.6,103,1982
Florida,South,8277,4815,1.3,70.66,10.7,52.6,11,54090
Georgia,South,4931,4091,2.0,68.54,13.9,40.6,60,58073


In [12]:
kruskal.test(Illiteracy ~ state.region, data=states)


	Kruskal-Wallis rank sum test

data:  Illiteracy by state.region
Kruskal-Wallis chi-squared = 22.672, df = 3, p-value = 4.726e-05


显著性检验的结果意味着美国四个地区的文盲率各不相同（p < 0.001）。

虽然你可以拒绝不存在差异的原假设，但这个检验并没有告诉你哪些地区显著地与其他地区不同。要回答这个问题，你可以使用Wilcoxon检验每次比较两组数据。