### 卡方检验
**简介**
- 一种用途很广的计数资料的假设检验方法
- 属于非参数检验，主要是比较两个及两个以上样本率（构成比）以及两个分类变量的关联性分析

**基本思想**
- 比较理论频数和实际频数的吻合程度或者拟合优度

**卡方分布**
- 若n个相互独立的随机变量ξ₁，ξ₂，...,ξn ，
- 均服从标准正态分布（也称独立同分布于标准正态分布），
- 则这n个服从标准正态分布的随机变量的平方和构成一新的随机变量，
- 其分布规律称为卡方分布（chi-square distribution）。

**应用**
- 两个率或两个构成比比较的卡方检验
- 多个率或多个构成比比较的卡方检验
- 分类资料的相关分析

### 卡方检验计算公式

&emsp;&emsp;&emsp;&emsp; $x^{2}=\sum_{}\frac{(A-T)^2}{T} $

- A为实际值
- T为理论值。
- x2为衡量实际值与理论值的差异程度（也就是卡方检验的核心思想）：
    - 1. 实际值与理论值偏差的绝对大小（由于平方的存在，差异是被放大的）
    - 2. 差异程度与理论值的相对大小

### 卡方分布临界值表
- 自由度等于V = (行数 - 1) * (列数 - 1)，对四格表，自由度V = 1
- n表示自由度
- p表示尾概率

In [1]:
import pandas as pd
chi_distribution_values = pd.read_excel('./chi_distribution_values-.xls')
chi_distribution_values.head()

Unnamed: 0,n/p,0.995,0.99,0.975,0.95,0.9,0.75,0.5,0.25,0.1,0.05,0.025,0.01,0.005
0,1,…,…,…,…,0.02,0.1,0.45,1.32,2.71,3.84,5.02,6.63,7.88
1,2,0.01,0.02,0.02,0.1,0.21,0.58,1.39,2.77,4.61,5.99,7.38,9.21,10.6
2,3,0.07,0.11,0.22,0.35,0.58,1.21,2.37,4.11,6.25,7.81,9.35,11.34,12.84
3,4,0.21,0.3,0.48,0.71,1.06,1.92,3.36,5.39,7.78,9.49,11.14,13.28,14.86
4,5,0.41,0.55,0.83,1.15,1.61,2.67,4.35,6.63,9.24,11.07,12.83,15.09,16.75


### 四格卡方检验的标准步骤

#### 目标
- 不吃晚饭对体重下降有没有影响


#### 数据集

In [2]:
data = pd.DataFrame({'weight_loss':[123, 45], 'not_weight_loss':[467,106]},index=['dinner', 'not_dinner'])
col_sum = pd.DataFrame({'sum':data.sum()}).T
data = data.append(col_sum)
data['sum'] = data.sum(axis=1)
data['weight_loss_rate'] = data.weight_loss / data['sum']
data

Unnamed: 0,weight_loss,not_weight_loss,sum,weight_loss_rate
dinner,123,467,590,0.208475
not_dinner,45,106,151,0.298013
sum,168,573,741,0.226721


**吃晚饭和不吃晚饭的体重下降率分别为20.85%和29.80%**

**两者的差别可能是抽样误差导致，也可能是 吃饭对体重真的有影响**

#### 建立假设检验
- 假设不吃晚饭对体重下降没有影响,α=0.05
- 如果说真的没有影响的话 表格中理论值和实际值差别应该会很小

#### 计算理论值

In [3]:
# 计算体重下降的实际概率
real_rate = data.loc['sum','weight_loss']/data.loc['sum', 'sum']
print('%0.2f%%'% (real_rate * 100))

22.67%


In [4]:
# 计算理论值
theoretical_data = pd.DataFrame({'weight_loss':data['sum'] * real_rate, 'not_weight_loss':data['sum'] * (1 - real_rate)},index=['dinner', 'not_dinner'])
col_sum = pd.DataFrame({'sum':theoretical_data.sum()}).T
theoretical_data = theoretical_data.append(col_sum)
theoretical_data['sum'] = theoretical_data.sum(axis=1)
theoretical_data['weight_loss_rate'] = theoretical_data.weight_loss / theoretical_data['sum']
theoretical_data

Unnamed: 0,weight_loss,not_weight_loss,sum,weight_loss_rate
dinner,133.765182,456.234818,590.0,0.226721
not_dinner,34.234818,116.765182,151.0,0.226721
sum,168.0,573.0,741.0,0.226721


#### 计算卡方值

In [5]:
chi = ((data - theoretical_data)**2 / theoretical_data)
chi

Unnamed: 0,weight_loss,not_weight_loss,sum,weight_loss_rate
dinner,0.866363,0.254012,0.0,0.001468
not_dinner,3.385125,0.992497,0.0,0.022418
sum,0.0,0.0,0.0,0.0


In [6]:
chi_x = chi.iloc[0:2,0:2]
chi_x.sum().sum()

5.497997391433855

#### 查卡方表求P值
**自由度计算**
- 自由度计算公式v=（行数-1）（列数-1）
- 自由度v=（2-1）（2-1）=1
- 查卡方界值表，p值为3.84

In [7]:
chi_p = chi_distribution_values[chi_distribution_values.loc[:,'n/p']==1].loc[:, 0.05][0]
chi_p

3.84

#### 得出结论

**chi_x=5.498＞chi_p=3.84，P＜0.05，否定上述假设，即差异有显著统计学意义**

**按α=0.05水准，即认为吃饭与体重下降相关**

### 总结
**卡方检验一般步骤**
1. 假设多个变量不相干 
2. 卡方值越大，P值越小，变量相关的可能性越大
3. 得出结论

    当chi_x=5.498＞=chi_p=3.84，P＜=0.05（置信度95%），否定上述假设，认为变量相关 
    
    当chi_x=5.498<chi_p=3.84，P＜=0.05（置信度95%），无法否定上述假设，假设成立