# 敘述統計與回歸分析
- 敘述統計(description statistics)
- 回歸分析(Regression Analysis)

### 匯入常用套件 `numpy`, `pandas`, `matplotlib.pyplot`, `seaborn`

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### 匯入統計與計量套件

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols, glm

### 匯入資料
- `pd.read_csv`
- `sep`: 分隔符號
- `header`: 設定標題在第幾行

In [None]:
wine = pd.read_csv('winequality-both.csv', sep = ',', header=0)

- 將變數名稱中的空格`( )`替換成底線`(_)`

In [None]:
wine.columns = wine.columns.str.replace(' ','_')

In [None]:
wine.columns

In [None]:
wine.head()

- 展示資料中的變數名稱 `columns`

In [None]:
wine.columns

- 展示所有變數的敘述統計量

In [None]:
wine.describe()

- 找出不重複數值/符號 `unique()`

In [None]:
print(sorted(wine.quality.unique()))

- 計算出現次數 `value_counts()`

In [None]:
print(wine.quality.value_counts())

### 分組、色階分布

- 按照酒的類型(type)來分類並顯示品質的敘述統計量 

In [None]:
wine.groupby('type')['quality'].describe()

- 按照

In [None]:
wine.groupby('type')[['quality']].quantile([0.25,0.75]).unstack('type')

In [None]:
wine.corr()

- 只想要某種條件的資料`wine.loc[條件, 出現值]`

In [None]:
wine['pH'].describe()

### 搜尋並抓出滿足特定條件的資料 `loc(條件, 值)`
- 搜尋 `pH值大於3`的資料，只收集`quality`變數
- 搜尋 `pH值大於3`的資料，只收集`pH與quality`變數
- 搜尋 `pH值大於3且是紅酒`的資料，只收集`quality`變數

In [None]:
red_wine = wine.loc[wine['pH'] > 3, 'quality']
red_wine

In [None]:
red_wine = wine.loc[wine['pH'] > 3, ['quality','pH']]
red_wine

In [None]:
red_wine = wine.loc[(wine['pH'] > 3) & (wine['type'] == 'red')]
red_wine

In [None]:
red_wine = wine.loc[wine['type'] == 'red', 'quality']

In [None]:
white_wine = wine.loc[wine['type'] == 'white', 
                     'quality']

In [None]:
sns.set_style("dark")

In [None]:
print(sns.distplot(red_wine, norm_hist=True, kde=False, color='red', label='Red wine'))
print(sns.distplot(white_wine, norm_hist=True, kde=False, color='white', label='White wine'))
plt.title("Distribution of Quality by Wine Type")
plt.legend()
plt.show()

## 統計檢定
- t檢定: 檢定兩組樣本的平均是否相等
- 參考網站: https://www.statsmodels.org/dev/stats.html

In [None]:
print(wine.groupby(['type'])[['quality']].agg(['std', 'mean']))

- 檢定紅酒與白酒品質是否有顯著差異 (`sm.stats.ttest_ind(紅酒, 白酒)`)

In [None]:
tstat, pvalue, df = sm.stats.ttest_ind(red_wine, white_wine)

In [None]:
print('tstat: %.3f  pvalue: %.4f' % (tstat, pvalue))

In [None]:
sm.stats.ttest_ind(red_wine, white_wine)

### 線性回歸 (理論說明)
- 假設我們有一組(100筆的)資料
$$
(x_1,y_1), (x_2,y_2), (x_3,y_3), \cdots, (x_{100},y_{100}).
$$
- 將其資料畫在二維平面座標上
- 希望找出一個模型(函數)描述這100筆資料
- 希望找出一個線性模型(函數)描述這100筆資料
- 這就是線性回歸模型
$$
y_i = \beta_0 + \beta_1 x_{1,i} + \epsilon_i, i=1,2,\cdots,100.
$$
- 根據最小平方法，將模型估計值與實際數值相減平方加總之後，找出一組參數使得其加總最小，此參數估計法稱為最小平方法(OLS)
$$
\hat{\beta_0},\hat{\beta_1}
$$

In [None]:
wine.columns

In [None]:
x = wine.columns[[1,2,3,4,5,6,7,8,9,10,11]]

### 線性回歸(操作方式)
- 設定線性模型 `my_formula = 'y ~ x1 + x2 + ...'`
- 進行最小平方法估計 `ols(my_formula, data).fit()`

In [None]:
my_formula = 'quality ~ pH'

In [None]:
my_formula = 'quality ~ alcohol + chlorides + citric_acid'
my_formula

In [None]:
my_formula = 'quality ~ alcohol + chlorides + citric_acid + density + fixed_acidity + free_sulfur_dioxide + pH + residual_sugar + sulphates + total_sulfur_dioxide + volatile_acidity'

In [None]:
lm = ols(my_formula, data = wine).fit()

### 簡單線性回歸的結果
- `lm.summary()`
- 估計參數的顯著性 `t, P>|t|`
- 估計參數的正負 `coef`
- 模型對於變數的解釋能力 `R-squared`


In [None]:
print(lm.summary())

- 只取出係數結果 `lm.params`
- 只取出係數檢定統計量結果 `lm.tvalues`
- 只取出係數的 p-value `lm.pvalues`

In [None]:
lm.pvalues