# 강의 목차
### 확률
- 확률분포
- 변수 사이의 관계

### 추정
- MSE Mean Squre Error
- 우도 Likelihood
- 선형모형/로지스틱 선형모형
- 신뢰구간

### 실험
- 설계
- 평균차이
- MAB Multi-Armed Bar

--------------------여기까지----------------
### 예측
: 데이터 분석의 핵심
- 회귀분석 Regression
- 예측의 다양한 문제들

### 잠재
: 관찰이 안되는(눈에 안보이는) 변수를 다루는 방법
- 클러스터링
- 요인분석과 차원축소

### 시간
: 미래
- 시계열분석과 생존분석
- 마케팅 효과

# 상관 Correlation
- 회귀분석의 밑바탕
- 두 변수가 서로 관계가 있다

### 산점도 Scatterplot
- 두 연속 변수의 관계를 시각화
- 한 건의 데이터를 점으로 표시

### 공분산 Covariance
- 두 연속 변수의 관계를 수치화
> $∑i(Xi -X평균)(Yi-Y평균)/N$
- 두 변수가 같은 방향으로 변하면 +, 반대방향으로 변하면 -
- 함께 변하는 경향이 강할수록 절대적 크기가 커진다

### 상관계수 Correlation Coefficient(Pearson)
- 공분산을 두 변수의 표준편차로 나눈 것
- 항상 -1 ~ +1
- 얼마나 퍼져있는지에 따라 정해진다. 기울기와는 관계 없다
- **직선이 아니면 패턴을 잡아내지 못한다**: 산점도 그리기가 중요한 이유

공분산 | 해석
--|--
+1|완벽하게 같은 방향으로 움직인다
0|아무 관계 없다
-1|완벽하게 반대 방향으로 움직인다

# 산점도와 상관계수를 함께 보는 습관을 기르자
이런 분포일 때 상관계수가 몇 정도인지

### 허위 상관관계 Spurious Correlation
- 두 변수 사이에 실제로는 관계가 없어도 상관관계가 나타나는 경우
- 데이터가 적을수록 나타나기 쉽다
- 상관계수의 신뢰구간을 확인해야한다

In [1]:
import pandas as pd

In [3]:
cars = pd.read_csv('cars.csv')
cars.head()

Unnamed: 0.1,Unnamed: 0,speed,dist
0,1,4,2
1,2,4,10
2,3,7,4
3,4,7,22
4,5,8,16


### 상관계수 확인

In [4]:
# (피어슨 상관계수, p값)
from scipy.stats import pearsonr

(상관계수, p 값)
- $p < 0.05$ : 95 % 신뢰구간 안에 반대 부포가 포함 X
- $p < 0.01$ : 99 % 신뢰구간 안에 반대 부포가 포함 X
- $p < 0.001$: 99.9 % 신뢰구간 안에 반대 부포가 포함 X

In [6]:
pearsonr(cars['speed'], cars['dist'])

(0.8068949006892105, 1.4898364962950702e-12)

### p 값 해석

    유의수준을 1.48e-10 % 까지 낮춰야 신뢰구간에 포함된다
    = 99.999999 % 신뢰구간 안에 포함된다

### Bootstrapping 으로 상관계수의 신뢰구간 구하기

In [8]:
from sklearn.utils import resample

In [15]:
df = resample(cars)

In [16]:
res = pearsonr(df['speed'], df['dist'])
res

(0.7518267818144415, 3.121723057687681e-10)

In [17]:
res[0]

0.7518267818144415

In [22]:
cors = [] 
for _ in range(10000):
    df = resample(cars)
    res = pearsonr(df['speed'], df['dist'])
    cors.append(res[0])

In [23]:
import numpy as np

In [24]:
np.quantile(cors, [.025, .975]) # 상관계수의 95 % 신뢰구간

array([0.6994029, 0.8828204])

In [25]:
np.quantile(cors, [.005, .995]) # 상관계수의 99 % 신뢰구간

array([0.64847972, 0.90094786])

# 여러가지 상관계수
- 피어슨Pearson 상관계수(Default)
- 스피어만Spearman 상관계수 $ρ$
    - 실제 변수값 대신 서열을 나타낸다
    - 서열의 상관관계
- 켄달Kendall 상관계수 $τ$
    - 실제 변수값 대신 서열을 나타낸다
    - 서열의 상관관계

### Spearman , Kendall 예시

In [27]:
liar = pd.read_csv('liar.csv')
liar.head()
# Position: Ranking

Unnamed: 0,Creativity,Position,Novice
0,53,1,0
1,36,3,1
2,31,4,0
3,43,2,0
4,30,4,1


In [28]:
from scipy.stats import spearmanr

In [30]:
spearmanr(liar['Creativity'], liar['Position'])

SpearmanrResult(correlation=-0.37321838128767815, pvalue=0.0017204168895658578)

In [38]:
cors = []
for _ in range(10000):
    df = resample(liar)
    res = spearmanr(liar['Creativity'], liar['Position'])
    cors.append(res[0])

In [39]:
np.quantile(cors, [.025, .975])

array([-0.37321838, -0.37321838])

In [36]:
from scipy.stats import kendalltau

In [37]:
kendalltau(liar['Creativity'], liar['Position'])

KendalltauResult(correlation=-0.3002413080651747, pvalue=0.001258802279346817)

In [40]:
cors = []
for _ in range(10000):
    df = resample(liar)
    res = kendalltau(liar['Creativity'], liar['Position'])
    cors.append(res[0])

In [41]:
np.quantile(cors, [.025, .975])

array([-0.30024131, -0.30024131])

In [42]:
np.quantile(cors, [.005, .995])

array([-0.30024131, -0.30024131])

In [43]:
np.quantile(cors, [.0005, .9995])

array([-0.30024131, -0.30024131])

In [44]:
np.quantile(cors, [.00005, .99995])

array([-0.30024131, -0.30024131])

1. Bootstrapping 신뢰구간 적정
2. P-value 0.05 보다 작은지

# 회귀분석 Regression
- 가장 넒은 의미: $X$ → $Y$ 를 예측
- 중간 의미: $Y$ 가 연속인 경우($Y$가 범주형인 경우는 분류 라고 한다)
- **가장 좁은 의미**(일반적): 선형 회귀 분석(선형 모형을 이용한 회귀분석)

#### 회귀 - 다시 돌아간다는 것의 의미
: 추세선에서 벗어난 값이 있더라도 결국에는 추세선(회귀선)으로 다시 돌아간다

## 절편과 계수
- 절편 intercept: 독립변수가 모두 0 일때 종속변수의 값 : 큰 의미 없다
- 계수 coefficient: 독립변수가 1 증가할 때 종속변수의 변화

## 예시) 자동차 데이터로 회귀분석

In [1]:
from statsmodels.formula.api import ols

In [2]:
import pandas as pd

cars = pd.read_csv('cars.csv')

In [6]:
# dist 종속변수 y
# speed 독립변수 x
res = ols('dist ~ speed', data=cars).fit()

In [7]:
res.summary()

0,1,2,3
Dep. Variable:,dist,R-squared:,0.651
Model:,OLS,Adj. R-squared:,0.644
Method:,Least Squares,F-statistic:,89.57
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,1.49e-12
Time:,14:01:32,Log-Likelihood:,-206.58
No. Observations:,50,AIC:,417.2
Df Residuals:,48,BIC:,421.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-17.5791,6.758,-2.601,0.012,-31.168,-3.990
speed,3.9324,0.416,9.464,0.000,3.097,4.768

0,1,2,3
Omnibus:,8.975,Durbin-Watson:,1.676
Prob(Omnibus):,0.011,Jarque-Bera (JB):,8.189
Skew:,0.885,Prob(JB):,0.0167
Kurtosis:,3.893,Cond. No.,50.7


### 계수Coefficient의 신뢰구간
: intercept 의 신뢰구간은 보지 않는다
- 표준오차Standard Error: 표본 분포에서 표준 편차

> $표준오차Std Err * 2 = 신뢰구간의 길이$

- $t$: 회귀계수 / 표준오차
- $P (P>|t|)$: 추정된 회귀계수부터 무한대까지 범위, 낮을수록 좋다

        P값을 보는 것과 신뢰구간을 해석하는 것이 일치해야한다
($P < 0.05$) (신뢰구간 내에서 부호가 같다)

### R-제곱 R-squared
- R-squared($ζ²$ 에타제곱): 모형 적합도 지수, 높을수록 좋다
        
        65 % 는 이 모형으로 설명할 수 있다
- Adj. R-squared 수정 R-제곱: R-squared 를 보정한다, 비교하기 위한 지표
        
        R-squared 로 비교하면 무조건 독립변수가 많은 쪽이 적합도가 높아진다.
- 독립변수가 1개 일 때는 $R-squared$ 값과 $(x와 y의 상관계수)²$ 값이 같다

### F 통계량 F-statistics: 모형 전체의 데이터가 충분한지
- Prob(F-statistics): 독립변수들이 하나도 없다고 쳤을 때도(독립변수의 계수가 0) 해당 결과가 나올 수 있는지
- p 값과 비슷하게 해석(0.05 보다 작아야 한다)

        F-statistics 가 0 에 가까우면 독립변수가 유효하다

- Log-Likelihood: 높을수록 좋다
- AIC 낮을수록 좋다
- BIC 낮을수록 좋다

# 요약
1. 독립변수의 계수의 신뢰구간을 구한다 
        +~+ OR -~- 인지 확인
2. Prob(F-statistics) < 0.05
3. 모형 비교 
        Adj. R-squard(↑), AIC(↓), BIC(↓) 세가지를 비교한다

## 예시) 아동 공격성

In [8]:
child = pd.read_csv('child.csv')

In [9]:
child.head()

Unnamed: 0,Aggression,Television,Computer_Games,Sibling_Aggression,Diet,Parenting_Style
0,0.37416,0.172671,0.141907,-0.328216,-0.110303,-0.279034
1,0.771153,-0.032872,0.709918,0.576837,-0.02299,-1.248167
2,-0.097728,-0.07446,-0.390141,-0.217184,0.280301,-0.328063
3,0.015935,-0.004427,-0.40808,0.046223,-0.263479,-1.005119
4,-0.275385,-0.675239,-0.277778,-0.891045,0.226581,0.489478


In [11]:
res = ols('Aggression ~ Computer_Games', child).fit()

In [13]:
res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.035
Model:,OLS,Adj. R-squared:,0.033
Method:,Least Squares,F-statistic:,23.9
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,1.27e-06
Time:,14:56:55,Log-Likelihood:,-172.63
No. Observations:,666,AIC:,349.3
Df Residuals:,664,BIC:,358.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0068,0.012,-0.560,0.576,-0.031,0.017
Computer_Games,0.1742,0.036,4.889,0.000,0.104,0.244

0,1,2,3
Omnibus:,25.478,Durbin-Watson:,1.929
Prob(Omnibus):,0.0,Jarque-Bera (JB):,66.334
Skew:,-0.011,Prob(JB):,3.94e-15
Kurtosis:,4.546,Cond. No.,2.93


R-squared 를 보면 모형이 데이터를 3.5 % 만 설명한다
        
    → 큰 영향 없다

In [20]:
res = ols('Aggression ~ Sibling_Aggression', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.017
Model:,OLS,Adj. R-squared:,0.015
Method:,Least Squares,F-statistic:,11.3
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,0.000821
Time:,15:25:36,Log-Likelihood:,-178.79
No. Observations:,666,AIC:,361.6
Df Residuals:,664,BIC:,370.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0061,0.012,-0.493,0.622,-0.030,0.018
Sibling_Aggression,0.1264,0.038,3.361,0.001,0.053,0.200

0,1,2,3
Omnibus:,24.126,Durbin-Watson:,1.903
Prob(Omnibus):,0.0,Jarque-Bera (JB):,60.452
Skew:,-0.025,Prob(JB):,7.47e-14
Kurtosis:,4.475,Cond. No.,3.06


In [19]:
res = ols('Aggression ~ Parenting_Style', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.044
Model:,OLS,Adj. R-squared:,0.043
Method:,Least Squares,F-statistic:,30.84
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,4.05e-08
Time:,15:25:14,Log-Likelihood:,-169.29
No. Observations:,666,AIC:,342.6
Df Residuals:,664,BIC:,351.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0050,0.012,-0.414,0.679,-0.029,0.019
Parenting_Style,0.0673,0.012,5.554,0.000,0.044,0.091

0,1,2,3
Omnibus:,27.44,Durbin-Watson:,1.907
Prob(Omnibus):,0.0,Jarque-Bera (JB):,72.583
Skew:,0.08,Prob(JB):,1.73e-16
Kurtosis:,4.609,Cond. No.,1.0


In [14]:
# 독립변수로 Television, Computer_Games 두 가지를 넣는다는 뜻
res = ols('Aggression ~ Television + Computer_Games', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.051
Model:,OLS,Adj. R-squared:,0.049
Method:,Least Squares,F-statistic:,17.99
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,2.45e-08
Time:,15:21:28,Log-Likelihood:,-166.81
No. Observations:,666,AIC:,339.6
Df Residuals:,663,BIC:,353.1
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0029,0.012,-0.237,0.813,-0.027,0.021
Television,0.1353,0.040,3.420,0.001,0.058,0.213
Computer_Games,0.1539,0.036,4.293,0.000,0.083,0.224

0,1,2,3
Omnibus:,24.166,Durbin-Watson:,1.934
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.964
Skew:,0.091,Prob(JB):,2.59e-13
Kurtosis:,4.434,Cond. No.,3.42


In [18]:
res = ols('Aggression ~ Television + Computer_Games + Parenting_Style', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.067
Model:,OLS,Adj. R-squared:,0.063
Method:,Least Squares,F-statistic:,15.97
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,4.92e-10
Time:,15:24:17,Log-Likelihood:,-161.14
No. Observations:,666,AIC:,330.3
Df Residuals:,662,BIC:,348.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0048,0.012,-0.402,0.688,-0.029,0.019
Television,0.0569,0.046,1.248,0.212,-0.033,0.146
Computer_Games,0.1355,0.036,3.765,0.000,0.065,0.206
Parenting_Style,0.0481,0.014,3.370,0.001,0.020,0.076

0,1,2,3
Omnibus:,25.946,Durbin-Watson:,1.915
Prob(Omnibus):,0.0,Jarque-Bera (JB):,64.91
Skew:,0.097,Prob(JB):,8.03e-15
Kurtosis:,4.517,Cond. No.,3.93


In [22]:
res = ols('Aggression ~ Television + Computer_Games + Parenting_Style + Sibling_Aggression', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.066
Method:,Least Squares,F-statistic:,12.66
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,6.19e-10
Time:,15:27:05,Log-Likelihood:,-159.82
No. Observations:,666,AIC:,329.6
Df Residuals:,661,BIC:,352.1
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0056,0.012,-0.462,0.644,-0.029,0.018
Television,0.0455,0.046,0.986,0.324,-0.045,0.136
Computer_Games,0.1244,0.037,3.399,0.001,0.053,0.196
Parenting_Style,0.0472,0.014,3.312,0.001,0.019,0.075
Sibling_Aggression,0.0622,0.038,1.621,0.106,-0.013,0.138

0,1,2,3
Omnibus:,24.527,Durbin-Watson:,1.904
Prob(Omnibus):,0.0,Jarque-Bera (JB):,60.107
Skew:,0.079,Prob(JB):,8.87e-14
Kurtosis:,4.463,Cond. No.,4.06


In [23]:
res = ols('Aggression ~ Computer_Games + Parenting_Style + Sibling_Aggression', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.07
Model:,OLS,Adj. R-squared:,0.066
Method:,Least Squares,F-statistic:,16.56
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,2.19e-10
Time:,15:27:51,Log-Likelihood:,-160.31
No. Observations:,666,AIC:,328.6
Df Residuals:,662,BIC:,346.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0069,0.012,-0.574,0.566,-0.030,0.017
Computer_Games,0.1256,0.037,3.433,0.001,0.054,0.197
Parenting_Style,0.0542,0.012,4.385,0.000,0.030,0.079
Sibling_Aggression,0.0680,0.038,1.793,0.073,-0.006,0.142

0,1,2,3
Omnibus:,25.031,Durbin-Watson:,1.901
Prob(Omnibus):,0.0,Jarque-Bera (JB):,63.246
Skew:,0.057,Prob(JB):,1.85e-14
Kurtosis:,4.505,Cond. No.,3.42


In [24]:
res = ols('Aggression ~ Television + Parenting_Style + Sibling_Aggression', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.055
Model:,OLS,Adj. R-squared:,0.051
Method:,Least Squares,F-statistic:,12.83
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,3.71e-08
Time:,15:28:11,Log-Likelihood:,-165.59
No. Observations:,666,AIC:,339.2
Df Residuals:,662,BIC:,357.2
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0043,0.012,-0.357,0.721,-0.028,0.019
Television,0.0507,0.046,1.091,0.276,-0.041,0.142
Parenting_Style,0.0541,0.014,3.804,0.000,0.026,0.082
Sibling_Aggression,0.0867,0.038,2.281,0.023,0.012,0.161

0,1,2,3
Omnibus:,24.853,Durbin-Watson:,1.894
Prob(Omnibus):,0.0,Jarque-Bera (JB):,60.93
Skew:,0.088,Prob(JB):,5.88e-14
Kurtosis:,4.471,Cond. No.,4.04


In [25]:
res = ols('Aggression ~ Television + Computer_Games + Parenting_Style + Diet', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.076
Model:,OLS,Adj. R-squared:,0.071
Method:,Least Squares,F-statistic:,13.67
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,1.02e-10
Time:,15:28:52,Log-Likelihood:,-157.94
No. Observations:,666,AIC:,325.9
Df Residuals:,661,BIC:,348.4
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0042,0.012,-0.346,0.730,-0.028,0.019
Television,0.0491,0.046,1.079,0.281,-0.040,0.139
Computer_Games,0.1540,0.037,4.210,0.000,0.082,0.226
Parenting_Style,0.0565,0.015,3.873,0.000,0.028,0.085
Diet,-0.0950,0.038,-2.527,0.012,-0.169,-0.021

0,1,2,3
Omnibus:,27.141,Durbin-Watson:,1.923
Prob(Omnibus):,0.0,Jarque-Bera (JB):,70.409
Skew:,0.092,Prob(JB):,5.14e-16
Kurtosis:,4.582,Cond. No.,4.01


In [21]:
res = ols('Aggression ~ Television + Computer_Games + Parenting_Style + Sibling_Aggression + Diet', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.083
Model:,OLS,Adj. R-squared:,0.076
Method:,Least Squares,F-statistic:,11.88
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,5.02e-11
Time:,15:26:42,Log-Likelihood:,-155.71
No. Observations:,666,AIC:,323.4
Df Residuals:,660,BIC:,350.4
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0050,0.012,-0.416,0.677,-0.029,0.019
Television,0.0329,0.046,0.715,0.475,-0.058,0.123
Computer_Games,0.1422,0.037,3.851,0.000,0.070,0.215
Parenting_Style,0.0566,0.015,3.891,0.000,0.028,0.085
Sibling_Aggression,0.0817,0.039,2.106,0.036,0.006,0.158
Diet,-0.1091,0.038,-2.864,0.004,-0.184,-0.034

0,1,2,3
Omnibus:,24.817,Durbin-Watson:,1.913
Prob(Omnibus):,0.0,Jarque-Bera (JB):,61.941
Skew:,0.067,Prob(JB):,3.55e-14
Kurtosis:,4.488,Cond. No.,4.19


In [26]:
res = ols('Aggression ~ Computer_Games + Parenting_Style + Sibling_Aggression + Diet', child).fit()

res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.082
Model:,OLS,Adj. R-squared:,0.076
Method:,Least Squares,F-statistic:,14.74
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,1.54e-11
Time:,15:35:27,Log-Likelihood:,-155.96
No. Observations:,666,AIC:,321.9
Df Residuals:,661,BIC:,344.4
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0059,0.012,-0.497,0.619,-0.029,0.017
Computer_Games,0.1434,0.037,3.891,0.000,0.071,0.216
Parenting_Style,0.0619,0.013,4.925,0.000,0.037,0.087
Sibling_Aggression,0.0863,0.038,2.258,0.024,0.011,0.161
Diet,-0.1116,0.038,-2.947,0.003,-0.186,-0.037

0,1,2,3
Omnibus:,25.206,Durbin-Watson:,1.911
Prob(Omnibus):,0.0,Jarque-Bera (JB):,64.229
Skew:,0.051,Prob(JB):,1.13e-14
Kurtosis:,4.518,Cond. No.,3.48


# 여러개의 독립변수가 있을 때
- B 와 C 가 동시에 A 를 설명하게 되면, B 와 C 사이의 간접적인 관계가 형성된다

ex. 형제 Sibling Aggression: 직/간접적인 여러가지 요인을 포함하고 있다.      
- 유전적 요인
- TV 나 공통의 환경적 요인
- 형제가 때려서 같이 때리는 직접적 요인

### TV, Computer_Games, Sibling_Aggression

In [32]:
res = ols('Aggression ~ Television + Computer_Games + Sibling_Aggression', child).fit()
res.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.056
Model:,OLS,Adj. R-squared:,0.051
Method:,Least Squares,F-statistic:,13.03
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,2.81e-08
Time:,16:05:02,Log-Likelihood:,-165.3
No. Observations:,666,AIC:,338.6
Df Residuals:,662,BIC:,356.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0037,0.012,-0.304,0.761,-0.027,0.020
Television,0.1214,0.040,3.015,0.003,0.042,0.201
Computer_Games,0.1416,0.036,3.879,0.000,0.070,0.213
Sibling_Aggression,0.0669,0.039,1.731,0.084,-0.009,0.143

0,1,2,3
Omnibus:,22.454,Durbin-Watson:,1.923
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52.404
Skew:,0.07,Prob(JB):,4.18e-12
Kurtosis:,4.367,Cond. No.,3.6


#### TV 와 게임을 **통계적으로 통제**했을 떄 
: 형제의 공격성Sibling Aggression은 아동의 공격성Aggression을 설명하지 못한다

ex. 사교육비가 늘어날수록 성적이 올라가지만, 
- 보호자의 관심을 통계적으로 통제했을 때,
- 보호자의 관심이라는 변수를 추가하면 사교육비가 성적을 설명하지 못한다

cf. 실험적으로 통제

    모형 전체가 잘 맞는지와 각 변수가 영향을 주느냐는 다른 문제
    모형 전체가 잘 맞으면 각 변수가 설명을 잘 못해도 냅둬야함

# 통계 vs 머신러닝

분석 과정|통계 | 머신러닝
--|--|--
모형 선택|O|O
변수 해석|O|X
중점|해석|예측

→ 머신러닝에서는 예측만 잘 맞으면 변수가 어떤 의미를 가지는지 관심 없다

# 변수 선택 Variable Selection

### 전방선택 Forward Selection: 통계적 사고에 좀 더 부합
- 절편만 있는 모형으로 시작
- 추가했을 때 모형을 가장 많이 개선할 수 있는 변수를 추가
- 더 이상 개선되지 않으면 중단
- 지나치게 적은 변수를 포함시킬 위험이 존재

### 후방선택 Backward Selection
- 모든 변수를 투입한 모형으로 시작
- 제외했을 때 모형을 가장 많이 개선하는 변수를 제외
- 더 이상 개선되지 않으면 중단
- 지나치게 많은 변수를 포함시킬 위험이 존재

### 더 좋은 방법은 다음 시간에

OLS Ordinary Least Square: 보통 최소 MSE