### 목차
- 예측: 미래를 다루는 것만이 예측이 아님

# 상관
## 산점도
- x축과 y축을 그리고 그 안에 점을 찍어 표현한 그림
- x: 커피의 가격, y: 커피의 맛 -> 가격에 따른 맛의 위치를 찍음
- 산점도를 그려 어떤 관계가 있는지 눈으로 확인할 수가 있음
- 엑셀로 산점도 그릴 수 있어야 함(기본적인 것들): 분산형 차트


## 공분산
- 두 변수의 관계를 수치화해서 보여줌
- x는 x의 평균에서 빼고 y는 y의 평균에서 빼서 둘을 곱하고 n으로 나눔
- 어떤 방향으로 퍼져 있는지 알 수 있음
- 공분산이 +면 같은 방향(정비례)으로 움직이고 -면 반대 방향(역비례?)으로 움직인다는 뜻
- 역비례의 예: x는 빚, y는 행복 -> 빚이 많은 사람은 행복이 적고 빚이 적은 사람은 행복이 많음

## 상관 계수
- 공분산의 크기를 일정하게 만든 것: -1 ~ +1 범위 안에 들도록 만든 것
- +1: 완벽하게 같이 움직임, -1: 완벽하게 반대로 움직임, 0: 관계 없음
- 기울기와는 관련이 없음, 퍼져 있는 정도에 따라 다름
- 예외로 완벽히 수평 직선인 경우에는 상관계수가 0임
- 직선적 패턴이 아닌 경우에는 수치화를 하기 어려움
- U자 형도 0으로 뜸 -> 관계가 없다고 하기 어려움
- 상관 계수도 뽑아내고 산점도도 그려보아야 함

## 허위 상관관계
- 데이터가 두 건밖에 없다면 무조건 +1 아니면 -1이 나올 것임
- 실제로는 상관이 없는데 데이터가 없거나 우연히 허위 상관관계가 발생하는 경우가 있음
- 바이블코드?
- https://www.tylervigen.com/spurious-correlations
    - 과학기술이 우릴 숨막혀 죽이게 한다? 99.7%
    - 니콜라스 케이지가 영화를 많이 찍으면 사람들이 물에 빠져 죽는다? 66.6%
    - 치즈 소비량과 침대보에 얽혀 죽은 사람 수를 비교 94.71%
    - 마가린 섭취와 이혼율 99.26%
- 상관계수의 신뢰구간을 확인해야 함: 부트스트래핑해서 구할 수 있음, 랜덤하게 100개를 뽑고 상관계수 구해보고 이걸 10000번 해봄
- 신뢰구간이 한 쪽에만 존재할 때까지 데이터를 모아야 함

### 데이터 열기

In [1]:
import pandas

In [2]:
cars = pandas.read_csv("cars.csv")

In [3]:
cars.head()

Unnamed: 0.1,Unnamed: 0,speed,dist
0,1,4,2
1,2,4,10
2,3,7,4
3,4,7,22
4,5,8,16


### 상관계수 확인

In [4]:
from scipy.stats import pearsonr

(상관계수, p값)  
p값 < .05: 95% 신뢰구간 반대 부호가 포함 X  
p값 < .01: 99% 신뢰구간 반대 부포가 포함 X  
p값 < .001: 99.9% 신뢰구간 반대 부포가 포함 X  
p값이 작다는 것은 신뢰구간 안에서 +와 -가 바뀔 일이 없다

In [5]:
pearsonr(cars['speed'], cars['dist'])

(0.8068949006892105, 1.4898364962950763e-12)

In [6]:
from sklearn.utils import resample

In [7]:
cors = [] # 빈 리스트를 만든다
for _ in range(10000): # 1만번 반복
    df = resample(cars) # 리샘플링
    res = pearsonr(df['speed'], df['dist']) # 상관계수를 구한다
    cors.append(res[0]) # 상관계수를 리스트에 추가 [0]은 상관계수, [1]은 p값

In [8]:
import numpy

In [9]:
numpy.quantile(cors, [.025, .975]) # 상관계수의 95% 신뢰구간

array([0.69867863, 0.88461586])

In [10]:
numpy.quantile(cors, [.005, .995]) # 상관계수의 99% 신뢰구간

array([0.65027436, 0.90242746])

신뢰수준을 높이니까 신뢰구간이 더 넓어짐, 더 많은 경우를 커버하기 때문이지  
어떤 경우에도 0을 건드리지 않음, 데이터를 어떻게 뽑든 상관관계가 +라고 생각할 수 있음  
신뢰구간을 엄청 크게 하면 유의수준은 매우 낮아짐 -> p값  
p값을 볼 때 신뢰구간도 같이 보기

## 여러가지 상관계수
- 피어슨 상관계수(대표적인 상관계수)
    - 원데이터를 가지고 상관계수를 계산
- 스피어만 상관계수
    - 서열 데이터를 가지고 상관계수를 계산
- 켄달 상관계수
    - 서열 데이터를 가지고 상관계수를 계산

In [12]:
liar = pandas.read_csv("liar.csv")

In [13]:
liar.head()

Unnamed: 0,Creativity,Position,Novice
0,53,1,0
1,36,3,1
2,31,4,0
3,43,2,0
4,30,4,1


In [14]:
from scipy.stats import spearmanr, kendalltau

In [15]:
spearmanr(liar['Creativity'], liar['Position'])

SpearmanrResult(correlation=-0.37321838128767815, pvalue=0.0017204168895658578)

In [16]:
kendalltau(liar['Creativity'], liar['Position'])

KendalltauResult(correlation=-0.3002413080651747, pvalue=0.001258802279346817)

계산 방법의 차이로 숫자가 조금씩 다름  
창의성이 높을 수록 등수의 숫자가 내려감, 창의성이 높을 수록 거짓말을 더 잘한다  
창의성과 거짓말 등수 사이의 역상관 -> 창의성이 높을 수록 거짓말을 더 잘한다

In [17]:
pearsonr(liar['Creativity'], liar['Position']) # p-value가 큼, 99%로 신뢰구간을 넓히면 +와 - 로 바뀔 수도 있다는 뜻

(-0.30603143483570205, 0.01114802877289378)

켄달에선는 99.9% 신뢰구간으로 가야 +-로 바뀌는데 피어슨에서는 99% 신뢰구간으로 가면 +-로 바뀜  
0을 치고 넘어가다 = +-로 바뀜  
켄달에서는 99%일 때는 0으로 안넘어감 99.9%에 비해 신뢰구간이 짧아서 안넘어감  
서열 데이터에서는 켄달이나 스피어만을 사용함

- 카테고리의 상관계수를 확인하고자 하는 경우  
남자 0 여자 1 & 강아지 0 고양이 1로 해서 상관관계를 구해볼 수 있음

# REGRESSION 회귀분석 

In [19]:
# statsmodels를 이용한 회귀분석

In [22]:
from statsmodels.formula.api import ols

In [25]:
res = ols('dist ~ speed', data = cars).fit() # ~는 R에서 사용되는 문법, 종속변수(y)~독립변수(x)

In [26]:
res.summary() 
# coef -> dist = 3.93*speed - 17.57 
# [0.025 0.975] -> 95% 신뢰구간 
# P>|t| -> P가 0.05보다 작음 -> 신뢰구간이 일정하다
# intercept의 신뢰구간은 크게 중요하지 않음 
# 대충 std error에 2배 = 신뢰구간의 차 
# t 는 P를 계산하는 과정에서 이론적으로 도출되는 값 
# R-squared = 에타제곱 (분산의 몇퍼센트) 제동 거리의 65%는 차의 속도로 설명되더라
# Adj R-squared -> 에타제곱을 보정해주는 지수 (슬라이드 p15)
# R-squared = dist랑 speed의 상관을 제곱해준 값과 동일 (R-squared를 루트 씌우면 상관을 구할 수 있음, 독립변수가 하나일때만!)
# F-statistic -> 회귀계수가 전부 0이라 가정을 했을 때 만들어지는 수치
# Prob(F-statistic) -> P값과 비슷하게 해석 -> 0.05보다 작아야함
# Log-Likelihood -> 모델을 가정했을 때, 현재 데이터가 나올 확률 (0에 가까울수록 좋음, 독립변수가 많을수록 좋음)
# AIC/BIC는 낮을수록 좋음 

# 정리
# 1. 독립변수 계수 -> 신뢰구간 
# 2. Prob(F) < 0.05
# 3. 모형비교 Adj R_squared 높아야 좋음, AIC/BIC는 낮아야 좋음 

0,1,2,3
Dep. Variable:,dist,R-squared:,0.651
Model:,OLS,Adj. R-squared:,0.644
Method:,Least Squares,F-statistic:,89.57
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,1.49e-12
Time:,14:00:55,Log-Likelihood:,-206.58
No. Observations:,50,AIC:,417.2
Df Residuals:,48,BIC:,421.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-17.5791,6.758,-2.601,0.012,-31.168,-3.990
speed,3.9324,0.416,9.464,0.000,3.097,4.768

0,1,2,3
Omnibus:,8.975,Durbin-Watson:,1.676
Prob(Omnibus):,0.011,Jarque-Bera (JB):,8.189
Skew:,0.885,Prob(JB):,0.0167
Kurtosis:,3.893,Cond. No.,50.7


In [28]:
child = pandas.read_csv('child.csv')

In [29]:
child.head()

Unnamed: 0,Aggression,Television,Computer_Games,Sibling_Aggression,Diet,Parenting_Style
0,0.37416,0.172671,0.141907,-0.328216,-0.110303,-0.279034
1,0.771153,-0.032872,0.709918,0.576837,-0.02299,-1.248167
2,-0.097728,-0.07446,-0.390141,-0.217184,0.280301,-0.328063
3,0.015935,-0.004427,-0.40808,0.046223,-0.263479,-1.005119
4,-0.275385,-0.675239,-0.277778,-0.891045,0.226581,0.489478


In [34]:
res_games = ols('Aggression~Computer_Games', child).fit()

In [35]:
res_games.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.035
Model:,OLS,Adj. R-squared:,0.033
Method:,Least Squares,F-statistic:,23.9
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,1.27e-06
Time:,15:08:53,Log-Likelihood:,-172.63
No. Observations:,666,AIC:,349.3
Df Residuals:,664,BIC:,358.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0068,0.012,-0.560,0.576,-0.031,0.017
Computer_Games,0.1742,0.036,4.889,0.000,0.104,0.244

0,1,2,3
Omnibus:,25.478,Durbin-Watson:,1.929
Prob(Omnibus):,0.0,Jarque-Bera (JB):,66.334
Skew:,-0.011,Prob(JB):,3.94e-15
Kurtosis:,4.546,Cond. No.,2.93


In [37]:
res_tv = ols('Aggression~Television', child).fit()

In [38]:
res_tv.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.025
Model:,OLS,Adj. R-squared:,0.024
Method:,Least Squares,F-statistic:,17.11
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,3.98e-05
Time:,15:09:35,Log-Likelihood:,-175.93
No. Observations:,666,AIC:,355.9
Df Residuals:,664,BIC:,364.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0005,0.012,-0.041,0.967,-0.025,0.024
Television,0.1634,0.040,4.137,0.000,0.086,0.241

0,1,2,3
Omnibus:,24.471,Durbin-Watson:,1.931
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.038
Skew:,0.108,Prob(JB):,2.5e-13
Kurtosis:,4.43,Cond. No.,3.23


In [39]:
res_diet = ols('Aggression~Diet', child).fit()

In [40]:
res_diet.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.001
Method:,Least Squares,F-statistic:,0.04891
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,0.825
Time:,15:10:06,Log-Likelihood:,-184.38
No. Observations:,666,AIC:,372.8
Df Residuals:,664,BIC:,381.8
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0049,0.012,-0.397,0.692,-0.029,0.019
Diet,-0.0081,0.037,-0.221,0.825,-0.080,0.064

0,1,2,3
Omnibus:,27.097,Durbin-Watson:,1.928
Prob(Omnibus):,0.0,Jarque-Bera (JB):,73.373
Skew:,-0.023,Prob(JB):,1.17e-16
Kurtosis:,4.625,Cond. No.,2.97


In [41]:
res_ps = ols('Aggression~Parenting_Style', child).fit()

In [42]:
res_ps.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.044
Model:,OLS,Adj. R-squared:,0.043
Method:,Least Squares,F-statistic:,30.84
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,4.05e-08
Time:,15:10:33,Log-Likelihood:,-169.29
No. Observations:,666,AIC:,342.6
Df Residuals:,664,BIC:,351.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0050,0.012,-0.414,0.679,-0.029,0.019
Parenting_Style,0.0673,0.012,5.554,0.000,0.044,0.091

0,1,2,3
Omnibus:,27.44,Durbin-Watson:,1.907
Prob(Omnibus):,0.0,Jarque-Bera (JB):,72.583
Skew:,0.08,Prob(JB):,1.73e-16
Kurtosis:,4.609,Cond. No.,1.0


In [43]:
res_comb = ols('Aggression~Television+Computer_Games', child).fit()

In [44]:
res_comb.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.051
Model:,OLS,Adj. R-squared:,0.049
Method:,Least Squares,F-statistic:,17.99
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,2.45e-08
Time:,15:23:58,Log-Likelihood:,-166.81
No. Observations:,666,AIC:,339.6
Df Residuals:,663,BIC:,353.1
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0029,0.012,-0.237,0.813,-0.027,0.021
Television,0.1353,0.040,3.420,0.001,0.058,0.213
Computer_Games,0.1539,0.036,4.293,0.000,0.083,0.224

0,1,2,3
Omnibus:,24.166,Durbin-Watson:,1.934
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.964
Skew:,0.091,Prob(JB):,2.59e-13
Kurtosis:,4.434,Cond. No.,3.42


### 실습: 변수를 조합해서 최고의 adjR-squared, AIC, BIC 구하기


In [47]:
res_1 = ols('Aggression~Television+Computer_Games+Sibling_Aggression', child).fit()
res_1.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.056
Model:,OLS,Adj. R-squared:,0.051
Method:,Least Squares,F-statistic:,13.03
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,2.81e-08
Time:,15:26:14,Log-Likelihood:,-165.3
No. Observations:,666,AIC:,338.6
Df Residuals:,662,BIC:,356.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0037,0.012,-0.304,0.761,-0.027,0.020
Television,0.1214,0.040,3.015,0.003,0.042,0.201
Computer_Games,0.1416,0.036,3.879,0.000,0.070,0.213
Sibling_Aggression,0.0669,0.039,1.731,0.084,-0.009,0.143

0,1,2,3
Omnibus:,22.454,Durbin-Watson:,1.923
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52.404
Skew:,0.07,Prob(JB):,4.18e-12
Kurtosis:,4.367,Cond. No.,3.6


In [62]:
res_2 = ols('Aggression~Computer_Games+Sibling_Aggression+Diet+Television+Parenting_Style', child).fit()
res_2.summary()

0,1,2,3
Dep. Variable:,Aggression,R-squared:,0.083
Model:,OLS,Adj. R-squared:,0.076
Method:,Least Squares,F-statistic:,11.88
Date:,"Mon, 16 Sep 2019",Prob (F-statistic):,5.02e-11
Time:,15:30:08,Log-Likelihood:,-155.71
No. Observations:,666,AIC:,323.4
Df Residuals:,660,BIC:,350.4
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0050,0.012,-0.416,0.677,-0.029,0.019
Computer_Games,0.1422,0.037,3.851,0.000,0.070,0.215
Sibling_Aggression,0.0817,0.039,2.106,0.036,0.006,0.158
Diet,-0.1091,0.038,-2.864,0.004,-0.184,-0.034
Television,0.0329,0.046,0.715,0.475,-0.058,0.123
Parenting_Style,0.0566,0.015,3.891,0.000,0.028,0.085

0,1,2,3
Omnibus:,24.817,Durbin-Watson:,1.913
Prob(Omnibus):,0.0,Jarque-Bera (JB):,61.941
Skew:,0.067,Prob(JB):,3.55e-14
Kurtosis:,4.488,Cond. No.,4.19
