# <font color='blue'><div style="text-align: center">회귀분석(Regression)</font> 

* Francis Galton's 1875 illustration of the correlation between the heights of adults and their parents. The observation that adult children's heights tended to deviate less from the mean height than their parents suggested the concept of   "regression toward the mean", giving regression its name.

<img src="https://drive.google.com/uc?id=1hEIe9tmbVxr5UlIVddtz7wdVc6XE3e_5" width="400" height="300">

* 위의 얘기는 1875년 갤튼이 부모의 키와 자식의 키의 관계 그림을 이용해 "regression toward the mean" 이라는 주장을 했다는 것이다. 

* 여기서, "regression toward the mean" 은 무슨 의미일까? 
* 이는 키가 큰 부모의 자식은 키가 조금 작아지는 경향이 있고 키가 작은 부모의 자식은 키가 조금 커지는 경향이 있어 자식의 키가 평균으로 회귀(regress)한다는 의미이다.



<img src="https://drive.google.com/uc?id=1p4DT5ylKomcKNVlmoOGPIpdGewl89xXK" width="600" height="500">
<img src="https://drive.google.com/uc?id=10hgkjaK8k3ujPqViaFXIdHoZ4dk7FGPW" width="600" height="500">
<img src="https://drive.google.com/uc?id=19XCwkkYDIakPGUtAp8E8retKygQSliYh" width="400" height="300">
<img src="https://drive.google.com/uc?id=1r9CjLGiJgV-Mr8wLnGhM25fZpdal2wZX" width="500" height="400">


In [7]:
import numpy as np
x = np.array([1,2,3])
y = np.array([1,2,3])

xbar = np.mean(x)
ybar = np.mean(y)

summ1 = summ2 = 0
for i in range(3):
    summ1 += (x[i]-xbar)**2
    summ2 += (x[i]-xbar)*(y[i]-ybar)

beta = summ2 / summ1
alpha = ybar - beta*xbar
alpha, beta

(0.0, 1.0)

In [9]:
import pandas as pd
from sklearn import datasets, linear_model
data = pd.read_csv("D:/NaverCloud/Lecture/통계교육원/파이썬통계/GaltonFamilies.csv")
maleData = data[data['gender'] == "male"]
femaleData = data[data['gender'] == "female"]
data[:10]

Unnamed: 0.1,Unnamed: 0,family,father,mother,midparentHeight,children,childNum,gender,childHeight
0,1,1,78.5,67.0,75.43,4,1,male,73.2
1,2,1,78.5,67.0,75.43,4,2,female,69.2
2,3,1,78.5,67.0,75.43,4,3,female,69.0
3,4,1,78.5,67.0,75.43,4,4,female,69.0
4,5,2,75.5,66.5,73.66,4,1,male,73.5
5,6,2,75.5,66.5,73.66,4,2,male,72.5
6,7,2,75.5,66.5,73.66,4,3,female,65.5
7,8,2,75.5,66.5,73.66,4,4,female,65.5
8,9,3,75.0,64.0,72.06,2,1,male,71.0
9,10,3,75.0,64.0,72.06,2,2,female,68.0


In [10]:
maleData['father'].mean()

69.13762993762994

In [None]:
maleData2 = maleData[maleData['father'] > 69.14]
maleData1 = maleData[maleData['father'] <= 69.14]

Y = maleData2['childHeight'].to_numpy()
X = maleData2['father'].to_numpy().reshape(-1,1)
linreg = linear_model.LinearRegression(fit_intercept = False)
model = linreg.fit(X, Y)
coef2 = model.coef_

Y = maleData1['childHeight'].to_numpy()
X = maleData1['father'].to_numpy().reshape(-1,1)
linreg = linear_model.LinearRegression(fit_intercept = False)
model = linreg.fit(X, Y)
coef1 = model.coef_
print(coef1, coef2)

* 위 코드는 아들과 아버지의 키에서 회귀식을 구하는 코드다.
* 아버지 중에 평균 키 이상인 집단에서 구한 기울기는 0.986로 1보다 작고 평균키 이하인 집단에서 구한 기울기는 1.015로 1보다 크게 측정되었다. 
* 이는 아버지가 크면 아들의 키가 약간 작아지는 경향이 있고, 아버지가 작으면 아들의 키가 약간 커지는 경향이 있다는 갤튼의 회귀법칙을 증명할 수 있는 결과다.
* 위의 실험을 엄마의 키와 딸의 키로 했을 때도 같은 결과를 얻을 수 있었다. 

## 다중회귀 모형(Multiple Regression Model)

* 다중회귀모형은종속변수에영향을주는다수의독립변수가존재하는경우의회귀모형임
* 예를들어, 연봉을종속변수로볼때, 연봉에영향을주는요인으로나이, 성별, 학력, 직업, 월근무시간등이있을것이다. 
* 이를수식으로표현하면아래와같다.
$$ 연봉=𝛽_0+𝛽_1나이+𝛽_2성별+𝛽_3학력+𝛽_4직업+𝛽_5 월근무시간+ϵ $$
* 이를일반화하여표현하면아래와같다.

<img src="https://drive.google.com/uc?id=1tTetmsnANGs4p-Ou3nhLtCAAsDDi6tcq" width="700" height="600">
<img src="https://drive.google.com/uc?id=1z6-ZNwexwEX_xmx_8xpKETkrNB66yp6O" width="400" height="300">


In [16]:
X = np.array([[1,1],
             [1,2],
             [1,3]])
Y = np.array([1,
              2,
              3])

_tmp1 = np.linalg.inv(np.dot(X.T, X))
_tmp2 = np.dot(X.T, Y)

np.dot(_tmp1, _tmp2)


array([-1.77635684e-15,  1.00000000e+00])

In [34]:
import pandas as pd
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

data = pd.read_csv('D:/NaverCloud/Lecture/통계교육원/파이썬통계/iris.csv')
data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,name
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [None]:
new_data = data[data['name']=='versicolor']

model = ols(" sepal_width ~ petal_length", new_data).fit()
print(model.summary())

pred = model.predict(new_data['petal_length'])

plt.scatter(new_data['petal_length'], new_data['sepal_width'])
plt.plot(new_data['petal_length'], pred)
plt.show()

In [36]:
model = ols(" sepal_width ~ sepal_length+petal_length+petal_width", new_data).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            sepal_width   R-squared:                       0.485
Model:                            OLS   Adj. R-squared:                  0.451
Method:                 Least Squares   F-statistic:                     14.43
Date:                Tue, 30 Jun 2020   Prob (F-statistic):           9.29e-07
Time:                        23:25:56   Log-Likelihood:                 4.0875
No. Observations:                  50   AIC:                           -0.1750
Df Residuals:                      46   BIC:                             7.473
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.8097      0.384      2.109   

In [33]:
model = ols('sepal_width ~ name + petal_length', data).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            sepal_width   R-squared:                       0.478
Model:                            OLS   Adj. R-squared:                  0.468
Method:                 Least Squares   F-statistic:                     44.63
Date:                Tue, 30 Jun 2020   Prob (F-statistic):           1.58e-20
Time:                        23:21:14   Log-Likelihood:                -38.185
No. Observations:                 150   AIC:                             84.37
Df Residuals:                     146   BIC:                             96.41
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              2.9813      0

$$ \hat Y = 2.98 \  - \ 1.48 \ I_{versicolor} \ - \ 1.663 \ I_{virginica} + \  0.298 \ petal\ Length$$