<a href="https://colab.research.google.com/github/seongheek/econtheory/blob/main/10%EC%A3%BC%EC%B0%A8_%EA%B0%95%EC%9D%98_%EB%8B%A4%EC%A4%91%ED%9A%8C%EA%B7%80_ipynb%EC%9D%98_%EC%82%AC%EB%B3%B8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **다중회귀분석**
누락변수오차를 확인하기 위해 다음의 데이터를 불러오자.

In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# 온라인에서 wooldridge의 wage1 데이터를 가져오기
url = "https://vincentarelbundock.github.io/Rdatasets/csv/wooldridge/wage1.csv"
df = pd.read_csv(url)

# 데이터 확인
print(df[['wage', 'educ', 'exper', 'female']].head())

   wage  educ  exper  female
0  3.10    11      2       1
1  3.24    12     22       1
2  3.00    11      2       0
3  6.00     8     44       0
4  5.30    12      7       0


In [6]:
# 로그 임금(Y)과 학력(X1)의 관계 회귀
import statsmodels.formula.api as smf
model = smf.ols('np.log(wage) ~ educ ', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           np.log(wage)   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.184
Method:                 Least Squares   F-statistic:                     119.6
Date:                Wed, 07 May 2025   Prob (F-statistic):           3.27e-25
Time:                        07:19:31   Log-Likelihood:                -359.38
No. Observations:                 526   AIC:                             722.8
Df Residuals:                     524   BIC:                             731.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5838      0.097      5.998      0.0

In [7]:
# 로그 임금(Y)과 학력(X1), 경력(X2)의 관계 회귀
model = smf.ols('np.log(wage) ~ educ + exper', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           np.log(wage)   R-squared:                       0.249
Model:                            OLS   Adj. R-squared:                  0.246
Method:                 Least Squares   F-statistic:                     86.86
Date:                Wed, 07 May 2025   Prob (F-statistic):           2.68e-33
Time:                        07:19:33   Log-Likelihood:                -338.01
No. Observations:                 526   AIC:                             682.0
Df Residuals:                     523   BIC:                             694.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.2169      0.109      1.997      0.0

베타1의 추정량: 0.098,
베타2의 추정량: 0.010 이 나왔다.
X2를 X1에 대해 회귀해보자.

In [8]:
model = smf.ols('exper ~ educ ', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  exper   R-squared:                       0.090
Model:                            OLS   Adj. R-squared:                  0.088
Method:                 Least Squares   F-statistic:                     51.65
Date:                Wed, 07 May 2025   Prob (F-statistic):           2.30e-12
Time:                        07:19:35   Log-Likelihood:                -2093.0
No. Observations:                 526   AIC:                             4190.
Df Residuals:                     524   BIC:                             4198.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     35.4615      2.628     13.494      0.0

델타는 -1.47이 나왔다.

총효과=0.083=0.098+0.010*(-1.47) 이 확인된다.

다시 경제이해력 데이터를 사용하여 분석해보자.

In [9]:
#데이터 불러오고 전처리
import pandas as pd
import numpy as np

df=pd.read_excel('econliteracy (1).xlsx', index_col=0)

df = df.rename(columns={'sq1': 'gender', 'sq2': 'age', 'sq3': 'region', 'sq4': 'job', 'sq5': 'edu', 'sq6': 'income'})  #열 이름 바꾸기

df=df.dropna(subset=['gender'])        #성별 정보가 누락된 샘플 제거

# 변수정리

df['gender'] = df['gender'].replace(2, 0)
df['income'] = df['income'].replace(8, 7)
df['job'] = df['job'].replace(6, 5)
df['edu'] = df['edu'].replace(1, 2)
df['region2'] = df['region'].apply(
    lambda x: 1 if x in [1, 4, 9]
    else 2 if x in [6, 7, 8, 10, 11, 12]
    else 3 if x in [2, 3, 7, 15, 16]
    else 4 if x in [5, 13, 14]
    else None
)



경제이해력을 종속변수로 놓고, 교차항을 만들어보자.

In [10]:
import statsmodels.formula.api as smf
model = smf.ols('score ~ age + I(age**2) + gender+ edu + gender*edu', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.126
Model:                            OLS   Adj. R-squared:                  0.111
Method:                 Least Squares   F-statistic:                     8.413
Date:                Wed, 07 May 2025   Prob (F-statistic):           1.89e-07
Time:                        07:19:47   Log-Likelihood:                -1287.6
No. Observations:                 297   AIC:                             2587.
Df Residuals:                     291   BIC:                             2609.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       4.2499     15.167      0.280      

In [11]:
# 범주형으로 변환
df['gender'] = df['gender'].astype('category')
df['edu'] = df['edu'].astype('category')
df['region2'] = df['region2'].astype('category')

model = smf.ols('score ~ age + I(age**2) + gender*region2', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.110
Model:                            OLS   Adj. R-squared:                  0.082
Method:                 Least Squares   F-statistic:                     3.937
Date:                Wed, 07 May 2025   Prob (F-statistic):           9.65e-05
Time:                        07:19:50   Log-Likelihood:                -1290.4
No. Observations:                 297   AIC:                             2601.
Df Residuals:                     287   BIC:                             2638.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept           

카테고리 변수의 준거집단(reference group)을 바꾸고 싶다면 treatment 옵션을 사용하면 된다.

In [12]:
model = smf.ols(
    'score ~ C(gender) * C(region2, Treatment(reference=4))',
    data=df
).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.018
Method:                 Least Squares   F-statistic:                     1.787
Date:                Wed, 07 May 2025   Prob (F-statistic):             0.0895
Time:                        07:19:52   Log-Likelihood:                -1301.4
No. Observations:                 297   AIC:                             2619.
Df Residuals:                     289   BIC:                             2648.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                                               coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------

황선호, 김성희(2024) 처럼 경제이해력 점수에 대한 회귀분석을 해보자.

In [13]:
df = df.rename(columns={'A4_A': 'channel', 'A5_A': 'econedu'})  #열 이름 바꾸기

In [14]:
#보기 6~8 4로 통합
df['channel'] = df['channel'].replace(7, 4)
df['channel'] = df['channel'].replace(8, 4)
df['channel'] = df['channel'].replace(6, 4)
df['channel'].value_counts().sort_index()

#소득 카테고리 줄이기
df['income'] = df['income'].replace(7, 6)
df['income'] = df['income'].replace(1, 2)

#직업 카테고리 줄이기
df['job'] = df['job'].replace(4, 3)
df['job'] = df['job'].replace(6, 5)
df['job'] = df['job'].replace(8, 7)
df['job'] = df['job'].replace(9, 7)

#학교밖 경제교육경험 더미
df['econedu'] = df['econedu'].replace(2, 0)


In [15]:
# 2x2 교차표
import pandas as pd

pd.crosstab(df['channel'], df['ssq2'], margins=True, rownames=['Channel'], colnames=['Agegroup'])    #비율 보고 싶을 때는 margins=True 옵션

Agegroup,2,3,4,5,6,7,All
Channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,10,6,22,31,46,22,137
2,1,0,3,3,3,1,11
3,35,24,26,16,9,2,112
4,8,6,8,4,3,2,31
5,1,0,3,1,1,0,6
All,55,36,62,55,62,27,297


In [16]:
import statsmodels.formula.api as smf
model = smf.ols('score ~ C(channel) + age + I(age**2) + gender+ edu + econedu+ C(income)+C(job, Treatment(reference=7))+C(region2)', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.175
Model:                            OLS   Adj. R-squared:                  0.115
Method:                 Least Squares   F-statistic:                     2.926
Date:                Wed, 07 May 2025   Prob (F-statistic):           4.45e-05
Time:                        07:19:59   Log-Likelihood:                -1279.1
No. Observations:                 297   AIC:                             2600.
Df Residuals:                     276   BIC:                             2678.
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
In

지식취득경로와 성별의 교차항을 더해보자.

In [17]:
import statsmodels.formula.api as smf
model = smf.ols('score ~ C(channel)*gender + age + I(age**2) + gender+ edu + econedu+ C(income)+C(job, Treatment(reference=7))+C(region2)', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.178
Model:                            OLS   Adj. R-squared:                  0.106
Method:                 Least Squares   F-statistic:                     2.461
Date:                Wed, 07 May 2025   Prob (F-statistic):           0.000263
Time:                        07:20:03   Log-Likelihood:                -1278.5
No. Observations:                 297   AIC:                             2607.
Df Residuals:                     272   BIC:                             2699.
Df Model:                          24                                         
Covariance Type:            nonrobust                                         
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
In

# 다음 수업내용 맛보기: 더미 종속변수일 때

In [18]:
import statsmodels.formula.api as smf
model = smf.ols('econedu ~ age + I(age**2) + gender+ edu + C(income)+C(job, Treatment(reference=7))+C(region2)', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                econedu   R-squared:                       0.050
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.9766
Date:                Wed, 07 May 2025   Prob (F-statistic):              0.480
Time:                        07:20:05   Log-Likelihood:                 146.21
No. Observations:                 297   AIC:                            -260.4
Df Residuals:                     281   BIC:                            -201.3
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
In