<a href="https://colab.research.google.com/github/seongheek/econtheory/blob/main/8%EC%A3%BC%EC%B0%A8_%EA%B0%95%EC%9D%98_%ED%9A%8C%EA%B7%80%EB%B6%84%EC%84%9D%ED%95%98%EA%B8%B0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **회귀분석**
경제이해력 자료를 사용하여 다중회귀분석을 시행해보자.

In [1]:
#데이터 불러오고 전처리

import pandas as pd
import numpy as np

df=pd.read_excel('econliteracy.xlsx', index_col=0)

df = df.rename(columns={'sq1': 'gender', 'sq2': 'age', 'sq3': 'region', 'sq4': 'job', 'sq5': 'edu', 'sq6': 'income'})  #열 이름 바꾸기

df=df.dropna(subset=['gender'])        #성별 정보가 누락된 샘플 제거

df['income'] = df['income'].replace(8, 7)     #income 이 8인 사람들을 7로 바꾸어 준다.

직업, 지역 변수도 특이사항이 없는지 미리 확인하자.

In [2]:
df['job'].value_counts().sort_index()

Unnamed: 0_level_0,count
job,Unnamed: 1_level_1
1,10
2,47
3,55
4,51
5,70
6,3
7,40
8,11
9,10


직업의 6번 카테고리(관리직/전문직)가 소수이므로 유사한 5번 카테고리(사무직)와 통합해주자.

In [3]:
df['job'] = df['job'].replace(6, 5)

In [4]:
df['edu'].value_counts().sort_index()

Unnamed: 0_level_0,count
edu,Unnamed: 1_level_1
1,24
2,134
3,139


In [5]:
df['edu'] = df['edu'].replace(1, 2)

지역도 확인해보자.

In [6]:
df['region'].value_counts().sort_index()

Unnamed: 0_level_0,count
region,Unnamed: 1_level_1
1,54
2,23
3,15
4,13
5,10
6,8
7,9
8,4
9,83
10,10


현재 표본 수에 비해 지역 카테고리가 많으므로, 권역별로 묶어보자. 수도권(1), 중부권(2), 영남권(3), 호남권(4) 이렇게 네 카테고리를 만들어보자. replace 명령어를 써도 되지만 권역을 나타내는 새로운 변수를 생성해보자.

In [7]:
df['region2'] = df['region'].apply(
    lambda x: 1 if x in [1, 4, 9]
    else 2 if x in [6, 7, 8, 10, 11, 12]
    else 3 if x in [2, 3, 7, 15, 16]
    else 4 if x in [5, 13, 14]
    else None
)

In [8]:
df['region2'].value_counts().sort_index()

Unnamed: 0_level_0,count
region2,Unnamed: 1_level_1
1,150
2,57
3,69
4,21


주요 변수들의 기초통계량을 뽑아보자.

In [9]:
df[['score', 'gender', 'age', 'job', 'income', 'edu', 'region2']].describe()

Unnamed: 0,score,gender,age,job,income,edu,region2
count,297.0,297.0,297.0,297.0,297.0,297.0,297.0
mean,57.441077,1.461279,48.787879,4.363636,3.538721,2.468013,1.868687
std,19.800971,0.49934,15.470944,1.97341,1.643789,0.499818,1.003169
min,15.0,1.0,18.0,1.0,1.0,2.0,1.0
25%,45.0,1.0,36.0,3.0,2.0,2.0,1.0
50%,60.0,1.0,48.0,4.0,4.0,2.0,1.0
75%,70.0,2.0,62.0,5.0,5.0,3.0,3.0
max,100.0,2.0,79.0,9.0,7.0,3.0,4.0


카테고리 변수들의 경우, 평균값 등이 의미가 없으므로 value_counts 명령어를 수정하여 비율로 나타내는 것이 더 낫다.

In [10]:
df['region2'].value_counts(normalize=True).sort_index()

Unnamed: 0_level_0,proportion
region2,Unnamed: 1_level_1
1,0.505051
2,0.191919
3,0.232323
4,0.070707


In [11]:
(df['region2'].value_counts(normalize=True) * 100).round(2).sort_index()  #퍼센티지로 환산 후 소수점 2자리까지

Unnamed: 0_level_0,proportion
region2,Unnamed: 1_level_1
1,50.51
2,19.19
3,23.23
4,7.07


성별의 경우, 더미변수이다. 남성 더미를 분석에 사용하면 되므로, 여성 값을 2에서 0으로 처리해준다.

In [12]:
df['gender'] = df['gender'].replace(2, 0)

In [13]:
df

Unnamed: 0_level_0,q1,a1,b1,q2,a2,b2,q3,a3,b3,q4,...,A9_A,A9e,gender,age,ssq2,region,job,edu,income,region2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3,3,1,1,1,1,3,2,0,4,...,2,,1.0,56,5,2,4,2,6,3
3,3,3,1,1,1,1,2,2,1,4,...,4,,1.0,36,3,1,5,3,4,1
4,2,3,0,3,1,0,1,2,0,3,...,2,,0.0,62,6,1,7,3,1,1
5,3,3,1,1,1,1,2,2,1,2,...,6,,0.0,33,3,7,3,3,2,2
6,1,3,0,1,1,1,2,2,1,4,...,1,,0.0,63,6,9,3,2,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,1,3,0,1,1,1,2,2,1,4,...,4,,0.0,38,3,13,7,3,1,4
297,3,3,1,2,1,0,3,2,0,1,...,2,,1.0,54,5,1,3,2,7,1
298,3,3,1,1,1,1,2,2,1,3,...,1,,1.0,74,7,9,2,3,3,1
299,2,3,0,2,1,0,2,2,1,1,...,4,,1.0,27,2,1,5,3,7,1


회귀분석을 해보자. statsmodels 패키지를 사용한다.

In [14]:
import statsmodels.formula.api as smf

model = smf.ols('score ~ age', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.036
Model:                            OLS   Adj. R-squared:                  0.033
Method:                 Least Squares   F-statistic:                     11.04
Date:                Wed, 30 Apr 2025   Prob (F-statistic):            0.00100
Time:                        06:26:37   Log-Likelihood:                -1302.2
No. Observations:                 297   AIC:                             2608.
Df Residuals:                     295   BIC:                             2616.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     69.3001      3.744     18.510      0.0

통제변수를 여러개 추가해보자.

In [15]:
model = smf.ols('score ~ age + gender', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.048
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     7.389
Date:                Wed, 30 Apr 2025   Prob (F-statistic):           0.000739
Time:                        06:26:55   Log-Likelihood:                -1300.4
No. Observations:                 297   AIC:                             2607.
Df Residuals:                     294   BIC:                             2618.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     67.2014      3.886     17.292      0.0

카테고리 변수의 경우, C로 묶으면 된다.

In [16]:
model = smf.ols('score ~ age + gender+C(region2)', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.071
Model:                            OLS   Adj. R-squared:                  0.055
Method:                 Least Squares   F-statistic:                     4.424
Date:                Wed, 30 Apr 2025   Prob (F-statistic):           0.000672
Time:                        06:27:05   Log-Likelihood:                -1296.8
No. Observations:                 297   AIC:                             2606.
Df Residuals:                     291   BIC:                             2628.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          69.6286      3.971     

In [18]:
model = smf.ols('score ~ age + gender+C(edu)+C(income)+C(job)+ C(region2)', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.151
Model:                            OLS   Adj. R-squared:                  0.092
Method:                 Least Squares   F-statistic:                     2.583
Date:                Wed, 30 Apr 2025   Prob (F-statistic):           0.000415
Time:                        06:27:30   Log-Likelihood:                -1283.5
No. Observations:                 297   AIC:                             2607.
Df Residuals:                     277   BIC:                             2681.
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          59.0718     11.541     

카테고리 변수의 준거집단(reference group)을 바꾸고 싶다면 treatment 옵션을 사용하면 된다.

In [20]:
model = smf.ols('score ~ age + gender+C(edu)+C(income)+C(job)+ C(region2, Treatment(reference=4))', data=df).fit()  #호남권을 준거집단으로
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.151
Model:                            OLS   Adj. R-squared:                  0.092
Method:                 Least Squares   F-statistic:                     2.583
Date:                Wed, 30 Apr 2025   Prob (F-statistic):           0.000415
Time:                        06:27:43   Log-Likelihood:                -1283.5
No. Observations:                 297   AIC:                             2607.
Df Residuals:                     277   BIC:                             2681.
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
                                              coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------

황선호, 김성희(2024) 처럼 경제이해력 점수를 세분화시켜보자. 예를 들어, 금융 점수(b12~16)를 생성해보자.

In [21]:
df['finance'] = df[['b12', 'b13', 'b14', 'b15', 'b16']].sum(axis=1)
df['finance']

Unnamed: 0_level_0,finance
ID,Unnamed: 1_level_1
1,5
3,4
4,2
5,0
6,2
...,...
296,5
297,3
298,3
299,3


In [22]:
df['finance']=df['finance']*4   #100점 만점으로 환산
df['finance']

Unnamed: 0_level_0,finance
ID,Unnamed: 1_level_1
1,20
3,16
4,8
5,0
6,8
...,...
296,20
297,12
298,12
299,12


In [23]:
model = smf.ols('finance ~ age + gender+C(region2)', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                finance   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.035
Method:                 Least Squares   F-statistic:                     3.142
Date:                Wed, 30 Apr 2025   Prob (F-statistic):            0.00886
Time:                        06:28:06   Log-Likelihood:                -887.70
No. Observations:                 297   AIC:                             1787.
Df Residuals:                     291   BIC:                             1810.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          15.1606      1.002     

# **연습문제**


*   논문에서 경제이해력의 세부항목을 선택하여 종속변수로 삼고, 통제변수를 넣어 회귀분석을 한 후 각 변수를 해석해보세요.
https://padlet.com/nathalieskim/padlet-5q0zb66s5dv95fna

