로지스틱 회귀(영어: logistic regression)는 영국의 통계학자인 D. R. Cox가 1958년[1] 에 제안한 확률 모델로서 독립 변수의 선형 결합을 이용하여 사건의 발생 가능성을 예측하는데 사용되는 통계 기법이다.

로지스틱 회귀의 목적은 일반적인 회귀 분석의 목표와 동일하게 종속 변수와 독립 변수간의 관계를 구체적인 함수로 나타내어 향후 예측 모델에 사용하는 것이다. 이는 독립 변수의 선형 결합으로 종속 변수를 설명한다는 관점에서는 선형 회귀 분석과 유사하다. 하지만 로지스틱 회귀는 선형 회귀 분석과는 다르게 종속 변수가 범주형 데이터를 대상으로 하며 입력 데이터가 주어졌을 때 해당 데이터의 결과가 특정 분류로 나뉘기 때문에 일종의 분류 (classification) 기법으로도 볼 수 있다.

흔히 로지스틱 회귀는 종속변수가 이항형 문제(즉, 유효한 범주의 개수가 두개인 경우)를 지칭할 때 사용된다. 이외에, 두 개 이상의 범주를 가지는 문제가 대상인 경우엔 다항 로지스틱 회귀 (multinomial logistic regression) 또는 분화 로지스틱 회귀 (polytomous logistic regression)라고 하고 복수의 범주이면서 순서가 존재하면 서수 로지스틱 회귀 (ordinal logistic regression) 라고 한다.[2] 로지스틱 회귀 분석은 의료, 통신, 데이터마이닝과 같은 다양한 분야에서 분류 및 예측을 위한 모델로서 폭넓게 사용되고 있다.

In [1]:
import pandas as pd

In [2]:
bank = pd.read_csv("bank.csv", sep = ";")
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [3]:
bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      4521 non-null   int64 
 14  previous   4521 non-null   int64 
 15  poutcome   4521 non-null   object
 16  y          4521 non-null   object
dtypes: int64(7), object(10)
memory usage: 600.6+ KB


# 리코딩 
- marital

In [4]:
bank['marital'].value_counts()

married     2797
single      1196
divorced     528
Name: marital, dtype: int64

In [5]:
bank['marital_G'] = bank['marital'].replace(['single', 'married', 'divorced'],
                                           [1, 2, 3])
bank['marital_G'].head()

0    2
1    2
2    1
3    2
4    2
Name: marital_G, dtype: int64

In [6]:
bank['marital_G'].value_counts()

2    2797
1    1196
3     528
Name: marital_G, dtype: int64

- education

In [7]:
bank['education_G'] = bank['education'].replace(['primary', 'secondary', 'tertiary','unknown'],
                                              [1, 2, 3, None])
bank['education_G'].head()

0    1.0
1    2.0
2    3.0
3    3.0
4    2.0
Name: education_G, dtype: float64

In [8]:
# housing
bank['housing_G'] = bank['housing'].replace(["yes", "no"],[1, 0])

# loan
bank['loan_G'] = bank['loan'].replace(["yes", "no"],[1, 0])

# default
bank['default_G'] = bank['default'].replace(["yes", "no"],[1, 0])

# contact
bank['contact_G'] = bank['contact'].replace(["cellular", "telephone", "unknown"],[1, 2, None])

# poutcome
bank['poutcome_G'] = bank['poutcome'].replace(["failure", "success", "unknown", "other"],[1, 2, None, None])

# y
bank['y_G'] = bank['y'].replace(["yes", "no"],[1, 0])

- 데이터셋 만들기

In [9]:
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,poutcome,y,marital_G,education_G,housing_G,loan_G,default_G,contact_G,poutcome_G,y_G
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,...,unknown,no,2,1.0,0,0,0,1.0,,0
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,...,failure,no,2,2.0,1,1,0,1.0,1.0,0
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,...,failure,no,1,3.0,1,0,0,1.0,1.0,0
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,...,unknown,no,2,3.0,1,1,0,,,0
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,...,unknown,no,2,2.0,1,0,0,,,0


In [40]:
bank_lr = bank[["y_G", "age", "duration", "pdays", "marital_G", "loan_G", "contact_G"]]
bank_lr.head()

Unnamed: 0,y_G,age,duration,pdays,marital_G,loan_G,contact_G
0,0,30,79,-1,2,0,1.0
1,0,33,220,339,2,1,1.0
2,0,35,185,330,1,0,1.0
3,0,30,199,-1,2,1,
4,0,59,226,-1,2,0,


In [41]:
bank_lr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   y_G        4521 non-null   int64  
 1   age        4521 non-null   int64  
 2   duration   4521 non-null   int64  
 3   pdays      4521 non-null   int64  
 4   marital_G  4521 non-null   int64  
 5   loan_G     4521 non-null   int64  
 6   contact_G  3197 non-null   float64
dtypes: float64(1), int64(6)
memory usage: 247.4 KB


In [42]:
bank_lr.isnull().sum()

y_G             0
age             0
duration        0
pdays           0
marital_G       0
loan_G          0
contact_G    1324
dtype: int64

In [43]:
bank_lr_na = bank_lr.dropna() # 결측값 제거

In [45]:
bank_lr_na.head()

Unnamed: 0,y_G,age,duration,pdays,marital_G,loan_G,contact_G
0,0,30,79,-1,2,0,1.0
1,0,33,220,339,2,1,1.0
2,0,35,185,330,1,0,1.0
5,0,35,141,176,1,0,1.0
6,0,36,341,330,2,0,1.0


In [14]:
bank_lr_na.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3197 entries, 0 to 4520
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   y_G        3197 non-null   int64  
 1   age        3197 non-null   int64  
 2   duration   3197 non-null   int64  
 3   pdays      3197 non-null   int64  
 4   marital_G  3197 non-null   int64  
 5   loan_G     3197 non-null   int64  
 6   contact_G  3197 non-null   float64
dtypes: float64(1), int64(6)
memory usage: 199.8 KB


In [15]:
bank_lr_na = bank_lr_na.reset_index()
bank_lr_na.head()

Unnamed: 0,index,y_G,age,duration,pdays,marital_G,loan_G,contact_G
0,0,0,30,79,-1,2,0,1.0
1,1,0,33,220,339,2,1,1.0
2,2,0,35,185,330,1,0,1.0
3,5,0,35,141,176,1,0,1.0
4,6,0,36,341,330,2,0,1.0


In [16]:
bank_lr_na = bank_lr_na.drop(["index"], axis = 1)
bank_lr_na.head()

Unnamed: 0,y_G,age,duration,pdays,marital_G,loan_G,contact_G
0,0,30,79,-1,2,0,1.0
1,0,33,220,339,2,1,1.0
2,0,35,185,330,1,0,1.0
3,0,35,141,176,1,0,1.0
4,0,36,341,330,2,0,1.0


In [17]:
bank_lr_na["contact_G"] = bank_lr_na["contact_G"]. astype("int") # int형으로 바꿈
bank_lr_na

Unnamed: 0,y_G,age,duration,pdays,marital_G,loan_G,contact_G
0,0,30,79,-1,2,0,1
1,0,33,220,339,2,1,1
2,0,35,185,330,1,0,1
3,0,35,141,176,1,0,1
4,0,36,341,330,2,0,1
...,...,...,...,...,...,...,...
3192,0,32,624,-1,1,0,1
3193,0,33,329,-1,2,0,1
3194,0,57,151,-1,2,0,1
3195,0,28,129,211,2,0,1


- 로지스틱회귀모형

In [18]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [19]:
# recoding 한 값은 C를 앞에 붙혀줌
logit_model = smf.logit("y_G ~ age + duration + pdays + C(marital_G) + C(loan_G) + C(contact_G)",
                       data = bank_lr_na).fit()
# Current function value 값이 0에 가까울수록 모델 적합도가 좋음.

Optimization terminated successfully.
         Current function value: 0.340131
         Iterations 7


In [20]:
print(logit_model.summary2())

                          Results: Logit
Model:               Logit            Pseudo R-squared: 0.174     
Dependent Variable:  y_G              AIC:              2190.8001 
Date:                2021-02-18 17:27 BIC:              2239.3599 
No. Observations:    3197             Log-Likelihood:   -1087.4   
Df Model:            7                LL-Null:          -1317.0   
Df Residuals:        3189             LLR p-value:      4.6240e-95
Converged:           1.0000           Scale:            1.0000    
No. Iterations:      7.0000                                       
------------------------------------------------------------------
                   Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
------------------------------------------------------------------
Intercept         -3.4404   0.2281 -15.0798 0.0000 -3.8876 -2.9932
C(marital_G)[T.2] -0.3806   0.1392  -2.7332 0.0063 -0.6535 -0.1077
C(marital_G)[T.3] -0.0581   0.1969  -0.2951 0.7679 -0.4439  0.3277
C(loan_G)[T.1]    -0.

- 해석: age, duration, pdays의 계수는 +, marital_G(2(married), 3(divorced)), loan_G(1(yes)), contact_G(2(telephone))의 계수는 -로 나타났으며, 통계적으로 유의하다

=============================================================

In [21]:
import numpy as np

np.exp(logit_model.params) # y가 yes일 확률이 이 값 만큼 됨

Intercept            0.032052
C(marital_G)[T.2]    0.683454
C(marital_G)[T.3]    0.943562
C(loan_G)[T.1]       0.381319
C(contact_G)[T.2]    0.972695
age                  1.016632
duration             1.003636
pdays                1.001779
dtype: float64

- 해석: age의 경우 Exp(0.0165) = 1.0166 이며, 유의하므로, age가 1만큼 커지면  y가 no일 확률 대비 yes 일 확률의 비율이 이전에 비해 약 1.02배가 된다.

====================================================================

- y(예금 가입) 예측

In [47]:
# predict() 함수를사용
predict = pd.DataFrame({"predict" : logit_model.predict()})
predict.head()

Unnamed: 0,predict
0,0.045597
1,0.05521
2,0.167266
3,0.115214
4,0.197323


In [23]:
bank_lr_na["predict"] = predict
bank_lr_na.head()

Unnamed: 0,y_G,age,duration,pdays,marital_G,loan_G,contact_G,predict
0,0,30,79,-1,2,0,1,0.045597
1,0,33,220,339,2,1,1,0.05521
2,0,35,185,330,1,0,1,0.167266
3,0,35,141,176,1,0,1,0.115214
4,0,36,341,330,2,0,1,0.197323


In [24]:
def pre_group(series):
    if series < 0.5:
        return 0
    else:
        return 1

In [25]:
bank_lr_na['preGroup'] = bank_lr_na['predict'].apply(pre_group)

In [26]:
bank_lr_na.head()

Unnamed: 0,y_G,age,duration,pdays,marital_G,loan_G,contact_G,predict,preGroup
0,0,30,79,-1,2,0,1,0.045597,0
1,0,33,220,339,2,1,1,0.05521,0
2,0,35,185,330,1,0,1,0.167266,0
3,0,35,141,176,1,0,1,0.115214,0
4,0,36,341,330,2,0,1,0.197323,0


In [27]:
pd.crosstab(bank_lr_na.y_G, bank_lr_na.preGroup, margins = True)

preGroup,0,1,All
y_G,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2684,53,2737
1,374,86,460
All,3058,139,3197


In [28]:
pd.crosstab(bank_lr_na.y_G, bank_lr_na.preGroup, margins = True, normalize = True) #normalize = True 는 확률로 보여줌

preGroup,0,1,All
y_G,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.839537,0.016578,0.856115
1,0.116985,0.0269,0.143885
All,0.956522,0.043478,1.0


- scikit-learn 이용

In [29]:
from sklearn.linear_model import LogisticRegression

In [30]:
bank_lr_na.head()

Unnamed: 0,y_G,age,duration,pdays,marital_G,loan_G,contact_G,predict,preGroup
0,0,30,79,-1,2,0,1,0.045597,0
1,0,33,220,339,2,1,1,0.05521,0
2,0,35,185,330,1,0,1,0.167266,0
3,0,35,141,176,1,0,1,0.115214,0
4,0,36,341,330,2,0,1,0.197323,0


In [31]:
# target = 종속변수 y
y = bank_lr_na[["y_G"]]
y.head()

Unnamed: 0,y_G
0,0
1,0
2,0
3,0
4,0


In [32]:
# feature = 독립변수 x
x = bank_lr_na[["age", "duration", "pdays", "marital_G", "loan_G", "contact_G"]]
x.head()

Unnamed: 0,age,duration,pdays,marital_G,loan_G,contact_G
0,30,79,-1,2,0,1
1,33,220,339,2,1,1
2,35,185,330,1,0,1
3,35,141,176,1,0,1
4,36,341,330,2,0,1


In [33]:
logit = LogisticRegression()

In [34]:
logit.fit(x, y)

  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [35]:
y_predict = logit.predict(x)
y_predict

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

- confusion_matrix(), accuracy_score() 이용해 예측률(정확도) 확인

In [36]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [37]:
confusion_matrix(y_predict, y)

array([[2686,  374],
       [  51,   86]], dtype=int64)

In [38]:
accuracy_score(y_predict, y)

0.8670628714419768