# 로지스틱 회귀분석


* 로지스틱 회귀분석은 회귀분석을 분류에 이용한 방법으로 독립변수의 선형결합을 이용해 사건의 발생 가능성을 예측히는 분석방법으로 종속변수(y)가 범주형 변수일 때 사용 가능하다.
* 로지스틱 회귀분석은 종속변수가 바로 범주형 변수를 반환하지 않고 각 범주(집단)에 포함될 확률값을 반환하여 분류한다. 예컨대 ‘어떤 사건이 발생할 확률과 발생하지 않을 확률’로 나누어서 예측한다. 

< 로지스틱스 회귀분석의 변수 >
* 일반적인 로지스틱 회귀분석은 종속변수가 속할 수 있는 집단이 두 개로 이진 분류가 기본이며，세개 이상의 집단을 분류하는 경우 이를 ‘다중 로지스틱 회귀분석’이라 한다. 
* 로지스틱 회귀분석은독립변수가 연속형이면서 종속변수가범주형일 때 가능하다. 만약 독립변수가 범주형일 경우에는그 범주형 독립변수를 더미변수로 변환하면 가능하다. 
* 로지스틱 회귀분석은 독립변수가 어떤 값을 가지든 상관없이 종속변수는 확률값을 가진다. 따라서 로지스틱 회귀분석의 종속변수는 항상 '0과 1’사이의 값을 가지는데，이를 위해 '오즈(Odds), 로짓변환, 시그모이드 함수 같은 개념이 등장하게 된다. 

* 오즈(Odds) : 성공할 확률이 실패할 확률의 몇 배인지를 나타내는 값. 로지스틱 회귀분석에서 이 오즈를 사용하여 각 범주(집단)에 분류될 확률 값을 추정한다. 

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import koreanize_matplotlib

In [2]:
data = pd.read_csv("https://raw.githubusercontent.com/haram4th/ablearn/main/Taitanic_train.csv")
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [4]:
X = data[['Pclass', 'Sex', 'Age']]
y = data['Survived']

In [5]:
X

Unnamed: 0,Pclass,Sex,Age
0,3,male,22.0
1,1,female,38.0
2,3,female,26.0
3,1,female,35.0
4,3,male,35.0
...,...,...,...
886,2,male,27.0
887,1,female,19.0
888,3,female,
889,1,male,26.0


In [6]:
X = pd.get_dummies(X, drop_first=True)
X['Age'] = X['Age'].fillna('29')
X

Unnamed: 0,Pclass,Age,Sex_male
0,3,22.0,True
1,1,38.0,False
2,3,26.0,False
3,1,35.0,False
4,3,35.0,True
...,...,...,...
886,2,27.0,True
887,1,19.0,False
888,3,29,False
889,1,26.0,True


In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Pclass    891 non-null    int64 
 1   Age       891 non-null    object
 2   Sex_male  891 non-null    bool  
dtypes: bool(1), int64(1), object(1)
memory usage: 14.9+ KB


'Age' 컬럼이 object 이기 때문에 int로 변환을 해준다. 

In [8]:
X['Age'] = X['Age'].astype(int)

In [9]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Pclass    891 non-null    int64
 1   Age       891 non-null    int32
 2   Sex_male  891 non-null    bool 
dtypes: bool(1), int32(1), int64(1)
memory usage: 11.4 KB


In [10]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=8) 

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve

In [12]:
lg = LogisticRegression()
lg.fit(X_train, y_train)
pred = lg.predict(X_test)
print("accuracy_score ", accuracy_score(y_test, pred))
print(classification_report(y_test, pred))
print("roc_auc_score ", roc_auc_score(y_test, pred))

accuracy_score  0.7597765363128491
              precision    recall  f1-score   support

           0       0.77      0.85      0.81       108
           1       0.73      0.62      0.67        71

    accuracy                           0.76       179
   macro avg       0.75      0.74      0.74       179
weighted avg       0.76      0.76      0.76       179

roc_auc_score  0.7357850808555034


In [13]:
from sklearn.preprocessing import MinMaxScaler

In [14]:
mms = MinMaxScaler()
X_train_scaled = mms.fit_transform(X_train)
X_test_scaled = mms.fit_transform(X_test)

In [15]:
X_train_scaled 

array([[1.    , 0.3625, 1.    ],
       [0.5   , 0.45  , 0.    ],
       [1.    , 0.3375, 0.    ],
       ...,
       [0.5   , 0.3625, 1.    ],
       [0.5   , 0.025 , 1.    ],
       [1.    , 0.3625, 1.    ]])

In [16]:
import statsmodels.api as sm


In [18]:
lg2 = sm.GLM(y_train, X_train_scaled, family=sm.families.Binomial())
res = lg2.fit()
res.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,GLM,Df Residuals:,709.0
Model Family:,Binomial,Df Model:,2.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-358.18
Date:,"Mon, 02 Sep 2024",Deviance:,716.37
Time:,14:49:48,Pearson chi2:,778.0
No. Iterations:,4,Pseudo R-squ. (CS):,0.2757
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
x1,-0.6416,0.171,-3.752,0.000,-0.977,-0.306
x2,2.6895,0.366,7.355,0.000,1.973,3.406
x3,-2.2544,0.195,-11.553,0.000,-2.637,-1.872


In [20]:
# 상수항 추가
X_train_scaled = sm.add_constant(X_train_scaled)
X_test_scaled = sm.add_constant(X_test_scaled)
print(model.summary())

                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  712
Model:                          Logit   Df Residuals:                      709
Method:                           MLE   Df Model:                            2
Date:                Mon, 02 Sep 2024   Pseudo R-squ.:                  0.2428
Time:                        15:05:59   Log-Likelihood:                -358.18
converged:                       True   LL-Null:                       -473.03
Covariance Type:            nonrobust   LLR p-value:                 1.330e-50
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.6416      0.171     -3.752      0.000      -0.977      -0.306
x2             2.6895      0.366      7.355      0.000       1.973       3.406
x3            -2.2544      0.195    -11.553      0.0

In [21]:
model = sm.Logit(y_train, X_train_scaled).fit()
pred = model.predict(X_test_scaled)
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.435826
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  712
Model:                          Logit   Df Residuals:                      708
Method:                           MLE   Df Model:                            3
Date:                Mon, 02 Sep 2024   Pseudo R-squ.:                  0.3440
Time:                        15:06:00   Log-Likelihood:                -310.31
converged:                       True   LL-Null:                       -473.03
Covariance Type:            nonrobust   LLR p-value:                 3.097e-70
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.4584      0.398      8.680      0.000       2.677       4.239
x1            -2.3657      0.

In [22]:
pred

array([0.07458202, 0.06062846, 0.08826877, 0.5871924 , 0.91566857,
       0.15860331, 0.42537823, 0.93364129, 0.13982742, 0.6727924 ,
       0.07985546, 0.38127959, 0.06277127, 0.40420743, 0.06062846,
       0.06963047, 0.08261805, 0.40738969, 0.68125718, 0.05460399,
       0.3554493 , 0.80013587, 0.29817892, 0.60501475, 0.40738969,
       0.06062846, 0.92890462, 0.05654669, 0.06062846, 0.81730497,
       0.08261805, 0.0884054 , 0.56913856, 0.41315264, 0.06062846,
       0.86607194, 0.50480486, 0.21413372, 0.06062846, 0.50480486,
       0.3138995 , 0.63979431, 0.54130329, 0.07206703, 0.80013587,
       0.20193898, 0.20797045, 0.81145324, 0.07985546, 0.40738969,
       0.26094386, 0.17399033, 0.13982742, 0.72003579, 0.16360578,
       0.49555093, 0.48630005, 0.13982742, 0.08546733, 0.40738969,
       0.84311381, 0.50480486, 0.05855423, 0.09143439, 0.78178406,
       0.04420956, 0.66790378, 0.56003907, 0.5781916 , 0.76224788,
       0.82810094, 0.05090583, 0.40738969, 0.14433927, 0.06062

In [25]:
pred = np.round(pred)