## 로지스틱 함수(Logistic Regression)

#### odds = 일어나지 않을 확률에서의 일어날 확률
    - 커질 수록, 일어날 확률이 높다
    - ln(p/1-p) = w1x + x0
    - p/1-p = e^(w1x+x0)
    - p=1/1 + e^(-1*(w1x+x0))
    - 시그 모이드함수에다가 회귀 값을 대입해준다면, 그에 맞는 확률값을 구할 수 있다
    - 1/(1+e^-x) :  시그모이드 함수
    - C는 1/alpha이므로 작을 수록 규제가 크다 (SNM에서는 C값이 클수록 큐제가 크다)

## sklearn.linear_model.LogisticRegression
* class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
* prarmeters
    - penalty :{‘l1’, ‘l2’, ‘elasticnet’, None}, default=’l2’
    - C :Inverse of regularization strength , float, default=1.0
    - max_iter : 최대 몇번 학습 할건지(에포크), int, default=100
    - multi_class{‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’
    'auto' : 알아서 계산해준다
    ‘ovr’: a binary problem is fit for each label.
    ‘multinomial’: the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.
    - solver : {‘lbfgs’, ‘liblinear’, ‘newton-cg’, ‘newton-cholesky’, ‘sag’, ‘saga’}, default=’lbfgs’
        - For small datasets, ‘liblinear’ is a good choice
        - ‘sag’ and ‘saga’ are faster for large ones;
        - 
* multi- classifier : 레이블이 여러개 값 일때
    - 시그모이드에 대입하는 것이 아니라, softmax function에 대입해야 한다
    - 엔트로피를 카테고리컬 엔트로피를 구해준다(one-hot encoding)


In [9]:
# warning메시지 무시
import warnings
warnings.filterwarnings('ignore')


In [3]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

cancer= load_breast_cancer()

scaler= StandardScaler()
data_scaled= scaler.fit_transform(cancer.data)

X_train,X_test,y_train,y_test = train_test_split(data_scaled,cancer.target,test_size=0.3,random_state=0 )

In [5]:
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.linear_model import LogisticRegression

lr_clf= LogisticRegression()
lr_clf.fit(X_train,y_train)
lr_pred= lr_clf.predict(X_test)

print(f'accuracy:{accuracy_score(y_test,lr_pred):.3f}')
print(f'roc_auc:{roc_auc_score(y_test,lr_pred):.3f}')

accuracy:0.977
roc_auc:0.972


In [10]:
from sklearn.model_selection import GridSearchCV
params={'penalty':['l2','l1'],
        'C':[0.01,0.1,1,5,10]}
grid_clf = GridSearchCV(lr_clf, param_grid=params, scoring='accuracy',cv=3)
grid_clf.fit(data_scaled,cancer.target)

print('최적 하이퍼 파라미터:{0}, 최적 평균 정확도:{1:.3f}'.format(grid_clf.best_params_,grid_clf.best_score_))

최적 하이퍼 파라미터:{'C': 1, 'penalty': 'l2'}, 최적 평균 정확도:0.975


In [28]:
from sklearn.datasets import load_iris
iris=load_iris()
scaler= StandardScaler()
iris_scaled= scaler.fit_transform(iris.data)

X_train,X_test,y_train,y_test = train_test_split(iris_scaled,iris.target,test_size=0.3,random_state=0 )

lg_clf= LogisticRegression()
lg_clf.fit(X_train,y_train)
lg_pred= lg_clf.predict_proba(X_test)
pd.DataFrame(lg_pred).describe()



Unnamed: 0,0,1,2
count,45.0,45.0,45.0
mean,0.3583,0.357943,0.283757
std,0.464811,0.36349,0.3469014
min,2.6e-05,0.004243,7.354107e-08
25%,0.002583,0.02612,1.614249e-06
50%,0.024095,0.161514,0.1207245
75%,0.968179,0.730649,0.5657152
max,0.995757,0.950313,0.9899404


In [26]:
lg_clf= LogisticRegression(multi_class='ovr')
lg_clf.fit(X_train,y_train)
lg_pred= lg_clf.predict_proba(X_test)
lg_pred= lg_clf.predict_proba(X_test)
pd.DataFrame(lg_pred).describe()

Unnamed: 0,0,1,2
count,45.0,45.0,45.0
mean,0.335794,0.356153,0.308052
std,0.425608,0.268351,0.3046
min,0.000239,0.018802,1.9e-05
25%,0.00523,0.106595,6.9e-05
50%,0.043824,0.277675,0.236031
75%,0.881048,0.542622,0.536531
max,0.981176,0.917569,0.903337
