# 逻辑回归

假设函数：$h(x)=\frac{1}{1+e^{-X\theta}}=p(y=1|X,\theta)=p$

参数：$\theta=(\theta_0,...,\theta_n)^{'}$

最大似然函数：$L=\prod^{m}_{i=1}p^{y^{(i)}}(1-p)^{1-y^{(i)}}$

代价函数：$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\ln(h(x))+(1-y^{(i)})\ln(1-h(x))$

梯度：$\bigtriangledown J(\theta)=\frac{1}{m}X^{'}(h(x)-y)$


In [None]:
import pandas as pd

from scipy.sparse import csr_matrix
from sklearn.preprocessing import MaxAbsScaler
from sklearn.decomposition import TruncatedSVD

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import *

## 数据

In [None]:
train = pd.read_csv('/content/sample_data/mnist_train_small.csv')
test = pd.read_csv('/content/sample_data/mnist_test.csv')

train_bin_cat = train[train.iloc[:, 0].isin([0, 1])]
test_bin_cat = test[test.iloc[:, 0].isin([0, 1])]

X_train, y_train = train_bin_cat.iloc[:, 1:], train_bin_cat.iloc[:, 0]
X_test, y_test = test_bin_cat.iloc[:, 1:], test_bin_cat.iloc[:, 0]

In [None]:
X_train = csr_matrix(X_train)
X_test = csr_matrix(X_test)

scaler = MaxAbsScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

dec = TruncatedSVD(n_components=96)
X_train = dec.fit_transform(X_train)
X_test = dec.transform(X_test)
print(dec.explained_variance_ratio_.sum())
print(X_test.shape)

0.9503326891265398
(2115, 96)


## 模型

In [None]:
clf = LogisticRegression(solver='saga', max_iter=500, random_state=100)
params = {
    'C': [0.001, 0.01, 0.1, 1, 10]
}
grid = GridSearchCV(clf, params, cv=5, scoring='roc_auc', verbose=2, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:   25.2s finished


{'C': 0.1}
0.9999931835737745


In [None]:
model = grid.best_estimator_
y_hat = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(f'accuracy：{accuracy_score(y_test, y_hat)}')
print(f'precision：{precision_score(y_test, y_hat)}')
print(f'recall：{recall_score(y_test, y_hat)}')
print(f'f1：{f1_score(y_test, y_hat)}')
print(f'auc：{roc_auc_score(y_test, y_prob)}')
print(confusion_matrix(y_test, y_hat))
print(classification_report(y_test, y_hat))

accuracy：0.9990543735224586
precision：0.9982409850483729
recall：1.0
f1：0.9991197183098591
auc：1.0
[[ 978    2]
 [   0 1135]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       980
           1       1.00      1.00      1.00      1135

    accuracy                           1.00      2115
   macro avg       1.00      1.00      1.00      2115
weighted avg       1.00      1.00      1.00      2115

