## Walk-through for Marginal Information

Suppose there is a data that has response variable $Y$ and explanatory variables $X_1$, $X_2$, and $X_3$. We want a logistic regression model
$$\mathbb{P}(Y = 1) = \frac{1}{1 + \exp(- \sum_{j=1}^3 \beta_j X_j)}$$

Question: how do we know that one of the $X_j$'s produce better results than others? 

Answer: 

(1) First approach is to try them one by one individually. For each $j \in [1, 2, 3]$, you can build a linear model
$$Y \sim X_j$$
where the above model can take the form of $\mathbb{P}(Y=1)$ defined above. In other words, we are using the model
$$\mathbb{P}(Y = 1) = \frac{1}{1 + \exp(- \beta_j X_j)}$$
and in the end I just need to check my regression coefficients.

(2) Second approach is to use the AUC value. There is a measure that checks the performance of $\hat{Y}$ and $Y$. The measure is called Receiver Operating Characteristics (ROC). This is a curve that can be drawn based on $\hat{Y}$ and $Y$ given many different thresholds. The area under this curve is called AUC. This AUC is a numeric value between 50% and 100% (the higher the better).

```
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.metrics import roc_auc_score
>>> X, y = load_breast_cancer(return_X_y=True)
>>> clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y)
>>> roc_auc_score(y, clf.predict_proba(X)[:, 1])
0.99...
>>> roc_auc_score(y, clf.decision_function(X))
0.99...
```

In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import pandas as pd



In [2]:
X, y = load_breast_cancer(return_X_y=True)
X.shape, y.shape

((569, 30), (569,))

Note the above data $X$ is a multi-variate model because it has more than one columns (i.e. 30 columns).

In [43]:
for j in range(30):
    if roc_auc_score(y, X[:,j]) < 0.5:
        print(j, 1 - roc_auc_score(y, X[:,j]))
    else:
        print(j, roc_auc_score(y, X[:,j]))

0 0.9375165160403784
1 0.7758244807356905
2 0.9468976269753184
3 0.9383158923946937
4 0.7220416468474182
5 0.8637823053749801
6 0.9378270175994926
7 0.9644376618571957
8 0.6985624438454627
9 0.5154656202103483
10 0.8683341261032715
11 0.5115942603456477
12 0.8763939538079383
13 0.9264111304899318
14 0.5311624649859944
15 0.7272805348554516
16 0.7808189313461233
17 0.7917921885735426
18 0.5551107235346969
19 0.6203028381163787
20 0.9704428941387876
21 0.7846308334654617
22 0.9754505575815232
23 0.9698284974367105
24 0.7540563395169388
25 0.8623024681570741
26 0.9213638285502881
27 0.9667036625971143
28 0.736939115268749
29 0.6859706146609588


From the above results, we know the AUC values for each column individually. This means we know how important each column is at making predictions. We can make selections!

#### Benchmark

We use all variables.

In [47]:
clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y)

In [48]:
roc_auc_score(y, clf.predict_proba(X)[:, 1]) # training performance

0.9946488029173934

#### Proposed

Since from before, we use AUC values to generate scores for each variable. Here the proposed model is to see if we can use much less variables to produce perhaps a similar result.

In [37]:
X[:, [9, 11]]

array([[0.07871, 0.9053 ],
       [0.05667, 0.7339 ],
       [0.05999, 0.7869 ],
       ...,
       [0.05648, 1.075  ],
       [0.07016, 1.595  ],
       [0.05884, 1.428  ]])

In [51]:
clf = LogisticRegression(solver="liblinear", random_state=0).fit(X[:, [0, 2, 3, 6, 7, 13, 20]], y)

In [52]:
roc_auc_score(y, clf.predict_proba(X[:, [0, 2, 3, 6, 7, 13, 20]])[:, 1]) # training performance

0.9888351567041911