# Logistic regression diagnostics

[Credit to this notebook](https://www.kaggle.com/cdeotte/logistic-regression-0-800)

[Also credit to this](https://www.kaggle.com/mnassrib/titanic-logistic-regression-with-python)

## Table of Contents

[Q1](#tag1)

[Q2](#tag2)

[Regularization and grid search](#tag3)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np, pandas as pd, os
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score




In [None]:

train = pd.read_csv('../input/instant-gratification/train.csv')
test = pd.read_csv('../input/instant-gratification/test.csv')

train.head()

In [None]:
train.shape

In [None]:
train.columns

In [None]:
#cols =columns for c in train.columns if c not in ['id', 'target']]
cols = ['muggy-smalt-axolotl-pembus', 'dorky-peach-sheepdog-ordinal',
       'slimy-seashell-cassowary-goose']
oof = np.zeros(len(train))
skf = StratifiedKFold(n_splits=5, random_state=42)
   
for train_index, test_index in skf.split(train.iloc[:,1:-1], train['target']):
    clf = LogisticRegression(solver='liblinear',penalty='l2',C=1.0)
    clf.fit(train.loc[train_index][cols],train.loc[train_index]['target'])
    oof[test_index] = clf.predict_proba(train.loc[test_index][cols])[:,1]
    
auc = roc_auc_score(train['target'],oof)
print('LR without interactions scores CV =',round(auc,5))

In [None]:
clf.get_params()

In [None]:
clf.coef_

<a id='tag1'></a>
## Q1: What if we shrink/ expand one of the variable by 100 times? What will happen to the coefficients?

* Expand variable by 100 times: coefficient shrink by 100 times, other coefficients doesn't change
* vice versa

In [None]:
train['muggy-smalt-axolotl-pembus_100'] = train['muggy-smalt-axolotl-pembus']*100 

In [None]:
cols = ['muggy-smalt-axolotl-pembus_100', 'dorky-peach-sheepdog-ordinal',
       'slimy-seashell-cassowary-goose']

for train_index, test_index in skf.split(train.iloc[:,1:-1], train['target']):
    clf = LogisticRegression(solver='liblinear',penalty='l2',C=1.0)
    clf.fit(train.loc[train_index][cols],train.loc[train_index]['target'])
    oof[test_index] = clf.predict_proba(train.loc[test_index][cols])[:,1]
    
auc = roc_auc_score(train['target'],oof)
print('LR without interactions scores CV =',round(auc,5))

In [None]:
clf.coef_

In [None]:
train['muggy-smalt-axolotl-pembus_s100'] = train['muggy-smalt-axolotl-pembus']/100 

In [None]:
cols = ['muggy-smalt-axolotl-pembus_s100', 'dorky-peach-sheepdog-ordinal',
       'slimy-seashell-cassowary-goose']

for train_index, test_index in skf.split(train.iloc[:,1:-1], train['target']):
    clf = LogisticRegression(solver='liblinear',penalty='l2',C=1.0)
    clf.fit(train.loc[train_index][cols],train.loc[train_index]['target'])
    oof[test_index] = clf.predict_proba(train.loc[test_index][cols])[:,1]
    
auc = roc_auc_score(train['target'],oof)
print('LR without interactions scores CV =',round(auc,5))

In [None]:
clf.coef_

<a id='tag2'></a>
## Q2 What if you change the distribution of the data?

We can see that the CV score has a slight decrease because of the change of the distribution

In [None]:
train['muggy-smalt-axolotl-pembus'].hist();

In [None]:
for i in range(10000):
    train.loc[i, 'muggy-smalt-axolotl-pembus'] = train.loc[i, 'muggy-smalt-axolotl-pembus']*100

In [None]:
train['muggy-smalt-axolotl-pembus'].hist();

In [None]:
cols = ['muggy-smalt-axolotl-pembus', 'dorky-peach-sheepdog-ordinal',
       'slimy-seashell-cassowary-goose']

for train_index, test_index in skf.split(train.iloc[:,1:-1], train['target']):
    clf = LogisticRegression(solver='liblinear',penalty='l2',C=1.0)
    clf.fit(train.loc[train_index][cols],train.loc[train_index]['target'])
    oof[test_index] = clf.predict_proba(train.loc[test_index][cols])[:,1]
    
auc = roc_auc_score(train['target'],oof)
print('LR without interactions scores CV =',round(auc,5))

<a id='tag3'></a>
## Regularization

In [None]:
train = pd.read_csv('../input/instant-gratification/train.csv')
cols = ['muggy-smalt-axolotl-pembus', 'dorky-peach-sheepdog-ordinal',
       'slimy-seashell-cassowary-goose']

for train_index, test_index in skf.split(train.iloc[:,1:-1], train['target']):
    clf = LogisticRegression(solver='liblinear',penalty='l1')
    clf.fit(train.loc[train_index][cols],train.loc[train_index]['target'])
    oof[test_index] = clf.predict_proba(train.loc[test_index][cols])[:,1]
    
auc = roc_auc_score(train['target'],oof)
print('LR without interactions scores CV =',round(auc,5))

In [None]:
train = pd.read_csv('../input/instant-gratification/train.csv')
cols = ['muggy-smalt-axolotl-pembus', 'dorky-peach-sheepdog-ordinal',
       'slimy-seashell-cassowary-goose']

for train_index, test_index in skf.split(train.iloc[:,1:-1], train['target']):
    clf = LogisticRegression(solver='liblinear',penalty='l2')
    clf.fit(train.loc[train_index][cols],train.loc[train_index]['target'])
    oof[test_index] = clf.predict_proba(train.loc[test_index][cols])[:,1]
    
auc = roc_auc_score(train['target'],oof)
print('LR without interactions scores CV =',round(auc,5))

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {'C': np.arange(1e-5, 3, 0.1)}
scoring = {'Accuracy': 'accuracy', 'AUC': 'roc_auc', 'Log loss': 'neg_log_loss'}
gs = GridSearchCV(LogisticRegression(), return_train_score=False,param_grid = param_grid, scoring=scoring, cv = 5, refit='AUC')
gs.fit(train[cols], train['target'])

In [None]:
print('='*20)
print("best params: " + str(gs.best_estimator_))
print("best params: " + str(gs.best_params_))
print('best score:', gs.best_score_)
print('='*20)


## Linear regression diag

In [None]:
df_train = pd.read_csv('../input/random-linear-regression/train.csv')
df_test = pd.read_csv('../input/random-linear-regression/test.csv')

In [None]:
df_test.head()

In [None]:
df_train.fillna(49.94, inplace=True)

In [None]:
df_test.isna().sum()

In [None]:
df_train.describe()

In [None]:
X_train = df_train['x'].values.reshape(-1,1)
y_train = df_train['y'].values
X_test = df_test['x'].values.reshape(-1,1)
y_test = df_test['y'].values

#### Question: duplicate 500 obs to result in 1000 obs, how do training error and variances change? 

* Training error: mean squared error does not change
* Variance of error: same as mean squared error, does not change
* Variances of the model: decrease by 2 in magnitude since the ability to generalize increase

In [None]:
from mlxtend.evaluate import bias_variance_decomp
from sklearn.linear_model import LinearRegression


In [None]:
lr = LinearRegression()
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
    lr, X_train, y_train, X_test, y_test, loss='mse',
    random_seed=123)

In [None]:
avg_bias, avg_var

In [None]:
df_train_new = pd.concat([df_train, df_train], axis=0)

In [None]:
df_train_new.shape

In [None]:
df_train.shape

In [None]:
X_train = df_train_new['x'].values.reshape(-1,1)
y_train = df_train_new['y'].values
X_test = df_test['x'].values.reshape(-1,1)
y_test = df_test['y'].values

In [None]:
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
    lr, X_train, y_train, X_test, y_test, loss='mse',
    random_seed=123)

In [None]:
avg_bias, avg_var

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
md1 = lr.fit(df_train['x'].values.reshape(-1,1),df_train['y'].values)

In [None]:
pred1 = md1.predict(df_train['x'].values.reshape(-1,1))

In [None]:
err1 = pred1- df_train['y'].values

In [None]:
np.mean(err1), np.var(err1), np.sum(err1)

In [None]:
mean_squared_error(pred1, df_train['y'].values)

In [None]:
md2 = lr.fit(df_train_new['x'].values.reshape(-1,1), df_train_new['y'].values)
pred2 = md2.predict(df_train_new['x'].values.reshape(-1,1))
mean_squared_error(pred2, df_train_new['y'].values)

In [None]:
err2 = pred2 - df_train_new['y'].values
np.mean(err2), np.var(err2), np.sum(err2)