First of all, I have used ideas from this website:

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

In this kernel we are going to use lasso regression.

https://en.wikipedia.org/wiki/Lasso_(statistics)

In [17]:
# Loading the packages
import numpy as np
import pandas as pd 
from sklearn.preprocessing import StandardScaler
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer 
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

In [18]:
# Loading the training dataset
df_train = pd.read_csv("../input/train.csv")

In [19]:
y = df_train["target"]
# We exclude the target and id columns from the training dataset
df_train.pop("target");
df_train.pop("id")
colnames1 = df_train.columns

We are going to standardize the explanatory variables by removing the mean and scaling to unit variance, this is mandatory for logistic regression. The standard score for the variable X is calculated as follows:

$$ z= \frac{X−\mu}{s} $$
 
Where  μ  is the mean and s is the standard deviation.

In [20]:


scaler = StandardScaler()
scaler.fit(df_train)
X = scaler.transform(df_train)
df_train = pd.DataFrame(data = X, columns=colnames1)   # df_train is standardized 

In this kernel:

https://www.kaggle.com/ricardorios/random-forests-don-t-overfit

We have found the following variables that are related with the target variable: 33, 279, 272, 83, 237, 241, 91, 199, 216, 19, 65, 141, 70, 243, 137, 26, 90. We are going to use these variables to fit the model.

In [21]:
random_forest_predictors = ["33", "279", "272", 
                           "83", "237", "241", 
                           "91", "199", "216", 
                           "19", "65", "141", "70", "243", "137", "26", "90"]

predictors = random_forest_predictors

df_train = df_train[predictors]


In order to regularize the model fitted in: 

https://www.kaggle.com/ricardorios/logistic-regression-model-don-t-overfit

We are going to use [Lasso Regression](https://www.statisticshowto.datasciencecentral.com/lasso-regression/), one of the advantages of using this approach is that the model is sparse and we get the best predictors. Next, we are going to perform a grid search over the parameter C.  

In [24]:
# We adapt code from this kernel: 
# https://www.kaggle.com/vincentlugat/logistic-regression-rfe

# Find best hyperparameters (roc_auc)
random_state = 0
clf = LogisticRegression(random_state = random_state)
param_grid = {'class_weight' : ['balanced'], 
              'penalty' : ['l1'],  
              'C' : [0.0001, 0.0005, 0.001, 
                     0.005, 0.01, 0.05, 0.1, 0.5, 1, 
                     10, 100, 1000, 1500, 2000, 2500, 
                     2600, 2700, 2800, 2900, 3000, 3100, 3200  
                     ], 
              'max_iter' : [100, 1000, 2000, 5000, 10000] }

# Make an roc_auc scoring object using make_scorer()
scorer = make_scorer(roc_auc_score)

grid = GridSearchCV(estimator = clf, param_grid = param_grid , 
                    scoring = scorer, verbose = 1, cv=20,
                    n_jobs = -1)

X = df_train.values

grid.fit(X,y)

print("Best Score:" + str(grid.best_score_))
print("Best Parameters: " + str(grid.best_params_))

best_parameters = grid.best_params_

Fitting 20 folds for each of 110 candidates, totalling 2200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Best Score:0.7501500000000001
Best Parameters: {'C': 0.1, 'class_weight': 'balanced', 'max_iter': 100, 'penalty': 'l1'}


[Parallel(n_jobs=-1)]: Done 2200 out of 2200 | elapsed:    3.0s finished


In [25]:
# We get the best model 
best_clf = grid.best_estimator_
print(best_clf)

LogisticRegression(C=0.1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l1', random_state=0,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)


The best model is obtained with C=0.1, next we are going to fit the model with the whole training dataset.

In [26]:
model = LogisticRegression(C=0.1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l1', random_state=0,
          solver='warn', tol=0.0001, verbose=0, warm_start=False);

model.fit(X, y);


The coefficients of the model are shown as follows. 

In [27]:
print(model.coef_)

[[ 0.67397473  0.          0.04381562  0.         -0.13063115  0.
  -0.22047058  0.2367313   0.          0.          0.48645642  0.
   0.          0.          0.          0.         -0.08626306]]


There are only 7 variables with coefficients different than zero, the resulting model is more parsimonious and, in consequence less prone to overfitting. Finally, we will generate the file submission.  

In [None]:
df_test = pd.read_csv("../input/test.csv")
df_test.pop("id");
X = df_test 
X = scaler.transform(X)
df_test = pd.DataFrame(data = X, columns=colnames1)   # df_train is standardized 
df_test = df_test[predictors]


X = df_test.values
y_pred = model.predict_proba(X)
y_pred = y_pred[:,1]    

In [None]:
# submit prediction
smpsb_df = pd.read_csv("../input/sample_submission.csv")
smpsb_df["target"] = y_pred
smpsb_df.to_csv("logistic_regression_l2_v1.csv", index=None)