# LASSO FEATURE SELECTION

### The lasso method for variable selection¶

The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables.

## LASSO Regression

A linear model that estimates sparse coefficients.

Mathematically, it consists of a linear model trained with $\ell_1$ prior as regularizer. The objective function to minimize is:

$$\min_{w}\frac{1}{2n_{samples}} \big|\big|Xw - y\big|\big|_2^2 + \alpha \big|\big|w\big|\big|_1$$

The lasso estimate thus solves the minimization of the least-squares penalty with $\alpha \big|\big|w\big|\big|_1$ added, where $\alpha$ is a constant and $\big|\big|w\big|\big|_1$ is the $\ell_1-norm$ of the parameter vector.

The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients. It has connections to soft-thresholding of wavelet coefficients, forward stagewise regression, and boosting methods.

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
seed=42
kf=4

In [3]:
XY_train=pd.read_excel('Z:\SecondArticle\X11Y11_train.xlsx')

In [4]:
XY_train=XY_train.drop(['Unnamed: 0', 'AcYear_11', 'AcYear_12'], axis=1)

In [5]:
XY_train.shape

(21934, 121)

In [6]:
X_train=XY_train.iloc[:,:120]
Y_train=XY_train.iloc[:,-1]

In [7]:
Y_train=Y_train.to_numpy()

In [8]:
# standardization
scaler=StandardScaler()

## Laso Cross Validation

In [None]:
pipeline = Pipeline([('standardize', StandardScaler()),('lasso', Lasso(alpha=0.01, fit_intercept=True, normalize=False,
                  copy_X=True, max_iter=10000, tol=0.0001, warm_start=False, positive=False,
                  random_state=seed))])

In [None]:
params = {'lasso__alpha':(0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4)}

In [None]:
lasso_grid = GridSearchCV(pipeline, params, n_jobs=-1,
                            cv=kf,scoring='neg_mean_absolute_error', verbose=1) 

In [None]:
grid_result=lasso_grid.fit(X_train,Y_train)

In [None]:
best = grid_result.best_estimator_.get_params()

for k in sorted(params.keys()): 
    print('\t{0}: \t {1:.2f}'.format(k, best[k]))

In [None]:
df_lasso_grid_res= pd.DataFrame(grid_result.cv_results_)

In [None]:
df_lasso_grid_res.to_excel('LASSO_RESULTS.xlsx', sheet_name='X11Y11_train')

## Lasso FIT

In [9]:
# lasso_alfa = 0.03: last mean_test_score that is greater than best mean_test_score minus best std_test_score 
Lasso_rgr = Lasso(alpha=0.03, copy_X=True, fit_intercept=True, max_iter=10000,
                       normalize=False, positive=False, precompute=False,
                       random_state=seed, selection='cyclic', tol=0.0001,
                       warm_start=False)

In [10]:
tscale=scaler.fit(X_train)
X_train_std=tscale.transform(X_train)

In [11]:
Lasso_rgr.fit(X_train_std, Y_train)

Lasso(alpha=0.03, copy_X=True, fit_intercept=True, max_iter=10000,
      normalize=False, positive=False, precompute=False, random_state=42,
      selection='cyclic', tol=0.0001, warm_start=False)

In [12]:
Lasso_coef=pd.DataFrame(Lasso_rgr.coef_.reshape(1,-1), columns=X_train.columns[0:120])

In [13]:
Lasso_coef.T

Unnamed: 0,0
Std_Gender_F,0.382498
N_Retentions,-0.371471
School_Size,0.054262
Class_Size,0.000285
Student_Computer,-0.000000
...,...
Teacher_NoTeachingDedicatedTime,0.000000
Teacher_EducationSupportDedicatedTime,-0.027214
SubjClass_Foreign_Lang,0.683672
SubjClass_Qual,-0.006190


In [14]:
with pd.ExcelWriter('Lasso_coef.xlsx',engine='openpyxl', mode='a') as writer:
    Lasso_coef.to_excel(writer, sheet_name='Coef_11')

#Lasso_coef.to_excel("Lasso_coef.xlsx", sheet_name='Coef_11')

In [15]:
m2=(Lasso_coef == 0).any()

In [16]:
a=Lasso_coef.columns[m2]

In [17]:
a

Index(['Student_Computer', 'Student_Internet', 'Student_ActiveWorking',
       'Student_Parish', 'Student_County', 'STD_Resp_Himself',
       'STD_Resp_LegalResp', 'FTH_Nation_EEUR', 'FTH_Nation_OTHERS',
       'FTH_Nation_RICH', 'SES_STDRESP_ProfClass_UnivI',
       'SES_STDRESP_ProfClass_UnivII',
       'SES_STDRESP_ProfClass_Unknown_NoProfession',
       'SES_FATH_ProfClass_BasicI', 'SES_FATH_ProfClass_Unknown_NoProfession',
       'SES_MOTH_ProfClass_UnivI', 'SES_MOTH_ProfClass_Unknown_NoProfession',
       'SES_STDRESP_JobSit_HomeAffairs', 'SES_STDRESP_JobSit_Other',
       'SES_STDRESP_JobSit_Retired', 'SES_STDRESP_JobSit_Student',
       'SES_STDRESP_JobSit_Unemployed', 'SES_STDRESP_JobSit_Unknown',
       'SES_FATH_JobSit_Employer', 'SES_FATH_JobSit_Other',
       'SES_FATH_JobSit_Retired', 'SES_FATH_JobSit_SelfEmployed',
       'SES_FATH_JobSit_Student', 'SES_FATH_JobSit_Unemployed',
       'SES_FATH_JobSit_Unknown', 'SES_MOTH_JobSit_Employer',
       'SES_MOTH_JobSit_HomeAffa

In [18]:
len(a)

71