Data Standardization before modeling #15

twang15 · 2021-01-18T03:24:42Z

Two seemingly conflicts: interpretability and feature importance

A lot of software for performing multiple linear regression will provide standardised coefficients which are equivalent to unstandardised coefficients where you manually standardise predictors and the response variable (of course, it sounds like you are talking about only standardising predictors).
In cases where the metric does have meaning to the person interpreting the regression equation, unstandardised coefficients are often more informative.
You can always convert standardised coefficients to unstandardised coefficients if you know the mean and standard deviation of the predictor variable in the original sample.

twang15 · 2021-01-18T03:55:33Z

Use correlations and semi-correlations between outcome variable and predicators as feature importance
http://jeromyanglim.blogspot.com/2009/09/variable-importance-and-multiple.html

twang15 · 2021-01-18T03:59:00Z

at the very least you might need to standardize if you use regularization. Maybe not necessarily to zero mean / equal variance, but at least to something meaningful. Otherwise, regularization will completely ignore the variables that happened to use larger units.

twang15 · 2021-01-18T04:09:35Z

For this reason, scaling by standard deviation (or standardization/normalization) is generally not recommended, especially when interactions are involved.

twang15 · 2021-01-18T04:14:53Z

For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size. Guidelines for deciding what effect size metric to use and how to report it are outlined. Foremost among these are: (i) a preference for simple effect size over standardized effect size, and (ii) the use of confidence intervals to indicate a plausible range of values the effect might take.

twang15 · 2021-01-18T04:29:46Z

A conflicting viewpoint:

For comparing coefficients for different predictors within a model, standardizing gets the nod. (Although I don’t standardize binary inputs. I code them as 0/1, and then I standardize all other numeric inputs by dividing by two standard deviation, thus putting them on approximately the same scale as 0/1 variables.)

twang15 · 2021-01-18T04:31:37Z

In regression, it is often recommended to center the variables so that the predictors have mean 0. This makes it easier to interpret the intercept term as the expected value of 𝑌𝑖 when the predictor values are set to their means. Otherwise, the intercept is interpreted as the expected value of 𝑌𝑖 when the predictors are set to 0, which may not be a realistic or interpretable situation (e.g. what if the predictors were height and weight?). , centering/scaling does not affect your statistical inference in regression models - the estimates are adjusted appropriately and the p-values will be the same.

Other situations where centering and/or scaling may be useful:

when you're trying to sum or average variables that are on different scales, perhaps to create a composite score of some kind. Without scaling, it may be the case that one variable has a larger impact on the sum due purely to its scale, which may be undesirable.
To simplify calculations and notation. For example, the sample covariance matrix of a matrix of values centered by their sample means is simply 𝑋′𝑋. Similarly, if a univariate random variable 𝑋 has been mean centered, then var(𝑋)=𝐸(𝑋2) and the variance can be estimated from a sample by looking at the sample mean of the squares of the observed values.
Related to aforementioned, PCA can only be interpreted as the singular value decomposition of a data matrix when the columns have first been centered by their means.

Note that scaling is not necessary in the last two bullet points I mentioned and centering may not be necessary in the first bullet I mentioned, so the two do not need to go hand and hand at all times.

twang15 · 2021-01-18T05:53:24Z

https://stats.stackexchange.com/questions/342140/standardization-of-continuous-variables-in-binary-logistic-regression?rq=1

You don't need to standardize for normal logistic regression as long as one keeps units in mind when interpreting the coefficients.
Standardizing can help with interpreting feature importance because then the coefficients should be apples to apples. (ie if your two standardized continuous variables have coeff of 0.01 and 0.7 then you know that the 2nd one is much more important.)
for regularized logistic regression continuous variables should be standardized for best results.

twang15 · 2021-01-18T17:41:36Z

https://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary

in general, you do not need to center or standardize your data for multiple regression.
One thing that people sometimes say is that if you have standardized your variables first, you can then interpret the betas as measures of importance. For instance, if 𝛽1=.6, and 𝛽2=.3, then the first explanatory variable is twice as important as the second. While this idea is appealing, unfortunately, it is not valid. There are several issues, but perhaps the easiest to follow is that you have no way to control for possible range restrictions in the variables. Inferring the 'importance' of different explanatory variables relative to each other is a very tricky philosophical issue.
The only case I can think of off the top of my head where centering is helpful is before creating power terms. Lets say you have a variable, 𝑋, that ranges from 1 to 2, but you suspect a curvilinear relationship with the response variable, and so you want to create an 𝑋2 term. If you don't center 𝑋 first, your squared term will be highly correlated with 𝑋, which could muddy the estimation of the beta. Centering first addresses this issue.
If an interaction / product term is created from two variables that are not centered on 0, some amount of collinearity will be induced (with the exact amount depending on various factors). Centering first addresses this potential problem.

-standardizing is needed when using regularization

The least squares estimators of 𝛽1,𝛽2,… are not affected by shifting. The reason is that these are the slopes of the fitting surface - how much the surface changes if you change 𝑥1,𝑥2,…, one unit. This does not depend on location. (The estimator of 𝛽0, however, does.)
In case you use gradient descent to fit your model, standardizing covariates may speed up convergence (because when you have unscaled covariates, the corresponding parameters may inappropriately dominate the gradient).
Also, for some applications of SVMs, scaling may improve predictive performance: Feature scaling in support vector data description.
Numerical stability is an algorithm-related reason to center and/or scale data.

twang15 · 2021-01-18T18:07:21Z

https://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary

in general standardization is necessary before Lasso. Lasso regression puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. It is therefore necessary to center and reduce, or standardize, the variables.. The result of centering the variables means that there is no longer an intercept. This applies equally to ridge regression, by the way. The L1 penalty parameter is a summation of absolute beta terms. If the variables are all of different dimensionality then this term is really not additive even though mathematically there isn't any error.
if you want to drop predictors with small coefficients (or otherwise use a penalty term depending on coefficient magnitude), you need to decide what counts as "small". Though standardization isn't mandatory before LASSO or other penalized regression methods, it's rarely the case that the original scales the predictors happen to be measured in are useful for this purpose.
I don't see the dummy/ categorical variables suffering from this issue and think they need not be standardized. standardizing these may just reduce interpretability of variables

twang15 · 2021-01-18T18:46:09Z

TO READ
0. https://stats.stackexchange.com/questions/25690/multiple-linear-regression-for-hypothesis-testing#25707

twang15 · 2021-01-18T19:00:21Z

Summary:

Feature importance: Z-score transformation is needed for feature importance comparison (even the common practice of doing this is questionable, it is how it is done)
Feature selection: fitting the model iteratively by eliminating those features with relative small coefficient (less important features) also requires Z-score transformation.
Normalized coefficients and non-normalized coefficients can be converted to each other with mean and std (if z-score transformation is used to standardize the training data) of each feature in the training dataset.
Z-score transformation / centering / min-max normalization does not affect multiple linear regression / logistic regression using ordinary least square.
Z-score transformation is required for LASSO / Ridge regression.
Centering can eliminate multi-collinearity for X and its quadratic terms X^2.
Centering does not necessarily require subtracting mean of the sample for each observation, the quantity could be other meaningful quantity, like means in the contrast group.
Z-score transformation / Normalization on Dummy variable is unnecessary for linear / logistic regression.
In medical/financial modeling, z-score transformation hurts model interpretation and can be standardized coefficients can be converted back to unstandardized coefficients as stated in 3.
Normalization: min-max normalization, transform inputs into (0-1) range
Standardization: z-score transformation, (x-u)/s

[Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.
Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/
Feature Scaling _ Standardization Vs Normalization.pdf
)
However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized and standardized data and compare the performance for best results.
It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.

twang15 · 2021-04-05T03:53:17Z

Importance of Feature Scaling

PCA
SVM
K-nearest Neighbours
Logistic regression (really?)
many algorithms (such as SVM, K-nearest neighbors, and logistic regression) require features to be normalized, intuitively we can think of Principle Component Analysis (PCA) as being a prime example of when normalization is important. In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. As a change in height of one meter can be considered much more important than the change in weight of one kilogram, this is clearly incorrect.

twang15 pinned this issue Mar 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Standardization before modeling #15

Data Standardization before modeling #15

twang15 commented Jan 18, 2021 •

edited

Loading

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021 •

edited

Loading

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021 •

edited

Loading

twang15 commented Jan 18, 2021 •

edited

Loading

twang15 commented Apr 5, 2021

Data Standardization before modeling #15

Data Standardization before modeling #15

Comments

twang15 commented Jan 18, 2021 • edited Loading

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021 • edited Loading

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021

twang15 commented Jan 18, 2021 • edited Loading

twang15 commented Jan 18, 2021 • edited Loading

twang15 commented Apr 5, 2021

twang15 commented Jan 18, 2021 •

edited

Loading

twang15 commented Jan 18, 2021 •

edited

Loading

twang15 commented Jan 18, 2021 •

edited

Loading

twang15 commented Jan 18, 2021 •

edited

Loading