Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Standardization before modeling #15

Open
twang15 opened this issue Jan 18, 2021 · 12 comments
Open

Data Standardization before modeling #15

twang15 opened this issue Jan 18, 2021 · 12 comments

Comments

@twang15
Copy link
Owner

twang15 commented Jan 18, 2021

Two seemingly conflicts: interpretability and feature importance

  1. A lot of software for performing multiple linear regression will provide standardised coefficients which are equivalent to unstandardised coefficients where you manually standardise predictors and the response variable (of course, it sounds like you are talking about only standardising predictors).
  2. In cases where the metric does have meaning to the person interpreting the regression equation, unstandardised coefficients are often more informative.
  3. You can always convert standardised coefficients to unstandardised coefficients if you know the mean and standard deviation of the predictor variable in the original sample.
@twang15
Copy link
Owner Author

twang15 commented Jan 18, 2021

Use correlations and semi-correlations between outcome variable and predicators as feature importance
http://jeromyanglim.blogspot.com/2009/09/variable-importance-and-multiple.html

@twang15
Copy link
Owner Author

twang15 commented Jan 18, 2021

In regression, it is often recommended to center the variables so that the predictors have mean 0. This makes it easier to interpret the intercept term as the expected value of 𝑌𝑖 when the predictor values are set to their means. Otherwise, the intercept is interpreted as the expected value of 𝑌𝑖 when the predictors are set to 0, which may not be a realistic or interpretable situation (e.g. what if the predictors were height and weight?). , centering/scaling does not affect your statistical inference in regression models - the estimates are adjusted appropriately and the p-values will be the same.

Other situations where centering and/or scaling may be useful:

  • when you're trying to sum or average variables that are on different scales, perhaps to create a composite score of some kind. Without scaling, it may be the case that one variable has a larger impact on the sum due purely to its scale, which may be undesirable.
  • To simplify calculations and notation. For example, the sample covariance matrix of a matrix of values centered by their sample means is simply 𝑋′𝑋. Similarly, if a univariate random variable 𝑋 has been mean centered, then var(𝑋)=𝐸(𝑋2) and the variance can be estimated from a sample by looking at the sample mean of the squares of the observed values.
  • Related to aforementioned, PCA can only be interpreted as the singular value decomposition of a data matrix when the columns have first been centered by their means.

Note that scaling is not necessary in the last two bullet points I mentioned and centering may not be necessary in the first bullet I mentioned, so the two do not need to go hand and hand at all times.

@twang15
Copy link
Owner Author

twang15 commented Jan 18, 2021

https://stats.stackexchange.com/questions/342140/standardization-of-continuous-variables-in-binary-logistic-regression?rq=1

  1. You don't need to standardize for normal logistic regression as long as one keeps units in mind when interpreting the coefficients.
  2. Standardizing can help with interpreting feature importance because then the coefficients should be apples to apples. (ie if your two standardized continuous variables have coeff of 0.01 and 0.7 then you know that the 2nd one is much more important.)
  3. for regularized logistic regression continuous variables should be standardized for best results.

@twang15
Copy link
Owner Author

twang15 commented Jan 18, 2021

https://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary

  • in general, you do not need to center or standardize your data for multiple regression.
  • One thing that people sometimes say is that if you have standardized your variables first, you can then interpret the betas as measures of importance. For instance, if 𝛽1=.6, and 𝛽2=.3, then the first explanatory variable is twice as important as the second. While this idea is appealing, unfortunately, it is not valid. There are several issues, but perhaps the easiest to follow is that you have no way to control for possible range restrictions in the variables. Inferring the 'importance' of different explanatory variables relative to each other is a very tricky philosophical issue.
  • The only case I can think of off the top of my head where centering is helpful is before creating power terms. Lets say you have a variable, 𝑋, that ranges from 1 to 2, but you suspect a curvilinear relationship with the response variable, and so you want to create an 𝑋2 term. If you don't center 𝑋 first, your squared term will be highly correlated with 𝑋, which could muddy the estimation of the beta. Centering first addresses this issue.
  • If an interaction / product term is created from two variables that are not centered on 0, some amount of collinearity will be induced (with the exact amount depending on various factors). Centering first addresses this potential problem.

-standardizing is needed when using regularization

  • The least squares estimators of 𝛽1,𝛽2,… are not affected by shifting. The reason is that these are the slopes of the fitting surface - how much the surface changes if you change 𝑥1,𝑥2,…, one unit. This does not depend on location. (The estimator of 𝛽0, however, does.)
  • In case you use gradient descent to fit your model, standardizing covariates may speed up convergence (because when you have unscaled covariates, the corresponding parameters may inappropriately dominate the gradient).
  • Also, for some applications of SVMs, scaling may improve predictive performance: Feature scaling in support vector data description.
  • Numerical stability is an algorithm-related reason to center and/or scale data.

@twang15
Copy link
Owner Author

twang15 commented Jan 18, 2021

https://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary

@twang15
Copy link
Owner Author

twang15 commented Jan 18, 2021

Summary:

  1. Feature importance: Z-score transformation is needed for feature importance comparison (even the common practice of doing this is questionable, it is how it is done)
  2. Feature selection: fitting the model iteratively by eliminating those features with relative small coefficient (less important features) also requires Z-score transformation.
  3. Normalized coefficients and non-normalized coefficients can be converted to each other with mean and std (if z-score transformation is used to standardize the training data) of each feature in the training dataset.
  4. Z-score transformation / centering / min-max normalization does not affect multiple linear regression / logistic regression using ordinary least square.
  5. Z-score transformation is required for LASSO / Ridge regression.
  6. Centering can eliminate multi-collinearity for X and its quadratic terms X^2.
  7. Centering does not necessarily require subtracting mean of the sample for each observation, the quantity could be other meaningful quantity, like means in the contrast group.
  8. Z-score transformation / Normalization on Dummy variable is unnecessary for linear / logistic regression.
  9. In medical/financial modeling, z-score transformation hurts model interpretation and can be standardized coefficients can be converted back to unstandardized coefficients as stated in 3.
  10. Normalization: min-max normalization, transform inputs into (0-1) range
  11. Standardization: z-score transformation, (x-u)/s
  • [Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.
  • Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/
    Feature Scaling _ Standardization Vs Normalization.pdf
    )
  • However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized and standardized data and compare the performance for best results.
  • It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.

@twang15 twang15 pinned this issue Mar 20, 2021
@twang15
Copy link
Owner Author

twang15 commented Apr 5, 2021

Importance of Feature Scaling

  • PCA
  • SVM
  • K-nearest Neighbours
  • Logistic regression (really?)
  • many algorithms (such as SVM, K-nearest neighbors, and logistic regression) require features to be normalized, intuitively we can think of Principle Component Analysis (PCA) as being a prime example of when normalization is important. In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. As a change in height of one meter can be considered much more important than the change in weight of one kilogram, this is clearly incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant