# Chapter 03 - Data Pre-processing

* Data pre-processing: e.g. addition, deletion, or transformation of training set data
* need is determined by the type of model
* _unsupervised_ vs. _supervised_ approaches to data processing: whether the outcome variable is considered (supervised) or not (unsupervised)
* _feature engineering_: how predictors are encoded

## 3.2 - Data Transformations for Individual Predictors

### Centering and Scaling

* **center** a predictor by subtracting the mean value of the predictor from all values
    * results in a mean of '0'
* **scale** a predictor by dividing each value by the predictor's standard deviation
    * results in a standard deviations of '1'
* benefit: improves numerical stability for some calculations
* negative: lose some interpretability in that values are no longer in "original" units

### Transformations to Resolve Skewness

* un-skewed: distribution is roughly symmetric; probablility of falling on each side of mean is roughly equal
* right-skewed: distribution has a large number of points on the left side of the distribution
* left-skewed: sim.
* skewness statistic:

\begin{equation}
skewness = \frac{\sum{(x_i - \bar{x})^3}}{(n - 1)v^{3/2} }
\end{equation}

where 
\begin{equation}
v = \frac{\sum{(x_i - \bar{x})^2}}{(n-1)}
\end{equation}

* if symmetrics, skewness statistics is close to 0
* right skewed generates larger, positive values
* left skewed generates smaller, negative values

Some possible approaches to remove skewness:

* replace data with log, square root, or inverse


### Empirical Identification of Transformation

* determine $\lambda$ parameter via maximum likelihood estimation

\begin{equation}
    x =
    \begin{cases}
      \frac{x^\lambda - 1}{\lambda{}}, & \text{if}\ \lambda{} \ne 0 \\
      \log{x}, & \text{if}\ \lambda{} = 0
    \end{cases}
\end{equation}

* only if all $x > 0$

## 3.3 Data Transformations for Multiple Predictors

### Outliers

* are any "scientifically" invalid (e.g. negative blood pressure, etc.)?
* are any due to data recording errors?
* care must be taken not to hastily remove or change
    * in small smaples, they may be an indication of skewed data
    * they may be a special part of the population under study, or a different population altogether
* some models are resistant to outliers: e.g. trees, classification SVMs
* potential transform to resolve outliers: **spatial sign**
    * projects predictor values onto a multidimensional sphere
    * makes all samples the same distance from the center of that sphere
    
\begin{equation}
x^*_{ij} = \frac{ x_{ij}  }{  \sqrt{  \sum_{j = 1}^{P}{x_{ij}^2}  }  }
\end{equation}

* important to center and scale predictor values before using spatial sign transform
* removing predictors after transforming with spatial sign may be problematic

### Data Reduction and Feature Extraction

* generate smaller number of predictors that attempt to capture the majority of the information in the original variables
* "signal extraction" or "feature extraction"

#### PCA

* Seeks to find **Principal Components**: linear combinations of the predictors capturing the most possible variance
* the 1st PC captures the most variances
* the next PC captures the most remaining variance AND is _uncorrelated_ with all previous PCs
* and so on...

\begin{equation}
PC_j = (a_{j1} \times \text{Predictor 1}) + (a_{j2} \times \text{Predictor 2}) + \dots + (a_{jP} \times \text{Predictor P})
\end{equation}

* primary advantage: PCs are _uncorrelated_
* BUT, PCA can also generate components that summarize irrelevant characteristics of the data
* because PC seeks high variability, it's influenced by scale & skewness
    * pre-PCA, transofrm skewed predictors
    * pre-PCA, center and scale
* if predcitor -> response relationship is not connected to predictors' variability, consider a supervised technique (e.g. PLS) rather than an unsupervised technique like PCA
* to determine the number of PCs to retain, use a scree plot
    * select component number immediately before variation tapers off
    * alternative approach is to use cross-validation to identify optimal cutoff
* visually examining PCs
    * plot PCs against each other
    * use symbology to indicate classes/groupings
    * might demonstrate clusters or outliers
    * might demonstrate potential separation of classes
    * might show significant overlap if no clear separation
    * use care in setting the scale used to visualize
        * avoid over-emphasis of variation in later PCs
* loadings can provide details about predictor contribution to PCs
    * loadings closer to 0 indicate a lower contribution of that predictor into the PC

## 3.4 Dealing with Missing Values

* _structurally missing_: (e.g. number of children a man has given birth to)
* or, value cannot or was not determined at time of model building
* important to understand _why_ values are missing
    * is pattern of missing data related to outcome? 
    * **informative missingness**; can introduce significant bias in the model
    * "Napoleon Dynamite Effect": majority of customer ratings were at extremes (positive and negative) - only those with strong opinions provided ratings
    * **censored data**: exact value is missing, but _something_ is known of its value
    * for predictive modeling, common to treat censored data either as missing data or as the observed quantity
* missing data tends to be related to predictor variables rather than the samples, but could be either
    * if isolated to specific predictors, these could be removed if percentage missing is high enough
    * if concentrated in specific samples:
        * in large datasets, removal of these samples may be un-problematic
        * for smaller sets, cost of removal is higher
* some modeling approaches can handle missing data: (e.g. trees, etc.)
* for others, consider **imputation** of missing data:
    * effectively predicting missing _predictors_ from other predictor values
    * adds uncertainty
    * if using resampling to select tuning parameter values or estimate performance, imputation should be incorporated within the resampling
* are there correlations between a predictor with missing values and other predictors with few missing values?
* another approach is to use K-nearest neighbors to find "nearby points" without missing values to be used for imputation
    * one advantage of this approach is that the imputed values are confined to the range of the training set values
    * one disadvantage is that it requires the entire training set to impute a missing value
    * $K$ is a tuning parameter, as is the distance metric

## 3.5 Removing Predictors

* fewer predictors means decreased computational time and complexity
* in case of correlated predictors, removing one should maintain similar performance, while improving simplicity and interpretability
* zero variance predictors (i.e. with a single value) can cause problems, and can easily be discarded
* near-zero variance predictors: predictors that have only a handful of unique values that occur with very low frequencies:
    * the fraction of unique values over the sample size is low (e.g. 10%)
    * the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20)
    * if both are true, and model is susceptible to this type of predictor, consider removing the variable    

### Between-Predictor Correlations

* **Collinearity**: when pair of predictors have substantial correlation
* **Multicollinearity**: collinearity between multiple predictors
* Correlation matrix: visual technique for identifying presence of collinearity
* PCA can be used as a non-visual technique to characterize the magnitude of collinearity
    * e.g. if first PC accounts for large percentage of variance, this implies there is at least one group of predictors that represent the same information
    * loadings can be used to identify _which_ predictors are within those groups
* Reasons to avoid collinearity:
    * redundant predictors add more complexity
    * for some techniques (e.g. linear regression) they can result in unstable models, numerical errors, and degraded predictive performance
* variance inflation factor (VIF) can be used to identify predictors that are impacted in the context of linear regression
    * drawbacks of using outside linear regression:
        * developed for linear models
        * requires # samples > # predictors
        * doesn't indicate which predictors should be removed
        
Heuristic for dealing with multicollinearity: Remove the minimum number of predictors to ensure that all pairwise correlations are below a certain threshold:

* calculate correlation matrix
* identify largest absolute, pairwise correlation
* for the two predictors identified, determine which has the largest correlation with all _other_ predictors
* remove the predictor with the larger avg correlation
* iterate until there are no absolute pariwise correlations above the threshold

PCA (and other feature extraction methods) can also be used in response to the presence of collinearities, at the cost of adding complexity to the predictor-response relationship.

## 3.6 Adding Predictors

* Categoricals: encoded as **dummy variables**
    * each category gets its own bit variable (0/1)
    * for regression, n-1 (n = # of categories) variables are included, to avoid numerical issues
    * for techniques not affected by this, including n variables helps from an interpretability standpoint
* In some cases, adding predictors can help the predictive performance of a simpler modeling technique, avoiding the need to use a more complex technique.
    * e.g. Adding non-linear (i.e. non-first-order) terms to a linear regression model.    

## 3.7 Binning Predictors

* Avoid manual binning of numeric values:
  * can lead to loss of performance of model
  * loss of precision in the predictions
  * can lead to high rate of false positives
* Note: this is in contrast to _automatic_ binning that happens as part of the algorithmic processing of some techniques.  