# Feature engineering for Machine Learning

## Variable types

A variable is any characteristic, number or quantity that can be measured or counted.

### Numerical variables

Any kind of variable that can be represented by a number, these values can have two types <span class="burk">Discrete</span> or <span class="burk">Continuous</span>.

<span class="burk">Discrete</span>

Any kind of numerical variable that is represented by whole numbers, usually belongs to the set of integers (Z).

<span class="burk">Continuous</span>

Any kind of numerical variable that is represented by a range of values, usually belongs to the set of reals (R).

There's other type of numerical variable that is also commom to feature in a dataset, <span class="burk">Binary</span>. 

<span class="burk">Binary</span>

A numerical value with a number of possible values, usually used to represent categories through encoding.

### Categorical variables

Any kind of value selected from a group of categories, also called labels. Can be separated by <span class="burk">Ordinal</span> and <span class="burk">Nominal</span> categories.

<span class="burk">Ordinal</span>

Any kind of label representations that can be ordered in a meaningful way.

<span class="burk">Nominal</span>

Any kind of label representation that doesn't carry a ordering or ranking by any means.

Usually categorical variables are used in ML as encoded numerical variables so the algorithm can build a model.

### Date and time variables

A representation of the point in time the instance was extracted, can be separated by <span class="burk">Date</span>, <span class="burk">Time</span> and <span class="burk">Date and Time</span>.

It's a special case of categorical variable, although has a special treatment due to the enrichness of information it provides.

### Mixed variables

Some times real world features have both numbers and categories mixed together, these characteristics can be often <span class="burk">numbers and labels among the same feature</span> or <span class="burk">numbers and labels in the same instance of a feature</span>.

This kind of feature usually carries a lot of information condensed and it requires a special treatment to retrieve informations.

## Variable Characteristics

Any dataset has its own characteristics, those are consequence of the characteristics of the features and how they relate to each other, all of these need to be adressed so its clear for the machine what is happening and the model be more accurate.

### Missing Data

<span class="mark">What is missing data?</span>

Missing data, or missing values is a problem that occurs when a certain value is not stored for an instance or a variable. Is a very commom problem on datasets and have a big impact on conclusions derived from the data.

The reasons a certain observation be missing can be <span class="burk">Lost</span>, when the value is forgotten, lost or not stored properly, <span class="burk">Don't exists</span>, commomly happens when a variable is result of a division and the result is simply not defined by the operation (dividing by 0, for instance) or <span class="burk">Not found | Not identified</span>, when matching certain variables and for some reason any value is wrong or missing.

<span class="mark">Impacts of missing data</span>

- Library incompatibility
- Model performance modification
- Imputation may distort data distribuition

For imputing data into a dataset there is 3 main techniques as follows:

- <span class="girk">Missing Data Completely at Random (MCAR)</span>:
    - Probability of being missing is the same for all the observations
    - No relationship with other variables
    - Disregarding those cases would not bias the inference made
- <span class="girk">Missing Data at Random (MAR)</span>:
    - The values missing are somehow related to other information provided at the dataset
- <span class="girk">Missing Data not at Random (MNAR)</span>:
    - There is a reason why some values are missing from the table

To understand why data is missing is important to understand how data is collected, since is not always possible, is important to have the deepest knowledge of data collection so features can be best engineered.

### Cardinality - Categorical variables

When dealing with categorical variables, the number of labels a variable can assume is important for how that variable impact certain algorithms on training, the number of labels a variable can assume is called cardinality.

Cardinality can play a big row on the outcome of a model such as:

- Since categorical variables are commomly depicted as strings, the values need to be encoded
    - Encoding techniques change the feture space and how the variables interact with each other
- It can cause an uneven distrbuition due to high cardinality
    - Labels may appear only on training datasets, leading to overfitting
    - Labels may appear only on test datasets, model won't know how to treat values 
- Overffiting, mostly on tree based algorithms
    - Variables with too many labels may dominate over other variables, mostly on entropy based algorithms
    - High cardinality may introduce noise with little information
- Operational problems
    - Unseen labels may cause error since model doesn't know how to handle them

### Rare Labels

Rare labels are also a frequent problem on datasets, usually, rare labels come from high cardinality variables, so the same problems of high cardinality are reflected.

The real question on whether to keep or to change those labels relies on model performance, since its difficult to understand rare labels role on outcome predictions.

Removing them and creating a new labels that groups all rare labels may be a way to improve model performance.

### Assumptions of linear models

Some times, trying to fit a linear model to certain dataset is hard, for that matter is important to understand how features fit on the assumptions made by a linear fit, the information gathered by how each variable behaves is a good indicator for problems on model performance.

- Linearity
    - The mean values of the outcome variable for each increment of the predictors lie along a straight line.
- No perfect multicolinearity
    - No perfect linear relationship between two or more variables
- Normal Distributed Errors
    - The residual errors are random, normally distributed around a mean of 0
- Homoscedaticity
    - At any level of the predictor variables the variance of the residual terms should be zero

When assumptions are not met, it means that features are not good enough to predict a certain outcome, for that matter there are three important problems face on assumptions:

- Outliers
- Lack of homoscedaticity
- The variables are too skewed

In order to overcome such problems there is some procedures that help surpass them, for outliers and lack of homoscedacity:

- Mathematical transformations
- Discretisation
- Remove or censor outliers

<span class="mark">Evaluating model performance</span>

There are some quantities that result from linear regression that helps to understand the quality of the fit.

- Residual error analysis N(0,sigma)
    - Since to perform linear regression the error term should be normalized and centered around zero, plotting such values must indicate the trend
    - Normality can be statistically tested with Kolmorogov-Smirnov test for exemple
- Homoscedasticity
    - There are some tests and plots to determine homoscedasticity such as:
        - Residuals plot
        - Levene's test
        - Barlett's test
        - Goldfedt-Quandt test
    - There are some visual evaluations that can be carried out, since this is the measure of distance of each point to the ideal linear model.
- No Co-linearity
    - This can be acessed by ploting the correlation matrix of all variables

### Probability distribuitions

**What to do with non-linear models?**

Sometimes a dataset can't be explained by a linear model, in this case, is required to work with a more broad approach and a distinct way of thinking about your data.

**Probability Distribuitions**

A function that gives the likelihood of a variable adopt a certain value, is a probability distribuition function, the probabilities must be between 0 and 1, the sum must of all probabilities is 1.

In order to perform a Probability Distribuition taking care of the nature of the variable, there is a lot of methods that can be used to analyse the distribuition.

- Discrete
    - Binomial
    - Poisson
- Continuous
    - Gaussian
    - Skewed
    - Many others

**The Normal Distribuition**

There is a few important features on a normal distribuition that can be useful to understand better non-linear variables, like all values are centered around a central peak, and the distance from the center dictates the probability of a certain value being adopted.

**The Skewed Distribuition**

When the tail of distribuition is longer than the other, there is the problem of skew, usually happens to unbalanced variable distribuitions. The main difference of a Skewed Distribuition is that the mean is affected by the skew, which means that not all values are centered around it. Usually, the mean sides with the longer tail.

**Model Performance**

The only algorithm that assumes that values are normally distributed is the linear regression, since all residual values must be normally distributed. Despite not considering normal distribuitions, models may have a improvement of performance normally distributed values.

In order to create more normally distributed values, mathematical transformations and discretisation are attitudes that usually improves model performance.

### Outliers

**What is an outlier?**

An outlier is a value that is significantly different from other points, so much so, that lead to believe that the mechanism that created it is different from all the others.

**Algorithms Suceptible to Outliers**

- Linear Regression
- AdaBoost

**Identifing an Outlier**

- For Normal Distribuitions
99% of all observations are inside the range of mean more or less 3 standard deviations of distance around the center, hence, everything outside this scope is an outlier.

- For Skewed Distribuitions
Since Skewed Distribuitions aren't simmetrical, is necessary to use another means to calculate outliers, the best method is to use the IQR, Inter-Quantile Range, is is performed as follows:
    
    - IQR = 75º Quantile - 25º Quantile = 3º Quartile - 1º Quartile
    - Upper Limit = 75º Quantile + IQR * 1.5 = 3º Quartile + IQR * 1.5
    - Lower Limit = 25º Quantile - IQR * 1.5 = 1º Quartile - IQR * 1.5
    
To account extreme Outliers, the value must be multiplied by 3.

**Visualizing Outliers**

Boxplots are used to visualize outliers, it is important to note that, if a distribuition is skewed, the unbalanced nature shows how a value can be outliered by skewness.

### Feature Magnitude

Scales between variables are important to keep your eye on, for example, the coefficients on a linear regression can change values if inputs have different scales, in this case, larger scales will prevail upon smaller ones.

Also, ranges play a big role on deciding on whether the coefficients are going to be heavier on output or not, this also needs to be adressed.

Important parts of machine learning algorithms need to be carefully thought to be more performant, *gradient descent* converges faster with similar scales, *suport vectors* are found faster with scaled features and any *distance basde algorithm* is sensitive to magnitude.

**Algorithms sensitive to magnitude**
- Linear and Logistic Regression
- Neural Networks
- SVM
- K-means clustering
- KNN
- LDA
- PCA

**Algorithms not sensitive to magnitude**
- Classification and Regression Trees
- Random Forests
- Gradient Boosted Trees

### How models deals with variables: a table of contents

The table below resumes how models deals with troubles on your variables.

In [13]:
from IPython.display import IFrame
IFrame("ML_Comparison.pdf", width=980, height=500)