# Week 5 Notes
### Primary Focus: Feature Engineering


##### What is Feature Engineering?
[Reference: Kaggle](https://www.kaggle.com/code/prashant111/a-reference-guide-to-feature-engineering-methods#1.-Introduction-to-Feature-Engineering-)

**Feature Engineering** is the process of using domain knowledge to extract features from raw data via data mining techniques to improve the performance of machine earning algorithms.

Coming up with features is difficult, time-consuming, and requires expert knowledge. "Applied machine learning" is basically feature engineering. - Andrew Ng

If we can boil it down to one concept, it's about transforming raw data into a form that's more useful for your model.

### Feature Scaling
Models that rely on the **distance between data points** (like k-nearest neighbors or Support Vector Machines) are **highly sensitive to the scale of the features**. If one feature has a much larger range than others, it can dominate the distance calculations and skew the results.

There will be other types discussed on other references.

- **StandardScaler**
    - This technique standardizes features by removing the mean and scaling to unit variance. 
    - The formula is z=(x−u)/s, where u is the mean and s is the standard deviation. It's great for features that follow a **normal or near-normal distribution**.
    $$z=(x−u)/s$$
- **MinMaxScaler**
    -  This scales features to a fixed range, typically 0 to 1. 
    - The formula is Xscaled​ = (X−Xmin​)/(Xmax−Xmin). 
        $$X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}$$
    - This is useful when you need to **constrain your data to a specific range or when your data isn't normally distributed**.

 Scaling doesn't change the distribution of your data, it just changes the scale. This helps ensure all features contribute equally to the model.

### Categorical Encoding 
Most machine learning algorithms require numerical input. Categorical features (like 'city' or 'product type') need to be converted to numbers before being used.
- OneHotEncoder
    - This is a common method for handling nominal (non-ordered) categorical data. It creates a new binary column for each category, with a **1 indicating the presence of that category and a 0 for its absence**. For example, if you have a 'city' feature with values 'New York', 'London', and 'Paris', OneHotEncoder would create three new columns: 'city_New York', 'city_London', and 'city_Paris'. This **prevents the model from assuming an arbitrary ordinal relationship between categories**.

### Creating new features
This is where the "engineering" comes in. By combining or transforming existing features, you can capture more complex relationships

- **Polynomial features**
    - You can create new features by raising existing features to a power e.g., $$x_1^2, x_2^2$$
    - This **allows linear models to capture non-linear relationships**.
- **Interaction terms**
    - These are new features created by multiplying two or more existing features e.g., $$x_1 * x_2$$ 
    - This allows the model to capture the combined effect of two features.

Plan of study: 
- Go over each reference on Day 2 to 3 and implement on Day 4.

References:
- [Kaggle: A Reference Guide to Feature Engineering Methods](https://www.kaggle.com/code/prashant111/a-reference-guide-to-feature-engineering-methods)
- [Geeks For Geeks: What is Feature Engineering?](https://www.geeksforgeeks.org/machine-learning/what-is-feature-engineering/)
- [Datacamp: Feature Engineering in Machine Learning: A Practical Guide](https://www.datacamp.com/tutorial/feature-engineering)
- [Practical Guide on Data Preprocessing in Python using Scikit Learn](https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/#:~:text=It%20should%20be%20kept%20in,due%20to%20its%20larger%20range.)
    - Downloaded the pdf here as it is required to login.
- [Data Preprocessing with Scikit-learn](https://medium.com/@drpa/data-preprocessing-with-scikit-learn-dcaaf82d000a)
- [10 Powerful Techniques for Feature Engineering in Machine Learning](https://thetechthinker.com/feature-engineering-in-machine-learning/)

Additional items to study/read up on for later weeks:

Can do this on the 5th or 6th day along with Cybersecurity this week since the activity is just a research. 
- Pipelines/orchestration tools (Airflow, Prefect, dbt)
- Just a background here, no need for a deep dive. Add vids for tutorials if possible.


--- 

# 

## Kaggle: A Reference Guide to Feature Engineering Methods
[Kaggle: A Reference Guide to Feature Engineering Methods](https://www.kaggle.com/code/prashant111/a-reference-guide-to-feature-engineering-methods)

Feature engineering is a very broad term that consists of different techniques to process data. These techniques help us to process our raw data into processed data ready to be fed into a machine learning algorithm. These techniques include filling missing values, encode categorical variables, variable transformation, create new variables from existing ones and others.

There will be a separate Lab for this one under feature_engineering_kaggle.ipynb


This reference covers six items, namely:
1. Missing data imputation
2. Categorical Encoding
3. Variable Transformation
4. Discretization
5. Outlier Engineering
6. Date and Time Engineering

### Missing Data Imputation

- Missing data, or Missing values, occur when no data / no value is stored for a certain observation within a variable.
- Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. 
- Incomplete data is an unavoidable problem in dealing with most data sources.
- Imputation is the act of replacing missing data with statistical estimates of the missing values. 
- The goal of any imputation technique is to produce a complete dataset that can be used to train machine learning models.

There are multiple techniques for missing data imputation. These are as follows:-
1. Complete case analysis
2. Mean / Median / Mode imputation
3. Random Sample Imputation
4. Replacement by Arbitrary Value
5. End of Distribution Imputation
6. Missing Value Indicator
7. Multivariate imputation

### Missing Data Mechanisms
There are 3 mechanisms that lead to missing data and 2 of them involve missing data randomly or almost randomly with the third one caused by a systematic loss of data.

1. Missing Completely at Random, MCAR

    A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

    If values for observations are missing completely at random, then disregarding those cases would not bias the inferences made.

2. Missing at Random, MAR

    MAR occurs when there is a systematic relationship between the propensity of missing values and the observed data. In other words, the probability an observation being missing depends only on available information (other variables in the dataset). For example, if men are more likely to disclose their weight than women, weight is MAR. The weight information will be missing at random for those men and women that decided not to disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.

    In a situation like the above, if we decide to proceed with the variable with missing values (in this case weight), we might benefit from including gender to control the bias in weight for the missing observations.

3. Missing Not at Random, MNAR
    
    Missing of values is not at random (MNAR) if their being missing depends on information not recorded in the dataset. In other words, there is a mechanism or a reason why missing values are introduced in the dataset.

**Noting here**

From here on out, the lab will contain the sample codes and the notes will be in this document.

### 1. Complete Case Analysis

Complete case analysis implies analysing only those observations in the dataset that contain values in all the variables. In other words, in complete case analysis we remove all observations with missing values. This procedure is suitable when there are few observations with missing data in the dataset.

So complete-case analysis (CCA), also called list-wise deletion of cases, consists in simply discarding observations where values in any of the variables are missing. Complete Case Analysis means literally analysing only those observations for which there is information in all of the variables (Xs).

But, if the dataset contains missing data across multiple variables, or some variables contain a high proportion of missing observations, we can easily remove a big chunk of the dataset, and this is undesirable.

CCA can be applied to both categorical and numerical variables.

In practice, CCA may be an acceptable method when the amount of missing information is small. In many real life datasets, the amount of missing data is never small, and therefore CCA is typically never an option.

So, in datasets with many variables that contain missing data, CCA will typically not be an option as it will produce a reduced dataset with complete observations. However, if only a subset of the variables from the dataset will be used, we could evaluate variable by variable, whether we choose to discard values with NA, or to replace them with other methods.


### 2. Mean/Median/Mode Imputation

We can replace missing values with the mean, median or mode of the variable. Mean / median / mode imputation is widely adopted in organisations and data competitions. Although in practice this technique is used in almost every situation, the procedure is suitable if data is missing at random and in small proportions. If there are a lot of missing observations, however, we will distort the distribution of the variable, as well as its relationship with other variables in the dataset. Distortion in the variable distribution may affect the performance of linear models.

Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean (if the variable has a Gaussian distribution) or median (if the variable has a skewed distribution).

For categorical variables, replacement by the mode, is also known as replacement by the most frequent category.

Mean/median imputation has the assumption that the data are missing completely at random (MCAR). If this is the case, we can think of replacing the NA with the most frequent occurrence of the variable, which is the mean if the variable has a Gaussian distribution, or the median otherwise.

The rationale is to replace the population of missing values with the most frequent value, since this is the most likely occurrence.

When replacing NA with the mean or median, the variance of the variable will be distorted if the number of NA is big respect to the total number of observations (since the imputed values do not differ from the mean or from each other). Therefore leading to underestimation of the variance.

In addition, estimates of covariance and correlations with other variables in the dataset may also be affected. This is because we may be destroying intrinsic correlations since the mean/median that now replace NA will not preserve the relation with the remaining variables

Imputation should be done over the training set, and then propagated to the test set. This means that the mean/median to be used to fill missing values both in train and test set, should be extracted from the train set only. And this is to avoid overfitting.

Mean/Median/Mode imputation is the most common method to impute missing values.


### 3. Random Sample Imputation

Random sample imputation refers to randomly selecting values from the variable to replace the missing data. This technique preserves the variable distribution, and is well suited for data missing at random. But, we need to account for randomness by adequately setting a seed. Otherwise, the same missing observation could be replaced by different values in different code runs, and therefore lead to a different model predictions. This is not desirable when using our models within an organisation.

Replacing of NA by random sampling for categorical variables is exactly the same as for numerical variables.

Random sampling consist of taking a random observation from the pool of available observations of the variable, that is, from the pool of available categories, and using that randomly extracted value to fill the NA. In Random Sampling one takes as many random observations as missing values are present in the variable.

By random sampling observations of the present categories, we guarantee that the frequency of the different categories/labels within the variable is preserved.

Assumptions:
Random sample imputation has the assumption that the data are missing completely at random (MCAR). If this is the case, it makes sense to substitute the missing values, by values extracted from the original variable distribution/ category frequency.


Important Note
Imputation should be done over the training set, and then propagated to the test set. This means that the random sample to be used to fill missing values both in train and test set, should be extracted from the train set.

### 4. Replacement by Arbitrary Value

Replacement by an arbitrary value, as its names indicates, refers to replacing missing data by any, arbitrarily determined value, but the same value for all missing data. Replacement by an arbitrary value is suitable if data is not missing at random, or if there is a huge proportion of missing values. If all values are positive, a typical replacement is -1. Alternatively, replacing by 999 or -999 are common practice. We need to anticipate that these arbitrary values are not a common occurrence in the variable. Replacement by arbitrary values however may not be suited for linear models, as it most likely will distort the distribution of the variables, and therefore model assumptions may not be met.
For categorical variables, this is the equivalent of replacing missing observations with the label “Missing” which is a widely adopted procedure.

Replacing the NA by artitrary values should be used when there are reasons to believe that the NA are not missing at random. In situations like this, we would not like to replace with the median or the mean, and therefore make the NA look like the majority of our observations.

Instead, we want to flag them. We want to capture the missingness somehow.

The arbitrary value has to be determined for each variable specifically.
We can see that this is totally arbitrary. But, it is used in the industry. Typical values chosen by companies are -9999 or 9999, or similar.

### 5. End of Distribution Imputation
End of tail imputation involves replacing missing values by a value at the far end of the tail of the variable distribution. This technique is similar in essence to imputing by an arbitrary value. However, by placing the value at the end of the distribution, we need not look at each variable distribution individually, as the algorithm does it automatically for us. This imputation technique tends to work well with tree-based algorithms, but it may affect the performance of linear models, as it distorts the variable distribution.

On occasions, one has reasons to suspect that missing values are not missing at random. And if the value is missing, there has to be a reason for it. Therefore, we would like to capture this information.

Adding an additional variable indicating missingness may help with this task. However, the values are still missing in the original variable, and they need to be replaced if we plan to use the variable in machine learning.

So, we will replace the NA, by values that are at the far end of the distribution of the variable.

The rationale is that if the value is missing, it has to be for a reason, therefore, we would not like to replace missing values for the mean and make that observation look like the majority of our observations. Instead, we want to flag that observation as different, and therefore we assign a value that is at the tail of the distribution, where observations are rarely represented in the population.

### 6. Missing Value Indicator

The missing indicator technique involves adding a binary variable to indicate whether the value is missing for a certain observation. This variable takes the value 1 if the observation is missing, or 0 otherwise. One thing to notice is that we still need to replace the missing values in the original variable, which we tend to do with mean or median imputation. By using these 2 techniques together, if the missing value has predictive power, it will be captured by the missing indicator, and if it doesn’t it will be masked by the mean / median imputation.

These 2 techniques in combination tend to work well with linear models. But, adding a missing indicator expands the feature space and, as multiple variables tend to have missing values for the same observations, many of these newly created binary variables could be identical or highly correlated.

### Conclusion - When to use each imputation method?
If missing values are less than 5% of the variable, then go for mean/median imputation or random sample replacement. Impute by most frequent category if missing values are more than 5% of the variable. Do mean/median imputation+adding an additional binary variable to capture missingness add a 'Missing' label in categorical variables.

If the number of NA in a variable is small, they are unlikely to have a strong impact on the variable / target that you are trying to predict. Therefore, treating them specially, will most certainly add noise to the variables. Therefore, it is more useful to replace by mean/random sample to preserve the variable distribution.

If the variable / target you are trying to predict is however highly unbalanced, then it might be the case that this small number of NA are indeed informative.

#### Exceptions
If we suspect that NAs are not missing at random and do not want to attribute the most common occurrence to NA, and if we don't want to increase the feature space by adding an additional variable to indicate missingness - in these cases, replace by a value at the far end of the distribution or an arbitrary value.