# Week 5 Notes
### Primary Focus: Feature Engineering


##### What is Feature Engineering?
[Reference: Kaggle](https://www.kaggle.com/code/prashant111/a-reference-guide-to-feature-engineering-methods#1.-Introduction-to-Feature-Engineering-)

**Feature Engineering** is the process of using domain knowledge to extract features from raw data via data mining techniques to improve the performance of machine earning algorithms.

Coming up with features is difficult, time-consuming, and requires expert knowledge. "Applied machine learning" is basically feature engineering. - Andrew Ng

If we can boil it down to one concept, it's about transforming raw data into a form that's more useful for your model.

### Feature Scaling
Models that rely on the **distance between data points** (like k-nearest neighbors or Support Vector Machines) are **highly sensitive to the scale of the features**. If one feature has a much larger range than others, it can dominate the distance calculations and skew the results.

There will be other types discussed on other references.

- **StandardScaler**
    - This technique standardizes features by removing the mean and scaling to unit variance. 
    - The formula is z=(x−u)/s, where u is the mean and s is the standard deviation. It's great for features that follow a **normal or near-normal distribution**.
    $$z=(x−u)/s$$
- **MinMaxScaler**
    -  This scales features to a fixed range, typically 0 to 1. 
    - The formula is Xscaled​ = (X−Xmin​)/(Xmax−Xmin). 
        $$X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}$$
    - This is useful when you need to **constrain your data to a specific range or when your data isn't normally distributed**.

 Scaling doesn't change the distribution of your data, it just changes the scale. This helps ensure all features contribute equally to the model.

### Categorical Encoding 
Most machine learning algorithms require numerical input. Categorical features (like 'city' or 'product type') need to be converted to numbers before being used.
- OneHotEncoder
    - This is a common method for handling nominal (non-ordered) categorical data. It creates a new binary column for each category, with a **1 indicating the presence of that category and a 0 for its absence**. For example, if you have a 'city' feature with values 'New York', 'London', and 'Paris', OneHotEncoder would create three new columns: 'city_New York', 'city_London', and 'city_Paris'. This **prevents the model from assuming an arbitrary ordinal relationship between categories**.

### Creating new features
This is where the "engineering" comes in. By combining or transforming existing features, you can capture more complex relationships

- **Polynomial features**
    - You can create new features by raising existing features to a power e.g., $$x_1^2, x_2^2$$
    - This **allows linear models to capture non-linear relationships**.
- **Interaction terms**
    - These are new features created by multiplying two or more existing features e.g., $$x_1 * x_2$$ 
    - This allows the model to capture the combined effect of two features.

Plan of study: 
- Go over each reference on Day 2 to 3 and implement on Day 4.

References:
- [Kaggle: A Reference Guide to Feature Engineering Methods](https://www.kaggle.com/code/prashant111/a-reference-guide-to-feature-engineering-methods)
- [Geeks For Geeks: What is Feature Engineering?](https://www.geeksforgeeks.org/machine-learning/what-is-feature-engineering/)
- [Datacamp: Feature Engineering in Machine Learning: A Practical Guide](https://www.datacamp.com/tutorial/feature-engineering)
- [Practical Guide on Data Preprocessing in Python using Scikit Learn](https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/#:~:text=It%20should%20be%20kept%20in,due%20to%20its%20larger%20range.)
    - Downloaded the pdf here as it is required to login.
- [Data Preprocessing with Scikit-learn](https://medium.com/@drpa/data-preprocessing-with-scikit-learn-dcaaf82d000a)
- [10 Powerful Techniques for Feature Engineering in Machine Learning](https://thetechthinker.com/feature-engineering-in-machine-learning/)

Additional items to study/read up on for later weeks:

Can do this on the 5th or 6th day along with Cybersecurity this week since the activity is just a research. 
- Pipelines/orchestration tools (Airflow, Prefect, dbt)
- Just a background here, no need for a deep dive. Add vids for tutorials if possible.


--- 

# 

# Kaggle: A Reference Guide to Feature Engineering Methods
[Kaggle: A Reference Guide to Feature Engineering Methods](https://www.kaggle.com/code/prashant111/a-reference-guide-to-feature-engineering-methods)

Feature engineering is a very broad term that consists of different techniques to process data. These techniques help us to process our raw data into processed data ready to be fed into a machine learning algorithm. These techniques include filling missing values, encode categorical variables, variable transformation, create new variables from existing ones and others.

There will be a separate Lab for this one under feature_engineering_kaggle.ipynb


This reference covers six items, namely:
1. Missing data imputation
2. Categorical Encoding
3. Variable Transformation
4. Discretization
5. Outlier Engineering
6. Date and Time Engineering

## Missing Data Imputation

- Missing data, or Missing values, occur when no data / no value is stored for a certain observation within a variable.
- Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. 
- Incomplete data is an unavoidable problem in dealing with most data sources.
- Imputation is the act of replacing missing data with statistical estimates of the missing values. 
- The goal of any imputation technique is to produce a complete dataset that can be used to train machine learning models.

There are multiple techniques for missing data imputation. These are as follows:-
1. Complete case analysis
2. Mean / Median / Mode imputation
3. Random Sample Imputation
4. Replacement by Arbitrary Value
5. End of Distribution Imputation
6. Missing Value Indicator
7. Multivariate imputation

### Missing Data Mechanisms
There are 3 mechanisms that lead to missing data and 2 of them involve missing data randomly or almost randomly with the third one caused by a systematic loss of data.

1. Missing Completely at Random, MCAR

    A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

    If values for observations are missing completely at random, then disregarding those cases would not bias the inferences made.

2. Missing at Random, MAR

    MAR occurs when there is a systematic relationship between the propensity of missing values and the observed data. In other words, the probability an observation being missing depends only on available information (other variables in the dataset). For example, if men are more likely to disclose their weight than women, weight is MAR. The weight information will be missing at random for those men and women that decided not to disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.

    In a situation like the above, if we decide to proceed with the variable with missing values (in this case weight), we might benefit from including gender to control the bias in weight for the missing observations.

3. Missing Not at Random, MNAR
    
    Missing of values is not at random (MNAR) if their being missing depends on information not recorded in the dataset. In other words, there is a mechanism or a reason why missing values are introduced in the dataset.

**Noting here**

From here on out, the lab will contain the sample codes and the notes will be in this document.

### 1. Complete Case Analysis

Complete case analysis implies analysing only those observations in the dataset that contain values in all the variables. In other words, in complete case analysis we remove all observations with missing values. This procedure is suitable when there are few observations with missing data in the dataset.

So complete-case analysis (CCA), also called list-wise deletion of cases, consists in simply discarding observations where values in any of the variables are missing. Complete Case Analysis means literally analysing only those observations for which there is information in all of the variables (Xs).

But, if the dataset contains missing data across multiple variables, or some variables contain a high proportion of missing observations, we can easily remove a big chunk of the dataset, and this is undesirable.

CCA can be applied to both categorical and numerical variables.

In practice, CCA may be an acceptable method when the amount of missing information is small. In many real life datasets, the amount of missing data is never small, and therefore CCA is typically never an option.

So, in datasets with many variables that contain missing data, CCA will typically not be an option as it will produce a reduced dataset with complete observations. However, if only a subset of the variables from the dataset will be used, we could evaluate variable by variable, whether we choose to discard values with NA, or to replace them with other methods.


### 2. Mean/Median/Mode Imputation

We can replace missing values with the mean, median or mode of the variable. Mean / median / mode imputation is widely adopted in organisations and data competitions. Although in practice this technique is used in almost every situation, the procedure is suitable if data is missing at random and in small proportions. If there are a lot of missing observations, however, we will distort the distribution of the variable, as well as its relationship with other variables in the dataset. Distortion in the variable distribution may affect the performance of linear models.

Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean (if the variable has a Gaussian distribution) or median (if the variable has a skewed distribution).

For categorical variables, replacement by the mode, is also known as replacement by the most frequent category.

Mean/median imputation has the assumption that the data are missing completely at random (MCAR). If this is the case, we can think of replacing the NA with the most frequent occurrence of the variable, which is the mean if the variable has a Gaussian distribution, or the median otherwise.

The rationale is to replace the population of missing values with the most frequent value, since this is the most likely occurrence.

When replacing NA with the mean or median, the variance of the variable will be distorted if the number of NA is big respect to the total number of observations (since the imputed values do not differ from the mean or from each other). Therefore leading to underestimation of the variance.

In addition, estimates of covariance and correlations with other variables in the dataset may also be affected. This is because we may be destroying intrinsic correlations since the mean/median that now replace NA will not preserve the relation with the remaining variables

Imputation should be done over the training set, and then propagated to the test set. This means that the mean/median to be used to fill missing values both in train and test set, should be extracted from the train set only. And this is to avoid overfitting.

Mean/Median/Mode imputation is the most common method to impute missing values.


### 3. Random Sample Imputation

Random sample imputation refers to randomly selecting values from the variable to replace the missing data. This technique preserves the variable distribution, and is well suited for data missing at random. But, we need to account for randomness by adequately setting a seed. Otherwise, the same missing observation could be replaced by different values in different code runs, and therefore lead to a different model predictions. This is not desirable when using our models within an organisation.

Replacing of NA by random sampling for categorical variables is exactly the same as for numerical variables.

Random sampling consist of taking a random observation from the pool of available observations of the variable, that is, from the pool of available categories, and using that randomly extracted value to fill the NA. In Random Sampling one takes as many random observations as missing values are present in the variable.

By random sampling observations of the present categories, we guarantee that the frequency of the different categories/labels within the variable is preserved.

Assumptions:
Random sample imputation has the assumption that the data are missing completely at random (MCAR). If this is the case, it makes sense to substitute the missing values, by values extracted from the original variable distribution/ category frequency.


Important Note
Imputation should be done over the training set, and then propagated to the test set. This means that the random sample to be used to fill missing values both in train and test set, should be extracted from the train set.

### 4. Replacement by Arbitrary Value

Replacement by an arbitrary value, as its names indicates, refers to replacing missing data by any, arbitrarily determined value, but the same value for all missing data. Replacement by an arbitrary value is suitable if data is not missing at random, or if there is a huge proportion of missing values. If all values are positive, a typical replacement is -1. Alternatively, replacing by 999 or -999 are common practice. We need to anticipate that these arbitrary values are not a common occurrence in the variable. Replacement by arbitrary values however may not be suited for linear models, as it most likely will distort the distribution of the variables, and therefore model assumptions may not be met.
For categorical variables, this is the equivalent of replacing missing observations with the label “Missing” which is a widely adopted procedure.

Replacing the NA by artitrary values should be used when there are reasons to believe that the NA are not missing at random. In situations like this, we would not like to replace with the median or the mean, and therefore make the NA look like the majority of our observations.

Instead, we want to flag them. We want to capture the missingness somehow.

The arbitrary value has to be determined for each variable specifically.
We can see that this is totally arbitrary. But, it is used in the industry. Typical values chosen by companies are -9999 or 9999, or similar.

### 5. End of Distribution Imputation
End of tail imputation involves replacing missing values by a value at the far end of the tail of the variable distribution. This technique is similar in essence to imputing by an arbitrary value. However, by placing the value at the end of the distribution, we need not look at each variable distribution individually, as the algorithm does it automatically for us. This imputation technique tends to work well with tree-based algorithms, but it may affect the performance of linear models, as it distorts the variable distribution.

On occasions, one has reasons to suspect that missing values are not missing at random. And if the value is missing, there has to be a reason for it. Therefore, we would like to capture this information.

Adding an additional variable indicating missingness may help with this task. However, the values are still missing in the original variable, and they need to be replaced if we plan to use the variable in machine learning.

So, we will replace the NA, by values that are at the far end of the distribution of the variable.

The rationale is that if the value is missing, it has to be for a reason, therefore, we would not like to replace missing values for the mean and make that observation look like the majority of our observations. Instead, we want to flag that observation as different, and therefore we assign a value that is at the tail of the distribution, where observations are rarely represented in the population.

### 6. Missing Value Indicator

The missing indicator technique involves adding a binary variable to indicate whether the value is missing for a certain observation. This variable takes the value 1 if the observation is missing, or 0 otherwise. One thing to notice is that we still need to replace the missing values in the original variable, which we tend to do with mean or median imputation. By using these 2 techniques together, if the missing value has predictive power, it will be captured by the missing indicator, and if it doesn’t it will be masked by the mean / median imputation.

These 2 techniques in combination tend to work well with linear models. But, adding a missing indicator expands the feature space and, as multiple variables tend to have missing values for the same observations, many of these newly created binary variables could be identical or highly correlated.

### Conclusion - When to use each imputation method?
If missing values are less than 5% of the variable, then go for mean/median imputation or random sample replacement. Impute by most frequent category if missing values are more than 5% of the variable. Do mean/median imputation+adding an additional binary variable to capture missingness add a 'Missing' label in categorical variables.

If the number of NA in a variable is small, they are unlikely to have a strong impact on the variable / target that you are trying to predict. Therefore, treating them specially, will most certainly add noise to the variables. Therefore, it is more useful to replace by mean/random sample to preserve the variable distribution.

If the variable / target you are trying to predict is however highly unbalanced, then it might be the case that this small number of NA are indeed informative.

#### Exceptions
If we suspect that NAs are not missing at random and do not want to attribute the most common occurrence to NA, and if we don't want to increase the feature space by adding an additional variable to indicate missingness - in these cases, replace by a value at the far end of the distribution or an arbitrary value.

---

## Categorical Encoding

Categorical data is data that takes only a limited number of values.

For example, if you people responded to a survey about which what brand of car they owned, the result would be categorical (because the answers would be things like Honda, Toyota, Ford, None, etc.). Responses fall into a fixed set of categories.

You will get an error if you try to plug these variables into most machine learning models in Python without "encoding" them first. Here we'll show the most popular method for encoding categorical variables.

Categorical variable encoding is a broad term for collective techniques used to transform the strings or labels of categorical variables into numbers. There are multiple techniques under this method:

1. One-Hot encoding (OHE)
2. Ordinal encoding
3. Count and Frequency encoding
4. Target encoding / Mean encoding
5. Weight of Evidence
6. Rare label encoding

#### One-Hot Encoding (OHE)
OHE is the standard approach to encode categorical data.

One hot encoding (OHE) creates a binary variable for each one of the different categories present in a variable. These binary variables take 1 if the observation shows a certain category or 0 otherwise. OHE is suitable for linear models. But, OHE expands the feature space quite dramatically if the categorical variables are highly cardinal, or if there are many categorical variables. In addition, many of the derived dummy variables could be highly correlated.

OHE, consists of replacing the categorical variable by different boolean variables, which take value 0 or 1, to indicate whether or not a certain category / label of the variable was present for that observation. Each one of the boolean variables are also known as dummy variables or binary variables.

For example, from the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is female or 0 otherwise. We can also generate the variable male, which takes 1 if the person is "male" and 0 otherwise.

Note that for categorical variables that only have 2 categories, we only need 1 dummy variable here. Ex. Gender, could be Male or Female, if Male then 1 else 0 thus Female or vice versa: If Female then 1 else 0 thus Male.

Scikt-Learn API provides a class for [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

Important Note:
Scikit-learn's one hot encoder class only takes numerical categorical values. So, any value of string type should be label encoded first before one hot encoded.

In the titanic example, the gender of the passengers has to be label encoded first before being one-hot encoded using Scikit-learn's one hot encoder class.

#### Ordinal Encoding

Categorical variable which categories can be meaningfully ordered are called ordinal. For example:

- Student's grade in an exam (A, B, C or Fail).
- Days of the week can be ordinal with Monday = 1, and Sunday = 7.
Educational level, with the categories: Elementary school, High school, College graduate, PhD ranked from 1 to 4.
- When the categorical variable is ordinal, the most straightforward approach is to replace the labels by some ordinal number.

In ordinal encoding we replace the categories by digits, either arbitrarily or in an informed manner. If we encode categories arbitrarily, we assign an integer per category from 1 to n, where n is the number of unique categories. If instead, we assign the integers in an informed manner, we observe the target distribution: we order the categories from 1 to n, assigning 1 to the category for which the observations show the highest mean of target value, and n to the category with the lowest target mean value.

#### Count and Frequency Encoding 

In count encoding we replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the colour blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency. These techniques capture the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome.

This approach is heavily used in Kaggle competitions, wherein we replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The two methods are equivalent.

#### Target/Mean Encoding

In target encoding, also called mean encoding, we replace each category of a variable, by the mean value of the target for the observations that show a certain category. For example, we have the categorical variable “city”, and we want to predict if the customer will buy a TV provided we send a letter. If 30 percent of the people in the city “London” buy the TV, we would replace London by 0.3.

This technique has 3 advantages:
- it does not expand the feature space,
- it captures some information regarding the target at the time of encoding the category, and
- it creates a monotonic relationship between the variable and the target.

Monotonic relationships between variable and target tend to improve linear model performance.

#### Weight of Evidence
Weight of evidence (WOE) is a technique used to encode categorical variables for classification. WOE is the natural logarithm of the probability of the target being 1 divided the probability of the target being 0. WOE has the property that its value will be 0 if the phenomenon is random; it will be bigger than 0 if the probability of the target being 0 is bigger, and it will be smaller than 0 when the probability of the target being 1 is greater.

WOE transformation creates a nice visual representation of the variable, because by looking at the WOE encoded variable, we can see, category by category, whether it favours the outcome of 0, or of 1. In addition, WOE creates a monotonic relationship between variable and target, and leaves all the variables within the same value range.

---

## Variable Transformation
Some machine learning models like linear and logistic regression assume that the variables are normally distributed. Others benefit from Gaussian-like distributions, as in such distributions the observations of X available to predict Y vary across a greater range of values. Thus, Gaussian distributed variables may boost the machine learning algorithm performance.

If a variable is not normally distributed, sometimes it is possible to find a mathematical transformation so that the transformed variable is Gaussian. Typically used mathematical transformations are:

1. Logarithm transformation - $$log(x)$$
2. Reciprocal transformation - $$1 / x$$
3. Square root transformation - $$sqrt(x)$$
4. Exponential transformation - $$exp(x)$$
5. Box-Cox transformation

Refer to the lab to see how these are transformed, and expound here.

#### BoxCox Transformation

The Box-Cox transformation is defined as:

$$ T(Y)=(Y exp(λ)−1)/λ $$

where: 
- Y is the response variable, and 
- λ is the transformation parameter. 

λ varies from -5 to 5. In the transformation, all values of λ are considered and the optimal value for a given variable is selected.

Briefly, for each λ (the transformation tests several λs), the correlation coefficient of the Probability Plot (Q-Q plot below, correlation between ordered values and theoretical quantiles) is calculated.

The value of λ corresponding to the maximum correlation on the plot is then the optimal choice for λ.

In python, we can evaluate and obtain the best λ with the stats.boxcox function from the package scipy.

--- 

## Discretization

- Discretisation is the process of transforming continuous variables into discrete variables by creating a set of contiguous intervals that spans the range of the variable's values.
- Discretisation helps handle outliers and highly skewed variables
- Discretisation helps handle outliers by placing these values into the lower or higher intervals together with the remaining inlier values of the distribution. Thus, these outlier observations no longer differ from the rest of the values at the tails of the distribution, as they are now all together in the same interval / bucket. In addition, by creating appropriate bins or intervals, discretisation can help spread the values of a skewed variable across a set of bins with equal number of observations.

There are several approaches to transform continuous variables into discrete ones. This process is also known as binning, with each bin being each interval.

Discretisation refers to sorting the values of the variable into bins or intervals, also called buckets. There are multiple ways to discretise variables:
1. Equal width discretisation
2. Equal Frequency discretisation
3. Domain knowledge discretisation
4. Discretisation using decision trees


**Discretising data with pandas cut and qcut functions**


When dealing with continuous numeric data, it is often helpful to bin the data into multiple buckets for further analysis. Pandas supports these approaches using the cut and qcut functions.

cut command creates equispaced bins but frequency of samples is unequal in each bin.

qcut command creates unequal size bins but frequency of samples is equal in each bin.

#### Equal width discretisation with pandas cut function

Equal width binning divides the scope of possible values into N bins of the same width.The width is determined by the range of values in the variable and the number of bins we wish to use to divide the variable.

$$width = (max value - min value) / N$$

For example if the values of the variable vary between 0 and 100, we create 5 bins like this: width = (100-0) / 5 = 20. The bins thus are 0-20, 20-40, 40-60, 80-100. The first and final bins (0-20 and 80-100) can be expanded to accommodate outliers (that is, values under 0 or greater than 100 would be placed in those bins as well).

There is no rule of thumb to define N. Typically, we would not want more than 10.

Source : https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html

#### Equal frequency discretisation with pandas qcut function

Equal frequency binning divides the scope of possible values of the variable into N bins, where each bin carries the same amount of observations. This is particularly useful for skewed variables as it spreads the observations over the different bins equally. Typically, we find the interval boundaries by determining the quantiles.

Equal frequency discretisation using quantiles consists of dividing the continuous variable into N quantiles, N to be defined by the user. There is no rule of thumb to define N. However, if we think of the discrete variable as a categorical variable, where each bin is a category, we would like to keep N (the number of categories) low (typically no more than 10).

Source : https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.qcut.html

#### Domain knowledge discretisation

Frequently, when engineering variables in a business setting, the business experts determine the intervals in which they think the variable should be divided so that it makes sense for the business. These intervals may be defined both arbitrarily or following some criteria of use to the business. Typical examples are the discretisation of variables like Age and Income.

Income for example is usually capped at a certain maximum value, and all incomes above that value fall into the last bucket. As per Age, it is usually divided in certain groups according to the business need, for example division into 0-21 (for under-aged), 20-30 (for young adults), 30-40, 40-60, and > 60 (for retired or close to) are frequent.

## Outlier Engineering

Outliers are values that are unusually high or unusually low respect to the rest of the observations of the variable. There are a few techniques for outlier handling:
- Outlier removal
- Treating outliers as missing values
- Discretisation
- Top / bottom / zero coding

#### Identifying outliers

Extreme Value Analysis
The most basic form of outlier detection is Extreme Value Analysis of 1-dimensional data. The key for this method is to determine the statistical tails of the underlying distribution of the variable, and then finding the values that sit at the very end of the tails.

In the typical scenario, the distribution of the variable is Gaussian and thus outliers will lie outside the mean plus or minus 3 times the standard deviation of the variable.

If the variable is not normally distributed, a general approach is to calculate the quantiles, and then the interquantile range (IQR), as follows:
- IQR = 75th quantile - 25th quantile

An outlier will sit outside the following upper and lower boundaries:
- Upper boundary = 75th quantile + (IQR * 1.5)
- Lower boundary = 25th quantile - (IQR * 1.5)

or for extreme cases:
- Upper boundary = 75th quantile + (IQR * 3)
- Lower boundary = 25th quantile - (IQR * 3)

#### Outlier Removal

Outlier removal refers to removing outlier observations from the dataset. Outliers, by nature are not abundant, so this procedure should not distort the dataset dramatically. But if there are outliers across multiple variables, we may end up removing a big portion of the dataset.

#### Treating outliers as missing values
We can treat outliers as missing information, and carry on any of the imputation methods described earlier in this kernel

#### Discretization
Discretisation handles outliers automatically, as outliers are sorted into the terminal bins, together with the other higher or lower value observations. The best approaches are equal frequency and tree based discretisation.

#### Top/Bottom/Zero Coding
Top or bottom coding are also known as Winsorisation or outlier capping. The procedure involves capping the maximum and minimum values at a predefined value. This predefined value can be arbitrary, or it can be derived from the variable distribution.

If the variable is normally distributed we can cap the maximum and minimum values at the mean plus or minus 3 times the standard deviation. If the variable is skewed, we can use the inter-quantile range proximity rule or cap at the top and bottom percentiles.

**Top-coding important**

Top-coding and bottom-coding, as any other feature pre-processing step, should be determined over the training set, and then transferred onto the test set. This means that we should find the upper and lower bounds in the training set only, and use those bands to cap the values in the test set.

## Date and Time Engineering
Date variables are special type of categorical variable. By their own nature, date variables will contain a multitude of different labels, each one corresponding to a specific date and sometimes time. Date variables, when preprocessed properly can highly enrich a dataset. For example, from a date variable we can extract:

- Month
- Quarter
- Semester
- Day (number)
- Day of the week
- Is Weekend?
- Hr
- Time differences in years, months, days, hrs, etc.

It is important to understand that date variables should not be used as the categorical variables we have been working so far when building a machine learning model. Not only because they have a multitude of categories, but also because when we actually use the model to score a new observation, this observation will most likely be in the future, an therefore its date label, will be different than the ones contained in the training set and therefore the ones used to train the machine learning algorithm.


# Geeks for Geeks: Feature Engineering Notes

Importance of Feature Engineering
Feature engineering can significantly influence model performance. By refining features, we can:

- Improve accuracy: Choosing the right features helps the model learn better, leading to more accurate predictions.
- Reduce overfitting: Using fewer, more important features helps the model avoid memorizing the data and perform better on new data.
- Boost interpretability: Well-chosen features make it easier to understand how the model makes its predictions.
- Enhance efficiency: Focusing on key features speeds up the model’s training and prediction process, saving time and resources.

### Processes Involved in Feature Engineering: 

1. Feature Creation: Feature creation involves generating new features from domain knowledge or by observing patterns in the data. It can be:
    - Domain-specific: Created based on industry knowledge like business rules.
    - Data-driven: Derived by recognizing patterns in data.
    - Synthetic: Formed by combining existing features.
2. Feature Transformation: Transformation adjusts features to improve model learning:
    - Normalization & Scaling: Adjust the range of features for consistency.
    - Encoding: Converts categorical data to numerical form i.e one-hot encoding.
    - Mathematical transformations: Like logarithmic transformations for skewed data.
3. Feature Extraction: Extracting meaningful features can reduce dimensionality and improve model accuracy:
    - Dimensionality reduction: Techniques like PCA reduce features while preserving important information.
    - Aggregation & Combination: Summing or averaging features to simplify the model.
4. Feature Selection: Feature selection involves choosing a subset of relevant features to use:
    - Filter methods: Based on statistical measures like correlation.
    - Wrapper methods: Select based on model performance.
    - Embedded methods: Feature selection integrated within model training.
5. Feature Scaling: Scaling ensures that all features contribute equally to the model:
    - Min-Max scaling: Rescales values to a fixed range like 0 to 1.
    - Standard scaling: Normalizes to have a mean of 0 and variance of 1.

## Steps in Feature Engineering
Feature engineering can vary depending on the specific problem but the general steps are:

1. Data Cleaning: Identify and correct errors or inconsistencies in the dataset to ensure data quality and reliability.
2. Data Transformation: Transform raw data into a format suitable for modeling including scaling, normalization and encoding.
3. Feature Extraction: Create new features by combining or deriving information from existing ones to provide more meaningful input to the model.
4. Feature Selection: Choose the most relevant features for the model using techniques like correlation analysis, mutual information and stepwise regression.
5. Feature Iteration: Continuously refine features based on model performance by adding, removing or modifying features for improvement.

### Common Techniques in Feature Engineering

1. **One-Hot Encoding**: One-Hot Encoding converts categorical variables into binary indicators, allowing them to be used by machine learning models.

In [1]:
import pandas as pd

data = {'Color': ['Red', 'Blue', 'Green', 'Blue']}
df = pd.DataFrame(data)

df_encoded = pd.get_dummies(df, columns=['Color'], prefix='Color')

print(df_encoded)

   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False
3        True        False      False


2. **Binning**: Binning transforms continuous variables into discrete bins, making them categorical for easier analysis.

In [2]:
import pandas as pd

data = {'Age': [23, 45, 18, 34, 67, 50, 21]}
df = pd.DataFrame(data)

bins = [0, 20, 40, 60, 100]
labels = ['0-20', '21-40', '41-60', '61+']

df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

print(df)

   Age Age_Group
0   23     21-40
1   45     41-60
2   18      0-20
3   34     21-40
4   67       61+
5   50     41-60
6   21     21-40


3. **Text Data Preprocessing**: Involves removing stop-words, stemming and vectorizing text data to prepare it for machine learning models.

In [6]:
# install corpus
import nltk
nltk.download('stopwords', download_dir='D:/nltk_data')

[nltk_data] Downloading package stopwords to D:/nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

texts = ["This is a sample sentence.", "Text data preprocessing is important."]

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
vectorizer = CountVectorizer()


def preprocess_text(text):
    words = text.split()
    words = [stemmer.stem(word)
             for word in words if word.lower() not in stop_words]
    return " ".join(words)


cleaned_texts = [preprocess_text(text) for text in texts]

X = vectorizer.fit_transform(cleaned_texts)

print("Cleaned Texts:", cleaned_texts)
print("Vectorized Text:", X.toarray())

Cleaned Texts: ['sampl sentence.', 'text data preprocess important.']
Vectorized Text: [[0 0 0 1 1 0]
 [1 1 1 0 0 1]]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Bennywutzsx\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


4. **Feature Splitting**: Divides a single feature into multiple sub-features, uncovering valuable insights and improving model performance.

In [4]:
import pandas as pd

data = {'Full_Address': [
    '123 Elm St, Springfield, 12345', '456 Oak Rd, Shelbyville, 67890']}
df = pd.DataFrame(data)

df[['Street', 'City', 'Zipcode']] = df['Full_Address'].str.extract(
    r'([0-9]+\s[\w\s]+),\s([\w\s]+),\s(\d+)')

print(df)

                     Full_Address      Street         City Zipcode
0  123 Elm St, Springfield, 12345  123 Elm St  Springfield   12345
1  456 Oak Rd, Shelbyville, 67890  456 Oak Rd  Shelbyville   67890


Tools for Feature Engineering
There are several tools available for feature engineering. Here are some popular ones:

- Featuretools: Automates feature engineering by extracting and transforming features from structured data. It integrates well with libraries like pandas and scikit-learn making it easy to create complex features without extensive coding.
- TPOT: Uses genetic algorithms to optimize machine learning pipelines, automating feature selection and model optimization. It visualizes the entire process, helping you identify the best combination of features and algorithms.
- DataRobot: Automates machine learning workflows including feature engineering, model selection and optimization. It supports time-dependent and text data and offers collaborative tools for teams to efficiently work on projects.
- Alteryx: Offers a visual interface for building data workflows, simplifying feature extraction, transformation and cleaning. It integrates with popular data sources and its drag-and-drop interface makes it accessible for non-programmers.
- H2O.ai: Provides both automated and manual feature engineering tools for a variety of data types. It includes features for scaling, imputation and encoding and offers interactive visualizations to better understand model results.