# Feature Engineering

## What is feature engineering?
All machine learning algorithms use some input data to generate outputs. Input data contains many features which may not be in proper form to be given to the model directly. It needs some kind of processing and here feature engineering helps. Feature engineering fulfils mainly two goals:

It prepares the input dataset in the form which is required for a specific model or machine learning algorithm.
Feature engineering helps in improving the performance of machine learning models magically.

The main feature engineering techniques that will be discussed are:

1. Missing data imputation

2. Categorical encoding

3. Variable transformation

4. Outlier engineering

5.  Date and time engineering

### Missing Data Imputation for Feature Engineering
Imputation is the act of replacing missing data with statistical estimates of the missing values. It helps you to complete your training data which can then be provided to any model or an algorithm for prediction.

There are multiple techniques for missing data imputation. These are as follows:-

1. Complete case analysis
2. Mean / Median / Mode imputation
3. Missing Value Indicator

#### Complete Case Analysis for Missing Data Imputation
Complete case analysis is basically analyzing those observations in the dataset that contains values in all the variables. Or you can say, remove all the observations that contain missing values. But this method can only be used when there are only a few observations which has a missing dataset otherwise it will reduce the dataset size and then it will be of not much use.

So, it can be used when missing data is small but in real-life datasets, the amount of missing data is always big. So, practically, complete case analysis is never an option to use, although you can use it if the missing data size is small.

#### Mean/ Median/ Mode for Missing Data Imputation


Missing values can also be replaced with the mean, median, or mode of the variable(feature). It is widely used in data competitions and in almost every situation. It is suitable to use this technique where data is missing at random places and in small proportions.


impute missing values in age in train and test set

median = X_train.Age.median()<br />
for df in [X_train, X_test]:
    df['Age'].fillna(median, inplace=True)<br />
    
X_train['Age'].isnull().sum()

One important point to consider while doing imputation is that it should be done over the training set first and then to the test set. All missing values in the train set and test set should be filled with the value which is extracted from the train set only. This helps in avoiding overfitting.


Explanation:

- **Imputing Missing Values**:

The goal is to handle missing values in the 'Age' column of both the training set (X_train) and the test set (X_test).

- **Choosing Imputation Value**:

The median value of the 'Age' column in the training set (X_train) is calculated and stored in the variable median.

- **Iterating Through DataFrames**:

A loop is used to iterate over both the training set (X_train) and the test set (X_test).

- **Filling Missing Values**:

For each DataFrame (df), the missing values in the 'Age' column are filled with the previously calculated median value using the fillna method. This operation is done in place (inplace=True), meaning the original DataFrames are modified.

- **Checking Missing Values After Imputation**:

X_train['Age'].isnull().sum() is used to check the number of missing values in the 'Age' column of the training set after the imputation.

#### Missing Value Indicator For Missing Value Indication
This technique involves adding a binary variable to indicate whether the value is missing for a certain observation. This variable takes the value 1 if the observation is missing, or 0 otherwise. But we still need to replace the missing values in the original variable, which we tend to do with mean or median imputation. By using these 2 techniques together, if the missing value has predictive power, it will be captured by the missing indicator, and if it doesn’t it will be masked by the mean / median imputation.

X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)
<br />
X_test['Age_NA'] = np.where(X_test['Age'].isnull(), 1, 0)
<br />
X_train.head()

X_train.Age.mean(), X_train.Age.median() = (29.915338645418327, 29.0)<br />
Now, since mean and median are the same, let’s replace them with the median.

X_train['Age'].fillna(X_train.Age.median(), inplace=True)
<br />
X_test['Age'].fillna(X_train.Age.median(), inplace=True)

X_train.head(10)
So, the Age_NA variable was created to capture the missingness.

The provided code and explanation describe a technique called "Missing Value Indicator for Missing Value Indication." This technique involves creating a binary indicator variable to explicitly capture whether a particular observation has a missing value. Additionally, it includes imputing the missing values in the original variable using mean or median imputation.

Let's break down the steps:

1. **Create Missing Indicator Variable:**
   ```python
   X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)
   X_test['Age_NA'] = np.where(X_test['Age'].isnull(), 1, 0)
   ```
   - This code creates a new binary variable, 'Age_NA,' which takes the value 1 if the corresponding 'Age' value is missing and 0 otherwise. This variable serves as an indicator of missingness.

2. **Impute Missing Values with Median:**
   ```python
   X_train['Age'].fillna(X_train.Age.median(), inplace=True)
   X_test['Age'].fillna(X_train.Age.median(), inplace=True)
   ```
   - The missing values in the 'Age' variable are imputed using the median of the non-missing values in the training set. Both the training and test sets are imputed with the median from the training set.

3. **Reasoning:**
   - The combination of creating a missing indicator and imputing with the median allows for capturing the information about missing values. If the fact that a value is missing has predictive power, it can be captured by the 'Age_NA' variable. Meanwhile, if the missing value itself is not informative, it gets replaced by the median value.

4. **Check the Result:**
   ```python
   X_train.head(10)
   ```
   - This line prints the first 10 rows of the modified training set to show the impact of the missing value indicator and imputation.

5. **Replace with Median when Mean and Median are the Same:**
   ```python
   X_train['Age'].fillna(X_train.Age.median(), inplace=True)
   X_test['Age'].fillna(X_train.Age.median(), inplace=True)
   ```
   - In this specific case where the mean and median of the 'Age' variable are the same, the missing values are replaced with the median.

The key idea is to retain information about missing values using the binary indicator while imputing missing values with a central tendency measure (median in this case). This approach is particularly useful when missingness itself might be predictive or informative in the dataset.

### Categorical encoding in Feature Engineering
Categorical data is defined as that data that takes only a number of values. Let’s understand this with an example. Parameter Gender in a dataset will have categorical values like Male, Female. If a survey is done to know which car people own then the result will be categorical (because the answers would be in categories like Honda, Toyota, Hyundai, Maruti, None, etc.). So, the point to notice here is that data falls in a fixed set of categories.

If you directly give this dataset with categorical variables to a model, you will get an error. Hence, they are required to be encoded. There are multiple techniques to do so:

1. One-Hot encoding (OHE)
2. Ordinal encoding
3. Count and Frequency encoding
4. Target encoding / Mean encoding

#### One-Hot Encoding

It is a commonly used technique for encoding categorical variables. It basically creates binary variables for each category present in the categorical variable. These binary variables will have 0 if it is absent in the category or 1 if it is present. Each new variable is called a dummy variable or binary variable.

Example: using this color approach below:

In [9]:
import pandas as pd

# Create a dummy dataset with a categorical variable 'Color'
data = {'ID': [1, 2, 3, 4, 5],
        'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']}
df = pd.DataFrame(data)

# Display the original dataset
print("Original Dataset:")
print(df)

# Perform one-hot encoding without dropping any variable
encoded_df = pd.get_dummies(df['Color'])
print("\nOne-Hot Encoding without dropping any variable:")
print(encoded_df)

# Concatenate the original 'Color' column with dummy variables
concatenated_df = pd.concat([df['Color'], encoded_df], axis=1)
print("\nConcatenated Dataset:")
print(concatenated_df)

# Perform one-hot encoding with drop_first=True
encoded_df_drop_first = pd.get_dummies(df['Color'], drop_first=True)
print("\nOne-Hot Encoding with drop_first=True:")
print(encoded_df_drop_first)


Original Dataset:
   ID  Color
0   1    Red
1   2  Green
2   3   Blue
3   4    Red
4   5  Green

One-Hot Encoding without dropping any variable:
    Blue  Green    Red
0  False  False   True
1  False   True  False
2   True  False  False
3  False  False   True
4  False   True  False

Concatenated Dataset:
   Color   Blue  Green    Red
0    Red  False  False   True
1  Green  False   True  False
2   Blue   True  False  False
3    Red  False  False   True
4  Green  False   True  False

One-Hot Encoding with drop_first=True:
   Green    Red
0  False   True
1   True  False
2  False  False
3  False   True
4   True  False


**Original Dataset**:

    ID   Color
    1    Red
    2    Green
    3    Blue
    4    Red
    5    Green



**One-Hot Encoding without dropping any variable**:

          Blue  Green  Red
    0     0      0    1
    1     0      1    0
    2     1      0    0
    3     0      0    1
    4     0      1    0


**Concatenated Dataset**:

             Color  Blue  Green  Red
        0    Red     0      0    1
        1  Green     0      1    0
        2   Blue     1      0    0
        3    Red     0      0    1
        4  Green     0      1    0


**One-Hot Encoding with drop_first=True:

        Green  Red
        0      0    1
        1      1    0
        2      0    0
        3      0    1
        4      1    0

The last output shows one-hot encoding with the drop_first=True argument, resulting in n-1 dummy variables. In this case, 'Blue' is dropped, and 'Green' and 'Red' are represented by a single dummy variable each.

When using one-hot encoding with drop_first=True, one of the categorical levels is dropped to avoid multicollinearity in certain statistical models, such as linear regression. Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, making it difficult to determine the individual effect of each variable on the response variable.

In the context of one-hot encoding:

If you have a categorical variable with n levels, creating n dummy variables would introduce perfect multicollinearity because knowing the values of n-1 dummy variables would uniquely determine the value of the remaining dummy variable.

By dropping one of the dummy variables, you prevent multicollinearity issues. The dropped variable becomes the reference category, and the information about that category is captured by the other dummy variables.

In the context of one-hot encoding with drop_first=True:
<br />

- If 'Green' is 0 and 'Red' is 0, it implies that both 'Green' and 'Red' are 0, which further implies that the dropped category ('Blue' in this case) is 1.

- If both 'Green' and 'Red' are 0, it means 'Blue' is 1.

The logic is derived from the fact that only one of the dummy variables should be 1 at a time, and the dropped category can be inferred by the absence of the other dummy variables. Let me break down the reasoning:

- Scenario 1 ('Green' is 0 and 'Red' is 0):

    'Green' is 0, indicating that the category 'Green' is not present.
    'Red' is 0, indicating that the category 'Red' is not present.
     Since both 'Green' and 'Red' are 0, it implies that the dropped category (in this case, 'Blue') is 1.
     
- Scenario 2 (both 'Green' and 'Red' are 0):

     Both 'Green' and 'Red' are 0, indicating that neither 'Green' nor 'Red' is present.
     Since both are 0, it implies that the dropped category ('Blue') is 1.

In summary, the values of the dummy variables are such that if the dropped category is not present (both 'Green' and 'Red' are 0), then the dropped category is indicated by the value 1 in the dropped variable.