# Week 5 Notes
### Primary Focus: Feature Engineering


##### What is Feature Engineering?
[Reference: Kaggle](https://www.kaggle.com/code/prashant111/a-reference-guide-to-feature-engineering-methods#1.-Introduction-to-Feature-Engineering-)

**Feature Engineering** is the process of using domain knowledge to extract features from raw data via data mining techniques to improve the performance of machine earning algorithms.

Coming up with features is difficult, time-consuming, and requires expert knowledge. "Applied machine learning" is basically feature engineering. If we can boil it down to one concept, it's about transforming raw data into a form that's more useful for your model.

### Feature Scaling
Models that rely on the **distance between data points** (like k-nearest neighbors or Support Vector Machines) are **highly sensitive to the scale of the features**. If one feature has a much larger range than others, it can dominate the distance calculations and skew the results.

- **StandardScaler**
    - This technique standardizes features by removing the mean and scaling to unit variance. 
    - The formula is z=(x−u)/s, where u is the mean and s is the standard deviation. It's great for features that follow a **normal or near-normal distribution**.
    $$z=(x−u)/s$$
- **MinMaxScaler**
    -  This scales features to a fixed range, typically 0 to 1. 
    - The formula is Xscaled​ = (X−Xmin​)/(Xmax−Xmin). 
        $$X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}$$
    - This is useful when you need to **constrain your data to a specific range or when your data isn't normally distributed**.

 Scaling doesn't change the distribution of your data, it just changes the scale. This helps ensure all features contribute equally to the model.

### Categorical Encoding 
Most machine learning algorithms require numerical input. Categorical features (like 'city' or 'product type') need to be converted to numbers before being used.
- OneHotEncoder
    - This is a common method for handling nominal (non-ordered) categorical data. It creates a new binary column for each category, with a **1 indicating the presence of that category and a 0 for its absence**. For example, if you have a 'city' feature with values 'New York', 'London', and 'Paris', OneHotEncoder would create three new columns: 'city_New York', 'city_London', and 'city_Paris'. This **prevents the model from assuming an arbitrary ordinal relationship between categories**.

### Creating new features
This is where the "engineering" comes in. By combining or transforming existing features, you can capture more complex relationships

- **Polynomial features**
    - You can create new features by raising existing features to a power e.g., $$x_1^2, x_2^2$$
    - This **allows linear models to capture non-linear relationships**.
- **Interaction terms**
    - These are new features created by multiplying two or more existing features e.g., $$x_1 * x_2$$ 
    - This allows the model to capture the combined effect of two features.

Plan of study: 
- Go over each reference on Day 2 to 3 and implement on Day 4.

References:
- [Kaggle: A Reference Guide to Feature Engineering Methods](https://www.kaggle.com/code/prashant111/a-reference-guide-to-feature-engineering-methods)
- [Geeks For Geeks: What is Feature Engineering?](https://www.geeksforgeeks.org/machine-learning/what-is-feature-engineering/)
- [Datacamp: Feature Engineering in Machine Learning: A Practical Guide](https://www.datacamp.com/tutorial/feature-engineering)
- [Practical Guide on Data Preprocessing in Python using Scikit Learn](https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/#:~:text=It%20should%20be%20kept%20in,due%20to%20its%20larger%20range.)
    - Downloaded the pdf here as it is required to login.
- [Data Preprocessing with Scikit-learn](https://medium.com/@drpa/data-preprocessing-with-scikit-learn-dcaaf82d000a)
- [10 Powerful Techniques for Feature Engineering in Machine Learning](https://thetechthinker.com/feature-engineering-in-machine-learning/)

Additional items to study/read up on for later weeks:

Can do this on the 5th or 6th day along with Cybersecurity this week since the activity is just a research. 
- Pipelines/orchestration tools (Airflow, Prefect, dbt)