![image.png](attachment:09743b49-618e-496d-9408-f5aa1f819ecc.png)

- Feature engineering refers to the process of using domain knowledge to select and transform the most relevant variables from raw data when creating a predictive model using machine learning or statistical modeling.
- The feature engineering pipeline is the preprocessing steps that transform raw data into features that can be used in machine learning algorithms, such as predictive models. Predictive models consist of an outcome variable and predictor variables, and it is during the feature engineering process that the most useful predictor variables are created and selected for the predictive model. Automated feature engineering has been available in some machine learning software since 2016. Feature engineering in ML consists of four main steps: Feature Creation, Transformations, Feature Extraction, and Feature Selection.

Feature engineering consists of creation, transformation, extraction, and selection of features, also known as variables, that are most conducive to creating an accurate ML algorithm. These processes entail:

- **Feature Creation**: Creating features involves identifying the variables that will be most useful in the predictive model. This is a subjective process that requires human intervention and creativity. Existing features are mixed via addition, subtraction, multiplication, and ratio to create new derived features that have greater predictive power. 

- **Transformations:** Transformation involves manipulating the predictor variables to improve model performance; e.g. ensuring the model is flexible in the variety of data it can ingest; ensuring variables are on the same scale, making the model easier to understand; improving accuracy; and avoiding computational errors by ensuring all features are within an acceptable range for the model. 

- **Feature Extraction:** Feature extraction is the automatic creation of new variables by extracting them from raw data. The purpose of this step is to automatically reduce the volume of data into a more manageable set for modeling. Some feature extraction methods include cluster analysis, text analytics, edge detection algorithms, and principal components analysis.

- **Feature Selection:** Feature selection algorithms essentially analyze, judge, and rank various features to determine which features are irrelevant and should be removed, which features are redundant and should be removed, and which features are most useful for the model and should be prioritized.

The art of feature engineering may vary among data scientists, however steps for how to perform feature engineering for most machine learning algorithms include the following:

- **Data Preparation:**  This preprocessing step involves the manipulation and consolidation of raw data from different sources into a standardized format so that it can be used in a model. Data preparation may entail data augmentation, cleaning, delivery, fusion, ingestion, and/or loading. 

- **Exploratory Analysis:** This step is used to identify and summarize the main characteristics in a data set through data analysis and investigation. Data science experts use data visualizations to better understand how best to manipulate data sources, to determine which statistical techniques are most appropriate for data analysis, and for choosing the right features for a model. 

- **Benchmark:** Benchmarking is setting a baseline standard for accuracy to which all variables are compared. This is done to reduce the rate of error and improve a model’s predictability. Experimentation, testing and optimizing metrics for benchmarking is performed by data scientists with domain expertise and business users.

![image.png](attachment:0970a4db-5c6e-4229-a31c-08571caab4aa.png)

- Feature selection is the process of isolating the most consistent, non-redundant, and relevant features to use in model construction. The main goal of feature selection is to improve the performance of a predictive model and reduce the computational cost of modeling.
- Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.
- Irrelevant or partially relevant features can negatively impact model performance.
- Feature selection and Data cleaning should be the first and most important step of your model designing.

How to select features and what are Benefits of performing feature selection before modeling your data?
- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

**Feature Selection Methods:**

- Univariate Selection
- Feature Importance
- Correlation Matrix with Heatmap

https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e#:~:text=Feature%20Selection%20is%20the%20process,learn%20based%20on%20irrelevant%20features.

![image.png](attachment:353f8036-ebab-49d4-bc64-e24a291e32a2.png)

- The main differences between the filter and wrapper methods for feature selection are: Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.

![image.png](attachment:1110bc4e-2795-4dde-a653-53b64e0c7cc1.png)

![image.png](attachment:16836011-0605-4ba3-8392-d11439b7f969.png)

![image.png](attachment:4e496e2d-7a6e-4cce-bbcb-b01dcfd959f0.png)

**Overall Feature Selection Process**

- Feature selection, one of the main components of feature engineering, is the process of selecting the most important features to input in machine learning algorithms. Feature selection techniques are employed to reduce the number of input variables by eliminating redundant or irrelevant features and narrowing down the set of features to those most relevant to the machine learning model. 

`The main benefits of performing feature selection in advance, rather than letting the machine learning model figure out which features are most important, include:`

- `simpler models:` simple models are easy to explain - a model that is too complex and unexplainable is not valuable
- `shorter training times:` a more precise subset of features decreases the amount of time needed to train a model
- `variance reduction:` increase the precision of the estimates that can be obtained for a given simulation 
- `avoid the curse of high dimensionality:` dimensionally cursed phenomena states that, as dimensionality and the number of features increases, the volume of space - `increases so fast that the available data become limited -` PCA feature selection may be used to reduce dimensionality 
- `The most common input variable data types include:` Numerical Variables, such as Integer Variables and Floating Point Variables; and Categorical Variables, such as Boolean Variables, Ordinal Variables, and Nominal Variables. Popular libraries for feature selection include sklearn feature selection, feature selection Python, and feature selection in R. 

`What makes one variable better than another? Typically, there are three key properties in a feature representation that makes it most desirable: easy to model, works well with regularization strategies, and disentangling of causal factors.`

‍



- **Feature Selection Methods**

- Feature selection algorithms are categorized as either supervised, which can be used for labeled data; or unsupervised, which can be used for unlabeled data. Unsupervised techniques are classified as filter methods, wrapper methods, embedded methods, or hybrid methods:

- `Filter methods:` Filter methods select features based on statistics rather than feature selection cross-validation performance. A selected metric is applied to identify irrelevant attributes and perform recursive feature selection. Filter methods are either univariate, in which an ordered ranking list of features is established to inform the final selection of feature subset; or multivariate, which evaluates the relevance of the features as a whole, identifying redundant and irrelevant features.
- `Wrapper methods:` Wrapper feature selection methods consider the selection of a set of features as a search problem, whereby their quality is assessed with the preparation, evaluation, and comparison of a combination of features to other combinations of features. This method facilitates the detection of possible interactions amongst variables. Wrapper methods focus on feature subsets that will help improve the quality of the results of the clustering algorithm used for the selection. Popular examples include Boruta feature selection and Forward feature selection.‍
- `Embedded methods:` Embedded feature selection methods integrate the feature selection machine learning algorithm as part of the learning algorithm, in which classification and feature selection are performed simultaneously. The features that will contribute the most to each iteration of the model training process are carefully extracted. Random forest feature selection, decision tree feature selection, and LASSO feature selection are common embedded methods.
‍

`How to Choose a Feature Selection Method`

Choosing the best feature selection method depends on the input and output in consideration:

- Numerical Input, Numerical Output: feature selection regression problem with numerical input variables - use a correlation coefficient, such as Pearson’s correlation coefficient (for linear regression feature selection) or Spearman’s rank coefficient (for nonlinear).
- Numerical Input, Categorical Output: feature selection classification problem with numerical input variables -  use a correlation coefficient, taking into account the categorical target, such as ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear).
- Categorical Input, Numerical Output: regression predictive modeling problem with categorical input variables (rare) - use a correlation coefficient, such as ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear), but in reverse.‍
- Categorical Input, Categorical Output: classification predictive modeling problem with categorical input variables - use a correlation coefficient, such as Chi-Squared test (contingency tables) or Mutual Information, which is a powerful method that is agnostic to data types.
‍

**Why Feature Selection is Important**

- Feature selection is an invaluable asset for data scientists. Understanding how to select important features in machine learning is crucial to the efficacy of the machine learning algorithm. Irrelevant, redundant, and noisy features can pollute an algorithm, negatively impacting learning performance, accuracy, and computational cost. Feature selection is increasingly important as the size and complexity of the average dataset continues to grow exponentially.

‍

**Principle Of Feature Extraction and Algorithm**

- Feature Extraction aims to reduce the number of features in a dataset by creating new features from the existing ones (and then discarding the original features). These new reduced set of features should then be able to summarize most of the information contained in the original set of features.

- PCA, ICA, LDA, LLE, t-SNE and AE.

![image.png](attachment:b24e27e5-5989-4f22-b78f-06874707cdca.png)

- The most important part of text classification is feature engineering: the process of creating features for a machine learning model from raw text data.
- Some of the common features that we can extract from a sentence are the number of words, number of capital words, number of punctuation, number of unique words, number of stopwords, average sentence length, etc. We can define these features based on our data set we are using. We can add some others features like the number of hashtags, number of mentions, etc.
- https://www.analyticsvidhya.com/blog/2021/04/a-guide-to-feature-engineering-in-nlp/

![image.png](attachment:d39f2650-ddc5-4181-8fbe-445792f51c0f.png)

- The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size they could still have a smaller angle between them. Smaller the angle, higher the similarity.
- Cosine similarity is mostly used with vectors produced by word embeddings. If you are using something like Doc2Vec, then you get a vector for the whole document. These vectors could be categorized by using cosine similarity. In your case, you should try a LSTM text classifier using Embedding layers.
- Word2Vec is a model used to represent words into vectors. Then, the similarity value can be generated using the Cosine Similarity formula of the word vector values produced by the Word2Vec model.The configuration of the Word2Vec model that produces the best similarity values will be the result of this study.

https://www.machinelearningplus.com/nlp/cosine-similarity/

In [1]:
from scipy import linalg, mat, dot
import numpy as np

matrix = mat( [[2,3,2,0,2,3,3,0,1],[2,1,0,0,3,2,1,3,1]] )
dot(matrix[0],matrix[1].T)/np.linalg.norm(matrix[0])/np.linalg.norm(matrix[1])

https://stackoverflow.com/questions/1746501/can-someone-give-an-example-of-cosine-similarity-in-a-very-simple-graphical-wa

![image.png](attachment:d636b2b4-310b-401a-a105-5a562fe987ae.png)

- Hamming Distance is the number of differing characters between two strings of equal lengths or the number of differing bits between two numbers.

For the given binary number we are going to calculate the hamming distance based on matching.
 - There are only 2 corresponding positional bit is not matching each other so from that we can say hamming gap or distance is 2.

![image.png](attachment:316c069e-f3ee-465a-baf9-a739f0220314.png)

- The high dimesional data set nothing but data which have more number of features.
- High Dimensional means that the number of dimensions are staggeringly high — so high that calculations become extremely difficult. With high dimensional data, the number of features can exceed the number of observations. For example, microarrays, which measure gene expression, can contain tens of hundreds of samples. Each sample can contain tens of thousands of genes.

- Example
 - Healthcare image data.
 - Banking streaming data.
 - Voice recoginition data.

Reduction of dimensionality means to simplify understanding of data, either numerically or visually. Data integrity is maintained. To reduce dimensionality, you could combine related data into groups using a tool like multidimensional scaling to identify similarities in data. You could also use clustering to group items together.

The curse of dimensionality usually refers to what happens when you add more and more variables to a multivariate model. The more dimensions you add to a data set, the more difficult it becomes to predict certain quantities. You would think that more is better. However, when it comes to adding variables, the opposite is true. Each added variable results in an exponential decrease in predictive power.

As a simple example, let’s say you are using a model to predict the location of a large bacteria in a 25cm2 petri dish. The model might be fairly accurate at pinning the particle down to the nearest square cm. However, let’s say you add just one more dimension: Instead of a 2D petri dish you use a 3D beaker . The predictive space increases exponentially, from 25 cm2 to 125 cm3. When you add more dimensions, it makes sense that the computational burden also increases. It wouldn’t be impossible to pinpoint where bacteria might be in a 3D model. However, it’s a more challenging task.

The statistical curse of dimensionality refers to a related fact: a required sample size n will grow exponentially with data that has d dimensions. In simple terms, adding more dimensions could mean that the sample size you need quickly become unmanageable.

![image.png](attachment:3bdd17ca-b698-4c5b-8ec1-fe9fd1eb1085.png)

**PCA is an acronym for personal Computer Analysis**

- Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed.

**Use Of Vector**

- vector, in mathematics, a quantity that has both magnitude and direction but not position.

**Embedded Technique**

- In Embedded Methods, the feature selection algorithm is integrated as part of the learning algorithm.
- Embedded methods combine the qualities of filter and wrapper methods.
- It’s implemented by algorithms that have their own feature selection methods in them.
- A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification/regression at the same time.
- The most Common embedded technique are the tree algorithm’s like RandomForest, ExtraTree and so on.
- Tree algorithms select a feature in each recursive step of the tree growth process and divide the sample set into smaller subsets. The more child nodes in a subset are in the same class, the more informative the features are.
- Other Embedded Methods are the LASSO with the L1 penalty and Ridge with the L2 penalty for constructing a linear model. These two methods shrink many features to zero or almost near to zero.

![image.png](attachment:de27fa85-553d-47ec-963b-aaef24b0c2bd.png)

SBS :- `Sequential Backward Selection` is nothing but all feature are consider at first time then make number of set of features and evalute the best features.

SFS :- `Sequential Foreward Selection` is nothing but consider the one by one feature at a time in greedy approches and then make set of the feature and evaluate them.                                                                                         

![image.png](attachment:29968b56-0b6f-428b-a08e-c58f776e7a39.png)

**SMC vs. Jaccard index**

SMC counts both mutual presences (when an attribute is present in both sets) and mutual absence (when an attribute is absent in both sets) as matches and compares it to the total number of attributes in the universe, whereas the Jaccard index only counts mutual presence as matches and compares it to the number of attributes that have been chosen by at least one of the two sets.

**Fuction Selection Method**

![image.png](attachment:328acbb2-12d5-417f-be3a-6c2ed41f00dd.png)